|
|
--- |
|
|
library_name: pytorch |
|
|
license: mit |
|
|
pipeline_tag: other |
|
|
tags: |
|
|
- arc-prize-2025 |
|
|
- program-synthesis |
|
|
- tiny-recursive-models |
|
|
- recursive-reasoning |
|
|
- resume-training |
|
|
- act |
|
|
- reproducibility |
|
|
datasets: |
|
|
- arc-prize-2025 |
|
|
model-index: |
|
|
- name: Tiny Recursive Models — ARC-AGI-2 (Resume Step 119432) |
|
|
results: |
|
|
- task: |
|
|
type: program-synthesis |
|
|
name: ARC Prize 2025 (legacy evaluation mapping) |
|
|
dataset: |
|
|
name: ARC Prize 2025 Public Evaluation |
|
|
type: arc-prize-2025 |
|
|
split: evaluation |
|
|
metrics: |
|
|
- type: accuracy |
|
|
name: ARC Task Solve Rate (pass@1) |
|
|
value: 0.0083 |
|
|
- type: accuracy |
|
|
name: ARC Task Solve Rate (pass@2) |
|
|
value: 0.0083 |
|
|
- type: accuracy |
|
|
name: ARC Task Solve Rate (pass@10) |
|
|
value: 0.0083 |
|
|
- type: accuracy |
|
|
name: ARC Task Solve Rate (pass@100) |
|
|
value: 0.0083 |
|
|
--- |
|
|
|
|
|
# Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432) |
|
|
|
|
|
Overview |
|
|
- 8× H200 resume snapshot at global step 119 432 from run `trm_arc2_8gpu_resume_step115815_plus100k_v2` (TinyRecursiveModels commit `e7b68717`). |
|
|
- Consolidated `model.ckpt` with accompanying configuration and provenance files. Integrity hash: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`. |
|
|
- Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions. |
|
|
|
|
|
About “119 434” vs “119 432” |
|
|
- Internal tracking referenced “step 119 434”; the persistent shard is `step_119432`. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains. |
|
|
|
|
|
Contents |
|
|
- `model.ckpt` — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting `step_119432/*`. |
|
|
- `COMMANDS.txt`, `COMMANDS_resumed.txt` — Exact `torchrun` invocations (8× H200) and resume parameters. |
|
|
- `ENVIRONMENT.txt`, `all_config.yaml` — Hydra-resolved configurations captured on the training pod. |
|
|
- `MANIFEST.txt` — Packaging metadata (step, source path, timestamp, sha256). |
|
|
- `TRM_COMMIT.txt` — Upstream TinyRecursiveModels commit (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`). |
|
|
- `dataset-metadata.json` — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience. |
|
|
|
|
|
Kaggle Status |
|
|
- This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split. |
|
|
- A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release. |
|
|
|
|
|
Current Kaggle Limitations |
|
|
- Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with `ARC_SAMPLING_COUNT > 1`. |
|
|
- Sampler configuration propagation: Variables such as `ARC_SAMPLING_COUNT`, `ARC_SAMPLING_MODE=sample`, and temperature did not reliably reach the evaluator under our overlays. |
|
|
- Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts. |
|
|
- Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity. |
|
|
- Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle. |
|
|
|
|
|
Attempted Mitigations |
|
|
- Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted. |
|
|
- Instrumented CPU/GPU debug evaluators (`scripts/debug_eval_cpu.py --samples 8 --log-attempts`); candidate diversity remained near zero. |
|
|
- Adjusted temperature and top‑k settings; no material improvement under the broken overlay path. |
|
|
- Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made. |
|
|
|
|
|
Evaluation Guidance |
|
|
- Enforce a strict unique‑candidate guard prior to scoring. |
|
|
- Validate that sampling environment variables propagate into the inference process. |
|
|
- Capture per‑attempt outputs and, where possible, logits to diagnose diversity. |
|
|
- Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder. |
|
|
|
|
|
Reproduction |
|
|
```python |
|
|
from huggingface_hub import hf_hub_download |
|
|
import torch |
|
|
|
|
|
ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt") |
|
|
state = torch.load(ckpt_path, map_location="cpu") |
|
|
print(state["hyperparameters"]["arch"]["hidden_size"]) # 512 |
|
|
``` |
|
|
|
|
|
``` |
|
|
# CoreWeave resume (reference) |
|
|
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml |
|
|
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm |
|
|
``` |
|
|
|
|
|
Resume Guard Signals |
|
|
- `RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815` |
|
|
- `RESUME_EXPECTED_STEP=115815` |
|
|
- Pod logs contain: `[resume] initializing train_state.step to 115815` before training proceeds. |
|
|
|
|
|
Known Issues and Next Steps |
|
|
- Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling. |
|
|
- Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning. |
|
|
- Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims. |
|
|
|
|
|
Ethics, License, and Intended Use |
|
|
- MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results. |
|
|
|
|
|
Acknowledgements |
|
|
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle. |
|
|
|