Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
What’s new (Nov 2025). This refresh publishes the best-performing checkpoint from the CoreWeave resume campaign—trm_arc2_8gpu_resume_step115815_plus100k_v2 at global step 119 432. The job resumed from TinyRecursiveModels commit e7b68717 with the full resume guard stack (trm-common-script + trm-pyshim) and legacy ARC identifier mapping. This is the same checkpoint we attempted to ship to Kaggle; the submission stalled at 0.83 % pass@1 because every task duplicated attempts, so we are documenting the shortfall here instead of claiming leaderboard progress.
Why the name mentions 119 434. Internal tracking labelled this snapshot “step 119 434”, but the persisted shard on the CoreWeave PVC is step_119432. The W&B records for the run confirm that resume guard initialized at the expected 115 815 step and advanced to the 119k block; no 119 434 shard survived the routine pruning. When downstream tooling expects the 119 434 identifier, point it at this artifact and note the two-step discrepancy.
Checkpoint Snapshot
- Run name:
trm_arc2_8gpu_resume_step115815_plus100k_v2 - Global step: 119 432 (3 617 optimizer updates after the 115 815 resume point)
- Architecture: Tiny Recursive Model ACT V1 (
L_layers=2,H_cycles=3,L_cycles=4, hidden size 512, 8 heads, RoPE, bfloat16 activations) - Optimizer: Adam-atan2 (
beta1=0.9,beta2=0.95,weight_decay=0.1, EMA 0.999, global batch size 768) - Dataset builder: Legacy identifier order (
dataset/build_arc_dataset_legacy.py) targetingarc2concept-aug-1000 - Resume provenance:
RESUME_CHECKPOINT_PATH→/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815RESUME_EXPECTED_STEP→115815[resume] initializing train_state.step to 115815appears in pod logs before training continues
- PVC retention: Latest PVC shards now extend to
step_662428; earlier 119k shards were pruned after packaging this export.
Files Included
| Path | Description |
|---|---|
model.ckpt |
Consolidated PyTorch checkpoint (optimizer, EMA, and weights) containing step_119432/* tensors. SHA-256: 2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e. |
COMMANDS.txt / COMMANDS_resumed.txt |
Torch distributed launch (8 × H200) showing the resume flags and dataset path. |
ENVIRONMENT.txt |
Hydra-resolved configuration captured on CoreWeave after overlays. |
MANIFEST.txt |
Packaging metadata (checkpoint step, source path, timestamp, sha256). |
TRM_COMMIT.txt |
Upstream TinyRecursiveModels Git SHA (e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9). |
all_config.yaml |
Structured config snapshot exported alongside the checkpoint. |
dataset-metadata.json |
Kaggle dataset manifest (kept for parity with previous releases). |
Evaluation Status
- Validation (CoreWeave pod evaluator, legacy mapping):
pass@1 = 0.83 %, identical scores for pass@2/5/10/100 because samples were duplicates. Mean token accuracy ≈ 70.1 %,train/lm_loss≈ 0.134 at resume,all/lm_loss≈ 1.56. - Kaggle inference notebook (test split): Also produced 259/259 duplicate attempts, yielding 0.83 % pass@1 and no leaderboard improvement. The issue remains unresolved; do not submit this checkpoint to Kaggle until the sampler divergence is fixed.
- Copy-mode diagnostics (
scripts/debug_eval_cpu.pyin legacy mode): 0/120 grid matches (consistent with earlier baselines).
The metrics bundled here are sufficient to reproduce our internal dashboards without requiring live W&B access. If you have Weights & Biases credentials, the run is listed under trm_arc2_8gpu_resume_step115815_plus100k_v2 in project trm-arc2; the first logged step after resume exceeds 115 815, confirming the guard executed.
Inference & Reproduction
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download("seconds0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
To recreate the CoreWeave launch:
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Ensure ConfigMaps trm-common-script, trm-pyshim-cm, and trm-eval-overlay are applied first.
Before submitting jobs, verify:
RESUME_CHECKPOINT_PATHpoints to the 115 815 shard.[resume] initializing train_state.step to 115815appears once training boots.- The first W&B point is ≥115 815 with
train/lm_loss≈ 0.13.
Known Gaps & Next Steps
- Sampler instability – Deduplicate sampler outputs before retrying Kaggle submissions.
- Identifier remapping – Remains legacy-only; switching to sorted identifiers requires remapping or finetuning.
- W&B rehydration – Set
WANDB_API_KEYlocally if you need fresh metrics; the release ships cached configs only.
Please cite the Tiny Recursive Models paper and ARC Prize 2025 when using this checkpoint. Contributions, bug reports, and sampler fixes are welcome via the repository issues.
Evaluation results
- ARC Task Solve Rate (pass@1) on ARC Prize 2025 Public Evaluationself-reported0.008
- ARC Task Solve Rate (pass@2) on ARC Prize 2025 Public Evaluationself-reported0.008
- ARC Task Solve Rate (pass@10) on ARC Prize 2025 Public Evaluationself-reported0.008
- ARC Task Solve Rate (pass@100) on ARC Prize 2025 Public Evaluationself-reported0.008