Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)

What’s new (Nov 2025). This refresh publishes the best-performing checkpoint from the CoreWeave resume campaign—trm_arc2_8gpu_resume_step115815_plus100k_v2 at global step 119 432. The job resumed from TinyRecursiveModels commit e7b68717 with the full resume guard stack (trm-common-script + trm-pyshim) and legacy ARC identifier mapping. This is the same checkpoint we attempted to ship to Kaggle; the submission stalled at 0.83 % pass@1 because every task duplicated attempts, so we are documenting the shortfall here instead of claiming leaderboard progress.

Why the name mentions 119 434. Internal tracking labelled this snapshot “step 119 434”, but the persisted shard on the CoreWeave PVC is step_119432. The W&B records for the run confirm that resume guard initialized at the expected 115 815 step and advanced to the 119k block; no 119 434 shard survived the routine pruning. When downstream tooling expects the 119 434 identifier, point it at this artifact and note the two-step discrepancy.

Checkpoint Snapshot

Run name: trm_arc2_8gpu_resume_step115815_plus100k_v2
Global step: 119 432 (3 617 optimizer updates after the 115 815 resume point)
Architecture: Tiny Recursive Model ACT V1 (L_layers=2, H_cycles=3, L_cycles=4, hidden size 512, 8 heads, RoPE, bfloat16 activations)
Optimizer: Adam-atan2 (beta1=0.9, beta2=0.95, weight_decay=0.1, EMA 0.999, global batch size 768)
Dataset builder: Legacy identifier order (dataset/build_arc_dataset_legacy.py) targeting arc2concept-aug-1000
Resume provenance:
- RESUME_CHECKPOINT_PATH → /workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815
- RESUME_EXPECTED_STEP → 115815
- [resume] initializing train_state.step to 115815 appears in pod logs before training continues
PVC retention: Latest PVC shards now extend to step_662428; earlier 119k shards were pruned after packaging this export.

Files Included

Path	Description
`model.ckpt`	Consolidated PyTorch checkpoint (optimizer, EMA, and weights) containing `step_119432/*` tensors. SHA-256: `2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e`.
`COMMANDS.txt` / `COMMANDS_resumed.txt`	Torch distributed launch (8 × H200) showing the resume flags and dataset path.
`ENVIRONMENT.txt`	Hydra-resolved configuration captured on CoreWeave after overlays.
`MANIFEST.txt`	Packaging metadata (checkpoint step, source path, timestamp, sha256).
`TRM_COMMIT.txt`	Upstream TinyRecursiveModels Git SHA (`e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9`).
`all_config.yaml`	Structured config snapshot exported alongside the checkpoint.
`dataset-metadata.json`	Kaggle dataset manifest (kept for parity with previous releases).

Evaluation Status

Validation (CoreWeave pod evaluator, legacy mapping): pass@1 = 0.83 %, identical scores for pass@2/5/10/100 because samples were duplicates. Mean token accuracy ≈ 70.1 %, train/lm_loss ≈ 0.134 at resume, all/lm_loss ≈ 1.56.
Kaggle inference notebook (test split): Also produced 259/259 duplicate attempts, yielding 0.83 % pass@1 and no leaderboard improvement. The issue remains unresolved; do not submit this checkpoint to Kaggle until the sampler divergence is fixed.
Copy-mode diagnostics (scripts/debug_eval_cpu.py in legacy mode): 0/120 grid matches (consistent with earlier baselines).

The metrics bundled here are sufficient to reproduce our internal dashboards without requiring live W&B access. If you have Weights & Biases credentials, the run is listed under trm_arc2_8gpu_resume_step115815_plus100k_v2 in project trm-arc2; the first logged step after resume exceeds 115 815, confirming the guard executed.

Inference & Reproduction

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("seconds0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"])  # 512

To recreate the CoreWeave launch:

kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Ensure ConfigMaps trm-common-script, trm-pyshim-cm, and trm-eval-overlay are applied first.

Before submitting jobs, verify:

RESUME_CHECKPOINT_PATH points to the 115 815 shard.
[resume] initializing train_state.step to 115815 appears once training boots.
The first W&B point is ≥115 815 with train/lm_loss ≈ 0.13.

Known Gaps & Next Steps

Sampler instability – Deduplicate sampler outputs before retrying Kaggle submissions.
Identifier remapping – Remains legacy-only; switching to sorted identifiers requires remapping or finetuning.
W&B rehydration – Set WANDB_API_KEY locally if you need fresh metrics; the release ships cached configs only.

Please cite the Tiny Recursive Models paper and ARC Prize 2025 when using this checkpoint. Contributions, bug reports, and sampler fixes are welcome via the repository issues.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

ARC Task Solve Rate (pass@1) on ARC Prize 2025 Public Evaluation
self-reported

0.008
ARC Task Solve Rate (pass@2) on ARC Prize 2025 Public Evaluation
self-reported

0.008
ARC Task Solve Rate (pass@10) on ARC Prize 2025 Public Evaluation
self-reported

0.008
ARC Task Solve Rate (pass@100) on ARC Prize 2025 Public Evaluation
self-reported

0.008

View on Papers With Code