metadata
library_name: pytorch
license: mit
pipeline_tag: other
tags:
- arc-prize-2025
- program-synthesis
- tiny-recursive-models
- recursive-reasoning
- resume-training
- act
- reproducibility
datasets:
- arc-prize-2025
model-index:
- name: Tiny Recursive Models — ARC-AGI-2 (Resume Step 119432)
results:
- task:
type: program-synthesis
name: ARC Prize 2025 (legacy evaluation mapping)
dataset:
name: ARC Prize 2025 Public Evaluation
type: arc-prize-2025
split: evaluation
metrics:
- type: accuracy
name: ARC Task Solve Rate (pass@1)
value: 0.0083
- type: accuracy
name: ARC Task Solve Rate (pass@2)
value: 0.0083
- type: accuracy
name: ARC Task Solve Rate (pass@10)
value: 0.0083
- type: accuracy
name: ARC Task Solve Rate (pass@100)
value: 0.0083
Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)
Overview
- 8× H200 resume snapshot at global step 119 432 from run
trm_arc2_8gpu_resume_step115815_plus100k_v2(TinyRecursiveModels commite7b68717). - Consolidated
model.ckptwith accompanying configuration and provenance files. Integrity hash:2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e. - Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.
About “119 434” vs “119 432”
- Internal tracking referenced “step 119 434”; the persistent shard is
step_119432. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.
Contents
model.ckpt— Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflectingstep_119432/*.COMMANDS.txt,COMMANDS_resumed.txt— Exacttorchruninvocations (8× H200) and resume parameters.ENVIRONMENT.txt,all_config.yaml— Hydra-resolved configurations captured on the training pod.MANIFEST.txt— Packaging metadata (step, source path, timestamp, sha256).TRM_COMMIT.txt— Upstream TinyRecursiveModels commit (e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9).dataset-metadata.json— Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.
Kaggle Status
- This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
- A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.
Current Kaggle Limitations
- Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with
ARC_SAMPLING_COUNT > 1. - Sampler configuration propagation: Variables such as
ARC_SAMPLING_COUNT,ARC_SAMPLING_MODE=sample, and temperature did not reliably reach the evaluator under our overlays. - Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
- Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
- Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.
Attempted Mitigations
- Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted.
- Instrumented CPU/GPU debug evaluators (
scripts/debug_eval_cpu.py --samples 8 --log-attempts); candidate diversity remained near zero. - Adjusted temperature and top‑k settings; no material improvement under the broken overlay path.
- Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made.
Evaluation Guidance
- Enforce a strict unique‑candidate guard prior to scoring.
- Validate that sampling environment variables propagate into the inference process.
- Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
- Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.
Reproduction
from huggingface_hub import hf_hub_download
import torch
ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"]) # 512
# CoreWeave resume (reference)
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm
Resume Guard Signals
RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815RESUME_EXPECTED_STEP=115815- Pod logs contain:
[resume] initializing train_state.step to 115815before training proceeds.
Known Issues and Next Steps
- Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
- Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
- Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims.
Ethics, License, and Intended Use
- MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.
Acknowledgements
- Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.