trm-arc2-8gpu / README.md
seconds-0's picture
Polish model card prose: professional Kaggle status, limitations, guidance
eab8db3 verified
metadata
library_name: pytorch
license: mit
pipeline_tag: other
tags:
  - arc-prize-2025
  - program-synthesis
  - tiny-recursive-models
  - recursive-reasoning
  - resume-training
  - act
  - reproducibility
datasets:
  - arc-prize-2025
model-index:
  - name: Tiny Recursive Models  ARC-AGI-2 (Resume Step 119432)
    results:
      - task:
          type: program-synthesis
          name: ARC Prize 2025 (legacy evaluation mapping)
        dataset:
          name: ARC Prize 2025 Public Evaluation
          type: arc-prize-2025
          split: evaluation
        metrics:
          - type: accuracy
            name: ARC Task Solve Rate (pass@1)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@2)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@10)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@100)
            value: 0.0083

Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)

Overview

  • 8× H200 resume snapshot at global step 119 432 from run trm_arc2_8gpu_resume_step115815_plus100k_v2 (TinyRecursiveModels commit e7b68717).
  • Consolidated model.ckpt with accompanying configuration and provenance files. Integrity hash: 2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e.
  • Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.

About “119 434” vs “119 432”

  • Internal tracking referenced “step 119 434”; the persistent shard is step_119432. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.

Contents

  • model.ckpt — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting step_119432/*.
  • COMMANDS.txt, COMMANDS_resumed.txt — Exact torchrun invocations (8× H200) and resume parameters.
  • ENVIRONMENT.txt, all_config.yaml — Hydra-resolved configurations captured on the training pod.
  • MANIFEST.txt — Packaging metadata (step, source path, timestamp, sha256).
  • TRM_COMMIT.txt — Upstream TinyRecursiveModels commit (e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9).
  • dataset-metadata.json — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.

Kaggle Status

  • This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
  • A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.

Current Kaggle Limitations

  • Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with ARC_SAMPLING_COUNT > 1.
  • Sampler configuration propagation: Variables such as ARC_SAMPLING_COUNT, ARC_SAMPLING_MODE=sample, and temperature did not reliably reach the evaluator under our overlays.
  • Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
  • Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
  • Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.

Attempted Mitigations

  • Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted.
  • Instrumented CPU/GPU debug evaluators (scripts/debug_eval_cpu.py --samples 8 --log-attempts); candidate diversity remained near zero.
  • Adjusted temperature and top‑k settings; no material improvement under the broken overlay path.
  • Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made.

Evaluation Guidance

  • Enforce a strict unique‑candidate guard prior to scoring.
  • Validate that sampling environment variables propagate into the inference process.
  • Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
  • Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.

Reproduction

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"])  # 512
# CoreWeave resume (reference)
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm

Resume Guard Signals

  • RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815
  • RESUME_EXPECTED_STEP=115815
  • Pod logs contain: [resume] initializing train_state.step to 115815 before training proceeds.

Known Issues and Next Steps

  • Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
  • Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
  • Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims.

Ethics, License, and Intended Use

  • MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.

Acknowledgements

  • Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.