trm-arc2-8gpu / README.md

seconds-0

Polish model card prose: professional Kaggle status, limitations, guidance

eab8db3 verified 1 day ago

preview code

raw

history blame contribute delete

6.44 kB

metadata

library_name: pytorch
license: mit
pipeline_tag: other
tags:
  - arc-prize-2025
  - program-synthesis
  - tiny-recursive-models
  - recursive-reasoning
  - resume-training
  - act
  - reproducibility
datasets:
  - arc-prize-2025
model-index:
  - name: Tiny Recursive Models — ARC-AGI-2 (Resume Step 119432)
    results:
      - task:
          type: program-synthesis
          name: ARC Prize 2025 (legacy evaluation mapping)
        dataset:
          name: ARC Prize 2025 Public Evaluation
          type: arc-prize-2025
          split: evaluation
        metrics:
          - type: accuracy
            name: ARC Task Solve Rate (pass@1)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@2)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@10)
            value: 0.0083
          - type: accuracy
            name: ARC Task Solve Rate (pass@100)
            value: 0.0083

Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)

Overview

8× H200 resume snapshot at global step 119 432 from run trm_arc2_8gpu_resume_step115815_plus100k_v2 (TinyRecursiveModels commit e7b68717).
Consolidated model.ckpt with accompanying configuration and provenance files. Integrity hash: 2bc8bb3a5a85cd73e169a6fd285f9138427db894bd157edc20e92a58ed8ee33e.
Current evaluation on the ARC public evaluation split is approximately 0.83% pass@1, attributable to duplicate candidate generation in our evaluation pipeline after the resume. This release is intended for reproducibility and analysis rather than leaderboard submissions.

About “119 434” vs “119 432”

Internal tracking referenced “step 119 434”; the persistent shard is step_119432. W&B logs confirm the resume guard initialized at step 115 815 and advanced into the 119k block. No distinct 119 434 shard remains.

Contents

model.ckpt — Consolidated PyTorch checkpoint (weights, optimizer, EMA) reflecting step_119432/*.
COMMANDS.txt, COMMANDS_resumed.txt — Exact torchrun invocations (8× H200) and resume parameters.
ENVIRONMENT.txt, all_config.yaml — Hydra-resolved configurations captured on the training pod.
MANIFEST.txt — Packaging metadata (step, source path, timestamp, sha256).
TRM_COMMIT.txt — Upstream TinyRecursiveModels commit (e7b68717f0a6c4cbb4ce6fbef787b14f42083bd9).
dataset-metadata.json — Kaggle packaging manifest (legacy identifier mapping). W&B CSV/summary are included for convenience.

Kaggle Status

This checkpoint is not accompanied by a validated Kaggle submission from our pipeline. In our evaluation path, inference produced duplicate candidates across attempts, yielding ≈0.83% pass@1 on the public evaluation split.
A third‑party Kaggle implementation demonstrating TRM inference is available on Kaggle; interested users can locate it by searching the platform. That implementation is independent of this release.

Current Kaggle Limitations

Unique‑candidate enforcement: A production overlay inadvertently bypassed the uniqueness filter, allowing duplicates even with ARC_SAMPLING_COUNT > 1.
Sampler configuration propagation: Variables such as ARC_SAMPLING_COUNT, ARC_SAMPLING_MODE=sample, and temperature did not reliably reach the evaluator under our overlays.
Sampling degeneracy: Under misconfiguration, multinomial sampling collapsed to identical outputs across attempts.
Limited attempt‑level telemetry: Evaluators emitted aggregate metrics only; per‑attempt logits and strings were not retained due to GPU runtime constraints and Kaggle logging limits. This prevents visualization of candidate selection and diversity.
Identifier mapping: The checkpoint was trained with the legacy identifier mapping, whereas some evaluators assume the sorted mapping. Without remapping or a compatibility layer, comparisons can be brittle.

Attempted Mitigations

Relaunched controlled evaluator pods with explicit sampling parameters and verified resume‑guard logs; duplicates persisted.
Instrumented CPU/GPU debug evaluators (scripts/debug_eval_cpu.py --samples 8 --log-attempts); candidate diversity remained near zero.
Adjusted temperature and top‑k settings; no material improvement under the broken overlay path.
Prepared Kaggle datasets/notebooks and executed end‑to‑end; duplicate attempts persisted and no leaderboard submission was made.

Evaluation Guidance

Enforce a strict unique‑candidate guard prior to scoring.
Validate that sampling environment variables propagate into the inference process.
Capture per‑attempt outputs and, where possible, logits to diagnose diversity.
Match the identifier mapping to the checkpoint (legacy vs sorted) and use the corresponding dataset builder.

Reproduction

from huggingface_hub import hf_hub_download
import torch

ckpt_path = hf_hub_download("seconds-0/trm-arc2-8gpu", "model.ckpt")
state = torch.load(ckpt_path, map_location="cpu")
print(state["hyperparameters"]["arch"]["hidden_size"])  # 512

# CoreWeave resume (reference)
kubectl apply -f infra/kubernetes/trm-train-8gpu-resume.yaml
# Requires configmaps: trm-common-script, trm-pyshim-cm, trm-eval-overlay-cm

Resume Guard Signals

RESUME_CHECKPOINT_PATH=/workspace/TinyRecursiveModels/checkpoints/Arc2concept-aug-1000-ACT-torch/trm_arc2_8gpu_resume_plus100k/step_115815
RESUME_EXPECTED_STEP=115815
Pod logs contain: [resume] initializing train_state.step to 115815 before training proceeds.

Known Issues and Next Steps

Duplicate candidate generation: restore uniqueness enforcement, validate environment propagation, and re‑verify multinomial sampling.
Identifier mapping mismatch: this release remains “legacy”; sorted‑mapping evaluations require remapping or fine‑tuning.
Observability: add candidate‑level telemetry and visualizations to support reliable evaluation claims.

Ethics, License, and Intended Use

MIT license. Intended for research and educational use. Users should independently validate evaluation protocols and candidate diversity prior to reporting results.

Acknowledgements

Built on the Tiny Recursive Models codebase and evaluated against ARC Prize 2025 materials. A community‑maintained Kaggle implementation demonstrates a functioning TRM inference pipeline and can be located via search on Kaggle.

Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)

Tiny Recursive Models — ARC-AGI-2 (8× H200 Resume, Step 119 432)