Model Card for so101_orange_pick_gr00tn1.5_model

NVIDIA Isaac GR00T N1.5 model fine-tuned on the so101_orange_pick dataset for experimenting with the robot arm LeRobot SO-101 in a dual-camera setup

The video preview has been generated from model training data in the dataset.

Model Details

Model Description

This model is a version of NVIDIA's GR00T N1.5 fine-tuned on the so101_orange_pick dataset. The model is relevant of the context of LightwheelAI's LeIsaac, where it models the standard task "LeIsaac-SO101-PickOrange-v0". In this task, a SO-ARM101 robo arm picks up 3 oranges from a table and places them in a bowl, one after another. The robot is equipped with a front and wrist camera.

Developed by: Florian Roscheck, based on work of wantobcm and model authors
License: Non-commercial use (See 3.2, NVIDIA license)
Finetuned from model: NVIDIA GR00T-N1.5-3B

Uses

The model is intended for researchers and hobbyists who would like to experiment with LeRobot, GR00T inference, LeIsaac and NVIDIA's Isaac Sim.

How to Get Started with the Model

To learn more about how to set up the environment for using the model as an inference service, refer to the instructions in the Isaac-GR00T repo.

You can run the inference server for the model with the following command:

 python scripts/inference_service.py --model-path flrs/so101_orange_pick_gr00tn1.5_model --server --embodiment-tag new_embodiment --data-config so100_dualcam

Training Details

Training Data

See the Dataset Card for more information on the training data.

All data in the dataset was used for training and no test data was withheld.

You can preview the dataset in the LeRobot Dataset Visualizer.

Training Procedure

Training was done on an NVIDIA L4 GPU, driver version 550.54.15, CUDA 12.4.

Set up the environment as described in the Isaac-GR00T repo.

Download the dataset from the Hugging Face Hub, e.g. via the Hugging Face CLI:

hf download --repo-type dataset --local-dir ./dataset wantobcm/so101_orange_pick_gr00tn1.5

In order to train the model, you need a modality file. Create the file dataset/meta/modality.json with the following content:

{
    "state": {
        "single_arm": {
            "start": 0,
            "end": 5
        },
        "gripper": {
            "start": 5,
            "end": 6
        }
    },
    "action": {
        "single_arm": {
            "start": 0,
            "end": 5
        },
        "gripper": {
            "start": 5,
            "end": 6
        }
    },
    "video": {
        "front": {
            "original_key": "observation.images.front"
        },
        "wrist": {
            "original_key": "observation.images.wrist"
        }
    },
    "annotation": {
        "human.task_description": {
            "original_key": "task_index"
        }
    }
}

Train (fine-tune) the model with the following command:

python scripts/gr00t_finetune.py \
    --dataset-path ./dataset \
    --num-gpus 1 \
    --output-dir ./so101_orange_pick_gr00tn1.5_model \
    --max-steps 6000 \
    --data-config so100_dualcam \
    --video-backend torchvision_av \
    --no-tune_diffusion_model \
    --dataloader-num-workers 1 \
    --batch-size 16 \
    --dataloader-prefetch-factor 1

Note: The following adjustments were made to accommodate for infrastructure limitations:

--num-gpus 1

--dataloader-num-workers 1

--batch-size 16

--dataloader-prefetch-factor 1

--no-tune_diffusion_model

Training Hyperparameters

Default hyperparameters for GR00T were used (as of commit 1259d62), except for the ones set via command line arguments above.

Evaluation

Evaluation was run via the following GR00T evaluation script:

python scripts/eval_policy.py \
  --model_path ./so101_orange_pick_gr00tn1.5_model \
  --embodiment-tag new_embodiment \
  --data-config so100_dualcam \
  --dataset_path ./dataset \
  --modality-keys single_arm gripper \
  --trajs 40

Testing Data, Factors & Metrics

Testing Data

Evaluation of the model was done on the full training set, as no separate test set was withheld. For the training set, see dataset card linked above.

Metrics

In accordance with the output of the evaluation script, MSE (Mean Squared Error) for all trajectories were used.

Results

The MSE for all 40 trajectories was 23.752.

An example trajectory is visualized in the following image (image created via the evaluation script for trajectory 10):

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA L4, 4 vCPUs, 16 GB memory, 300 GB SDD
Hours used: 3
Cloud Provider: Google Cloud Platform (GCP)
Compute Region: us-east1-c
Carbon Emitted: 0.08 kg CO2eq (estimate via Machine Learning Impact calculator)