File size: 3,715 Bytes
0341b51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
# Zebra-CoT: A Dataset for Interleaved Vision-Language Reasoning

![Image](assets/zebra_cot_datacard.png)
### BAGEL Training Zebra-CoT

This repository is adapted from the [Bagel](https://github.com/ByteDance-Seed/Bagel) repository.
### Setup

```bash
git clone https://github.com/multimodal-reasoning-lab/Bagel-Zebra-CoT.git
cd Bagel-Zebra-CoT
conda create -n bagel python=3.10 -y
conda activate bagel
pip install -r requirements.txt
pip install flash_attn --no-build-isolation
```

### Download checkpoint

Set the `HF_HOME` in `download_model.py` to the path of the checkpoint you want to download.

```bash
python download_model.py
```

You can also do this straight from python if your `HF_HOME` has already been set.
```python
from huggingface_hub import snapshot_download

snapshot_download(
  repo_id="multimodal-reasoning-lab/Bagel-Zebra-CoT",
  local_dir_use_symlinks=False,
  resume_download=True,
  allow_patterns=["*.json", "*.safetensors", "*.bin", "*.py", "*.md", "*.txt"],
)
```

### Inference

![Image](assets/bagel-cot-example.png)

The inference script (`infz_bf16.py`) supports inherent interleaved text and visual reasoning. To customize it for your
specific use case:

##### 1. Model Checkpoint Path

Update the checkpoint path to point to your model:

```python
checkpoint_dir = "/path/to/your/HF_HOME/models/Bagel-Zebra-CoT"
```

For example, under the `HF_HOME`, the path to the checkpoint folder is:

```bash
checkpoint_dir = f"{HF_HOME}/models--multimodal-reasoning-lab--Bagel-Zebra-CoT/snapshots/c1ff3c56dd5909841523e3a6b554c77d919c2b28
```

You can also use the local dir:

```
checkpoint_dir = f"{HF_HOME}/models/Bagel-Zebra-CoT
```

##### 2. Setting up prompt and images

Edit the prompt and image variables in `infz_bf16.py` (around lines 203-211):

**For single image problems:**
```python
prompt = "Your question here"
image = Image.open('path/to/your/image.png')
```

**For multiple image problems:**
```python
prompt = "Your question about multiple images"
image_1 = Image.open('path/to/image1.jpg')
image_2 = Image.open('path/to/image2.jpg')
image_3 = Image.open('path/to/image3.jpg')
image = [image_1, image_2, image_3]  # List of images
```

**For text-only problems:**
```python
prompt = "Your text-only question"
image = None
```

##### 3. Inference Parameters

You can adjust the generation parameters in the `inference_hyper` dictionary:

```python
inference_hyper = dict(
    do_sample=True,
    text_temperature=0.3,
    cfg_text_scale=4.0,
    cfg_img_scale=2.0,
    cfg_interval=[0.0, 1.0],
    timestep_shift=3.0,
    num_timesteps=50,
    cfg_renorm_min=0.0,
    cfg_renorm_type="text_channel",
)
```

For details, refer to the original jupyter notebook [here](inference.ipynb).

#### Example Use Cases

```python
prompt = "Subtract all cylinders. Add 1 red sphere. How many objects are left?"
image = Image.open('test_images/image.png')
```

### Training
For training, run

```bash
bash scripts/train.sh
```

For details, please refer to the original repo [README](https://github.com/bytedance-seed/BAGEL).

The interleaved reasoning data customized for Zebra-CoT can be found in [think_trace_dataset.py](data/interleave_datasets/think_trace_dataset.py).

### Cite
```bibtex
@misc{li2025zebracot,
      title={Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning},
      author={Ang Li and Charles Wang and Kaiyu Yue and Zikui Cai and Ollie Liu and Deqing Fu and Peng Guo and Wang Bill Zhu and Vatsal Sharan and Robin Jia and Willie Neiswanger and Furong Huang and Tom Goldstein and Micah Goldblum},
      year={2025},
      eprint={2507.16746},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16746},
}
```