Add comprehensive model card for LaCoT

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +120 -0
README.md ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
+ ---
5
+
6
+ # LaCoT: Latent Chain-of-Thought for Visual Reasoning
7
+
8
+ This repository contains the official implementation of the paper [Latent Chain-of-Thought for Visual Reasoning](https://huggingface.co/papers/2510.23925). LaCoT proposes a novel approach to improve the interpretability and reliability of Large Vision-Language Models (LVLMs) by reformulating reasoning as posterior inference.
9
+
10
+ <p align="center" width="100%">
11
+ <img src="https://github.com/heliossun/LaCoT/raw/main/docs/framework.jpg" width="50%" height="50%">
12
+ </p>
13
+
14
+ ## Abstract
15
+
16
+ Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
17
+
18
+ ## Links
19
+
20
+ * **Paper**: [Latent Chain-of-Thought for Visual Reasoning](https://huggingface.co/papers/2510.23925)
21
+ * **GitHub Repository**: [https://github.com/heliossun/LaCoT](https://github.com/heliossun/LaCoT)
22
+
23
+ ## Model Checkpoints
24
+
25
+ * **LaCoT 7B**: [ZachSun/Qwen2.5VL-GFN-7B-1024](https://huggingface.co/ZachSun/Qwen2.5VL-GFN-7B-1024)
26
+ * **LaCoT 3B**: [ZachSun/Qwen2.5-gfn-3B](https://huggingface.co/ZachSun/Qwen2.5-gfn-3B)
27
+ * **SFT 7B**: [ZachSun/Qwen2.5-gfn-sft-7b-250k](https://huggingface.co/ZachSun/Qwen2.5-gfn-sft-7b-250k)
28
+ * **SFT 3B**: [ZachSun/Qwen2.5-gfn-sft-3b-250k](https://huggingface.co/ZachSun/Qwen2.5-gfn-sft-3b-250k)
29
+
30
+ ## Data Preparation
31
+
32
+ * [Stage-1 SFT Dataset](https://huggingface.co/datasets/ZachSun/visual-cot/blob/main/llava-cot%2Br1ov-250k.json): Download the dataset.
33
+ * [Stage-2 RL Dataset](https://huggingface.co/datasets/ZachSun/visual-cot/blob/main/gfn-3k.json): Download the dataset.
34
+ * Prepare the raw images following: [LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT) and [R1-Onevision](https://github.com/Fancy-MLLM/R1-Onevision) (you may also follow our [script](https://github.com/heliossun/qwen2.5-laCoT/blob/main/get_r1_ov_data.py) to prepare R1-Onevision data).
35
+
36
+ Note:
37
+ 1. Download **LLaVA-CoT** in folder **cot**.
38
+ 2. Download **R1-Onevision** in folder **cot/r1ov-image**
39
+
40
+ The final data path should look like this:
41
+ ```bash
42
+ cot
43
+ β”œβ”€β”€ ai2d
44
+ β”œβ”€β”€ chartqa
45
+ β”œβ”€β”€ CLEVR_v1.0
46
+ β”œβ”€β”€ coco
47
+ β”œβ”€β”€ docvqa
48
+ β”œβ”€β”€ geoqa+
49
+ β”œβ”€β”€ gqa
50
+ β”œβ”€β”€ llava
51
+ β”œβ”€β”€ ocr_vqa
52
+ β”œβ”€β”€ pisc
53
+ β”œβ”€β”€ r1ov-image
54
+ β”œβ”€β”€ sam
55
+ β”œβ”€β”€ share_textvqa
56
+ β”œβ”€β”€ sqa
57
+ β”œβ”€β”€ textvqa
58
+ β”œβ”€β”€ vg
59
+ β”œβ”€β”€ web-celebrity
60
+ β”œβ”€β”€ web-landmark
61
+ └── wikiart
62
+ ```
63
+
64
+ ## Installation
65
+
66
+ #### 1. **Clone this repository and navigate to the LLaVA folder:**
67
+
68
+ ```bash
69
+ git clone https://github.com/heliossun/LaCoT.git
70
+
71
+ cd LaCoT
72
+ ```
73
+
74
+ #### 2. **Install the inference package:**
75
+
76
+ ```bash
77
+ conda create -n qwen python=3.10 -y
78
+ conda activate qwen
79
+
80
+ ### if ImportError: /lib64/libc.so.6: version `GLIBC_2.32' not found
81
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
82
+ pip install flash-attn==2.7.4.post1 --no-build-isolation
83
+ pip install git+https://github.com/huggingface/transformers accelerate
84
+ pip install qwen-vl-utils[decord]
85
+ ## Install required packages
86
+ pip install deepspeed
87
+ pip install peft
88
+ pip install ujson
89
+ pip install liger_kernel
90
+ pip install dataset
91
+ pip install torchvision
92
+ pip install wandb
93
+ # use transformers==4.51.3 for training
94
+ ```
95
+
96
+ ## Training
97
+
98
+ **Stage1 SFT:**
99
+ You may follow [training code](https://github.com/heliossun/LaCoT/blob/main/scripts/finetune.sh)
100
+
101
+ **Stage2 GFN:**
102
+ You may follow [training code](https://github.com/heliossun/LaCoT/blob/main/scripts/finetune_gfn.sh)
103
+ You may adjust the following hyperparameters in the training script
104
+ ```bash
105
+ --explore_nums 6 \ # number of exploration
106
+ --explore_min_bs 2 \ # batch size for exploration
107
+ --rat_max_len 1024 \ # explored rational's max sequence length
108
+ --rat_min_len 64 \
109
+ --reward_tolarent_start 1.5 \ # higher means accepting low reward exploration during policy gradient
110
+ --reward_tolarent_end 1 \
111
+ --reward_tolarent_horizon 50 \ # warmup steps
112
+ ```
113
+
114
+ ## Evaluation
115
+
116
+ We implement our model card in [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) for evaluation. After installation, please check the scripts in [models](https://github.com/heliossun/LaCoT/tree/main/lmms-eval/models) for more detail.
117
+
118
+ ## Citation
119
+
120
+ If you find it useful for your research and applications, please cite related papers/blogs.