ZachSun
/

Qwen2.5-gfn-3B

+---
+pipeline_tag: image-text-to-text
+library_name: transformers
+---
+# LaCoT: Latent Chain-of-Thought for Visual Reasoning
+This repository contains the official implementation of the paper [Latent Chain-of-Thought for Visual Reasoning](https://huggingface.co/papers/2510.23925). LaCoT proposes a novel approach to improve the interpretability and reliability of Large Vision-Language Models (LVLMs) by reformulating reasoning as posterior inference.
+<p align="center" width="100%">
+<img src="https://github.com/heliossun/LaCoT/raw/main/docs/framework.jpg" width="50%" height="50%">
+</p>
+## Abstract
+Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
+## Links
+*   **Paper**: [Latent Chain-of-Thought for Visual Reasoning](https://huggingface.co/papers/2510.23925)
+*   **GitHub Repository**: [https://github.com/heliossun/LaCoT](https://github.com/heliossun/LaCoT)
+## Model Checkpoints
+*   **LaCoT 7B**: [ZachSun/Qwen2.5VL-GFN-7B-1024](https://huggingface.co/ZachSun/Qwen2.5VL-GFN-7B-1024)
+*   **LaCoT 3B**: [ZachSun/Qwen2.5-gfn-3B](https://huggingface.co/ZachSun/Qwen2.5-gfn-3B)
+*   **SFT 7B**: [ZachSun/Qwen2.5-gfn-sft-7b-250k](https://huggingface.co/ZachSun/Qwen2.5-gfn-sft-7b-250k)
+*   **SFT 3B**: [ZachSun/Qwen2.5-gfn-sft-3b-250k](https://huggingface.co/ZachSun/Qwen2.5-gfn-sft-3b-250k)
+## Data Preparation
+*   [Stage-1 SFT Dataset](https://huggingface.co/datasets/ZachSun/visual-cot/blob/main/llava-cot%2Br1ov-250k.json): Download the dataset.
+*   [Stage-2 RL Dataset](https://huggingface.co/datasets/ZachSun/visual-cot/blob/main/gfn-3k.json): Download the dataset.
+*   Prepare the raw images following: [LLaVA-CoT](https://github.com/PKU-YuanGroup/LLaVA-CoT) and [R1-Onevision](https://github.com/Fancy-MLLM/R1-Onevision) (you may also follow our [script](https://github.com/heliossun/qwen2.5-laCoT/blob/main/get_r1_ov_data.py) to prepare R1-Onevision data).
+Note:
+1.  Download **LLaVA-CoT** in folder **cot**.
+2.  Download **R1-Onevision** in folder **cot/r1ov-image**
+The final data path should look like this:
+```bash
+cot
+├── ai2d
+├── chartqa
+├── CLEVR_v1.0
+├── coco
+├── docvqa
+├── geoqa+
+├── gqa
+├── llava
+├── ocr_vqa
+├── pisc
+├── r1ov-image
+├── sam
+├── share_textvqa
+├── sqa
+├── textvqa
+├── vg
+├── web-celebrity
+├── web-landmark
+└── wikiart
+```
+## Installation
+#### 1. **Clone this repository and navigate to the LLaVA folder:**
+```bash
+git clone https://github.com/heliossun/LaCoT.git
+cd LaCoT
+```
+#### 2. **Install the inference package:**
+```bash
+conda create -n qwen python=3.10 -y
+conda activate qwen
+### if ImportError: /lib64/libc.so.6: version `GLIBC_2.32' not found
+pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
+pip install flash-attn==2.7.4.post1 --no-build-isolation
+pip install git+https://github.com/huggingface/transformers accelerate
+pip install qwen-vl-utils[decord]
+## Install required packages
+pip install deepspeed
+pip install peft
+pip install ujson
+pip install liger_kernel
+pip install dataset
+pip install torchvision
+pip install wandb
+# use transformers==4.51.3 for training
+```
+## Training
+**Stage1 SFT:**
+You may follow [training code](https://github.com/heliossun/LaCoT/blob/main/scripts/finetune.sh)
+**Stage2 GFN:**
+You may follow [training code](https://github.com/heliossun/LaCoT/blob/main/scripts/finetune_gfn.sh)
+You may adjust the following hyperparameters in the training script
+```bash
+--explore_nums 6 \ # number of exploration
+--explore_min_bs 2 \ # batch size for exploration
+--rat_max_len 1024 \ # explored rational's max sequence length
+--rat_min_len 64 \
+--reward_tolarent_start 1.5 \ # higher means accepting low reward exploration during policy gradient
+--reward_tolarent_end 1 \
+--reward_tolarent_horizon 50 \ # warmup steps
+```
+## Evaluation
+We implement our model card in [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) for evaluation. After installation, please check the scripts in [models](https://github.com/heliossun/LaCoT/tree/main/lmms-eval/models) for more detail.
+## Citation
+If you find it useful for your research and applications, please cite related papers/blogs.