🚀 CRIT-VL-38B

CRIT-VL-38B is a large-scale Vision-Language Model (VLM) fine-tuned for complex Cross-Modal Multi-Hop Reasoning. This model was trained to effectively connect text context with visual cues across multiple images, addressing the hallucination and grounding issues prevalent in existing VLMs.

This model is the official open-source release accompanying the CVPR 2026 Accepted paper: "CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning".

📊 Model Details

  • Base Model: InternVL-3.5-Pretrained (38B)
  • Architecture: Vision-Language Model with merged LoRA weights.
  • Training Data Recipe: The model was supervised fine-tuned (SFT) using an optimized combination of the following datasets:
    • LLaVA-Onevision-Instruct
    • CRIT (+ Korean extension)
    • R1-Onevision (+ Korean extension)
  • Training Infrastructure: Trained on an AWS ParallelCluster / Slurm environment utilizing 64x H200 GPUs. Training throughput was highly optimized using DeepSpeed ZeRO Stage 3 and Gradient Checkpointing.

💻 Quick Start

To use CRIT-VL-38B, you will need to allow custom code execution (trust_remote_code=True) as it utilizes the InternVL architecture.

import torch
from transformers import AutoTokenizer, AutoModel

path = "KU-MIIL/CRIT-VL-38B"

tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
# If you have an 80GB VRAM GPU, you can load it in bfloat16. 
# Otherwise, consider using quantization (e.g., load_in_8bit=True).
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda()

# Example: Generate a response (Modify the prompt and image structure according to InternVL documentation)
# response = model.chat(tokenizer, pixel_values, question, generation_config)

📖 Citation

If you find this model or the CRIT dataset useful in your research, please consider citing our CVPR 2026 paper:

@inproceedings{crit2026,
  title={CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning},
  author={Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

🏢 Acknowledgements

This project was conducted by the Multimodal Interactive Intelligence Laboratory (MIIL) at Korea University.

Downloads last month
33
Safetensors
Model size
38B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KU-MIIL/CRIT-VL-38B

Finetuned
(3)
this model

Datasets used to train KU-MIIL/CRIT-VL-38B