File size: 799 Bytes

653b4ae

---
license: apache-2.0
tags:
  - multimodal
  - vision-language
  - video understanding
  - visuospatial cognition
  - spatial reasoning
  - vlm
  - llava
  - qwen
  - siglip
  - hiera
  - sam2
  - dual-encoder
datasets:
  - liuhaotian/LLaVA-CC3M-Pretrain-595K
language:
  - en
library_name: transformers
pipeline_tag: video-text-to-text
model_name: ViCA2-7B-Stage1

---

## Usage and Full Documentation

For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the following links**:

[![GitHub](https://img.shields.io/badge/GitHub-ViCA2-181717?logo=github&logoColor=white)](https://github.com/nkkbr/ViCA)

[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA2-blue)](https://huggingface.co/nkkbr/ViCA2)