File size: 799 Bytes
653b4ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
---
license: apache-2.0
tags:
- multimodal
- vision-language
- video understanding
- visuospatial cognition
- spatial reasoning
- vlm
- llava
- qwen
- siglip
- hiera
- sam2
- dual-encoder
datasets:
- liuhaotian/LLaVA-CC3M-Pretrain-595K
language:
- en
library_name: transformers
pipeline_tag: video-text-to-text
model_name: ViCA2-7B-Stage1
---
## Usage and Full Documentation
For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the following links**:
[](https://github.com/nkkbr/ViCA)
[](https://huggingface.co/nkkbr/ViCA2) |