metadata
base_model:
- Qwen/Qwen3-VL-2B-Instruct
tags:
- GGUF
Qwen3-VL-2B-Instruct
Note currently only NexaSDK supports this model's GGUF.
Quickstart:
- Download NexaSDK with one click
- one line of code to run in your terminal:
nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF
Model Description
Qwen3-VL-2B-Instruct is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning.
Features
- Multimodal I/O: Understands images and long videos, performs OCR, and handles mixed image-text prompts.
- Long-context reasoning: Up to 256K context for books, documents, or extended visual analysis.
- Spatial & temporal understanding: Improved grounding and temporal event tracking for videos.
- Agentic capabilities: Recognizes UI elements and reasons about screen layouts for tool use.
- Lightweight footprint: 2B parameters for efficient inference across CPU, GPU, or NPU.
Use Cases
- Visual question answering, captioning, and summarization
- OCR and document understanding (multi-page, multilingual)
- Video analysis and highlight detection
- On-device visual assistants and UI automation agents
- Edge analytics and lightweight IoT vision tasks
Inputs and Outputs
Inputs
- Text prompts
- Images (single or multiple)
- Videos or frame sequences
- Mixed multimodal chat turns
Outputs
- Natural language answers, captions, and visual reasoning
- OCR text and structured visual information
License
This model is released under the Apache 2.0 License.
Please refer to the Hugging Face model card for detailed licensing and usage information.