--- base_model: - Qwen/Qwen3-VL-2B-Instruct tags: - GGUF --- # Qwen3-VL-2B-Instruct Currently, only [NexaSDK](https://github.com/NexaAI/nexa-sdk) supports this GGUF. ## Quickstart: - Download [NexaSDK](https://github.com/NexaAI/nexa-sdk) with one click - one line of code to run in your terminal: ``` nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF ``` ## Model Description **Qwen3-VL-2B-Instruct** is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning. ## Features - **Multimodal I/O**: Understands images and long videos, performs OCR, and handles mixed image-text prompts. - **Long-context reasoning**: Up to 256K context for books, documents, or extended visual analysis. - **Spatial & temporal understanding**: Improved grounding and temporal event tracking for videos. - **Agentic capabilities**: Recognizes UI elements and reasons about screen layouts for tool use. - **Lightweight footprint**: 2B parameters for efficient inference across CPU, GPU, or NPU. ## Use Cases - Visual question answering, captioning, and summarization - OCR and document understanding (multi-page, multilingual) - Video analysis and highlight detection - On-device visual assistants and UI automation agents - Edge analytics and lightweight IoT vision tasks ## Inputs and Outputs **Inputs** - Text prompts - Images (single or multiple) - Videos or frame sequences - Mixed multimodal chat turns **Outputs** - Natural language answers, captions, and visual reasoning - OCR text and structured visual information ## License This model is released under the **Apache 2.0 License**. Please refer to the Hugging Face model card for detailed licensing and usage information.