|  | --- | 
					
						
						|  | base_model: | 
					
						
						|  | - Qwen/Qwen3-VL-2B-Instruct | 
					
						
						|  | tags: | 
					
						
						|  | - GGUF | 
					
						
						|  | --- | 
					
						
						|  | # Qwen3-VL-2B-Instruct | 
					
						
						|  |  | 
					
						
						|  | > [!NOTE] | 
					
						
						|  | > Note currently only [NexaSDK](https://github.com/NexaAI/nexa-sdk) supports this model's GGUF. | 
					
						
						|  |  | 
					
						
						|  | ## Quickstart: | 
					
						
						|  | - Download [NexaSDK](https://github.com/NexaAI/nexa-sdk) with one click | 
					
						
						|  | - one line of code to run in your terminal: | 
					
						
						|  | ``` | 
					
						
						|  | nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Model Description | 
					
						
						|  | **Qwen3-VL-2B-Instruct** is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning. | 
					
						
						|  |  | 
					
						
						|  | ## Features | 
					
						
						|  | - **Multimodal I/O**: Understands images and long videos, performs OCR, and handles mixed image-text prompts. | 
					
						
						|  | - **Long-context reasoning**: Up to 256K context for books, documents, or extended visual analysis. | 
					
						
						|  | - **Spatial & temporal understanding**: Improved grounding and temporal event tracking for videos. | 
					
						
						|  | - **Agentic capabilities**: Recognizes UI elements and reasons about screen layouts for tool use. | 
					
						
						|  | - **Lightweight footprint**: 2B parameters for efficient inference across CPU, GPU, or NPU. | 
					
						
						|  |  | 
					
						
						|  | ## Use Cases | 
					
						
						|  | - Visual question answering, captioning, and summarization | 
					
						
						|  | - OCR and document understanding (multi-page, multilingual) | 
					
						
						|  | - Video analysis and highlight detection | 
					
						
						|  | - On-device visual assistants and UI automation agents | 
					
						
						|  | - Edge analytics and lightweight IoT vision tasks | 
					
						
						|  |  | 
					
						
						|  | ## Inputs and Outputs | 
					
						
						|  | **Inputs** | 
					
						
						|  | - Text prompts | 
					
						
						|  | - Images (single or multiple) | 
					
						
						|  | - Videos or frame sequences | 
					
						
						|  | - Mixed multimodal chat turns | 
					
						
						|  |  | 
					
						
						|  | **Outputs** | 
					
						
						|  | - Natural language answers, captions, and visual reasoning | 
					
						
						|  | - OCR text and structured visual information | 
					
						
						|  |  | 
					
						
						|  | ## License | 
					
						
						|  | This model is released under the **Apache 2.0 License**. | 
					
						
						|  | Please refer to the Hugging Face model card for detailed licensing and usage information. |