NexaAI
/

Qwen3-VL-2B-Instruct-GGUF

Model card Files Files and versions

Qwen3-VL-2B-Instruct-GGUF / README.md

nexaml's picture

Update README.md

ae1c248 verified 4 days ago

|

history blame contribute delete

2.07 kB

	---
	base_model:
	- Qwen/Qwen3-VL-2B-Instruct
	tags:
	- GGUF
	---
	# Qwen3-VL-2B-Instruct

	> [!NOTE]
	> Note currently only [NexaSDK](https://github.com/NexaAI/nexa-sdk) supports this model's GGUF.

	## Quickstart:
	- Download [NexaSDK](https://github.com/NexaAI/nexa-sdk) with one click
	- one line of code to run in your terminal:
	```
	nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF
	```

	## Model Description
	Qwen3-VL-2B-Instruct is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning.

	## Features
	- Multimodal I/O: Understands images and long videos, performs OCR, and handles mixed image-text prompts.
	- Long-context reasoning: Up to 256K context for books, documents, or extended visual analysis.
	- Spatial & temporal understanding: Improved grounding and temporal event tracking for videos.
	- Agentic capabilities: Recognizes UI elements and reasons about screen layouts for tool use.
	- Lightweight footprint: 2B parameters for efficient inference across CPU, GPU, or NPU.

	## Use Cases
	- Visual question answering, captioning, and summarization
	- OCR and document understanding (multi-page, multilingual)
	- Video analysis and highlight detection
	- On-device visual assistants and UI automation agents
	- Edge analytics and lightweight IoT vision tasks

	## Inputs and Outputs
	Inputs
	- Text prompts
	- Images (single or multiple)
	- Videos or frame sequences
	- Mixed multimodal chat turns

	Outputs
	- Natural language answers, captions, and visual reasoning
	- OCR text and structured visual information

	## License
	This model is released under the Apache 2.0 License.
	Please refer to the Hugging Face model card for detailed licensing and usage information.