Qwen3-VL-2B-Instruct-GGUF / README.md

nexaml

Update README.md

ae1c248 verified about 19 hours ago

preview code

raw

history blame contribute delete

2.07 kB

metadata

base_model:
  - Qwen/Qwen3-VL-2B-Instruct
tags:
  - GGUF

Qwen3-VL-2B-Instruct

Note currently only NexaSDK supports this model's GGUF.

Quickstart:

Download NexaSDK with one click
one line of code to run in your terminal:

nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF

Model Description

Qwen3-VL-2B-Instruct is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning.

Features

Multimodal I/O: Understands images and long videos, performs OCR, and handles mixed image-text prompts.
Long-context reasoning: Up to 256K context for books, documents, or extended visual analysis.
Spatial & temporal understanding: Improved grounding and temporal event tracking for videos.
Agentic capabilities: Recognizes UI elements and reasons about screen layouts for tool use.
Lightweight footprint: 2B parameters for efficient inference across CPU, GPU, or NPU.

Use Cases

Visual question answering, captioning, and summarization
OCR and document understanding (multi-page, multilingual)
Video analysis and highlight detection
On-device visual assistants and UI automation agents
Edge analytics and lightweight IoT vision tasks

Inputs and Outputs

Inputs

Text prompts
Images (single or multiple)
Videos or frame sequences
Mixed multimodal chat turns

Outputs

Natural language answers, captions, and visual reasoning
OCR text and structured visual information

License

This model is released under the Apache 2.0 License.
Please refer to the Hugging Face model card for detailed licensing and usage information.