nexaml's picture
Update README.md
ae1c248 verified
metadata
base_model:
  - Qwen/Qwen3-VL-2B-Instruct
tags:
  - GGUF

Qwen3-VL-2B-Instruct

Note currently only NexaSDK supports this model's GGUF.

Quickstart:

  • Download NexaSDK with one click
  • one line of code to run in your terminal:
nexa infer NexaAI/Qwen3-VL-2B-Instruct-GGUF

Model Description

Qwen3-VL-2B-Instruct is a 2-billion-parameter, instruction-tuned vision-language model in the Qwen3-VL family. It’s designed for efficient multimodal understanding and generation—combining strong text skills with image and video perception—making it ideal for edge and on-device deployment. It supports long contexts (up to 256K tokens) and features upgraded architecture for better spatial, visual, and temporal reasoning.

Features

  • Multimodal I/O: Understands images and long videos, performs OCR, and handles mixed image-text prompts.
  • Long-context reasoning: Up to 256K context for books, documents, or extended visual analysis.
  • Spatial & temporal understanding: Improved grounding and temporal event tracking for videos.
  • Agentic capabilities: Recognizes UI elements and reasons about screen layouts for tool use.
  • Lightweight footprint: 2B parameters for efficient inference across CPU, GPU, or NPU.

Use Cases

  • Visual question answering, captioning, and summarization
  • OCR and document understanding (multi-page, multilingual)
  • Video analysis and highlight detection
  • On-device visual assistants and UI automation agents
  • Edge analytics and lightweight IoT vision tasks

Inputs and Outputs

Inputs

  • Text prompts
  • Images (single or multiple)
  • Videos or frame sequences
  • Mixed multimodal chat turns

Outputs

  • Natural language answers, captions, and visual reasoning
  • OCR text and structured visual information

License

This model is released under the Apache 2.0 License.
Please refer to the Hugging Face model card for detailed licensing and usage information.