Qwen
/

Qwen3-VL-4B-Thinking-GGUF

@@ -1,10 +1,10 @@
----
-license: apache-2.0
-pipeline_tag: image-text-to-text
-library_name: transformers
-base_model:
-- Qwen/Qwen3-VL-4B-Thinking
----
 <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
@@ -75,6 +75,52 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins
 **Pure text performance**
 ![](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/qwen3vl_4b_8b_text_thinking.jpg)
 ### Generation Hyperparameters
 #### VL

+---
+license: apache-2.0
+pipeline_tag: image-text-to-text
+library_name: transformers
+base_model:
+- Qwen/Qwen3-VL-4B-Thinking
+---
 <a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
 **Pure text performance**
 ![](https://qianwen-res.oss-accelerate.aliyuncs.com/Qwen3-VL/qwen3vl_4b_8b_text_thinking.jpg)
+## How to Use
+To use these models with `llama.cpp`, please ensure you are using the **latest version**—either by [building from source](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) or downloading the most recent [release](https://github.com/ggml-org/llama.cpp/releases/tag/b6907) according to the devices.
+You can run inference via the command line or through a web-based chat interface.
+### CLI Inference (`llama-mtmd-cli`)
+For example, to run Qwen3-VL-4B-Thinking with an FP16 vision encoder and Q8_0 quantized LLM:
+```bash
+llama-mtmd-cli \
+  -m path/to/Qwen3VL-4B-Thinking-Q8_0.gguf \
+  --mmproj path/to/mmproj-Qwen3VL-4B-Thinking-F16.gguf \
+  --image test.jpeg \
+  -p "What is the publisher name of the newspaper?" \
+  --temp 1.0 --top-k 20 --top-p 0.95 -n 1024
+```
+### Web Chat (using `llama-server`)
+To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI:
+```bash
+llama-server \
+  -m path/to/Qwen3VL-235B-A22B-Instruct-Q4_K_M-split-00001-of-00003.gguf \
+  --mmproj path/to/mmproj-Qwen3VL-235B-A22B-Instruct-Q8_0.gguf
+```
+> **Tip**: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts.
+Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
+### Quantize Your Custom Model
+You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit:
+```bash
+# Quantize to 2-bit (IQ2_XXS)
+llama-quantize \
+  path/to/Qwen3VL-235B-A22B-Instruct-F16.gguf \
+  path/to/Qwen3VL-235B-A22B-Instruct-IQ2_XXS.gguf \
+  iq2_xxs 8
+```
+For a full list of supported quantization types and detailed instructions, refer to the [quantization documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md).
 ### Generation Hyperparameters
 #### VL