Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,10 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
pipeline_tag: image-text-to-text
|
| 4 |
-
library_name: transformers
|
| 5 |
-
base_model:
|
| 6 |
-
- Qwen/Qwen3-VL-4B-Thinking
|
| 7 |
-
---
|
| 8 |
<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
|
| 9 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
| 10 |
</a>
|
|
@@ -75,6 +75,52 @@ Available in Dense and MoE architectures that scale from edge to cloud, with Ins
|
|
| 75 |
**Pure text performance**
|
| 76 |

|
| 77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
### Generation Hyperparameters
|
| 80 |
#### VL
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
base_model:
|
| 6 |
+
- Qwen/Qwen3-VL-4B-Thinking
|
| 7 |
+
---
|
| 8 |
<a href="https://chat.qwenlm.ai/" target="_blank" style="margin: 2px;">
|
| 9 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
| 10 |
</a>
|
|
|
|
| 75 |
**Pure text performance**
|
| 76 |

|
| 77 |
|
| 78 |
+
## How to Use
|
| 79 |
+
|
| 80 |
+
To use these models with `llama.cpp`, please ensure you are using the **latest version**—either by [building from source](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) or downloading the most recent [release](https://github.com/ggml-org/llama.cpp/releases/tag/b6907) according to the devices.
|
| 81 |
+
|
| 82 |
+
You can run inference via the command line or through a web-based chat interface.
|
| 83 |
+
|
| 84 |
+
### CLI Inference (`llama-mtmd-cli`)
|
| 85 |
+
|
| 86 |
+
For example, to run Qwen3-VL-4B-Thinking with an FP16 vision encoder and Q8_0 quantized LLM:
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
llama-mtmd-cli \
|
| 90 |
+
-m path/to/Qwen3VL-4B-Thinking-Q8_0.gguf \
|
| 91 |
+
--mmproj path/to/mmproj-Qwen3VL-4B-Thinking-F16.gguf \
|
| 92 |
+
--image test.jpeg \
|
| 93 |
+
-p "What is the publisher name of the newspaper?" \
|
| 94 |
+
--temp 1.0 --top-k 20 --top-p 0.95 -n 1024
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
### Web Chat (using `llama-server`)
|
| 98 |
+
|
| 99 |
+
To serve Qwen3-VL-235B-A22B-Instruct via an OpenAI-compatible API with a web UI:
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
llama-server \
|
| 103 |
+
-m path/to/Qwen3VL-235B-A22B-Instruct-Q4_K_M-split-00001-of-00003.gguf \
|
| 104 |
+
--mmproj path/to/mmproj-Qwen3VL-235B-A22B-Instruct-Q8_0.gguf
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
> **Tip**: For models split into multiple GGUF files, simply specify the first shard (e.g., `...-00001-of-00003.gguf`). llama.cpp will automatically load all parts.
|
| 108 |
+
|
| 109 |
+
Once the server is running, open your browser to `http://localhost:8080` to access the built-in chat interface, or send requests to the `/v1/chat/completions` endpoint. For more details, refer to the [official documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/server/README.md).
|
| 110 |
+
|
| 111 |
+
### Quantize Your Custom Model
|
| 112 |
+
|
| 113 |
+
You can further quantize the FP16 weights to other precision levels. For example, to quantize the model to 2-bit:
|
| 114 |
+
|
| 115 |
+
```bash
|
| 116 |
+
# Quantize to 2-bit (IQ2_XXS)
|
| 117 |
+
llama-quantize \
|
| 118 |
+
path/to/Qwen3VL-235B-A22B-Instruct-F16.gguf \
|
| 119 |
+
path/to/Qwen3VL-235B-A22B-Instruct-IQ2_XXS.gguf \
|
| 120 |
+
iq2_xxs 8
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
For a full list of supported quantization types and detailed instructions, refer to the [quantization documentation](https://github.com/ggml-org/llama.cpp/blob/master/tools/quantize/README.md).
|
| 124 |
|
| 125 |
### Generation Hyperparameters
|
| 126 |
#### VL
|