|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- liumindmind/NekoQA-10K |
|
|
language: |
|
|
- zh |
|
|
base_model: |
|
|
- hhzm/qwen3-14b-meow |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# **DISCLAIMER: This model is an experimental project by a beginner in fine-tuning. Output quality is not guaranteed, so please do not use it for production or professional work.** |
|
|
|
|
|
|
|
|
```sh |
|
|
pip install "vllm>=0.8.5" |
|
|
``` |
|
|
|
|
|
Use `--enable-auto-tool-choice --tool-call-parser hermes` to enable tool calling. |
|
|
|
|
|
``` |
|
|
# enable reasoning |
|
|
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser hermes |
|
|
|
|
|
# disable reasoning |
|
|
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --chat-template qwen3-14b-meow-gptq-w8a8/qwen3_nonthinking.jinja --enable-auto-tool-choice --tool-call-parser hermes |
|
|
``` |
|
|
|
|
|
For longer context window (>40960), use YaRN, factor is adjustable. |
|
|
|
|
|
The environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 is required to enable context lengths greater than 40960. |
|
|
|
|
|
```sh |
|
|
# enable YaRN rope scaling |
|
|
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max_model_len 131072 |
|
|
``` |
|
|
|
|
|
Expected to be compatible with older Volta and Turing generation GPUs, as it was trained with FlashAttention-2 disabled, and using FP16. |