DISCLAIMER: This model is an experimental project by a beginner in fine-tuning. Output quality is not guaranteed, so please do not use it for production or professional work.

pip install "vllm>=0.8.5"

Use --enable-auto-tool-choice --tool-call-parser hermes to enable tool calling.

# enable reasoning
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser hermes 

# disable reasoning
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --chat-template qwen3-14b-meow-gptq-w8a8/qwen3_nonthinking.jinja --enable-auto-tool-choice --tool-call-parser hermes

For longer context window (>40960), use YaRN, factor is adjustable.

The environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 is required to enable context lengths greater than 40960.

# enable YaRN rope scaling
--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max_model_len 131072

Expected to be compatible with older Volta and Turing generation GPUs, as it was trained with FlashAttention-2 disabled, and using FP16.

Downloads last month: 2

Safetensors

Model size

15B params

Tensor type

F16

Model tree for hhzm/qwen3-14b-meow-gptq-w8a8

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Finetuned

mlabonne/Qwen3-14B-abliterated

Finetuned

hhzm/qwen3-14b-meow

Quantized

(1)

this model

hhzm
/

qwen3-14b-meow-gptq-w8a8

DISCLAIMER: This model is an experimental project by a beginner in fine-tuning. Output quality is not guaranteed, so please do not use it for production or professional work.

Model tree for hhzm/qwen3-14b-meow-gptq-w8a8

Dataset used to train hhzm/qwen3-14b-meow-gptq-w8a8