hhzm
/

qwen3-14b-meow-gptq-w8a8

Text Generation

8-bit precision

compressed-tensors

Model card Files Files and versions

qwen3-14b-meow-gptq-w8a8 / README.md

hhzm's picture

Update README.md

15bd3e2 verified 3 months ago

|

history blame contribute delete

1.31 kB

	---
	license: mit
	datasets:
	- liumindmind/NekoQA-10K
	language:
	- zh
	base_model:
	- hhzm/qwen3-14b-meow
	pipeline_tag: text-generation
	---

	# DISCLAIMER: This model is an experimental project by a beginner in fine-tuning. Output quality is not guaranteed, so please do not use it for production or professional work.


	```sh
	pip install "vllm>=0.8.5"
	```

	Use `--enable-auto-tool-choice --tool-call-parser hermes` to enable tool calling.

	```
	# enable reasoning
	VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser hermes

	# disable reasoning
	VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve hhzm/qwen3-14b-meow-gptq-w8a8 --chat-template qwen3-14b-meow-gptq-w8a8/qwen3_nonthinking.jinja --enable-auto-tool-choice --tool-call-parser hermes
	```

	For longer context window (>40960), use YaRN, factor is adjustable.

	The environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 is required to enable context lengths greater than 40960.

	```sh
	# enable YaRN rope scaling
	--rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max_model_len 131072
	```

	Expected to be compatible with older Volta and Turing generation GPUs, as it was trained with FlashAttention-2 disabled, and using FP16.