Carnice-9b W8A16 AWQ :

8-bit symmetric AWQ quantization of kai-os/Carnice-9b, optimized for Ampere GPUs (RTX 30-series) with vLLM.

How it works :

kai-os/Carnice-9b is a fine-tune of Qwen/Qwen3.5-9B that drops the visual components and uses the Qwen3_5ForCausalLM architecture. This architecture is not natively supported by vLLM.

To work around this, this quantized checkpoint re-wraps the weights back into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B config), so vLLM can load it with --language-model-only to serve text-only inference.

Quantization details:

Method: AWQ (Activation-aware Weight Quantization) via llm-compressor
Bits: 8 (per-channel, symmetric)
Activations: FP16
Ignored layers: linear_attn, lm_head, mtp

Performance :

Tested on a dual RTX 3090 rig (48GB), single request: Avg prompt throughput: 8061.9 tokens/s Avg generation throughput: 92.9 tokens/s

Usage :

VLLM :

On one GPU :

vllm serve TurbulenceDeterministe/Caranice-9b-W8A16-AWQ
            --max-model-len auto 
            --reasoning-parser qwen3
            --language-model-only #To only load text parameters
            --tensor-parallel-size 1

On multiple GPU : (you need to install the Conch triton Kernel)

pip install conch-triton-kernels 
vllm serve TurbulenceDeterministe/Caranice-9b-W8A16-AWQ
            --max-model-len auto 
            --reasoning-parser qwen3
            --language-model-only #To only load text parameters
            --tensor-parallel-size 2

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

I64

I32

BF16

Model tree for TurbulenceDeterministe/Carnice-9b-W8A16-AWQ

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

kai-os/Carnice-9b

Quantized

(2)

this model