Carnice-9b W8A16 AWQ :

8-bit symmetric AWQ quantization of kai-os/Carnice-9b, optimized for Ampere GPUs (RTX 30-series) with vLLM.

How it works :

kai-os/Carnice-9b is a fine-tune of Qwen/Qwen3.5-9B that drops the visual components and uses the Qwen3_5ForCausalLM architecture. This architecture is not natively supported by vLLM.

To work around this, this quantized checkpoint re-wraps the weights back into the Qwen3_5ForConditionalGeneration architecture (matching the original Qwen/Qwen3.5-9B config), so vLLM can load it with --language-model-only to serve text-only inference.

Quantization details:

  • Method: AWQ (Activation-aware Weight Quantization) via llm-compressor
  • Bits: 8 (per-channel, symmetric)
  • Activations: FP16
  • Ignored layers: linear_attn, lm_head, mtp

Performance :

Tested on a dual RTX 3090 rig (48GB), single request: Avg prompt throughput: 8061.9 tokens/s Avg generation throughput: 92.9 tokens/s

Usage :

VLLM :

On one GPU :

vllm serve TurbulenceDeterministe/Caranice-9b-W8A16-AWQ
            --max-model-len auto 
            --reasoning-parser qwen3
            --language-model-only #To only load text parameters
            --tensor-parallel-size 1

On multiple GPU : (you need to install the Conch triton Kernel)

pip install conch-triton-kernels 
vllm serve TurbulenceDeterministe/Caranice-9b-W8A16-AWQ
            --max-model-len auto 
            --reasoning-parser qwen3
            --language-model-only #To only load text parameters
            --tensor-parallel-size 2
Downloads last month
-
Safetensors
Model size
8B params
Tensor type
I64
·
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TurbulenceDeterministe/Carnice-9b-W8A16-AWQ

Finetuned
Qwen/Qwen3.5-9B
Quantized
(2)
this model