Qwen3.5-9B MLX TurboQuant TQ3

This repo packages the current best TurboQuant runtime recipe we have measured for mlx-community/Qwen3.5-9B-MLX-4bit: TurboQuant 3-bit KV-cache compression (TQ3) on Apple Silicon.

Short version:

  • this is not a new checkpoint
  • it is a reproducible inference overlay on top of the base MLX model
  • it ships the usage path, benchmark evidence, and artifacts in one place

What this repo is — and is not

This repo does:

  • point to the base model weights: mlx-community/Qwen3.5-9B-MLX-4bit
  • replace compatible runtime KVCache slots with TurboQuantKVCache(bits=3)
  • document the measured performance/memory tradeoff
  • provide a copy-paste runnable example

This repo does not:

  • publish modified model weights
  • claim that TurboQuant is a new checkpoint format
  • hide the fact that this is a runtime KV-cache method

That distinction matters. TurboQuant changes the inference-time cache representation, not the underlying model parameters.

Why use this?

For the measured Qwen3.5-9B MLX workload, TQ3 is the current best operating point on the speed/compression frontier:

  • consistently memory-positive
  • near FP16 speed at short context
  • increasingly compelling as context length grows

If you want the cleanest current recommendation rather than a menu of half-baked variants, this is it.

How to use

Requirements:

  • Apple Silicon
  • Python 3.10+
  • git or hf

Recommended hardware:

  • Apple Silicon with at least 32 GB unified memory
  • 64 GB is safer if you want comfortable long-context headroom
  • This package is aimed at local MLX inference, not CPU-only laptops

Option 1: clone the full bundle from Hugging Face

git lfs install
git clone https://huggingface.co/alexcovo/qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3

python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install mlx mlx-lm
pip install -e .

python examples/run_qwen35_tq3.py

Option 2: download without git

hf download alexcovo/qwen35-9b-mlx-turboquant-tq3 --local-dir qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3

python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install mlx mlx-lm
pip install -e .

python examples/run_qwen35_tq3.py

The example script loads the model from the local repo directory, so the weights on disk are used directly.

Smoke test:

python -c "import mlx.core as mx; from pathlib import Path; from mlx_lm import load; from mlx_lm.models.cache import KVCache; from turboquant_mlx import TurboQuantKVCache, apply_patch; root=Path('.').resolve(); model, tokenizer = load(str(root)); apply_patch(); cache=[TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c for c in model.make_cache()]; logits=model(mx.array(tokenizer.encode('TurboQuant smoke test.'))[None], cache=cache); mx.eval(logits); token=mx.argmax(logits[:, -1, :], axis=-1); logits=model(token.reshape(1, 1), cache=cache); mx.eval(logits); print('TQ3 local bundle OK')"

Troubleshooting:

  • If git clone fetches pointer files instead of weights, run git lfs install first and clone again.
  • If hf is missing, install it with pip install -U "huggingface_hub[cli]".
  • If pip install -e . fails, make sure you are inside the downloaded repo root.
  • If MLX import fails, verify you are on Apple Silicon with a recent Python 3 environment.

Minimal usage

from pathlib import Path

import mlx.core as mx
from mlx_lm import load
from mlx_lm.models.cache import KVCache
from turboquant_mlx import TurboQuantKVCache, apply_patch

repo_root = Path('.').resolve()

model, tokenizer = load(str(repo_root))
apply_patch()

cache = []
for c in model.make_cache():
    cache.append(TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c)

prompt = "Explain why KV-cache compression matters for long-context decoding."
input_ids = mx.array(tokenizer.encode(prompt))[None]
logits = model(input_ids, cache=cache)
mx.eval(logits)

Full runnable example:

  • examples/run_qwen35_tq3.py

What changes under the hood?

TurboQuant compresses the KV cache at inference time using:

  • randomized Hadamard rotation
  • scalar quantization
  • bit-packed storage
  • fused Metal kernels on the hot path

The goal is simple: reduce KV-cache footprint while keeping decode throughput and output quality on the useful side of the tradeoff.

Benchmark snapshot

Qwen3.5-9B publication-grade curve

Representative medians from the publication-grade sweep:

Prompt scale FP16 tok/s TQ4 tok/s TQ3 tok/s TQ4 cache TQ3 cache
1.00x 52.90 51.58 51.67 56.2 MB 50.6 MB
1.50x 43.42 48.71 48.65 71.4 MB 63.0 MB
2.25x 47.97 45.74 48.26 94.1 MB 81.5 MB
3.00x 46.62 45.49 45.35 116.8 MB 100.1 MB

Interpretation:

  • TurboQuant is best understood here as a frontier extender first.
  • TQ3 is the current best speed/compression operating point.
  • adaptive splits are not the preferred path on current evidence.

Files in this repo

  • examples/run_qwen35_tq3.py: minimal runnable example
  • benchmarks/README.md: benchmark summary
  • benchmarks/summary_rows.tsv: median benchmark table
  • benchmarks/raw_rows.tsv: raw benchmark rows
  • benchmarks/run_metadata.json: run metadata
  • assets/curve.png / assets/curve.svg: chart artifacts

Source repo

Implementation and ongoing development live here:

Benchmark driver used for this sweep:

  • benchmarks/qwen35_9b_publication_curve.py in the source repo

Honesty note

This repository is intentionally explicit: it packages a practical runtime recipe and the supporting evidence, not a new set of model weights.

Downloads last month
3,125
Safetensors
Model size
2B params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexcovo/qwen35-9b-mlx-turboquant-tq3

Finetuned
Qwen/Qwen3.5-9B
Quantized
(1)
this model