Qwen3.5-9B MLX TurboQuant TQ3

This repo packages the current best TurboQuant runtime recipe we have measured for mlx-community/Qwen3.5-9B-MLX-4bit: TurboQuant 3-bit KV-cache compression (TQ3) on Apple Silicon.

Short version:

this is not a new checkpoint
it is a reproducible inference overlay on top of the base MLX model
it ships the usage path, benchmark evidence, and artifacts in one place

What this repo is — and is not

This repo does:

point to the base model weights: mlx-community/Qwen3.5-9B-MLX-4bit
replace compatible runtime KVCache slots with TurboQuantKVCache(bits=3)
document the measured performance/memory tradeoff
provide a copy-paste runnable example

This repo does not:

publish modified model weights
claim that TurboQuant is a new checkpoint format
hide the fact that this is a runtime KV-cache method

That distinction matters. TurboQuant changes the inference-time cache representation, not the underlying model parameters.

Why use this?

For the measured Qwen3.5-9B MLX workload, TQ3 is the current best operating point on the speed/compression frontier:

consistently memory-positive
near FP16 speed at short context
increasingly compelling as context length grows

If you want the cleanest current recommendation rather than a menu of half-baked variants, this is it.

How to use

Requirements:

Apple Silicon
Python 3.10+
git or hf

Recommended hardware:

Apple Silicon with at least 32 GB unified memory
64 GB is safer if you want comfortable long-context headroom
This package is aimed at local MLX inference, not CPU-only laptops

Option 1: clone the full bundle from Hugging Face

git lfs install
git clone https://huggingface.co/alexcovo/qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3

python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install mlx mlx-lm
pip install -e .

python examples/run_qwen35_tq3.py

Option 2: download without git

hf download alexcovo/qwen35-9b-mlx-turboquant-tq3 --local-dir qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3

python3 -m venv .venv
source .venv/bin/activate

pip install -U pip
pip install mlx mlx-lm
pip install -e .

python examples/run_qwen35_tq3.py

The example script loads the model from the local repo directory, so the weights on disk are used directly.

Smoke test:

python -c "import mlx.core as mx; from pathlib import Path; from mlx_lm import load; from mlx_lm.models.cache import KVCache; from turboquant_mlx import TurboQuantKVCache, apply_patch; root=Path('.').resolve(); model, tokenizer = load(str(root)); apply_patch(); cache=[TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c for c in model.make_cache()]; logits=model(mx.array(tokenizer.encode('TurboQuant smoke test.'))[None], cache=cache); mx.eval(logits); token=mx.argmax(logits[:, -1, :], axis=-1); logits=model(token.reshape(1, 1), cache=cache); mx.eval(logits); print('TQ3 local bundle OK')"

Troubleshooting:

If git clone fetches pointer files instead of weights, run git lfs install first and clone again.
If hf is missing, install it with pip install -U "huggingface_hub[cli]".
If pip install -e . fails, make sure you are inside the downloaded repo root.
If MLX import fails, verify you are on Apple Silicon with a recent Python 3 environment.

Minimal usage

from pathlib import Path

import mlx.core as mx
from mlx_lm import load
from mlx_lm.models.cache import KVCache
from turboquant_mlx import TurboQuantKVCache, apply_patch

repo_root = Path('.').resolve()

model, tokenizer = load(str(repo_root))
apply_patch()

cache = []
for c in model.make_cache():
    cache.append(TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c)

prompt = "Explain why KV-cache compression matters for long-context decoding."
input_ids = mx.array(tokenizer.encode(prompt))[None]
logits = model(input_ids, cache=cache)
mx.eval(logits)

Full runnable example:

examples/run_qwen35_tq3.py

What changes under the hood?

TurboQuant compresses the KV cache at inference time using:

randomized Hadamard rotation
scalar quantization
bit-packed storage
fused Metal kernels on the hot path

The goal is simple: reduce KV-cache footprint while keeping decode throughput and output quality on the useful side of the tradeoff.

Benchmark snapshot

Representative medians from the publication-grade sweep:

Prompt scale	FP16 tok/s	TQ4 tok/s	TQ3 tok/s	TQ4 cache	TQ3 cache
1.00x	52.90	51.58	51.67	56.2 MB	50.6 MB
1.50x	43.42	48.71	48.65	71.4 MB	63.0 MB
2.25x	47.97	45.74	48.26	94.1 MB	81.5 MB
3.00x	46.62	45.49	45.35	116.8 MB	100.1 MB

Interpretation:

TurboQuant is best understood here as a frontier extender first.
TQ3 is the current best speed/compression operating point.
adaptive splits are not the preferred path on current evidence.

Files in this repo

examples/run_qwen35_tq3.py: minimal runnable example
benchmarks/README.md: benchmark summary
benchmarks/summary_rows.tsv: median benchmark table
benchmarks/raw_rows.tsv: raw benchmark rows
benchmarks/run_metadata.json: run metadata
assets/curve.png / assets/curve.svg: chart artifacts

Source repo

Implementation and ongoing development live here:

https://github.com/dax8it/turboquant-mlx

Benchmark driver used for this sweep:

benchmarks/qwen35_9b_publication_curve.py in the source repo

Honesty note

This repository is intentionally explicit: it packages a practical runtime recipe and the supporting evidence, not a new set of model weights.

Downloads last month: 3,125

Safetensors

Model size

2B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for alexcovo/qwen35-9b-mlx-turboquant-tq3

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

mlx-community/Qwen3.5-9B-MLX-4bit

Quantized

(1)

this model