Qwen3.5-9B MLX TurboQuant TQ3
This repo packages the current best TurboQuant runtime recipe we have measured for mlx-community/Qwen3.5-9B-MLX-4bit: TurboQuant 3-bit KV-cache compression (TQ3) on Apple Silicon.
Short version:
- this is not a new checkpoint
- it is a reproducible inference overlay on top of the base MLX model
- it ships the usage path, benchmark evidence, and artifacts in one place
What this repo is — and is not
This repo does:
- point to the base model weights:
mlx-community/Qwen3.5-9B-MLX-4bit - replace compatible runtime
KVCacheslots withTurboQuantKVCache(bits=3) - document the measured performance/memory tradeoff
- provide a copy-paste runnable example
This repo does not:
- publish modified model weights
- claim that TurboQuant is a new checkpoint format
- hide the fact that this is a runtime KV-cache method
That distinction matters. TurboQuant changes the inference-time cache representation, not the underlying model parameters.
Why use this?
For the measured Qwen3.5-9B MLX workload, TQ3 is the current best operating point on the speed/compression frontier:
- consistently memory-positive
- near FP16 speed at short context
- increasingly compelling as context length grows
If you want the cleanest current recommendation rather than a menu of half-baked variants, this is it.
How to use
Requirements:
- Apple Silicon
- Python 3.10+
gitorhf
Recommended hardware:
- Apple Silicon with at least 32 GB unified memory
- 64 GB is safer if you want comfortable long-context headroom
- This package is aimed at local MLX inference, not CPU-only laptops
Option 1: clone the full bundle from Hugging Face
git lfs install
git clone https://huggingface.co/alexcovo/qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install mlx mlx-lm
pip install -e .
python examples/run_qwen35_tq3.py
Option 2: download without git
hf download alexcovo/qwen35-9b-mlx-turboquant-tq3 --local-dir qwen35-9b-mlx-turboquant-tq3
cd qwen35-9b-mlx-turboquant-tq3
python3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install mlx mlx-lm
pip install -e .
python examples/run_qwen35_tq3.py
The example script loads the model from the local repo directory, so the weights on disk are used directly.
Smoke test:
python -c "import mlx.core as mx; from pathlib import Path; from mlx_lm import load; from mlx_lm.models.cache import KVCache; from turboquant_mlx import TurboQuantKVCache, apply_patch; root=Path('.').resolve(); model, tokenizer = load(str(root)); apply_patch(); cache=[TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c for c in model.make_cache()]; logits=model(mx.array(tokenizer.encode('TurboQuant smoke test.'))[None], cache=cache); mx.eval(logits); token=mx.argmax(logits[:, -1, :], axis=-1); logits=model(token.reshape(1, 1), cache=cache); mx.eval(logits); print('TQ3 local bundle OK')"
Troubleshooting:
- If
git clonefetches pointer files instead of weights, rungit lfs installfirst and clone again. - If
hfis missing, install it withpip install -U "huggingface_hub[cli]". - If
pip install -e .fails, make sure you are inside the downloaded repo root. - If MLX import fails, verify you are on Apple Silicon with a recent Python 3 environment.
Minimal usage
from pathlib import Path
import mlx.core as mx
from mlx_lm import load
from mlx_lm.models.cache import KVCache
from turboquant_mlx import TurboQuantKVCache, apply_patch
repo_root = Path('.').resolve()
model, tokenizer = load(str(repo_root))
apply_patch()
cache = []
for c in model.make_cache():
cache.append(TurboQuantKVCache(bits=3) if isinstance(c, KVCache) else c)
prompt = "Explain why KV-cache compression matters for long-context decoding."
input_ids = mx.array(tokenizer.encode(prompt))[None]
logits = model(input_ids, cache=cache)
mx.eval(logits)
Full runnable example:
examples/run_qwen35_tq3.py
What changes under the hood?
TurboQuant compresses the KV cache at inference time using:
- randomized Hadamard rotation
- scalar quantization
- bit-packed storage
- fused Metal kernels on the hot path
The goal is simple: reduce KV-cache footprint while keeping decode throughput and output quality on the useful side of the tradeoff.
Benchmark snapshot
Representative medians from the publication-grade sweep:
| Prompt scale | FP16 tok/s | TQ4 tok/s | TQ3 tok/s | TQ4 cache | TQ3 cache |
|---|---|---|---|---|---|
| 1.00x | 52.90 | 51.58 | 51.67 | 56.2 MB | 50.6 MB |
| 1.50x | 43.42 | 48.71 | 48.65 | 71.4 MB | 63.0 MB |
| 2.25x | 47.97 | 45.74 | 48.26 | 94.1 MB | 81.5 MB |
| 3.00x | 46.62 | 45.49 | 45.35 | 116.8 MB | 100.1 MB |
Interpretation:
- TurboQuant is best understood here as a frontier extender first.
TQ3is the current best speed/compression operating point.- adaptive splits are not the preferred path on current evidence.
Files in this repo
examples/run_qwen35_tq3.py: minimal runnable examplebenchmarks/README.md: benchmark summarybenchmarks/summary_rows.tsv: median benchmark tablebenchmarks/raw_rows.tsv: raw benchmark rowsbenchmarks/run_metadata.json: run metadataassets/curve.png/assets/curve.svg: chart artifacts
Source repo
Implementation and ongoing development live here:
Benchmark driver used for this sweep:
benchmarks/qwen35_9b_publication_curve.pyin the source repo
Honesty note
This repository is intentionally explicit: it packages a practical runtime recipe and the supporting evidence, not a new set of model weights.
- Downloads last month
- 3,125
4-bit
Model tree for alexcovo/qwen35-9b-mlx-turboquant-tq3
Base model
Qwen/Qwen3.5-9B-Base