Gemma 4 31B-it — TurboQuant+ GGUF
Best quality you can realistically run on constrained hardware. Better quality than Q4_K_M while still fitting in memory where Q8_0 does not.
Q8_0 is often too large. Q4_K_M fits but drops quality. Config-I fits and delivers noticeably better quality (KLD 0.125 vs 0.132).
TurboQuant+ Config-I quantization of google/gemma-4-31b-it. Config-I applies WHT-domain compression (TQ4_1S) to attention and gate/up tensors while keeping boundary layers and ffn_down at higher precision for optimal quality. See the getting started guide for details.
Requires TurboQuant+ llama.cpp fork at tag
tqp-v0.1.0. Will NOT work with stock llama.cpp. TurboQuant+ is an independent research project. These quantization types have not been merged into upstream ggml-org/llama.cpp. Do not file issues there for TQ models.
Files
| File | Quant | Size | vs Q8_0 | KLD vs Q8_0 | KLD vs Q4_K_M |
|---|---|---|---|---|---|
| Gemma-4-31B-it-Config-I.gguf | Config-I | 18.9 GB | 62% of Q8_0 (30.4 GB) | 0.125 | 5% better |
Why Config-I over Q4_K_M?
Config-I is ~8% larger than Q4_K_M but delivers measurably better quality — closer to Q8_0 output than Q4_K_M on every metric.
| Quant | Size | Median KLD | Same top token |
|---|---|---|---|
| Q8_0 | 30.4 GB | baseline | baseline |
| Q4_K_M | 17.4 GB | 0.132 | 74.2% |
| Config-I | 18.9 GB | 0.125 | 74.9% |
Lower KLD = closer to Q8_0 output. Higher same-top-token = more likely to pick the same word as the unquantized model.
Note: Standard wikitext PPL produces invalid numbers on Gemma 4 due to tokenizer differences. KL-divergence against Q8_0 logits is used instead, which directly measures how much quantization changes model outputs.
Compatibility
| Field | Value |
|---|---|
| Fork | TheTom/llama-cpp-turboquant |
| Tag | tqp-v0.1.0 |
| Backends | Metal, CUDA, ROCm/HIP, Vulkan |
| Quantized on | 2026-04-08 |
No forward-compatibility guarantee. This model was built and validated against the tag above. Future fork updates may change the format. If decode produces garbage, rebuild from this tag.
Benchmarks (Apple M5 Max 128GB, Metal)
Speed
On Metal, Config-I trades decode speed for better quality at lower memory. CUDA performance has not been validated on this model yet. Based on other TQ4_1S results, performance is expected to be higher than Metal.
Metal (Apple M5 Max 128GB, recommended -ctk q8_0 -ctv turbo4)
| Config | pp512 | pp2048 | tg128 | Size |
|---|---|---|---|---|
| Q8_0 baseline (f16 KV) | 488 t/s | 389 t/s | 15.5 t/s | 30.4 GB |
| Q4_K_M (f16 KV) | 468 t/s | — | 24.4 t/s | 17.4 GB |
| Config-I + turbo4 KV | 356 t/s | 310 t/s | 9.6 t/s | 18.9 GB + KV cache memory reduction |
Config-I is ~2.5x slower decode than Q4_K_M on Metal (9.6 vs 24.4 t/s). The value here is quality, not speed. If you need maximum throughput, use Q4_K_M. If you need the best quality that fits in memory, use Config-I.
Context Scaling (tg128 decode at varying context length, Metal)
| Context | 1K | 4K | 8K | 16K | 32K |
|---|---|---|---|---|---|
| t/s | 9.8 | 9.4 | 9.6 | 9.6 | 9.7 |
Effectively flat decode performance across all context lengths. No degradation cliff.
Quality (KL-divergence vs Q8_0, wikitext-2-raw, 512 context, 5 chunks)
| Config | Median KLD | Same top token |
|---|---|---|
| Q4_K_M | 0.132 | 74.2% |
| Config-I | 0.125 | 74.9% |
Recommended KV Cache Settings
TurboQuant+ also compresses the KV cache at runtime via -ctk and -ctv flags. This doesn't change the model file, just how much memory the context window uses. KV compression further reduces runtime memory usage, allowing larger context windows within the same hardware limits.
K cache (-ctk) |
V cache (-ctv) |
KV buffer (16K ctx) | Recommendation |
|---|---|---|---|
| f16 | f16 | 2,480 MiB | default, short context |
| q8_0 | turbo4 | 988 MiB (60% smaller) | recommended for long context |
| q8_0 | turbo3 | 901 MiB (64% smaller) | aggressive, more KV savings |
Download
# Install huggingface-cli
brew install huggingface-cli
# or: pip install huggingface-hub
# Download
hf download pidtom/Gemma-4-31B-it-TQPlus Gemma-4-31B-it-Config-I.gguf --local-dir .
How to Run
# Clone and build the TurboQuant+ fork
git clone https://github.com/TheTom/llama-cpp-turboquant.git
cd llama-cpp-turboquant
git checkout tqp-v0.1.0
# Build for Metal (macOS)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Build for CUDA (NVIDIA)
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Build for Vulkan
cmake -B build -DGGML_VULKAN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
# Run (default KV cache)
./build/bin/llama-cli -m Gemma-4-31B-it-Config-I.gguf -ngl 99 -c 8192
# Run with recommended KV compression (for long context)
./build/bin/llama-cli -m Gemma-4-31B-it-Config-I.gguf -ngl 99 -c 32768 -ctk q8_0 -ctv turbo4
What is TurboQuant+?
TurboQuant+ applies Walsh-Hadamard Transform (WHT) domain quantization to compress model weights beyond standard GGUF quant types. This achieves lower bits-per-weight at equivalent or better quality by exploiting structured redundancy in the weight matrices.
- Downloads last month
- 661
We're not able to determine the quantization variants.