MR-GPTQ
Collection
MXFP4 and NVFP4 quantized models
•
2 items
•
Updated
This model was obtained by quantizing the weights of Llama-3.1-8B-Instruct to MXFP4 data type. This optimization reduces the number of bits per parameter from 16 to 4.25, reducing the disk size and GPU memory requirements by approximately 73%.
MR-GPTQ quantized models with QuTLASS kernels are supported in the following integrations:
transformers with these features:main (Documentation).vLLM with these features:FP-Quant and the transformers integration.This model was evaluated on a subset of OpenLLM v1 benchmarks and Platinum bench. Model outputs were generated with the vLLM engine.
OpenLLM v1 results
| Model | MMLU‑CoT | GSM8k | Hellaswag | Winogrande | Average | Recovery (%) |
|---|---|---|---|---|---|---|
meta‑llama/Llama 3.1‑8B‑Instruct |
0.7276 | 0.8506 | 0.8001 | 0.7790 | 0.7893 | – |
ISTA‑DASLab/Llama‑3.1‑8B‑Instruct‑MR‑GPTQ‑mxfp |
0.6754 | 0.7892 | 0.7737 | 0.7324 | 0.7427 | 94.09 |
Platinum bench results
Below we report recoveries on individual tasks as well as the average recovery.
Recovery by Task
| Task | Recovery (%) |
|---|---|
| SingleOp | 97.94 |
| SingleQ | 95.95 |
| MultiArith | 98.22 |
| SVAMP | 95.08 |
| GSM8K | 93.69 |
| MMLU-Math | 80.54 |
| BBH-LogicalDeduction-3Obj | 89.87 |
| BBH-ObjectCounting | 82.03 |
| BBH-Navigate | 90.66 |
| TabFact | 86.92 |
| HotpotQA | 96.81 |
| SQuAD | 98.46 |
| DROP | 94.33 |
| Winograd-WSC | 89.47 |
| Average | 92.14 |
Base model
meta-llama/Llama-3.1-8B