Instructions to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4") model = AutoModelForCausalLM.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4
- SGLang
How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with Docker Model Runner:
docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4
GLM-4.7-Flash-Marlin-MMFP4
MMFP4-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 4 bits per weight using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.
| Metric | Value |
|---|---|
| Effective bits | 4.0 bpw |
| Compression | 4× vs FP16 |
| Model size | ~16 GB (vs ~60 GB FP16) |
| Parameters | 29.3B |
| Format | HuggingFace sharded safetensors |
Model Description
This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.
GLM-4.7-Flash features:
- 30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
- Multi-head Latent Attention (MLA) for 8× KV cache compression
- State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
- Bilingual (English + Chinese)
Quantization Details
Quantized using MR-GPTQ (Metal Marlin GPTQ) with CUDA acceleration:
Method
- Format: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
- Quantization: GPTQ with actorder (activation-order column permutation)
- Hessian calibration: Pre-computed Hessians for attention layers
- Expert quantization: Identity Hessian with actorder (no calibration data for MoE experts)
- Group size: 128
- Hardware: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)
Quantization Statistics
| Component | Bit Width | Notes |
|---|---|---|
| Embeddings | FP16 | Full precision |
| LM Head | FP16 | Full precision |
| Attention (q/k/v/o) | 4-bit | GPTQ with Hessians |
| MoE Experts (64×) | 4-bit | GPTQ with actorder |
| Layer Norms | FP16 | Full precision |
| Router Weights | FP16 | Full precision |
- Total tensors: 19,066
- Shards: 48 safetensors files
- Quantization time: ~20 minutes (RTX 3090 Ti)
Files
GLM-4.7-Flash-Marlin-MMFP4/
├── model-00001-of-00048.safetensors # Layer 0 (embeddings)
├── model-00002-of-00048.safetensors # Layer 1
├── ...
├── model-00048-of-00048.safetensors # Layer 47 + lm_head
├── model.safetensors.index.json # Weight map
├── config.json # Model config
├── generation_config.json
├── tokenizer.json # Tokenizer
└── tokenizer_config.json
Usage
With Metal Marlin (Apple Silicon)
from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer
model = MarlinForCausalLM.from_pretrained(
"RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")
prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Tensor Format
Each quantized weight tensor has corresponding scale factors:
{name}.weight: Packed FP4 weights (uint8){name}.scales: FP16 per-group scales (group_size=128)
Hardware Requirements
| Device | Memory | Notes |
|---|---|---|
| Apple M4 Max | 36 GB+ | Via Metal Marlin |
| Apple M2 Ultra | 36 GB+ | Via Metal Marlin |
Benchmarks
Original Model Performance (from Z.AI)
| Benchmark | GLM-4.7-Flash | Qwen3-30B-A3B | GPT-OSS-20B |
|---|---|---|---|
| AIME 2025 | 91.6 | 85.0 | 91.7 |
| GPQA | 75.2 | 73.4 | 71.5 |
| SWE-bench Verified | 59.2 | 22.0 | 34.0 |
| τ²-Bench | 79.5 | 49.0 | 47.7 |
| BrowseComp | 42.8 | 2.29 | 28.3 |
Quantized Model Notes
- GPTQ with actorder minimizes quality loss vs RTN
- Expected degradation: ~1-2% on benchmarks vs FP16
- E2M1 FP4 format optimized for Metal Performance Shaders
Comparison with Trellis Quant
| Model | Format | Size | Bits | Method |
|---|---|---|---|---|
| GLM-4.7-Flash-Trellis-MM | Trellis | 14 GB | 3.78 bpw | EXL3-style mixed precision |
| This model | MMFP4 | 16 GB | 4.0 bpw | GPTQ + actorder |
Choose Trellis for smaller size, MMFP4 for simpler tensor format and potentially better compatibility.
Limitations
- Metal Marlin required for optimal inference on Apple Silicon
- No speculative decoding yet
- Quality loss: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)
Credits
- Original model: Z.AI / GLM Team
- Quantization method: GPTQ with actorder
- Quantization toolkit: Metal Marlin
Citation
If you use this model, please cite the original GLM-4.5 paper:
@misc{glm2025glm45,
title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models},
author={GLM Team and Aohan Zeng and Xin Lv and others},
year={2025},
eprint={2508.06471},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.06471},
}
License
This quantized model inherits the MIT License from the original GLM-4.7-Flash model.
- Downloads last month
- 19
Model tree for RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4
Base model
zai-org/GLM-4.7-Flash