Instructions to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4")
model = AutoModelForCausalLM.from_pretrained("RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4

SGLang

How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4 with Docker Model Runner:
```
docker model run hf.co/RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4
```

GLM-4.7-Flash-Marlin-MMFP4

MMFP4-quantized GLM-4.7-Flash — a 30B-A3B MoE model compressed to 4 bits per weight using GPTQ with actorder and Metal Marlin's E2M1 FP4 format.

Metric	Value
Effective bits	4.0 bpw
Compression	4× vs FP16
Model size	~16 GB (vs ~60 GB FP16)
Parameters	29.3B
Format	HuggingFace sharded safetensors

Model Description

This is a quantized version of zai-org/GLM-4.7-Flash, the strongest model in the 30B class that balances performance and efficiency.

GLM-4.7-Flash features:

30B-A3B MoE architecture (64 experts + shared expert, 2-4 active per token)
Multi-head Latent Attention (MLA) for 8× KV cache compression
State-of-the-art reasoning (91.6% on AIME 2025, 59.2% on SWE-bench Verified)
Bilingual (English + Chinese)

Quantization Details

Quantized using MR-GPTQ (Metal Marlin GPTQ) with CUDA acceleration:

Method

Format: MMFP4 (E2M1 FP4) — Metal Marlin's native FP4 format
Quantization: GPTQ with actorder (activation-order column permutation)
Hessian calibration: Pre-computed Hessians for attention layers
Expert quantization: Identity Hessian with actorder (no calibration data for MoE experts)
Group size: 128
Hardware: NVIDIA RTX 3090 Ti (CUDA-accelerated Cholesky factorization)

Quantization Statistics

Component	Bit Width	Notes
Embeddings	FP16	Full precision
LM Head	FP16	Full precision
Attention (q/k/v/o)	4-bit	GPTQ with Hessians
MoE Experts (64×)	4-bit	GPTQ with actorder
Layer Norms	FP16	Full precision
Router Weights	FP16	Full precision

Total tensors: 19,066
Shards: 48 safetensors files
Quantization time: ~20 minutes (RTX 3090 Ti)

Files

GLM-4.7-Flash-Marlin-MMFP4/
├── model-00001-of-00048.safetensors   # Layer 0 (embeddings)
├── model-00002-of-00048.safetensors   # Layer 1
├── ...
├── model-00048-of-00048.safetensors   # Layer 47 + lm_head
├── model.safetensors.index.json       # Weight map
├── config.json                        # Model config
├── generation_config.json
├── tokenizer.json                     # Tokenizer
└── tokenizer_config.json

Usage

With Metal Marlin (Apple Silicon)

from metal_marlin import MarlinForCausalLM
from transformers import AutoTokenizer

model = MarlinForCausalLM.from_pretrained(
    "RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4",
    device="mps"
)
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.7-Flash")

prompt = "<|user|>\nExplain quantum computing in simple terms.\n<|assistant|>\n"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("mps")
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Tensor Format

Each quantized weight tensor has corresponding scale factors:

{name}.weight: Packed FP4 weights (uint8)
{name}.scales: FP16 per-group scales (group_size=128)

Hardware Requirements

Device	Memory	Notes
Apple M4 Max	36 GB+	Via Metal Marlin
Apple M2 Ultra	36 GB+	Via Metal Marlin

Benchmarks

Original Model Performance (from Z.AI)

Benchmark	GLM-4.7-Flash	Qwen3-30B-A3B	GPT-OSS-20B
AIME 2025	91.6	85.0	91.7
GPQA	75.2	73.4	71.5
SWE-bench Verified	59.2	22.0	34.0
τ²-Bench	79.5	49.0	47.7
BrowseComp	42.8	2.29	28.3

Quantized Model Notes

GPTQ with actorder minimizes quality loss vs RTN
Expected degradation: ~1-2% on benchmarks vs FP16
E2M1 FP4 format optimized for Metal Performance Shaders

Comparison with Trellis Quant

Model	Format	Size	Bits	Method
GLM-4.7-Flash-Trellis-MM	Trellis	14 GB	3.78 bpw	EXL3-style mixed precision
This model	MMFP4	16 GB	4.0 bpw	GPTQ + actorder

Choose Trellis for smaller size, MMFP4 for simpler tensor format and potentially better compatibility.

Limitations

Metal Marlin required for optimal inference on Apple Silicon
No speculative decoding yet
Quality loss: ~1-2% on benchmarks vs FP16 (typical for 4-bit quantization)

Credits

Original model: Z.AI / GLM Team
Quantization method: GPTQ with actorder
Quantization toolkit: Metal Marlin

Citation

If you use this model, please cite the original GLM-4.5 paper:

@misc{glm2025glm45,
      title={GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models}, 
      author={GLM Team and Aohan Zeng and Xin Lv and others},
      year={2025},
      eprint={2508.06471},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.06471}, 
}

License

This quantized model inherits the MIT License from the original GLM-4.7-Flash model.

Downloads last month: 19

Safetensors

Model size

5B params

Tensor type

F16

U32

Model tree for RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4

Base model

zai-org/GLM-4.7-Flash

Finetuned

(65)

this model

Paper for RESMP-DEV/GLM-4.7-Flash-Marlin-MMFP4

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Paper • 2508.06471 • Published Aug 8, 2025 • 211