Is it possible to do this for the full GLM 4.5?

by InfernalDread - opened Aug 20

Aug 20

Thank you for producing this variant of GLM 4.5 Air, I am getting really good results from it! Is it at all possible to do this MXFP4 version to the full GLM 4.5 as well? The efficiency is really good!

Thank you for your time and effort!

Face314

Owner Aug 21

I think it's possible. Most of the work was done by llama.cpp.

I use llama.cpp to quantize the model. The VRAM of my mac mini is 64GB , I can't verify whether the large model is working properly.

You can download and quantize the model yourself. Here is my workflow for quantizing the model.

Pull the latest code of llama.cpp and build it.
Download the GGUF format full precision model.
Quantize it with llama-quantize command.

llama-quantize --keep-split  GLM-4.5-Air-BF16-00001-of-00005.gguf  GLM-4.5-Air-MXFP4_MOE.gguf MXFP4_MOE

InfernalDread

Aug 21

•

edited Aug 21

Do you have an estimate as to how much VRAM/RAM combined is required to quantize the big one (Full GLM 4.5)? This would be my first time, so I am not sure of the memory requirements.

Face314

Owner Aug 21

The quantization process is performed layer by layer, so you don't need to load the entire model into memory. I guess 16GB is enough.

I download unsloth/GLM-4.5-Air-GGUF BF16 GGUF model , which size is 221GB. And quantize it with my 64GB VRAM mac mini.

You can download a small full precision GGUF model (maybe Qwen/Qwen3-4B-Instruct-2507) and make it to Q4_0 precision, that won't take too much time. Once you succeed, then you can try glm-4.5.

Remember to download the large split model with LMStudio.This way you don't need to download individual files one by one.

When you have questions, ask deepseek/qwen/glm/GPT first, they are smart and convenient.

jdwaverly

Sep 1

•

edited Sep 1

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

jc2375

Sep 6

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

I just promoted this quant there!

sm54

Sep 14

•

edited Sep 14

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

I've created an mxfp4 version of the thinking and instruct 235b model below. I've not tested them yet, but they should work just fine, using the method outlined above.

https://huggingface.co/sm54/Qwen3-235B-A22B-Thinking-2507-MXFP4_MOE
https://huggingface.co/sm54/Qwen3-235B-A22B-Instruct-2507-MXFP4_MOE

sm54

Sep 14

I've created an mxfp4 quant of GLM-4.5 here:

https://huggingface.co/sm54/GLM-4.5-MXFP4_MOE

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment