Is it possible to do this for the full GLM 4.5?

#1
by InfernalDread - opened

Thank you for producing this variant of GLM 4.5 Air, I am getting really good results from it! Is it at all possible to do this MXFP4 version to the full GLM 4.5 as well? The efficiency is really good!

Thank you for your time and effort!

Owner

I think it's possible. Most of the work was done by llama.cpp.

I use llama.cpp to quantize the model. The VRAM of my mac mini is 64GB , I can't verify whether the large model is working properly.

You can download and quantize the model yourself. Here is my workflow for quantizing the model.

  1. Pull the latest code of llama.cpp and build it.
  2. Download the GGUF format full precision model.
  3. Quantize it with llama-quantize command.
llama-quantize --keep-split  GLM-4.5-Air-BF16-00001-of-00005.gguf  GLM-4.5-Air-MXFP4_MOE.gguf MXFP4_MOE 

Do you have an estimate as to how much VRAM/RAM combined is required to quantize the big one (Full GLM 4.5)? This would be my first time, so I am not sure of the memory requirements.

Owner

The quantization process is performed layer by layer, so you don't need to load the entire model into memory. I guess 16GB is enough.

I download unsloth/GLM-4.5-Air-GGUF BF16 GGUF model , which size is 221GB. And quantize it with my 64GB VRAM mac mini.

You can download a small full precision GGUF model (maybe Qwen/Qwen3-4B-Instruct-2507) and make it to Q4_0 precision, that won't take too much time. Once you succeed, then you can try glm-4.5.

Remember to download the large split model with LMStudio.This way you don't need to download individual files one by one.

When you have questions, ask deepseek/qwen/glm/GPT first, they are smart and convenient.

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

I just promoted this quant there!

Yes, this is an amazing result. Please post this on the Reddit local Llama group! I'm running full 128K context on a 128G Mac. Not possible with other Quants at this quality. The Qwen3 235B model would be a great target for this.

I've created an mxfp4 version of the thinking and instruct 235b model below. I've not tested them yet, but they should work just fine, using the method outlined above.

https://huggingface.co/sm54/Qwen3-235B-A22B-Thinking-2507-MXFP4_MOE
https://huggingface.co/sm54/Qwen3-235B-A22B-Instruct-2507-MXFP4_MOE

I've created an mxfp4 quant of GLM-4.5 here:

https://huggingface.co/sm54/GLM-4.5-MXFP4_MOE

Sign up or log in to comment