cerebras/GLM-4.5-Air-REAP-82B-A12B · Missing MTP layers, add it again please!

12 days ago

Hey cerebras! This looks promising.

@zerofata noticed that you are missing the MTP layer, which is necessary for lcpp / GGUF.

You can reintroduce it using this script: https://rentry.co/GLM4-5-Air-LoraFineTune#re-adding-the-mtp-layer

We've tested on our own GLM 4.5 Air tunes and can confirm that it works. Could you update your repo / create a new repo for it? (I'd do it but I'm trying to save on storage space :^))

TheDrummer

12 days ago

To save you a click, here's Zerofata's solution for reintroducing the MTP layers (allowing us to GGUF the full weights)

import torch
import json
import glob
import os
from safetensors import safe_open
from safetensors.torch import save_file
import shutil

def attach_mtp_layer(base_model_path, merged_model_path, output_path):
    print("Extracting layer 46 from base model...")
    layer_46_weights = {}
    safetensor_files = glob.glob(f"{base_model_path}/*.safetensors")

    for file in safetensor_files:
        with safe_open(file, framework="pt") as f:
            for key in f.keys():
                if "model.layers.46." in key:
                    layer_46_weights[key] = f.get_tensor(key)

    print(f"Found {len(layer_46_weights)} layer 46 tensors")

    # Copy merged model to output
    print(f"Copying merged model to {output_path}...")
    shutil.copytree(merged_model_path, output_path, dirs_exist_ok=True)

    # Load and update the index
    index_path = f"{output_path}/model.safetensors.index.json"
    with open(index_path, 'r') as f:
        index = json.load(f)

    # Find the last shard file
    weight_map = index["weight_map"]
    last_shard = sorted(set(weight_map.values()))[-1]
    shard_path = f"{output_path}/{last_shard}"

    print(f"Adding layer 46 to {last_shard}...")

    # Load existing weights from last shard
    existing_weights = {}
    with safe_open(shard_path, framework="pt") as f:
        for key in f.keys():
            existing_weights[key] = f.get_tensor(key)

    # Add layer 46 weights
    existing_weights.update(layer_46_weights)

    # Update weight_map in index
    for key in layer_46_weights:
        weight_map[key] = last_shard

    # Save updated shard
    save_file(existing_weights, shard_path, metadata={"format": "pt"})

    # Update and save index
    with open(index_path, 'w') as f:
        json.dump(index, f, indent=2)

    # Update config to reflect MTP layer
    config_path = f"{output_path}/config.json"
    with open(config_path, 'r') as f:
        config = json.load(f)

    config['num_nextn_predict_layers'] = 1

    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)

    print("✅ MTP layer 46 successfully attached!")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--base", required=True, help="Base model with layer 46")
    parser.add_argument("--merged", required=True, help="Merged model missing layer 46")
    parser.add_argument("--out", required=True, help="Output path for fixed model")

    args = parser.parse_args()
    attach_mtp_layer(args.base, args.merged, args.out)

vithursant

Cerebras org 12 days ago

@TheDrummer Appreciate the feedback and thanks for bringing this to our attention. We are working on re-uploading the diff asap. Stay tuned!

lazarevich

Cerebras org 12 days ago

@TheDrummer just pushed a fix: https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B/commit/fb762bc44a26ce645b536e06c220a9692511f651

TheDrummer

12 days ago

Thanks @lazarevich !

@bartowski my manski, do your thing!

bartowski

11 days ago

Unfortunately running into a new issue now during conversion when modifying tensors:

KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

lazarevich

Cerebras org 11 days ago

@bartowski could you please provide more details on where in your code this error is coming from. there is a tensor with a key model.layers.46.mlp.experts.8.down_proj.weight in model-00033-of-00033.safetensors on the latest commit

belisarius

11 days ago

@lazarevich Although Im not bartowski and Im not sure if it relevant, maybe it helps in some way. I get the same error when using the ggml-org/gguf-my-repo space.

It happens at pretty much the first step there.

Error converting to fp16: INFO:hf-to-gguf:Loading model: GLM-4.5-Air-REAP-82B-A12B
INFO:hf-to-gguf:Model architecture: Glm4MoeForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00033.safetensors'
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}

etc. until part 33

INFO:hf-to-gguf:gguf: loading model part 'model-00033-of-00033.safetensors'
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.45.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.45.ffn_down_exps.weight, torch.bfloat16 --> F16, shape = {1408, 4096, 96}
INFO:hf-to-gguf:blk.45.ffn_gate_exps.weight, torch.bfloat16 --> F16, shape = {4096, 1408, 96}
INFO:hf-to-gguf:blk.45.ffn_up_exps.weight, torch.bfloat16 --> F16, shape = {4096, 1408, 96}
INFO:hf-to-gguf:blk.45.exp_probs_b.bias, torch.float32 --> F32, shape = {96}
INFO:hf-to-gguf:blk.45.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {4096, 96}
INFO:hf-to-gguf:blk.45.ffn_down_shexp.weight, torch.bfloat16 --> F16, shape = {1408, 4096}
INFO:hf-to-gguf:blk.45.ffn_gate_shexp.weight, torch.bfloat16 --> F16, shape = {4096, 1408}
INFO:hf-to-gguf:blk.45.ffn_up_shexp.weight, torch.bfloat16 --> F16, shape = {4096, 1408}
INFO:hf-to-gguf:blk.45.post_attention_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.nextn.eh_proj.weight, torch.bfloat16 --> F16, shape = {8192, 4096}
INFO:hf-to-gguf:blk.46.nextn.embed_tokens.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.46.nextn.enorm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.nextn.hnorm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
Traceback (most recent call last):
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 9640, in
main()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 9634, in main
model_instance.write()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 432, in write
self.prepare_tensors()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 7331, in prepare_tensors
super().prepare_tensors()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 303, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 7310, in modify_tensors
datas.append(self._experts[bid][ename])
~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

bartowski

11 days ago

Sorry missed the ping, but yeah that's the exact trace ^

Thanks @belisarius

ellenhp

11 days ago

This comment has been hidden (marked as Off-Topic)

ellenhp

11 days ago

This comment has been hidden (marked as Resolved)

theo77186

11 days ago

I managed to convert the version without MTP (the original upload) to GGUF by modifying the value of num_nextn_predict_layers to 0 (was incorrectly set to 1). This could be an useful info for those converting these models.

lazarevich

Cerebras org 10 days ago

i updated the MTP layer such that it contains only 96 experts as specified in model config. i believe that is the root cause for KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

on commit 7d219fc64, i'm able to do conversion to 16-bit GGUF locally

ellenhp

10 days ago

This comment has been hidden (marked as Resolved)

lazarevich

Cerebras org 10 days ago

@ellenhp @bartowski can you please try on commit 7d219fc64 without llama.cpp patches?

jmoney54378256438905

10 days ago

This comment has been hidden (marked as Resolved)

AesSedai

10 days ago

@lazarevich how was the pruned MTP tensor produced? I don't seen an update to the cerebras reap repository, and I'm curious because I'd like to do the same for the two GLM-4.6 prunes I've uploaded:

engrtipusultan

10 days ago

@noctrex can you do mxfp4 moe ?

lazarevich

Cerebras org 10 days ago

@AesSedai note that we did not yet do calibration on the MTP layer, we just sliced the experts and the router dim to fit into the new #experts. so while this works for conversion, the MTP layer itself won't be as effective. we're working on an update with a properly pruned MTP layer, so it can be well utilized wherever supported.

for llama.cpp conversion now, you might wanna do the same and just slice the first 96 experts/router dims.

AesSedai

10 days ago

I think that since MTP isn't supported in llama.cpp yet either (outside of being included in the GGUF at least) and without being able to do a calibrated selection it might just be better to omit the MTP layer by setting the num_nextn_predict_layers to 0 in the config and the quantize it, or just wait until the proper pruning is available so that the quantization doesn't have to be redone. The community being what it is though, I feel like the former option of just excluding the MTP layer will happen in the short term and re-quants will likely follow when MTP pruning is working in reap :)

Thanks for the information!

engrtipusultan

9 days ago

@bartowski are your excluding MTP layer?

@noctrex when you create GGUF can you exclude them it will hopefully make size smaller.
For MTP layer loading original air gguf llama.cpp we get.

model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.attn_q.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.attn_k.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_v.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_output.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring
model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_exps.weight (size = 242221056 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_exps.weight (size = 507510784 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_exps.weight (size = 242221056 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_shexp.weight (size = 3964928 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.nextn.eh_proj.weight (size = 11010048 bytes) -- ignoring
model has unused tensor blk.46.nextn.embed_tokens.weight (size = 203685888 bytes) -- ignoring
model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_head.weight (size = 203685888 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring

So it better if they can be omitted

noctrex

9 days ago

@engrtipusultan Did an experimental quant of it:
https://huggingface.co/noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF
Feel free to try it out and share the experience

engrtipusultan

5 days ago

@engrtipusultan Did an experimental quant of it:
https://huggingface.co/noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF
Feel free to try it out and share the experience

I have downloaded your version. I am still getting that model has unused tensor.
Model itself is dummer that GLM-4.5-Air-UD-Q2_K_XL.gguf
I am not sure is it because of your quants or model itself. I am behind very low bandwidth. It took me 2 days to download it.
Your quants one blindly makes tools call even when it is not required original Air mentioned above says tools is not required because ask is simple.
Also I asked REAP one to provide " give me tcp client server implementation in rust." It thought about it made three versions in the thinking and gave me simplest one. Waste of tokens. Where as original Q2 mentioned earlier. Planned what it needed to do in thinking and simple gave me code in the output. Much more efficient.

Thank you everyone for your effort. For now I think I will stick to GLM-4.5-Air-UD-Q2_K_XL.gguf appears to me much better than REAP one.