Missing MTP layers, add it again please!

#1
by TheDrummer - opened

Hey cerebras! This looks promising.

@zerofata noticed that you are missing the MTP layer, which is necessary for lcpp / GGUF.

You can reintroduce it using this script: https://rentry.co/GLM4-5-Air-LoraFineTune#re-adding-the-mtp-layer

We've tested on our own GLM 4.5 Air tunes and can confirm that it works. Could you update your repo / create a new repo for it? (I'd do it but I'm trying to save on storage space :^))

To save you a click, here's Zerofata's solution for reintroducing the MTP layers (allowing us to GGUF the full weights)

import torch
import json
import glob
import os
from safetensors import safe_open
from safetensors.torch import save_file
import shutil

def attach_mtp_layer(base_model_path, merged_model_path, output_path):
    print("Extracting layer 46 from base model...")
    layer_46_weights = {}
    safetensor_files = glob.glob(f"{base_model_path}/*.safetensors")

    for file in safetensor_files:
        with safe_open(file, framework="pt") as f:
            for key in f.keys():
                if "model.layers.46." in key:
                    layer_46_weights[key] = f.get_tensor(key)

    print(f"Found {len(layer_46_weights)} layer 46 tensors")

    # Copy merged model to output
    print(f"Copying merged model to {output_path}...")
    shutil.copytree(merged_model_path, output_path, dirs_exist_ok=True)

    # Load and update the index
    index_path = f"{output_path}/model.safetensors.index.json"
    with open(index_path, 'r') as f:
        index = json.load(f)

    # Find the last shard file
    weight_map = index["weight_map"]
    last_shard = sorted(set(weight_map.values()))[-1]
    shard_path = f"{output_path}/{last_shard}"

    print(f"Adding layer 46 to {last_shard}...")

    # Load existing weights from last shard
    existing_weights = {}
    with safe_open(shard_path, framework="pt") as f:
        for key in f.keys():
            existing_weights[key] = f.get_tensor(key)

    # Add layer 46 weights
    existing_weights.update(layer_46_weights)

    # Update weight_map in index
    for key in layer_46_weights:
        weight_map[key] = last_shard

    # Save updated shard
    save_file(existing_weights, shard_path, metadata={"format": "pt"})

    # Update and save index
    with open(index_path, 'w') as f:
        json.dump(index, f, indent=2)

    # Update config to reflect MTP layer
    config_path = f"{output_path}/config.json"
    with open(config_path, 'r') as f:
        config = json.load(f)

    config['num_nextn_predict_layers'] = 1

    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)

    print("βœ… MTP layer 46 successfully attached!")

if __name__ == "__main__":
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("--base", required=True, help="Base model with layer 46")
    parser.add_argument("--merged", required=True, help="Merged model missing layer 46")
    parser.add_argument("--out", required=True, help="Output path for fixed model")

    args = parser.parse_args()
    attach_mtp_layer(args.base, args.merged, args.out)
Cerebras org

@TheDrummer Appreciate the feedback and thanks for bringing this to our attention. We are working on re-uploading the diff asap. Stay tuned!

Thanks @lazarevich !

@bartowski my manski, do your thing!

Unfortunately running into a new issue now during conversion when modifying tensors:

KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

Cerebras org

@bartowski could you please provide more details on where in your code this error is coming from. there is a tensor with a key model.layers.46.mlp.experts.8.down_proj.weight in model-00033-of-00033.safetensors on the latest commit

@lazarevich Although Im not bartowski and Im not sure if it relevant, maybe it helps in some way. I get the same error when using the ggml-org/gguf-my-repo space.

It happens at pretty much the first step there.

Error converting to fp16: INFO:hf-to-gguf:Loading model: GLM-4.5-Air-REAP-82B-A12B
INFO:hf-to-gguf:Model architecture: Glm4MoeForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-00033.safetensors'
INFO:hf-to-gguf:token_embd.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.0.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}

etc. until part 33

INFO:hf-to-gguf:gguf: loading model part 'model-00033-of-00033.safetensors'
INFO:hf-to-gguf:output.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.45.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.45.ffn_down_exps.weight, torch.bfloat16 --> F16, shape = {1408, 4096, 96}
INFO:hf-to-gguf:blk.45.ffn_gate_exps.weight, torch.bfloat16 --> F16, shape = {4096, 1408, 96}
INFO:hf-to-gguf:blk.45.ffn_up_exps.weight, torch.bfloat16 --> F16, shape = {4096, 1408, 96}
INFO:hf-to-gguf:blk.45.exp_probs_b.bias, torch.float32 --> F32, shape = {96}
INFO:hf-to-gguf:blk.45.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {4096, 96}
INFO:hf-to-gguf:blk.45.ffn_down_shexp.weight, torch.bfloat16 --> F16, shape = {1408, 4096}
INFO:hf-to-gguf:blk.45.ffn_gate_shexp.weight, torch.bfloat16 --> F16, shape = {4096, 1408}
INFO:hf-to-gguf:blk.45.ffn_up_shexp.weight, torch.bfloat16 --> F16, shape = {4096, 1408}
INFO:hf-to-gguf:blk.45.post_attention_norm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.nextn.eh_proj.weight, torch.bfloat16 --> F16, shape = {8192, 4096}
INFO:hf-to-gguf:blk.46.nextn.embed_tokens.weight, torch.bfloat16 --> F16, shape = {4096, 151552}
INFO:hf-to-gguf:blk.46.nextn.enorm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.nextn.hnorm.weight, torch.bfloat16 --> F32, shape = {4096}
INFO:hf-to-gguf:blk.46.attn_norm.weight, torch.bfloat16 --> F32, shape = {4096}
Traceback (most recent call last):
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 9640, in
main()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 9634, in main
model_instance.write()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 432, in write
self.prepare_tensors()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 7331, in prepare_tensors
super().prepare_tensors()
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 303, in prepare_tensors
for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/app/./llama.cpp/convert_hf_to_gguf.py", line 7310, in modify_tensors
datas.append(self._experts[bid][ename])
~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

Sorry missed the ping, but yeah that's the exact trace ^

Thanks @belisarius

This comment has been hidden (marked as Off-Topic)
This comment has been hidden (marked as Resolved)

I managed to convert the version without MTP (the original upload) to GGUF by modifying the value of num_nextn_predict_layers to 0 (was incorrectly set to 1). This could be an useful info for those converting these models.

Cerebras org

i updated the MTP layer such that it contains only 96 experts as specified in model config. i believe that is the root cause for KeyError: 'model.layers.46.mlp.experts.8.down_proj.weight'

on commit 7d219fc64, i'm able to do conversion to 16-bit GGUF locally

This comment has been hidden (marked as Resolved)
Cerebras org

@ellenhp @bartowski can you please try on commit 7d219fc64 without llama.cpp patches?

This comment has been hidden (marked as Resolved)

@lazarevich how was the pruned MTP tensor produced? I don't seen an update to the cerebras reap repository, and I'm curious because I'd like to do the same for the two GLM-4.6 prunes I've uploaded:

@noctrex can you do mxfp4 moe ?

Cerebras org

@AesSedai note that we did not yet do calibration on the MTP layer, we just sliced the experts and the router dim to fit into the new #experts. so while this works for conversion, the MTP layer itself won't be as effective. we're working on an update with a properly pruned MTP layer, so it can be well utilized wherever supported.

for llama.cpp conversion now, you might wanna do the same and just slice the first 96 experts/router dims.

I think that since MTP isn't supported in llama.cpp yet either (outside of being included in the GGUF at least) and without being able to do a calibrated selection it might just be better to omit the MTP layer by setting the num_nextn_predict_layers to 0 in the config and the quantize it, or just wait until the proper pruning is available so that the quantization doesn't have to be redone. The community being what it is though, I feel like the former option of just excluding the MTP layer will happen in the short term and re-quants will likely follow when MTP pruning is working in reap :)

Thanks for the information!

@bartowski are your excluding MTP layer?

@noctrex when you create GGUF can you exclude them it will hopefully make size smaller.
For MTP layer loading original air gguf llama.cpp we get.

model has unused tensor blk.46.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.attn_q.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.attn_k.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_v.weight (size = 2359296 bytes) -- ignoring
model has unused tensor blk.46.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.46.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.46.attn_output.weight (size = 28311552 bytes) -- ignoring
model has unused tensor blk.46.post_attention_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_inp.weight (size = 2097152 bytes) -- ignoring
model has unused tensor blk.46.exp_probs_b.bias (size = 512 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_exps.weight (size = 242221056 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_exps.weight (size = 507510784 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_exps.weight (size = 242221056 bytes) -- ignoring
model has unused tensor blk.46.ffn_gate_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.ffn_down_shexp.weight (size = 3964928 bytes) -- ignoring
model has unused tensor blk.46.ffn_up_shexp.weight (size = 3244032 bytes) -- ignoring
model has unused tensor blk.46.nextn.eh_proj.weight (size = 11010048 bytes) -- ignoring
model has unused tensor blk.46.nextn.embed_tokens.weight (size = 203685888 bytes) -- ignoring
model has unused tensor blk.46.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_head.weight (size = 203685888 bytes) -- ignoring
model has unused tensor blk.46.nextn.shared_head_norm.weight (size = 16384 bytes) -- ignoring

So it better if they can be omitted

@engrtipusultan Did an experimental quant of it:
https://huggingface.co/noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF
Feel free to try it out and share the experience

@engrtipusultan Did an experimental quant of it:
https://huggingface.co/noctrex/GLM-4.5-Air-REAP-82B-A12B-MXFP4_MOE-GGUF
Feel free to try it out and share the experience

I have downloaded your version. I am still getting that model has unused tensor.
Model itself is dummer that GLM-4.5-Air-UD-Q2_K_XL.gguf
I am not sure is it because of your quants or model itself. I am behind very low bandwidth. It took me 2 days to download it.
Your quants one blindly makes tools call even when it is not required original Air mentioned above says tools is not required because ask is simple.
Also I asked REAP one to provide " give me tcp client server implementation in rust." It thought about it made three versions in the thinking and gave me simplest one. Waste of tokens. Where as original Q2 mentioned earlier. Planned what it needed to do in thinking and simple gave me code in the output. Much more efficient.

Thank you everyone for your effort. For now I think I will stick to GLM-4.5-Air-UD-Q2_K_XL.gguf appears to me much better than REAP one.

Sign up or log in to comment