mratsim/MiniMax-M2.1-FP8-INT4-AWQ · Looking forward to trying this!

3 days ago

•

Downloading now :)

I have a weird NVL2 GH200 system (https://dnhkng.github.io/posts/hopper/), so I have lots of RAM (960GB) for model quantisation. Happy to help with conversions, just reach out!

PS. I will try them both (And I was a big fan of Exllama back in version 2), but if you had to pick one, tis or mratsim/GLM-4.7-EXL3? I understand this model works very well with Claude Code.

dehnhaide

3 days ago

Downloading now :)

I have a weird NVL2 GH200 system (https://dnhkng.github.io/posts/hopper/), so I have lots of RAM (960GB) for model quantisation. Happy to help with conversions, just reach out!

Ok, now that I've died a little (reading about your system) and resurrected myself... how on Earth did you get that beast for such a price? I also live in Germany and I would go on foot to any Bavarian forest, with the cash money in one hand and a torch in the other, seeking for Bernhard! 👊

dnhkng

3 days ago

It was the deal of the century; pure luck I saw the offer in r/LocalLlama while browsing the new section (which I never normally do). I think he only sold it to me because I offer to come and collect it,.and he didn't want to bother with postage and insurance etc.

dehnhaide

3 days ago

It was the deal of the century; pure luck I saw the offer in r/LocalLlama while browsing the new section (which I never normally do). I think he only sold it to me because I offer to come and collect it,.and he didn't want to bother with postage and insurance etc.

Just wrote him... maybe, for another German, willing to travel and pick up! :) Wünscht mir Glück!

mratsim

Owner 3 days ago

PS. I will try them both (And I was a big fan of Exllama back in version 2), but if you had to pick one, tis or mratsim/GLM-4.7-EXL3? I understand this model works very well with Claude Code.

TabbyAPI / ExLlama v3 currently is doesn't work with tool calling https://github.com/theroyallab/tabbyAPI/pull/378#issuecomment-3679072283, which is why I've been quantizing Minimax.

I've been trying my hand at vibe-coding / reasoning for 2 weeks now:

first GLM-4.5-Air on:
- a reasoning_content reasoning rewrapper: https://github.com/mratsim/llm-reasoning-proxy
- a declarative quantization framework to quantize MiniMax: https://github.com/mratsim/quantizers
Then MiniMax on CLI and MCP tooling to expand LLM skillsets: https://github.com/mratsim/delulu (currently only allowing Google FLights through CLI, but soon MCP, and Google Hotels is being annoying :/)

One thing is that on 2x RTX Pro 6000, GLM-4.7-3.84bpw will be 27~~37 tok/s (depending on sampler, context length, ....) while this MiniMax quant is 80~~110.

The other model I want to try is MiMo-V2-Flash as it uses the attention sinks like GPT-OSS and on the current AWQ quant I reached over 130 tok/s but it kept looping and there is no proper "all expert calibration" file in llmcompressor and that can have devastating impact on coding abilities: https://avtc.github.io/aquarium-side-by-side/

dnhkng

2 days ago

•

edited 2 days ago

I finally tried Mimo-v2-flash on open router, and it's often finished the thinking and answer before GLM4.7 has even started. However the main repo has several open issues on problems with Claude Code.

For Exllama v3 models, what about using a proxy between Claude Code and Tabby? Something like this might work, and its less work for the TabbyAPI maintainers: https://github.com/1rgs/claude-code-proxy

I saw on this patch for llama.cpp: https://github.com/ggml-org/llama.cpp/discussions/18005, and worked in this direction for a while, but it's probably only suitable for my Frankenstein server. On Q8 and by using unified RAM, I was getting slightly higher initial speeds for the first few dozen tokens (>50tps), but then it falls back to about 10-15.

It would be interesting to examine the REAP experts, and only load those into VRAM, and keep the rest in System RAM, for better coding.

mratsim

Owner 1 day ago

I'm not even talking about Claude Code, just plain chat completions tool calls don't work.

I think either you need to make tabbyAPI support the tool calling format of GLM models (xml) or you need to convince GLM to do tool calls in json format. But at the proxy level, tool calls have a specific wire format that you can't change.

dnhkng

about 17 hours ago

This works great with Claude Code at first glance. I will test it more thoroughly over the weekend, but it was happily editing files, grepping and examining my code, all running locally 🤩

dehnhaide

about 17 hours ago

This works great with Claude Code at first glance. I will test it more thoroughly over the weekend, but it was happily editing files, grepping and examining my code, all running locally 🤩

Keep the good news coming! This would be such a relief to have someone else's confirmation that it works locally in good terms with Claude. Curious how it behaves with long context and some large code refactoring... But your behemoth will probably not break a sweat in the process.

Please report back if you get more tests done. Thanks for the effort!