Instructions to use ubergarm/MiniMax-M2.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ubergarm/MiniMax-M2.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ubergarm/MiniMax-M2.7-GGUF", filename="BROKEN-TEST-ONLY-DONT-DOWNLOAD-MiniMax-M2.7-iq1_s_q4_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ubergarm/MiniMax-M2.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: ./llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q # Run inference directly in the terminal: ./build/bin/llama-cli -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- LM Studio
- Jan
- vLLM
How to use ubergarm/MiniMax-M2.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ubergarm/MiniMax-M2.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ubergarm/MiniMax-M2.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Ollama
How to use ubergarm/MiniMax-M2.7-GGUF with Ollama:
ollama run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Unsloth Studio
How to use ubergarm/MiniMax-M2.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ubergarm/MiniMax-M2.7-GGUF to start chatting
- Pi
How to use ubergarm/MiniMax-M2.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ubergarm/MiniMax-M2.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Run Hermes
hermes
- Docker Model Runner
How to use ubergarm/MiniMax-M2.7-GGUF with Docker Model Runner:
docker model run hf.co/ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
- Lemonade
How to use ubergarm/MiniMax-M2.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ubergarm/MiniMax-M2.7-GGUF:IQ1_S_Q
Run and chat with the model
lemonade run user.MiniMax-M2.7-GGUF-IQ1_S_Q
List all available models
lemonade list
4 bpw suggestions
Another great quant, thanks! If it's not too much trouble, maybe maximize for 2x96 GB with max context. That might be a nice target.
Thanks! the existing IQ5_K 157.771 GiB (5.926 BPW) will fit comfortably in 192GB VRAM with plenty of context. Generally there isn't much reason to go above this quality as it uses iq6_k for the ffn_down_exps tensors and iq5_k for the ffn_(gate|up)_up.
I'll have some perplexity/kld data soon to see the details!
Well then request fulfilled, thank you!
How do you run a perplexity test? I'm curious about the unsloth quant I'm using.
Look at the logs folder for my exact syntax (keep in mind different backends might have some offset relative to other backends e.g. CUDA vs CPU (i'm run cpu-only testing)) so be careful attempting to compare across model providers.
ask me if you have any questions
the wiki.test.raw is available here: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/blob/main/wiki.test.raw.gz (gunzip it first of course)
i have some notes and links about this spread about too
Does this command look correct if I test on CUDA?
llama-perplexity.exe --ctx-size 512 --batch-size 512 -f web_feed\wiki-text-raw.txt -fa 1 -ngl 999 --seed 1337 --threads 1 -m MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf
perplexity: tokenizing the input ..
perplexity: tokenization took 994.582 ms
perplexity: calculating perplexity over 552 chunks, n_ctx=512, batch_size=512, n_seq=1
perplexity: 0.54 seconds per pass - ETA 4.98 minutes
[1]4.0207,[2]5.2011,[3]5.0865,[4]5.6036,[5]5.7799,[6]6.3037,[7]6.6499,[8]7.6539,[9]8.0681,[10]8.2523,[11]8.3057,[12]8.6900,[13]8.6879,[14]8.5312,[15]8.6820,[16]8.2602,[17]8.4055,[18]8.3708,[19]8.1997,[20]7.9595,[21]7.9015,[22]7.6729,[23]7.4212,[24]7.2696,[25]6.9098,[26]6.7656,[27]6.8904,[28]6.9075,[29]6.9217,[30]6.9122,[31]6.8547,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]-nan,[40]-nan,[41]-nan,[42]-nan,[43]-nan,[44]-nan,[45]-nan,[46]-nan,[47]-nan,[48]-nan,[49]-nan,[50]-nan,[51]-nan,[52]-nan,[53]-nan,[54]-nan,[55]-nan,[56]-nan,[57]-nan,[58]-nan,[59]-nan,[60]-nan,[61]-nan,[62]-nan,[63]-nan,[64]-nan,[65]-nan,[66]-nan,[67]-nan,[68]-nan,[69]-nan,[70]-nan,[71]-nan,[72]-nan,[73]-nan,[74]-nan,[75]-nan,[76]-nan,[77]-nan,[78]-nan,[79]-nan,[80]-nan,[81]-nan,[82]-nan,[83]-nan,[84]-nan,[85]-nan,[86]-nan,[87]-nan,[88]-nan,[89]-nan,[90]-nan,[91]-nan,[92]-nan,[93]-nan,[94]-nan,[95]-nan,[96]-nan,[97]-nan,[98]-nan,[99]-nan,[100]-nan,[101]-nan,[102]-nan,[103]-nan,[104]-nan,[105]-nan,[106]-nan,[107]-nan,[108]-nan,[109]-nan,[110]-nan,[111]-nan,[112]-nan,[113]-nan,[114]-nan,[115]-nan,[116]-nan,[117]-nan,[118]-nan,[119]-nan,[120]-nan,[121]-nan,[122]-nan,[123]-nan,[124]-nan,[125]-nan,[126]-nan,[127]-nan,[128]-nan,[129]-nan,[130]-nan,[131]-nan,[132]-nan,[133]-nan,[134]-nan,[135]-nan,[136]-nan,[137]-nan,[138]-nan,[139]-nan,[140]-nan,[141]-nan,[142]-nan,[143]-nan,[144]-nan,[145]-nan,[146]-nan,[147]-nan,[148]-nan,[149]-nan,[150]-nan,[151]-nan,[152]-nan,[153]-nan,[154]-nan,[155]-nan,[156]-nan,[157]-nan,[158]-nan,[159]-nan,[160]-nan,[161]-nan,[162]-nan,[163]-nan,[164]-nan,[165]-nan,[166]-nan,[167]-nan,[168]-nan,[169]-nan,[170]-nan,[171]-nan,[172]-nan,[173]-nan,[174]-nan,[175]-nan,[176]-nan,[177]-nan,[178]-nan,[179]-nan,[180]-nan,[181]-nan,[182]-nan,[183]-nan,[184]-nan,[185]-nan,[186]-nan,[187]-nan,[188]-nan,[189]-nan,[190]-nan,[191]-nan,[192]-nan,[193]-nan,[194]-nan,[195]-nan
Well that seems to have gone off track. No big deal, just a curiosity.
for CUDA full offload that looks about right... the numbers seem okay, what do you mean it went off track?
example for full GPU offload:
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
-ngl 999 \
--seed 1337 \
--ctx-size 512 \
--threads 1 \
--no-mmap
yeah should be fine... the seed doesn't matter as no sampling is done here... i just like it haha
It turned into nans after [32]. Trying your command also turns into nans at [32]. You might have to scroll right to see where the nan sequence begins.
I don't think there's anything wrong with evaluation since I can generate long sequences of 30-50k tokens in opencode. Maybe llama-perplexity has an issue there.
ooooh ... nan is bad.. that typically indicates an numerical issue either with the backend kernels or the quant itself... i have not seen any nans on my quants running on CPU backend...
check here for ik_llama.cpp windows builds (which you might already be using?) https://github.com/Thireus/ik_llama.cpp/releases
It's strange. I get that after building the latest ik_llama.cpp for Windows, but also see the same error after [32] with llama.cpp. Maybe it's just a unsloth UD Q4_K_XL thing.
You can try to run it with ik_llama.cpp and add --validate-quants to the command and it should tell you if there are blocks of 0 in the quant... i run all my quants through that before releasing... hrmm.. thanks for sharing some info, if it persists might want to let unsloth know...
kind of interesting find in the wild here with nans showing up on 4ish BPW mainline mixes across some models...
Looks like ppl here is 0 around Q4 with the same nan issue. https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dbfd62ab6e80fe0c444fda
@ndroidph it partially worked looking at the log: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/blob/main/kld_data/aes_sedai/MiniMax-M2.7-Q4_K_M.md
but after the first couple of batches starting throwing nan.
That's similar to what I see after batch 32.
For folks who haven't seen https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ - the NaN issue has been fixed for all our quants - Aes also re-uploaded Q4_K_M with the Q6_K fix.
Bart is still investigating, but will most likely upload in the next few days.
The issue isn't isolated to us - 10/26 (38%) of Bartowski's quants also NaN whilst 5/23 (22%) of ours NaN. So it's a widespread issue.
blk.61.ffn_down_exps overflows under Q4_K and Q5_K, so Q6_K must be used
Bart is still investigating, but will most likely upload in the next few days.
We've narrowed it down that it seems:
- does not happenon CPU-only compiled backend
- ik_llama.cpp --validate-quants does not find any issues with the quant
- works okay on Vulkan backend unless forced to int8 where it will begin throwing nan
so it could be an issue with the CUDA backend path specific to q4_K mmq kernel perhaps, but still trying to isolate exactly the issue.