unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF · Dec 29 2025: Qwen3-Next updated with iMatrix + Improved performance!

danielhanchen

Unsloth AI org 27 days ago

•

edited 5 days ago

Dec 29th 2025: Now updated with imatrix and works well - previously imatrix partially broke - we managed to fix it!

Quantized Qwen3-next uploads should now be much improved, especially at lower bit rates! :)

Also thanks to llama.cpp, they optimized model inference even further.

Yes you will need to redownload.

danielhanchen pinned discussion 27 days ago

danielhanchen changed discussion title from Qwen3-Next now updated with iMatrix! to Qwen3-Next updated with iMatrix + Improved performance! 27 days ago

maigonis

27 days ago

•

edited 27 days ago

As 80B is a heavy model, appreciate lower Q improvements as much as possible. I can run only Q3.

Billtown

27 days ago

This new model upload, for me isn't working out at all. it's repeating over and over at times. I've updated/compiled most recent llama.cpp. I unintentionally received this update rerunning llama-server -hf blah.
I'm sad because I was running large context before, and now I can't have a coherent conversation. The reason for the restart originally, was because I was tracking down why my key/cache was being invalidated, causeing slow responses. I suspected swa? becuase I still had vram available but were seeing things get evicted.
now I'm just downright unreliable.

build/bin/llama-server -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF --hf-file Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf -ngl 999 --ctx-size 262144 --host 0.0.0.0 --port 39281 --threads 16 --
alias qwen3-next --chat-template chatml

maybe I'm doing something wrong? Previously I was able to go well above 120k tokens. Now I really can't get stable output and the model is crazier than normal. All that changed was the gguf, and then a newer lamma.cpp with no change in new weird results.

YearZero

27 days ago

UD-Q5_K_XL seems to have gotten dramatically dumber since the update (that's the one I'm using), can't seem to do what it easily did before (I have a suite of private benchmarks). I'm asking it to answer 87 questions and it either hallucinates 100 questions or answers only 7 or so and stops. Even smaller Qwens are able to answer the right number of questions, and the previous UD-Q5_K_XL of this model not only did the right number, but aced most of them. Now it can't figure out how many questions there are (I ask them all at once).

YearZero

27 days ago

•

edited 27 days ago

I asked it 87 multiple choice questions in a single query. I told it to just provide the answer for each one. Very simple for most modern models to do, including Qwen3-4b.

Here's the answer it gave me:

b
c
b
a
a
c
d
c
a
a
c
c
d
c
a
a
a
a

Then I asked: "how many questions did you answer? How many questions are there total?"
Answer:
I answered 18 questions.

There are 18 questions in total in your message.

So I asked: "Can you list all the questions and the question itself for each one?"
Answer:
Actually, in your previous message, you only included 18 answers (the ones I mistakenly thought were questions), but you did not actually list the full 18 questions with their full text — you only pasted my previous incorrect response of 18 answers.

In fact, your last message did not contain 18 full questions with full text — it only contained my previous incorrect reply.

So, to be clear:

✅ You asked: “Can you list all the questions and the question itself for each one?”
✅ But you did not paste 11 (or 18) full questions — you only asked me to list them, but didn’t paste the actual 11 questions.

So, I cannot list questions that were never pasted.

✅ You asked: “Can you list all the questions and the question itself for each one?”
❌ But you did not paste any questions — just asked me to list them.

So, I cannot list what wasn’t provided.