Dec 29 2025: Qwen3-Next updated with iMatrix + Improved performance!

#3
by danielhanchen - opened

Dec 29th 2025: Now updated with imatrix and works well - previously imatrix partially broke - we managed to fix it!

Quantized Qwen3-next uploads should now be much improved, especially at lower bit rates! :)

Also thanks to llama.cpp, they optimized model inference even further.

Yes you will need to redownload.

danielhanchen pinned discussion
danielhanchen changed discussion title from Qwen3-Next now updated with iMatrix! to Qwen3-Next updated with iMatrix + Improved performance!

As 80B is a heavy model, appreciate lower Q improvements as much as possible. I can run only Q3.

This new model upload, for me isn't working out at all. it's repeating over and over at times. I've updated/compiled most recent llama.cpp. I unintentionally received this update rerunning llama-server -hf blah.
I'm sad because I was running large context before, and now I can't have a coherent conversation. The reason for the restart originally, was because I was tracking down why my key/cache was being invalidated, causeing slow responses. I suspected swa? becuase I still had vram available but were seeing things get evicted.
now I'm just downright unreliable.

build/bin/llama-server -hf unsloth/Qwen3-Next-80B-A3B-Instruct-GGUF --hf-file Qwen3-Next-80B-A3B-Instruct-Q4_0.gguf -ngl 999 --ctx-size 262144 --host 0.0.0.0 --port 39281 --threads 16 --
alias qwen3-next --chat-template chatml

maybe I'm doing something wrong? Previously I was able to go well above 120k tokens. Now I really can't get stable output and the model is crazier than normal. All that changed was the gguf, and then a newer lamma.cpp with no change in new weird results.

UD-Q5_K_XL seems to have gotten dramatically dumber since the update (that's the one I'm using), can't seem to do what it easily did before (I have a suite of private benchmarks). I'm asking it to answer 87 questions and it either hallucinates 100 questions or answers only 7 or so and stops. Even smaller Qwens are able to answer the right number of questions, and the previous UD-Q5_K_XL of this model not only did the right number, but aced most of them. Now it can't figure out how many questions there are (I ask them all at once).

I asked it 87 multiple choice questions in a single query. I told it to just provide the answer for each one. Very simple for most modern models to do, including Qwen3-4b.

Here's the answer it gave me:

  1. b
  2. c
  3. b
  4. a
  5. a
  6. c
  7. d
  8. c
  9. a
  10. a
  11. c
  12. c
  13. d
  14. c
  15. a
  16. a
  17. a
  18. a

Then I asked: "how many questions did you answer? How many questions are there total?"
Answer:
I answered 18 questions.

There are 18 questions in total in your message.

So I asked: "Can you list all the questions and the question itself for each one?"
Answer:
Actually, in your previous message, you only included 18 answers (the ones I mistakenly thought were questions), but you did not actually list the full 18 questions with their full text β€” you only pasted my previous incorrect response of 18 answers.

In fact, your last message did not contain 18 full questions with full text β€” it only contained my previous incorrect reply.

So, to be clear:

βœ… You asked: β€œCan you list all the questions and the question itself for each one?”
βœ… But you did not paste 11 (or 18) full questions β€” you only asked me to list them, but didn’t paste the actual 11 questions.

So, I cannot list questions that were never pasted.

βœ… You asked: β€œCan you list all the questions and the question itself for each one?”
❌ But you did not paste any questions β€” just asked me to list them.

So, I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked me to list the questions β€” but you never pasted them.

So I cannot list what wasn’t provided.

βœ… You asked

As you can see, the model is broken, it would never go in a scizoid loop like this.

I'm using llama-server b7324. All other models work just fine, and this one did too prior to the GGUF update. I haven't tested it on older llamacpp builds so I'm not 100% if it was the GGUF update or a llamacpp update that caused something to break yet. The /slots endpoint confirmed that it always had the full suite of 87 multiple choice questions in its context. The older version of the GGUF answered 78 out of 87 correctly and did not mess up the number of questions or get into any kind of repetitive loop.

I was able to confirm that Bartowski's Q5_K_M still works as expected. So at least that rules out Llamacpp issues, something is wonky with the new GGUF's :(

Similar issue .. Response in a LOOP using just downloaded :
Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf
llama.cpp

When the old Qwen3-Next-80B-A3B-Instruct-Q4_K_M.gguf + the latest llama.ccp is used no any issue

On Q3 also loops.

On Q8 also loops. Old gguf file work fine.

Unsloth AI org

Hey guys we're going to investigate, could be an issue with the conversion

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

Hi, thanks for your hardworking. I used the UD-IQ2XXS version, although there's some repetition issues, it's only 26GB and barely fits in my system. Now the smallest quant is 29GB and it's too big...

Unsloth AI org

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

Hi, thanks for your hardworking. I used the UD-IQ2XXS version, although there's some repetition issues, it's only 26GB and barely fits in my system. Now the smallest quant is 29GB and it's too big...

Ohh mmm we'll see if we can upload smaller ones. Tbh you're better off using a smaller model in higher precision if that's the case

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

Hi, thanks for your hardworking. I used the UD-IQ2XXS version, although there's some repetition issues, it's only 26GB and barely fits in my system. Now the smallest quant is 29GB and it's too big...

Ohh mmm we'll see if we can upload smaller ones. Tbh you're better off using a smaller model in higher precision if that's the case

Yeah you're right. I'm really looking forward to Kimi Linear 48B, since Qwen only have 30B then 80B

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

i used previous version of q4_k_xl UD downloaded a few days ago and this outputs some loops and repetitions. and now earlier version reverted model works normal for me.

We've now reverted the uploads to our previous GGUFs. Can you guys see if the problem still persists?

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

Also thank you for letting us know about the issue really appreciate it!

CC: @Billtown @Gogich77 @LaikaFramework @maigonis @shimmyshimmer

The issue has disappeared and the model is back to full power!

Is there a way I can private message you my benchmark prompt (I want to keep it off any public setting so models don't get trained on it)? Maybe on reddit if you're ok with that?

You can test it on the UD-Q5_K_XL of both versions. The updated imatrix gguf was always broken and could never answer the prompt, and the current re-upload always succeeds, so it should be easy to repeatedly reproduce.

Does anyone know how to reap it down to 60b?
That version at 3bpw would be great.

Does anyone know how to reap it down to 60b?
That version at 3bpw would be great.

i think, better you add more system ram (to 64 gb total or more) and this accepts work with q4 quants. i used cpu offload in LM studio and only active MoE experts working in vram, approximately ~3gb used (and ~50 gb system ram used). ~14 tokens\second

The goal is different for me. I have a 2.1bpw version running in a L4@60tps.
I expect a REAP25 at 3bpw is equally fast but gives better answers.

Q8 work well, no problem.

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

I find it worrying that the reason couldn't be found, hope it's not something to do with imatrix and that could potentially happen for other models too...

For some reason after testing, we could not reproduce the issue but after confirming with some others, it does not happen for about 95% of the time and only happens maybe 5%. So we've reverted it.

I find it worrying that the reason couldn't be found, hope it's not something to do with imatrix and that could potentially happen for other models too...

My guess is, its related to next architecture as its initial implementation, not a global thing.

danielhanchen changed discussion title from Qwen3-Next updated with iMatrix + Improved performance! to Dec 29: Qwen3-Next updated with iMatrix + Improved performance!
Unsloth AI org

Dec 29, updated with proper imatrix support this time! Feel free to check out

danielhanchen changed discussion title from Dec 29: Qwen3-Next updated with iMatrix + Improved performance! to Dec 29 2025: Qwen3-Next updated with iMatrix + Improved performance!

Sign up or log in to comment