IQ2_KS

#1
by gghfez - opened

Thanks for doing these, I'm looking forward to trying this model!
Are you doing an IQ2_KS for this one?
(I'm using your IQ2_KS for the previous release with 256GB RAM + 6x24GB VRAM)

Is IQ2_ks good enough for you in terms of quality?

I'll raise my hand for IQ2_KS as well. :-)

Is IQ2_ks good enough for you in terms of quality?

I don't know yet for this one, but for K2, yes. Specifically Ubergarm's IQ2_ks is the only way I can run it locally without it being obviously lobotomized.

That quant/model is able to find logic issues in my fairly bespoke coding projects that Opus 4.1 misses and it's my favorite model for creative writing.

I just tried out the unsloth IQ2_XXS regenerating the last response in my K2 chats and it's a lot worse. Misses bugs K2 found, inattentive for creative writing, etc. It also uses more memory / I have to place more tensors on CPU.

Hopefully an IQ2_KS will be as great as the K2 one.

Owner

Dealing with some hardware stuff, but got the imatrix uploaded, I'll prioritize cooking the IQ2_KS first and then do some other sizes.

Thanks and appreciate the feedback!

Owner

Also heads up @Thireus - the new imatrix is up as you saw already, but while using it now I notice it is missing importance weights for the first ffn_(gate|down|up) dense layer (blk 0 only on Kimi-K2) as well as the shared expert ffn_(gate|down|up)_shexp. I'll be leaving those all full q8_0 for this round given that, and probably leave the attn all q8_0 as well just given it is a small percentage of overall weights more or less and the original seemed quite sensitive to quantization there.

example messages during quantizing:

====== llama_model_quantize_internal: did not find weights for blk.0.ffn_gate.weight
...
====== llama_model_quantize_internal: did not find weights for blk.56.ffn_up_shexp.weight

Seems to have everything it needs for the routed exps which are the most important given we're quantizing those the most.

Also I was unable to run imatrix with --layer-importance as it gave this error:

llama_kv_cache_init:        CPU KV buffer size =    34.31 MiB
llama_new_context_with_model: KV self size  =   34.31 MiB, c^KV (f16):   34.31 MiB, kv^T: not used
llama_new_context_with_model:        CPU  output buffer size =     0.63 MiB
llama_new_context_with_model:        CPU compute buffer size =   334.00 MiB
llama_new_context_with_model: graph nodes  = 3340
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 192 (n_threads_batch = 384) / 768 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 551.937 ms
compute_imatrix: computing over 826 chunks with batch_size 512
================= Adjusted mainline llama.cpp MLA tensors to ik_llama.cpp
======================================= HAVE_FANCY_SIMD is defined
Oops, inconsistent ffn vs last_input size

This Oops may be related to the missing importance weights above, but didn't have time to try to debug further.

fwiw I used the triton-cpu method to fp8 to bf16 cast the safetensors. Then I used mainline llama.cpp convert_hf_to_gguf.py and then switch over to ik_llama.cppfor quantizing the pure q8_0, getting imatrix from it, then quantizing the rest now from bf16 gguf.

@ubergarm , thanks for the heads up!

Owner

@gghfez @mtcl @whoisjeremylam

Okie folks, first one is uploaded: IQ2_KS 289.820 GiB (2.425 BPW) !!!

It is a bit heavy for VRAM given all the attn/first dense layer/shared expert is full Q8_0 but will give best quality despite smaller routed exps. I'll have a few sizes up available by later today if all goes well.

Cheers!

Oh I just realized that the openai api has "Responses API" and the older "Chat Completions API" but they are different with different behaviors and JSON respones: https://github.com/openai/openai-python?tab=readme-ov-file#usage might be important if you are trying to get tool use working with your client.

-ooae (offload only activated experts) flag?

I've never tried it myself, but the PR suggests it can give a good speedup for some models, but slows down other models depending on how the routed experts are used: https://github.com/ikawrakow/ik_llama.cpp/pull/698

Right, seemed to have no effect for Kimi-K2 IQ2_KS.

Sorry @ubergarm ! I've been away and then got very busy with work.

  1. How are you running llama-server especially regarding stuff like --jinja or not, if you are using your own myCustomTemplate.jinja chat template, and any other args like --reasoning-format and --reasoning-budget stuff? If you're not using any of those, does llama-server debug logs look like it is showing the correct expected template?

I'm using the built-in --jinja template.

  1. How are you calling the python tool especially making sure to hit whatever the correct API endpoints you want (e.g. /v1/* is the openai API compliant ones psure now, and the older ones are not behind /v1/* if I understand correctly recent changes on ik_llama.cpp.

Sure, here is the command that I am using:

python tool_calls_eval.py samples.jsonl \
    --model kimi-k2-0905 \
    --base-url http://192.168.100.200:5000/v1 \
    --api-key not_used \
    --concurrency 1 \
    --output results.jsonl \
    --summary summary.json
  1. Does this test expect it to actually do something with the tool calls, or just set them up correctly? If it just sets them up then I guess it should be able to test ik_llama.cpp without additional toolcall framework stuff?

I'm not sure to be honest... I haven't looked into what the script is actually doing...

Would be curious to see how these quants performs on that!

That's a great question that I thought might be answered by running this script!

EDIT: A new PR about tool-calling just came in if you want to apply and test and report back on that gh issue thread: https://github.com/ikawrakow/ik_llama.cpp/pull/799

I did just try today and unfortunately, I get a core dump. I've raised an issue since core dumping isn't presumably good. https://github.com/ikawrakow/ik_llama.cpp/issues/865

@whoisjeremylam Someone ran that tool on Kimi-K2 with llama.cpp (from reddit).

k2vv-llamacpp , reddit post

From his repo, he had to patch the latest jinja file from the kimi repo:

--- chat_template.jinja 2025-10-28 14:32:49.869564619 -0500
+++ ../chat_template_fixed.jinja        2025-10-29 09:18:41.848487773 -0500
@@ -15,7 +15,7 @@


 {%- if tools -%}
-  <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
+  <|im_system|>tool_declare<|im_middle|>{{ tools | tojson }}<|im_end|>
 {%- endif -%}
 {% for message in messages %}
   {%- if loop.first and messages[0]['role'] != 'system' -%}

https://github.com/usrlocalben/k2vv-llamacpp?tab=readme-ov-file#3-chat-template-trouble

But he also had issues with these ^ quants:

At the time K2 was new, at least two different people had concurrent WiP towards conversion scripts and llama.cpp support. I recall much of the noise in the WiP development was around the tokenizer, but some of the details are fuzzy now. Comparing some of the common GGUFs available, there is variation in tokens, both string/num table and special-token assignments. I first noticed this when trying to use jukofyork's DRAFT models, which would not bind with e.g. ubergarm's quants due to special-token mismatches.

https://github.com/usrlocalben/k2vv-llamacpp?tab=readme-ov-file#additional-note-about-k2-ggufs

@gghfez thank you.

I'll apply the patch to the chat template and specify that chat template file manually with the --jinja!

Owner

@gghfez

Hey thanks for linking @usrlocalben 's reddit post and recent github repo! The tl;dr; to me is both ik_llama.cpp and mainline llama.cpp work well with Kimi-K2 tool-calling stuff when provided an updated template e.g. llama-server ... --jinja --chat-template-file myupdatetemplate.jinja . (the feature --chat-template-file is not documented on ik's llama-server --help but overrides the GGUFs built-in template with whatever you want to use).

Right, my own procedure is to always use the official original provided chat template from the tokenizer_config.json and default values from the safetensors config.json(e.g. for things like rope and yarn etc.). Then I never update the GGUF as users can override any of the chat template or key-value metadata themselves at runtime.

This makes it so no need to release two versions of a big GGUF for "super long context 128k extended yarn" when the user can just do stuff like --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 40960 --override-kv qwen3.context_length=int:163840... Sure it is a PITA and ollama people might not be able to do it, but anyway...

Cheers y'all!

"super long context 128k extended yarn"

LOL. I didn't know people were doing this!

the feature --chat-template-file is not documented on ik's llama-server --help

Yeah I had funny experience with that, I was all ready to port that over from llama.cpp, then realized it's already there.

@whoisjeremylam Hey mate, could you share the results for your tool_calls_eval benchmark?

Sign up or log in to comment