Update links to croco.cpp and Thireus builds.

3df769d 2 months ago

18 kB

	---
	quantized_by: ubergarm
	pipeline_tag: text-generation
	base_model: moonshotai/Kimi-K2-Instruct-0905
	license: other
	license_name: modified-mit
	license_link: https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/blob/main/LICENSE
	base_model_relation: quantized
	tags:
	- mla
	- imatrix
	- conversational
	- ik_llama.cpp
	---

	## `ik_llama.cpp` imatrix Quantizations of moonshotai/Kimi-K2-Instruct-0905
	This quant collection REQUIRES [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

	NOTE `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

	Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP. For pre-built Windows binaries of ik_llama.cpp check out [Thireus' fork here](https://github.com/Thireus/ik_llama.cpp/releases).

	These quants provide best in class perplexity for the given memory footprint.

	## Big Thanks
	Shout out to Wendell and the Level1Techs crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

	Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!

	## Notes

	* The current imatrix dat file seems to be missing entries for just the single dense layer and shared expert so all my recipes are using `q8_0` for those.
	* For notes on tool calling api endpoints checkout details from this PR: https://github.com/ikawrakow/ik_llama.cpp/pull/723
	* `smol` here simply means the routed experts recipe uses the same quantization for down as well as (gate\|up) tensors.

	## Quant Collection
	Compare with baseline perplexity of full size `Q8_0` 1016.117 GiB (8.504 BPW)

	Final estimate: PPL = 2.4443 +/- 0.01175

	![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")

	### `smol-IQ5_KS` 632.664 GiB (5.295 BPW)
	Final estimate: PPL = 2.4526 +/- 0.01182

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq5_ks
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq5_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq6_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ5_KS.gguf \
	IQ5_KS \
	192
	```

	</details>

	### `smol-IQ4_KSS` 485.008 GiB (4.059 BPW)
	Final estimate: PPL = 2.5185 +/- 0.01221

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq4_kss
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_kss

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq6_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ4_KSS.gguf \
	IQ4_KSS \
	192
	```

	</details>

	### `IQ4_KS` 553.624 GiB (4.633 BPW)
	Final estimate: PPL = 2.4641 +/- 0.01190

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq5_ks
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 1 -m 1 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ4_KS.gguf \
	IQ4_KS \
	192
	```

	</details>

	### `IQ3_KS` 420.558 GiB (3.520 BPW)
	Final estimate: PPL = 2.5640 +/- 0.01262

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq4_kss
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq3_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ3_KS.gguf \
	IQ3_KS \
	192
	```

	</details>

	### `smol-IQ3_KS` 388.258 GiB (3.249 BPW)
	Final estimate: PPL = 2.5902 +/- 0.01284

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq3_ks
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq3_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ3_KS.gguf \
	IQ3_KS \
	192
	```

	</details>

	### `IQ2_KL` 358.419 GiB (3.000 BPW)
	Final estimate: PPL = 2.7993 +/- 0.01416

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq3_k
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_kl

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ2_KL.gguf \
	IQ2_KL \
	192
	```

	</details>

	### `smol-IQ2_KL` 329.195 GiB (2.755 BPW)
	Final estimate: PPL = 2.9294 +/- 0.01499

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq2_kl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_kl

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 1 -m 1 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ2_KL.gguf \
	IQ2_KL \
	192
	```

	</details>

	### `IQ2_KS` 289.820 GiB (2.425 BPW)
	Final estimate: PPL = 3.2478 +/- 0.01721

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq2_kl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 1 -m 1 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-IQ2_KS.gguf \
	IQ2_KS \
	192
	```

	</details>

	### `smol-IQ2_KS` 270.133 GiB (2.261 BPW)
	Final estimate: PPL = 3.4977 +/- 0.01924

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq2_ks
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_ks

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ2_KS.gguf \
	IQ2_KS \
	192
	```

	</details>

	### `smol-IQ1_KT` 218.936 GiB (1.832 BPW)
	Final estimate: PPL = 4.2224 +/- 0.02443

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	## Attention [0-60] (GPU)
	blk\..*\.attn_k_b\.weight=q8_0
	blk\..*\.attn_v_b\.weight=q8_0

	# Balance of attn tensors
	blk\..*\.attn_kv_a_mqa\.weight=q8_0
	blk\..*\.attn_q_a\.weight=q8_0
	blk\..*\.attn_q_b\.weight=q8_0
	blk\..*\.attn_output\.weight=q8_0

	## First Single Dense Layer [0] (GPU)
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	## Shared Expert [1-60] (GPU)
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	## Routed Experts [1-60] (CPU)
	blk\..*\.ffn_down_exps\.weight=iq1_kt
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq1_kt

	## Token embedding and output tensors (GPU)
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/imatrix-Kimi-K2-Instruct-0905-Q8_0.dat \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-384x14B-Instruct-safetensors-0905-BF16-00001-of-00046.gguf \
	/mnt/data/models/ubergarm/Kimi-K2-Instruct-0905-GGUF/Kimi-K2-Instruct-0905-smol-IQ1_KT.gguf \
	IQ1_KT \
	192
	```

	</details>

	## Example Commands
	### Hybrid (multiple) CUDA + CPU
	```bash
	# Two CUDA devices with enough VRAM to offload more layers
	# Keep in mind Kimi-K2 starts at 1 unlike DeepSeek at 3 (first dense layers)
	./build/bin/llama-server \
	--model "$model"\
	--alias ubergarm/Kimi-K2-Instruct-0905 \
	--ctx-size 32768 \
	-ctk q8_0 \
	-fa -fmoe \
	-mla 3 \
	-ngl 99 \
	-ot "blk\.(1\|2\|3)\.ffn_.*=CUDA0" \
	-ot "blk\.(4\|5\|6)\.ffn_.*=CUDA1" \
	-ot exps=CPU \
	--parallel 1 \
	--threads 48 \
	--threads-batch 64 \
	--host 127.0.0.1 \
	--port 8080
	```

	### CPU-Only (no GPU)
	```bash
	# compile
	cmake -B build -DGGML_CUDA=0 -DGGML_BLAS=0 -DGGML_VULKAN=0
	cmake --build build --config Release -j $(nproc)

	# run server
	# single CPU of a dual socket rig configured one NUMA per socket
	numactl -N 0 -m 0 \
	./build/bin/llama-server \
	--model "$model"\
	--alias ubergarm/Kimi-K2-Instruct-0905 \
	--ctx-size 98304 \
	-ctk q8_0 \
	-fa -fmoe \
	-mla 3 \
	--parallel 1 \
	--threads 128 \
	--threads-batch 192 \
	--numa numactl \
	--host 127.0.0.1 \
	--port 8080
	```

	## References
	* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)