Kimi-K2-Instruct-0905 MLX 8-bit

MLX 8-bit quantized version of moonshotai/Kimi-K2-Instruct-0905, a state-of-the-art instruction-following language model based on DeepSeek V3 architecture.

Model Details

Architecture: DeepSeek V3 (Kimi K2)

Parameters: ~671B total (Mixture of Experts)
- 384 routed experts
- 8 experts per token
- 1 shared expert
Hidden Size: 7168
Layers: 61
Context Length: 262,144 tokens
Quantization: MLX 8-bit (8.501 bits per weight)
Size: 1.0 TB
Original Model: moonshotai/Kimi-K2-Instruct-0905

Features

Long context support (262K tokens)
Advanced Mixture of Experts (MoE) architecture with 384 experts
Optimized for Apple Silicon with MLX framework
High-quality 8-bit quantization maintains excellent performance
Instruction-following and multi-turn conversation capabilities
Native Metal acceleration on M1/M2/M3/M4 Macs

Installation

pip install mlx-lm

Usage

Python API

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

# Generate text
prompt = "Explain quantum computing in simple terms."
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)

Command Line

mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
  --prompt "Write a Python function to calculate Fibonacci numbers." \
  --max-tokens 500

Chat Format

The model uses the ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>

Multi-turn Conversation Example

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

conversation = """<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function to reverse a string.<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=300)
print(response)

System Requirements

Minimum:

1.1 TB free disk space
64 GB RAM (unified memory)
Apple Silicon Mac (M1 or later)
macOS 12.0 or later

Recommended:

128 GB+ unified memory
M2 Ultra, M3 Max, or M4 Max/Ultra
Fast SSD storage

Performance Notes

Memory Usage: ~1 TB model size + ~20-40 GB runtime overhead
Inference Speed: Depends on hardware (faster on M2 Ultra/M3 Max)
Quantization: 8-bit quantization maintains near-original model quality
MoE Efficiency: Only 8 experts activated per token (not all 384)

Model Variants

If you need different quantization levels or formats:

MLX 6-bit (coming soon): richardyoung/Kimi-K2-Instruct-0905-MLX-6bit
MLX 4-bit (coming soon): richardyoung/Kimi-K2-Instruct-0905-MLX-4bit
Original Model: moonshotai/Kimi-K2-Instruct-0905

Limitations

Requires Apple Silicon (not compatible with x86/CUDA)
Very large model size (1 TB) requires significant storage
High memory requirements (64+ GB unified memory)
Inference speed depends heavily on available RAM and SSD speed
Chinese-English bilingual model, optimized for both languages

Technical Details

Quantization Method

This model was quantized using MLX's built-in quantization:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
  -q --q-bits 8 --trust-remote-code

Result: 8.501 bits per weight (slightly higher than 8-bit due to metadata)

Architecture Highlights

Rope Scaling: YaRN with 64x factor for extended context
KV Compression: LoRA-based key-value compression (rank 512)
Query Compression: Q-LoRA rank 1536
MoE Routing: Top-8 expert selection with sigmoid scoring
Training: Pre-quantized with FP8 (e4m3) in base model

Citation

If you use this model, please cite the original Kimi K2 work:

@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

License

Same as base model: Apache 2.0

Model tree for richardyoung/Kimi-K2-Instruct-0905-MLX-8bit

Base model

moonshotai/Kimi-K2-Instruct-0905

Quantized

(16)

this model

richardyoung
/

Kimi-K2-Instruct-0905-MLX-8bit