Kimi-K2-Instruct-0905 MLX 8-bit

MLX 8-bit quantized version of moonshotai/Kimi-K2-Instruct-0905, a state-of-the-art instruction-following language model based on DeepSeek V3 architecture.

Model Details

Architecture: DeepSeek V3 (Kimi K2)

  • Parameters: ~671B total (Mixture of Experts)
    • 384 routed experts
    • 8 experts per token
    • 1 shared expert
  • Hidden Size: 7168
  • Layers: 61
  • Context Length: 262,144 tokens
  • Quantization: MLX 8-bit (8.501 bits per weight)
  • Size: 1.0 TB
  • Original Model: moonshotai/Kimi-K2-Instruct-0905

Features

  • Long context support (262K tokens)
  • Advanced Mixture of Experts (MoE) architecture with 384 experts
  • Optimized for Apple Silicon with MLX framework
  • High-quality 8-bit quantization maintains excellent performance
  • Instruction-following and multi-turn conversation capabilities
  • Native Metal acceleration on M1/M2/M3/M4 Macs

Installation

pip install mlx-lm

Usage

Python API

from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

# Generate text
prompt = "Explain quantum computing in simple terms."
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)

Command Line

mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
  --prompt "Write a Python function to calculate Fibonacci numbers." \
  --max-tokens 500

Chat Format

The model uses the ChatML format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>

Multi-turn Conversation Example

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

conversation = """<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function to reverse a string.<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=300)
print(response)

System Requirements

Minimum:

  • 1.1 TB free disk space
  • 64 GB RAM (unified memory)
  • Apple Silicon Mac (M1 or later)
  • macOS 12.0 or later

Recommended:

  • 128 GB+ unified memory
  • M2 Ultra, M3 Max, or M4 Max/Ultra
  • Fast SSD storage

Performance Notes

  • Memory Usage: ~1 TB model size + ~20-40 GB runtime overhead
  • Inference Speed: Depends on hardware (faster on M2 Ultra/M3 Max)
  • Quantization: 8-bit quantization maintains near-original model quality
  • MoE Efficiency: Only 8 experts activated per token (not all 384)

Model Variants

If you need different quantization levels or formats:

  • MLX 6-bit (coming soon): richardyoung/Kimi-K2-Instruct-0905-MLX-6bit
  • MLX 4-bit (coming soon): richardyoung/Kimi-K2-Instruct-0905-MLX-4bit
  • Original Model: moonshotai/Kimi-K2-Instruct-0905

Limitations

  • Requires Apple Silicon (not compatible with x86/CUDA)
  • Very large model size (1 TB) requires significant storage
  • High memory requirements (64+ GB unified memory)
  • Inference speed depends heavily on available RAM and SSD speed
  • Chinese-English bilingual model, optimized for both languages

Technical Details

Quantization Method

This model was quantized using MLX's built-in quantization:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
  -q --q-bits 8 --trust-remote-code

Result: 8.501 bits per weight (slightly higher than 8-bit due to metadata)

Architecture Highlights

  • Rope Scaling: YaRN with 64x factor for extended context
  • KV Compression: LoRA-based key-value compression (rank 512)
  • Query Compression: Q-LoRA rank 1536
  • MoE Routing: Top-8 expert selection with sigmoid scoring
  • Training: Pre-quantized with FP8 (e4m3) in base model

Citation

If you use this model, please cite the original Kimi K2 work:

@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

License

Same as base model: Apache 2.0

Links


Quantized by: richardyoung Format: MLX 8-bit Created: 2025-10-25

Downloads last month
-
Safetensors
Model size
1T params
Tensor type
BF16
U32
F32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for richardyoung/Kimi-K2-Instruct-0905-MLX-8bit

Quantized
(16)
this model