---
license: apache-2.0
base_model: moonshotai/Kimi-K2-Instruct-0905
tags:
  - mlx
  - quantized
  - kimi
  - deepseek-v3
  - moe
  - instruction-following
  - 8-bit
model_type: kimi_k2
pipeline_tag: text-generation
---

# Kimi-K2-Instruct-0905 MLX 8-bit

MLX 8-bit quantized version of [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905), a state-of-the-art instruction-following language model based on DeepSeek V3 architecture.

## Model Details

**Architecture:** DeepSeek V3 (Kimi K2)
- **Parameters:** ~671B total (Mixture of Experts)
  - 384 routed experts
  - 8 experts per token
  - 1 shared expert
- **Hidden Size:** 7168
- **Layers:** 61
- **Context Length:** 262,144 tokens
- **Quantization:** MLX 8-bit (8.501 bits per weight)
- **Size:** 1.0 TB
- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)

## Features

- Long context support (262K tokens)
- Advanced Mixture of Experts (MoE) architecture with 384 experts
- Optimized for Apple Silicon with MLX framework
- High-quality 8-bit quantization maintains excellent performance
- Instruction-following and multi-turn conversation capabilities
- Native Metal acceleration on M1/M2/M3/M4 Macs

## Installation

```bash
pip install mlx-lm
```

## Usage

### Python API

```python
from mlx_lm import load, generate

# Load the model
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

# Generate text
prompt = "Explain quantum computing in simple terms."
response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
print(response)
```

### Command Line

```bash
mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
  --prompt "Write a Python function to calculate Fibonacci numbers." \
  --max-tokens 500
```

### Chat Format

The model uses the ChatML format:

```
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant
{assistant response}<|im_end|>
```

### Multi-turn Conversation Example

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

conversation = """<|im_start|>system
You are a helpful coding assistant.<|im_end|>
<|im_start|>user
Write a Python function to reverse a string.<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=300)
print(response)
```

## System Requirements

**Minimum:**
- 1.1 TB free disk space
- 64 GB RAM (unified memory)
- Apple Silicon Mac (M1 or later)
- macOS 12.0 or later

**Recommended:**
- 128 GB+ unified memory
- M2 Ultra, M3 Max, or M4 Max/Ultra
- Fast SSD storage

## Performance Notes

- **Memory Usage:** ~1 TB model size + ~20-40 GB runtime overhead
- **Inference Speed:** Depends on hardware (faster on M2 Ultra/M3 Max)
- **Quantization:** 8-bit quantization maintains near-original model quality
- **MoE Efficiency:** Only 8 experts activated per token (not all 384)

## Model Variants

If you need different quantization levels or formats:

- **MLX 6-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-6bit`
- **MLX 4-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-4bit`
- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)

## Limitations

- Requires Apple Silicon (not compatible with x86/CUDA)
- Very large model size (1 TB) requires significant storage
- High memory requirements (64+ GB unified memory)
- Inference speed depends heavily on available RAM and SSD speed
- Chinese-English bilingual model, optimized for both languages

## Technical Details

### Quantization Method

This model was quantized using MLX's built-in quantization:

```bash
mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
  -q --q-bits 8 --trust-remote-code
```

**Result:** 8.501 bits per weight (slightly higher than 8-bit due to metadata)

### Architecture Highlights

- **Rope Scaling:** YaRN with 64x factor for extended context
- **KV Compression:** LoRA-based key-value compression (rank 512)
- **Query Compression:** Q-LoRA rank 1536
- **MoE Routing:** Top-8 expert selection with sigmoid scoring
- **Training:** Pre-quantized with FP8 (e4m3) in base model

## Citation

If you use this model, please cite the original Kimi K2 work:

```bibtex
@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
```

## License

Same as base model: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Links

- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
- **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
- **MLX LM:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)

---

**Quantized by:** richardyoung
**Format:** MLX 8-bit
**Created:** 2025-10-25