--- license: apache-2.0 base_model: moonshotai/Kimi-K2-Instruct-0905 tags: - mlx - quantized - kimi - deepseek-v3 - moe - instruction-following - 8-bit model_type: kimi_k2 pipeline_tag: text-generation --- # Kimi-K2-Instruct-0905 MLX 8-bit MLX 8-bit quantized version of [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905), a state-of-the-art instruction-following language model based on DeepSeek V3 architecture. ## Model Details **Architecture:** DeepSeek V3 (Kimi K2) - **Parameters:** ~671B total (Mixture of Experts) - 384 routed experts - 8 experts per token - 1 shared expert - **Hidden Size:** 7168 - **Layers:** 61 - **Context Length:** 262,144 tokens - **Quantization:** MLX 8-bit (8.501 bits per weight) - **Size:** 1.0 TB - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) ## Features - Long context support (262K tokens) - Advanced Mixture of Experts (MoE) architecture with 384 experts - Optimized for Apple Silicon with MLX framework - High-quality 8-bit quantization maintains excellent performance - Instruction-following and multi-turn conversation capabilities - Native Metal acceleration on M1/M2/M3/M4 Macs ## Installation ```bash pip install mlx-lm ``` ## Usage ### Python API ```python from mlx_lm import load, generate # Load the model model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") # Generate text prompt = "Explain quantum computing in simple terms." response = generate(model, tokenizer, prompt=prompt, max_tokens=500) print(response) ``` ### Command Line ```bash mlx_lm.generate \ --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \ --prompt "Write a Python function to calculate Fibonacci numbers." \ --max-tokens 500 ``` ### Chat Format The model uses the ChatML format: ``` <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user {user message}<|im_end|> <|im_start|>assistant {assistant response}<|im_end|> ``` ### Multi-turn Conversation Example ```python from mlx_lm import load, generate model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") conversation = """<|im_start|>system You are a helpful coding assistant.<|im_end|> <|im_start|>user Write a Python function to reverse a string.<|im_end|> <|im_start|>assistant """ response = generate(model, tokenizer, prompt=conversation, max_tokens=300) print(response) ``` ## System Requirements **Minimum:** - 1.1 TB free disk space - 64 GB RAM (unified memory) - Apple Silicon Mac (M1 or later) - macOS 12.0 or later **Recommended:** - 128 GB+ unified memory - M2 Ultra, M3 Max, or M4 Max/Ultra - Fast SSD storage ## Performance Notes - **Memory Usage:** ~1 TB model size + ~20-40 GB runtime overhead - **Inference Speed:** Depends on hardware (faster on M2 Ultra/M3 Max) - **Quantization:** 8-bit quantization maintains near-original model quality - **MoE Efficiency:** Only 8 experts activated per token (not all 384) ## Model Variants If you need different quantization levels or formats: - **MLX 6-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-6bit` - **MLX 4-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-4bit` - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) ## Limitations - Requires Apple Silicon (not compatible with x86/CUDA) - Very large model size (1 TB) requires significant storage - High memory requirements (64+ GB unified memory) - Inference speed depends heavily on available RAM and SSD speed - Chinese-English bilingual model, optimized for both languages ## Technical Details ### Quantization Method This model was quantized using MLX's built-in quantization: ```bash mlx_lm.convert \ --hf-path moonshotai/Kimi-K2-Instruct-0905 \ --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \ -q --q-bits 8 --trust-remote-code ``` **Result:** 8.501 bits per weight (slightly higher than 8-bit due to metadata) ### Architecture Highlights - **Rope Scaling:** YaRN with 64x factor for extended context - **KV Compression:** LoRA-based key-value compression (rank 512) - **Query Compression:** Q-LoRA rank 1536 - **MoE Routing:** Top-8 expert selection with sigmoid scoring - **Training:** Pre-quantized with FP8 (e4m3) in base model ## Citation If you use this model, please cite the original Kimi K2 work: ```bibtex @misc{kimi-k2-2025, title={Kimi K2: Advancing Long-Context Language Models}, author={Moonshot AI}, year={2025}, url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905} } ``` ## License Same as base model: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) ## Links - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) - **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx) - **MLX LM:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms) --- **Quantized by:** richardyoung **Format:** MLX 8-bit **Created:** 2025-10-25