--- license: apache-2.0 base_model: moonshotai/Kimi-K2-Instruct-0905 tags: - mlx - quantized - kimi - deepseek-v3 - moe - instruction-following - 8-bit - apple-silicon model_type: kimi_k2 pipeline_tag: text-generation language: - en - zh library_name: mlx ---
# 🌙 Kimi K2 Instruct - MLX 8-bit ### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon [![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx) [![Model Size](https://img.shields.io/badge/Size-1.0_TB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit) [![Quantization](https://img.shields.io/badge/Quantization-8--bit-orange)](https://github.com/ml-explore/mlx) [![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) **[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)** ---
## 📖 What is This? This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact! ### ✨ Why You'll Love It - 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!) - 🧠 **671B Parameters** - One of the most capable open models available - ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration - 🎯 **8-bit Precision** - Best quality-to-size ratio for serious work - 🌏 **Bilingual** - Fluent in both English and Chinese - 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more ## 🎯 Quick Start ### Installation ```bash pip install mlx-lm ``` ### Your First Generation (3 lines of code!) ```python from mlx_lm import load, generate model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200)) ``` That's it! 🎉 ## 💻 System Requirements | Component | Minimum | Recommended | |-----------|---------|-------------| | **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ | | **Memory** | 64 GB unified | 128 GB+ unified | | **Storage** | 1.1 TB free | Fast SSD (2+ TB) | | **macOS** | 12.0+ | Latest version | > ⚠️ **Note:** This is a HUGE model! Make sure you have enough RAM and storage. ## 📚 Usage Examples ### Command Line Interface ```bash mlx_lm.generate \ --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \ --prompt "Write a Python script to analyze CSV files." \ --max-tokens 500 ``` ### Chat Conversation ```python from mlx_lm import load, generate model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") conversation = """<|im_start|>system You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|> <|im_start|>user Can you help me optimize this Python code?<|im_end|> <|im_start|>assistant """ response = generate(model, tokenizer, prompt=conversation, max_tokens=500) print(response) ``` ### Advanced: Streaming Output ```python from mlx_lm import load, generate model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") for token in generate( model, tokenizer, prompt="Tell me about the future of AI:", max_tokens=500, stream=True ): print(token, end="", flush=True) ``` ## 🏗️ Architecture Highlights
Click to expand technical details ### Model Specifications | Feature | Value | |---------|-------| | **Total Parameters** | ~671 Billion | | **Architecture** | DeepSeek V3 (MoE) | | **Experts** | 384 routed + 1 shared | | **Active Experts** | 8 per token | | **Hidden Size** | 7168 | | **Layers** | 61 | | **Heads** | 56 | | **Context Length** | 262,144 tokens | | **Quantization** | 8.501 bits per weight | ### Advanced Features - **🎯 YaRN Rope Scaling** - 64x factor for extended context - **🗜️ KV Compression** - LoRA-based (rank 512) - **⚡ Query Compression** - Q-LoRA (rank 1536) - **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring - **🔧 FP8 Training** - Pre-quantized with e4m3 precision
## 🎨 Other Quantization Options Choose the right balance for your needs: | Quantization | Size | Quality | Speed | Best For | |--------------|------|---------|-------|----------| | **8-bit** (you are here) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality | | [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users | | [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference | | [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Experimental | | [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only | ## 🔧 How It Was Made This model was quantized using MLX's built-in quantization: ```bash mlx_lm.convert \ --hf-path moonshotai/Kimi-K2-Instruct-0905 \ --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \ -q --q-bits 8 \ --trust-remote-code ``` **Result:** 8.501 bits per weight (includes metadata overhead) ## ⚡ Performance Tips
Getting the best performance 1. **Close other applications** - Free up as much RAM as possible 2. **Use an external SSD** - If your internal drive is full 3. **Monitor memory** - Watch Activity Monitor during inference 4. **Adjust batch size** - If you get OOM errors, reduce max_tokens 5. **Keep your Mac cool** - Good airflow helps maintain peak performance
## ⚠️ Known Limitations - 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs - 💾 **Huge Storage Needs** - Make sure you have 1.1 TB+ free - 🐏 **RAM Intensive** - Needs 64+ GB unified memory minimum - 🐌 **Slower on M1** - Best performance on M2 Ultra or newer - 🌐 **Bilingual Focus** - Optimized for English and Chinese ## 📄 License Apache 2.0 - Same as the original model. Free for commercial use! ## 🙏 Acknowledgments - **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2 - **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework - **Inspiration:** DeepSeek V3 architecture ## 📚 Citation If you use this model in your research or product, please cite: ```bibtex @misc{kimi-k2-2025, title={Kimi K2: Advancing Long-Context Language Models}, author={Moonshot AI}, year={2025}, url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905} } ``` ## 🔗 Useful Links - 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) - 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx) - 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms) - 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions) ---
**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)** *If you find this useful, please ⭐ star the repo and share with others!* **Created:** October 2025 | **Format:** MLX 8-bit