--- license: apache-2.0 datasets: - open-r1/Mixture-of-Thoughts language: - en - ar - fr - es base_model: - mistralai/Mixtral-8x7B-Instruct-v0.1 pipeline_tag: text2text-generation library_name: transformers tags: - reasoning - r1 - deepseek - mixtral - MoE - thinking - code - science - math metrics: - accuracy new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit --- # Mixtral-8x7B-DeepSeek-R1-Distill A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model. ## Model Details ### Model Description This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1. - **Developed by:** ykarout - **Model type:** Mixture of Experts (MoE) Language Model - **Language(s) (NLP):** English, Arabic, French, Spanish (inherited from base model) - **License:** Apache 2.0 - **Finetuned from model:** mistralai/Mixtral-8x7B-Instruct-v0.1 ### Model Sources - **Base Repository:** https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 - **Training Dataset:** open-r1/Mixture-of-Thoughts ## Uses ### Direct Use This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including: - Mathematical problem solving with detailed explanations - Logical reasoning tasks - Code generation with explanatory comments - Scientific analysis and hypothesis formation - Complex question answering with reasoning traces ### Downstream Use The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes. ### Out-of-Scope Use - Real-time applications requiring sub-second responses (due to reasoning overhead) - Tasks where reasoning explanations are not desired - Applications requiring factual accuracy without verification (model may hallucinate during reasoning) ## Bias, Risks, and Limitations - **Reasoning Overhead:** Generates longer responses due to explicit thinking processes - **Inherited Biases:** Retains biases from the base Mixtral model and training data - **Hallucination Risk:** May generate plausible but incorrect reasoning steps - **Language Bias:** Reasoning capabilities may be stronger in English than other supported languages ### Recommendations Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning." ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit") model = AutoModelForCausalLM.from_pretrained( "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit", torch_dtype=torch.bfloat16, device_map="auto" ) # Example reasoning prompt prompt = """[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]""" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, do_sample=True ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Training Details ### Training Data The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning. ### Training Procedure #### Training Hyperparameters - **Training regime:** bf16 mixed precision - **Optimizer:** AdamW with fused implementation - **Learning rate:** 5e-6 (reduced from initial 1e-5 for stability) - **Batch size:** 8 per device - **Gradient accumulation steps:** 1 - **Max sequence length:** 8192 tokens - **Epochs:** 1 - **Gradient clipping:** 0.1 (tightened for stability) - **Learning rate scheduler:** Cosine with 10% warmup - **Weight decay:** 0.01 #### Training Infrastructure - **Hardware:** Single NVIDIA H200 GPU - **Framework:** Transformers + TRL SFTTrainer - **Gradient checkpointing:** Enabled - **Memory optimizations:** Remove unused columns, persistent data loaders #### Speeds, Sizes, Times - **Training time:** Approximately 15 hours for full epoch - **Peak memory usage:** ~140GB on H200 - **Tokens processed:** ~15M tokens - **Final model size:** ~90GB (bf16 precision) ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data Evaluation pending on standard reasoning benchmarks including: - GSM8K (mathematical reasoning) - MATH dataset - LogiQA (logical reasoning) - Code reasoning tasks #### Metrics - **Primary:** Token-level accuracy during training - **Secondary:** Loss convergence and gradient stability - **Planned:** Human evaluation of reasoning quality ### Results **Training Metrics:** - **Final training loss:** ~0.6 (converged from ~0.85) - **Token accuracy:** Stabilized around 78-84% - **Training stability:** Achieved after hyperparameter tuning Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion. ## Model Examination The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing. ## Environmental Impact **Estimated Training Impact:** - **Hardware Type:** NVIDIA H200 (140GB HBM3) - **Hours used:** ~15 hours - **Cloud Provider:** Academic cluster - **Compute Region:** [Location specific] - **Estimated Carbon Emitted:** ~2-3 kg CO2eq (approximate) ## Technical Specifications ### Model Architecture and Objective - **Base Architecture:** Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts) - **Active Parameters:** ~13B (2 experts activated per token) - **Total Parameters:** ~47B - **Training Objective:** Causal language modeling with reasoning supervision - **Attention:** Sliding window attention with 32k context capability ### Compute Infrastructure #### Hardware - **Training:** NVIDIA H200 (132GB HBM3) - **Memory:** 139GB peak utilization - **Precision:** bfloat16 #### Software - **Framework:** PyTorch + Transformers + TRL - **CUDA:** Compatible with latest versions - **Optimization:** Flash Attention, gradient checkpointing ## Citation **BibTeX:** ```bibtex @model{mixtral-deepseek-r1-distill, title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts}, author={ykarout}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit} } ``` ## Model Card Contact For questions or issues, please contact through Hugging Face