---
license: apache-2.0
datasets:
- open-r1/Mixture-of-Thoughts
language:
- en
- ar
- fr
- es
base_model:
- mistralai/Mixtral-8x7B-Instruct-v0.1
pipeline_tag: text2text-generation
library_name: transformers
tags:
- reasoning
- r1
- deepseek
- mixtral
- MoE
- thinking
- code
- science
- math
metrics:
- accuracy
new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit
---

# Mixtral-8x7B-DeepSeek-R1-Distill

A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.

## Model Details

### Model Description

This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.

- **Developed by:** ykarout
- **Model type:** Mixture of Experts (MoE) Language Model
- **Language(s) (NLP):** English, Arabic, French, Spanish (inherited from base model)
- **License:** Apache 2.0
- **Finetuned from model:** mistralai/Mixtral-8x7B-Instruct-v0.1

### Model Sources

- **Base Repository:** https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
- **Training Dataset:** open-r1/Mixture-of-Thoughts

## Uses

### Direct Use

This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:

- Mathematical problem solving with detailed explanations
- Logical reasoning tasks
- Code generation with explanatory comments
- Scientific analysis and hypothesis formation
- Complex question answering with reasoning traces

### Downstream Use

The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.

### Out-of-Scope Use

- Real-time applications requiring sub-second responses (due to reasoning overhead)
- Tasks where reasoning explanations are not desired
- Applications requiring factual accuracy without verification (model may hallucinate during reasoning)

## Bias, Risks, and Limitations

- **Reasoning Overhead:** Generates longer responses due to explicit thinking processes
- **Inherited Biases:** Retains biases from the base Mixtral model and training data
- **Hallucination Risk:** May generate plausible but incorrect reasoning steps
- **Language Bias:** Reasoning capabilities may be stronger in English than other supported languages

### Recommendations

Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."

## How to Get Started with the Model

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
    "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Training Details

### Training Data

The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.

### Training Procedure

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision
- **Optimizer:** AdamW with fused implementation
- **Learning rate:** 5e-6 (reduced from initial 1e-5 for stability)
- **Batch size:** 8 per device
- **Gradient accumulation steps:** 1
- **Max sequence length:** 8192 tokens
- **Epochs:** 1
- **Gradient clipping:** 0.1 (tightened for stability)
- **Learning rate scheduler:** Cosine with 10% warmup
- **Weight decay:** 0.01

#### Training Infrastructure

- **Hardware:** Single NVIDIA H200 GPU
- **Framework:** Transformers + TRL SFTTrainer
- **Gradient checkpointing:** Enabled
- **Memory optimizations:** Remove unused columns, persistent data loaders

#### Speeds, Sizes, Times

- **Training time:** Approximately 15 hours for full epoch
- **Peak memory usage:** ~140GB on H200
- **Tokens processed:** ~15M tokens
- **Final model size:** ~90GB (bf16 precision)

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Evaluation pending on standard reasoning benchmarks including:
- GSM8K (mathematical reasoning)
- MATH dataset
- LogiQA (logical reasoning)
- Code reasoning tasks

#### Metrics

- **Primary:** Token-level accuracy during training
- **Secondary:** Loss convergence and gradient stability
- **Planned:** Human evaluation of reasoning quality

### Results

**Training Metrics:**
- **Final training loss:** ~0.6 (converged from ~0.85)
- **Token accuracy:** Stabilized around 78-84%
- **Training stability:** Achieved after hyperparameter tuning

Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.

## Model Examination

The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.

## Environmental Impact

**Estimated Training Impact:**
- **Hardware Type:** NVIDIA H200 (140GB HBM3)
- **Hours used:** ~15 hours
- **Cloud Provider:** Academic cluster
- **Compute Region:** [Location specific]
- **Estimated Carbon Emitted:** ~2-3 kg CO2eq (approximate)

## Technical Specifications

### Model Architecture and Objective

- **Base Architecture:** Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
- **Active Parameters:** ~13B (2 experts activated per token)
- **Total Parameters:** ~47B
- **Training Objective:** Causal language modeling with reasoning supervision
- **Attention:** Sliding window attention with 32k context capability

### Compute Infrastructure

#### Hardware
- **Training:** NVIDIA H200 (132GB HBM3)
- **Memory:** 139GB peak utilization
- **Precision:** bfloat16

#### Software
- **Framework:** PyTorch + Transformers + TRL
- **CUDA:** Compatible with latest versions
- **Optimization:** Flash Attention, gradient checkpointing

## Citation

**BibTeX:**
```bibtex
@model{mixtral-deepseek-r1-distill,
  title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
  author={ykarout},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}
```


## Model Card Contact

For questions or issues, please contact through Hugging Face