Model Card for MLX GPT-OSS-120B GSM8K Evaluation
Model Description
This model card documents the evaluation results of the MLX GPT-OSS-120B model on the GSM8K mathematical reasoning benchmark using few-shot testing methodology. The evaluation was conducted using a custom testing framework that leverages Apple's MLX framework for efficient inference on Apple Silicon.
- Model Type: Transformer-based language model
- Model Size: 120 billion parameters
- Framework: MLX (Apple Silicon optimized)
- Evaluation Method: Few-shot testing with 2 demonstration examples
- Dataset: GSM8K main test set (1,319 samples)
Evaluation Results
The model was evaluated on the GSM8K mathematical reasoning benchmark using the following testing protocol:
| Metric | Value |
|---|---|
| Accuracy | Calculating... |
| Total Problems | 1,319 |
| Few-shot Examples | 2 |
| Max Tokens Generated | 512 |
| Temperature | Default (0.7) |
Note: Final accuracy results will be populated after the evaluation completes.
Usage
The evaluation was conducted using the following Python script:
from mlx_gpt_oss_120b_few_shot_testing_gsm8k import MLXGPTGSM8KEvaluator
# Initialize evaluator
evaluator = MLXGPTGSM8KEvaluator(
model_path="/path/to/your/model",
data_path="/path/to/gsm8k_main_test_20250902_110036.json"
)
# Run evaluation
results, accuracy = evaluator.evaluate_gsm8k(num_samples=1319)
Evaluation Methodology
The evaluation process follows this structured approach:
flowchart TD
A[Start Evaluation] --> B[Load MLX GPT-OSS-120B Model]
B --> C[Load GSM8K Dataset<br/>1319 samples]
C --> D[Create Few-Shot Prompts<br/>2 examples per question]
subgraph EvaluationLoop [Per-Sample Processing]
D --> E[Generate Model Response]
E --> F[Extract Numerical Answer]
F --> G[Compare with Expected Answer]
G --> H[Record Accuracy]
end
H --> I[Save Intermediate Results<br/>Every 10 samples]
EvaluationLoop --> J[Calculate Final Accuracy]
J --> K[Generate Comprehensive Reports<br/>JSON, TXT, Logs]
K --> L[End Evaluation]
Key Components:
- Few-shot Prompting: Each question is prefixed with 2 worked examples demonstrating the expected reasoning format
- Answer Extraction: Uses regex patterns to extract numerical answers from model responses
- Accuracy Calculation: Compares extracted answers with ground truth values
- Comprehensive Logging: Detailed logs and intermediate result saving
Files Generated
The evaluation script produces the following output files:
gsm8k_evaluation_YYYYMMDD_HHMMSS.log- Detailed execution loggpt_oss_output_YYYYMMDD_HHMMSS/- Directory containing:final_results.json- Complete evaluation resultsintermediate_results.json- Periodic saves during evaluationsummary.json- Evaluation metrics summaryresults_summary.txt- Human-readable summary
Limitations
- Evaluation conducted on a subset of the full GSM8K test set
- Performance may vary based on the specific few-shot examples used
- Answer extraction relies on pattern matching which may not capture all valid answer formats
- Computational requirements are significant due to model size
Environmental Impact
The evaluation was conducted on Apple Silicon hardware, which typically offers improved energy efficiency compared to traditional GPU setups. The MLX framework further optimizes resource utilization for Apple hardware.
Citation
If you use this evaluation methodology or results in your research, please acknowledge:
Evaluation of GPT-OSS-120B using MLX framework on GSM8K mathematical reasoning benchmark.
Contact
For questions about this evaluation, please open an issue in the respective repository.
This model card was generated based on the evaluation of MLX GPT-OSS-120B on the GSM8K dataset.
Model tree for TroglodyteDerivations/MLX_GPT_OSS_120B_GSM8K_Evaluation
Base model
deepseek-ai/DeepSeek-V3.1-Base