Model Card for MLX GPT-OSS-120B GSM8K Evaluation

Model Description

This model card documents the evaluation results of the MLX GPT-OSS-120B model on the GSM8K mathematical reasoning benchmark using few-shot testing methodology. The evaluation was conducted using a custom testing framework that leverages Apple's MLX framework for efficient inference on Apple Silicon.

  • Model Type: Transformer-based language model
  • Model Size: 120 billion parameters
  • Framework: MLX (Apple Silicon optimized)
  • Evaluation Method: Few-shot testing with 2 demonstration examples
  • Dataset: GSM8K main test set (1,319 samples)

Evaluation Results

The model was evaluated on the GSM8K mathematical reasoning benchmark using the following testing protocol:

Metric Value
Accuracy Calculating...
Total Problems 1,319
Few-shot Examples 2
Max Tokens Generated 512
Temperature Default (0.7)

Note: Final accuracy results will be populated after the evaluation completes.

Usage

The evaluation was conducted using the following Python script:

from mlx_gpt_oss_120b_few_shot_testing_gsm8k import MLXGPTGSM8KEvaluator

# Initialize evaluator
evaluator = MLXGPTGSM8KEvaluator(
    model_path="/path/to/your/model",
    data_path="/path/to/gsm8k_main_test_20250902_110036.json"
)

# Run evaluation
results, accuracy = evaluator.evaluate_gsm8k(num_samples=1319)

Evaluation Methodology

The evaluation process follows this structured approach:

flowchart TD
    A[Start Evaluation] --> B[Load MLX GPT-OSS-120B Model]
    B --> C[Load GSM8K Dataset<br/>1319 samples]
    C --> D[Create Few-Shot Prompts<br/>2 examples per question]
    
    subgraph EvaluationLoop [Per-Sample Processing]
        D --> E[Generate Model Response]
        E --> F[Extract Numerical Answer]
        F --> G[Compare with Expected Answer]
        G --> H[Record Accuracy]
    end
    
    H --> I[Save Intermediate Results<br/>Every 10 samples]
    EvaluationLoop --> J[Calculate Final Accuracy]
    J --> K[Generate Comprehensive Reports<br/>JSON, TXT, Logs]
    K --> L[End Evaluation]

Key Components:

  1. Few-shot Prompting: Each question is prefixed with 2 worked examples demonstrating the expected reasoning format
  2. Answer Extraction: Uses regex patterns to extract numerical answers from model responses
  3. Accuracy Calculation: Compares extracted answers with ground truth values
  4. Comprehensive Logging: Detailed logs and intermediate result saving

Files Generated

The evaluation script produces the following output files:

  • gsm8k_evaluation_YYYYMMDD_HHMMSS.log - Detailed execution log
  • gpt_oss_output_YYYYMMDD_HHMMSS/ - Directory containing:
    • final_results.json - Complete evaluation results
    • intermediate_results.json - Periodic saves during evaluation
    • summary.json - Evaluation metrics summary
    • results_summary.txt - Human-readable summary

Limitations

  • Evaluation conducted on a subset of the full GSM8K test set
  • Performance may vary based on the specific few-shot examples used
  • Answer extraction relies on pattern matching which may not capture all valid answer formats
  • Computational requirements are significant due to model size

Environmental Impact

The evaluation was conducted on Apple Silicon hardware, which typically offers improved energy efficiency compared to traditional GPU setups. The MLX framework further optimizes resource utilization for Apple hardware.

Citation

If you use this evaluation methodology or results in your research, please acknowledge:

Evaluation of GPT-OSS-120B using MLX framework on GSM8K mathematical reasoning benchmark.

Contact

For questions about this evaluation, please open an issue in the respective repository.


This model card was generated based on the evaluation of MLX GPT-OSS-120B on the GSM8K dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TroglodyteDerivations/MLX_GPT_OSS_120B_GSM8K_Evaluation

Finetuned
(19)
this model

Dataset used to train TroglodyteDerivations/MLX_GPT_OSS_120B_GSM8K_Evaluation