QTM7-4B

QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).


UPDATE: Observed Performance Shift

This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:

QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.

The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.


Model Details

  • Developed by: Independent researcher (solo project)
  • Funding: Self-funded (~$20 total compute cost)
  • Model type: Decoder-only transformer for text generation
  • Language(s): English
  • License: Apache-2.0
  • Finetuned from: Qwen/Qwen3-4B-Base

Sources


Uses

Direct Use

  • Research into math & code reasoning
  • Proof-of-concept for low-budget finetuning on large language models
  • New Focus: Evaluation of low-resource impact on creative writing and narrative coherence.

Downstream Use

  • Potential basis for math problem solvers or code reasoning assistants
  • Experiments in lightweight alignment or evaluation pipelines

Out-of-Scope

  • Not suitable for safety-critical, legal, or medical applications
  • Not RLHF-aligned; outputs may be unfiltered or ungrounded

Bias, Risks, and Limitations

  • Inherits biases from Qwen3-4B-Base
  • Untested on broader NLP benchmarks (MMLU, ARC, etc.)
  • Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow
  • General conversational ability remains base-model level

Recommendation: Treat outputs as experimental. Do not deploy in production or decision-making contexts.


Training Details

Training Data

  • unsloth/OpenMathReasoning-mini — math reasoning dataset
  • nvidia/OpenCodeReasoning — code reasoning tasks
  • No GSM8K contamination was found in either the training or post-training data.

Procedure

  • Mixed precision: fp16
  • Optimizer: AdamW (standard defaults)
  • Duration: ~4 hours on 1x NVIDIA A100
  • Checkpoint size: ~16 GB (fp16)

Evaluation

Setup

  • Compared against Qwen/Qwen3-4B (post-trained version)
  • Dataset: GSM8K test split (subset of 300 “hard” problems)
  • Metrics: Exact match on final numeric answer

Results

Training Loss Curve
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.

image/png


GSM8K Accuracy (Sampled)
QTM7-4B* scored ~80.7% vs Qwen3-4B’s ~28.0%.

image/png


Head-to-Head Outcomes
QTM7-4B* won most direct comparisons.

  • Only QTM7-4B* correct → 171
  • Both correct → 71
  • Both wrong → 45
  • Only Qwen correct → 13

image/png


Outcome Breakdown by Model (GSM8K subset)
Side-by-side percentages for correctness vs error types.

  • QTM7-4B*: 80.7% correct, 7.3% mismatch, 12.0% truncated
  • Qwen3-4B: 28.0% correct, 72.0% mismatch, 0% truncated

image/png


* QTM7-4B = 2hr checkpoint


Environmental Impact

Estimated using MLCO2 Impact Calculator:

  • Hardware: NVIDIA A100 (80GB)
  • GPU hours: ~4
  • Cloud Provider: Google Colab (us-central assumed)
  • Carbon emitted:1.2 kg CO2eq

(About the same as driving ~5 km in a gasoline car.)


Technical Specifications

  • Architecture: Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)
  • Objective: Causal LM finetuning on reasoning tasks
  • Software: PyTorch + Hugging Face Transformers + Datasets

Summary

QTM7-4B is a minimal-budget proof-of-concept showing that:

  • Small compute can still move the needle on reasoning with focused datasets.
  • Math reasoning gains were observed even with short finetunes.
  • The model is not benchmarked broadly, but shows promise as a low-resource experiment.

Downloads last month
11
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ma7ee7/QTM7-4b-1771

Base model

Qwen/Qwen3-4B-Base
Finetuned
(158)
this model
Quantizations
1 model

Datasets used to train Ma7ee7/QTM7-4b-1771