QTM7-4B
QTM7-4B is a proof-of-concept math & code reasoning model, trained briefly from Qwen/Qwen3-4B-Base.
It was finetuned for ~4 hours on a single A100 GPU, using lightweight datasets focused on mathematical reasoning and structured problem solving.
This project demonstrates what can be achieved on minimal compute/budget (≈$20 total cost).
UPDATE: Observed Performance Shift
This model was explicitly trained using math and code datasets with the intent of achieving higher performance in structured reasoning compared to the base Qwen3-4B model. While quantitative GSM8K metrics show improved math ability, recent qualitative testing suggests an unexpected side effect:
QTM7-4B exhibits significantly enhanced performance in creative writing, narrative generation, and descriptive tasks compared to the Qwen3-4B base model.
The model appears to have utilized the focused finetuning to better understand complex instruction following and structure, which has translated into a superior ability to generate cohesive and evocative creative content.
Model Details
- Developed by: Independent researcher (solo project)
- Funding: Self-funded (~$20 total compute cost)
- Model type: Decoder-only transformer for text generation
- Language(s): English
- License: Apache-2.0
- Finetuned from: Qwen/Qwen3-4B-Base
Sources
- Repository: Ma7ee7/QTM7-4b-2hr-checkpoint
Uses
Direct Use
- Research into math & code reasoning
- Proof-of-concept for low-budget finetuning on large language models
- New Focus: Evaluation of low-resource impact on creative writing and narrative coherence.
Downstream Use
- Potential basis for math problem solvers or code reasoning assistants
- Experiments in lightweight alignment or evaluation pipelines
Out-of-Scope
- Not suitable for safety-critical, legal, or medical applications
- Not RLHF-aligned; outputs may be unfiltered or ungrounded
Bias, Risks, and Limitations
- Inherits biases from Qwen3-4B-Base
- Untested on broader NLP benchmarks (MMLU, ARC, etc.)
- Training was short (~2 hours net, ~4 GPU hours total), so coverage is shallow
- General conversational ability remains base-model level
Recommendation: Treat outputs as experimental. Do not deploy in production or decision-making contexts.
Training Details
Training Data
- unsloth/OpenMathReasoning-mini — math reasoning dataset
- nvidia/OpenCodeReasoning — code reasoning tasks
- No GSM8K contamination was found in either the training or post-training data.
Procedure
- Mixed precision: fp16
- Optimizer: AdamW (standard defaults)
- Duration: ~4 hours on 1x NVIDIA A100
- Checkpoint size: ~16 GB (fp16)
Evaluation
Setup
- Compared against Qwen/Qwen3-4B (post-trained version)
- Dataset: GSM8K test split (subset of 300 “hard” problems)
- Metrics: Exact match on final numeric answer
Results
Training Loss Curve
Stable convergence toward ~0.63 by step 1750, even as difficulty increased.
GSM8K Accuracy (Sampled)
QTM7-4B* scored ~80.7% vs Qwen3-4B’s ~28.0%.
Head-to-Head Outcomes
QTM7-4B* won most direct comparisons.
- Only QTM7-4B* correct → 171
- Both correct → 71
- Both wrong → 45
- Only Qwen correct → 13
Outcome Breakdown by Model (GSM8K subset)
Side-by-side percentages for correctness vs error types.
- QTM7-4B*: 80.7% correct, 7.3% mismatch, 12.0% truncated
- Qwen3-4B: 28.0% correct, 72.0% mismatch, 0% truncated
* QTM7-4B = 2hr checkpoint
Environmental Impact
Estimated using MLCO2 Impact Calculator:
- Hardware: NVIDIA A100 (80GB)
- GPU hours: ~4
- Cloud Provider: Google Colab (us-central assumed)
- Carbon emitted: ≈ 1.2 kg CO2eq
(About the same as driving ~5 km in a gasoline car.)
Technical Specifications
- Architecture: Qwen3-4B transformer (4B params, decoder-only, rotary embeddings, SwiGLU, grouped query attention)
- Objective: Causal LM finetuning on reasoning tasks
- Software: PyTorch + Hugging Face Transformers + Datasets
Summary
QTM7-4B is a minimal-budget proof-of-concept showing that:
- Small compute can still move the needle on reasoning with focused datasets.
- Math reasoning gains were observed even with short finetunes.
- The model is not benchmarked broadly, but shows promise as a low-resource experiment.
- Downloads last month
- 11



