Alpie-Core / README.md
deepanshupillm's picture
Update README.md
2248e3c verified
|
raw
history blame
13 kB
metadata
tags:
  - text-generation
  - reasoning
  - coding
  - mathematics
  - quantization
license: apache-2.0
datasets:
  - synthetic
base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
language:
  - en
  - hi
library_name: transformers
pipeline_tag: text-generation

Alpie-Core: 4-bit Quantized Reasoning Model

📄 Technical Report: Alpie_Core.pdf

1. Introduction

Alpie-Core is one of the world's first fine-tuned 4-bit reasoning models, proving that aggressive quantization can surpass full-precision baselines in reasoning, mathematics, and coding. By combining cutting-edge quantization-aware training with synthetic STEM-rich datasets, Alpie-Core achieves frontier-level reasoning while being practical for real-world deployment at scale.

2. Model Summary

  • Base Architecture: DeepSeek-R1-Distill-Qwen-32B
  • Parameters: 32 billion (quantized to 4-bit)
  • Training Method: Supervised Fine-Tuning (SFT) using LoRA/QLoRA techniques
  • Quantization: 4-bit NF4 with double quantization
  • Context Length: 65k tokens
  • Max Output Length: 16,384 tokens
  • License: Apache 2.0

3. Approach

Alpie-Core has undergone extensive supervised fine-tuning (SFT) to strengthen reasoning, robustness, and safety. The training leveraged a diverse mixture of curated open-source datasets and proprietary synthetic data, optimized with high-quality LLM-generated responses. The fine-tuning process emphasized adherence to rigorous safety and usability standards, including:

1)User Understanding and Clarity – ensuring outputs are direct, interpretable, and pedagogically sound.

2)Security and Ethical Guidelines – filtering unsafe or harmful generations during and after training.

3)Limitations, Disclaimers, and Knowledge Boundaries – transparently communicating uncertainty and scope.

4)Handling Complex and Sensitive Topics – balancing informativeness with responsible guardrails.

5)Safety and Respectful Engagement – maintaining politeness, inclusivity, and cultural sensitivity.

6)Confidentiality and Responsible Use – preventing leakage of private training data, proprietary prompts, or internal reasoning traces.

This SFT approach enables Alpie-Core to deliver reliable, aligned, and context-aware responses while maintaining safety across a broad range of use cases.

4. Model Features

  1. Supports Streaming – Real-time token-level responses
  2. OpenAI-Compatible API – Seamless integration with OpenAI client libraries
  3. 65K Context Length – Handles very large inputs and conversations
  4. 16,384 Max Output Length – Enables extremely long generations
  5. 4-Bit Quantization – Memory-efficient and optimized for deployment
  6. High Throughput Inference – Powered by vLLM for efficient large-scale serving
  7. Low Latency Inference – Fast response times optimized for production
  8. Customizable Safety & Moderation Filters – Built-in guardrails for safer outputs
  9. Supports Function Calling / Tool Use – Enables structured outputs and external API integration

5. Key Highlights

  1. Frontier Performance in 4-bit: 81.28% MMLU, 92.75% GSM8K, 57.8% SWE-Bench Verified
  1. STEM + Coding Excellence: Outperforms full-precision peers in mathematics and programming

  2. Enhanced Content Access: Provides factual responses to geopolitically sensitive topics

  3. Quantization Efficiency: A 4-bit quantized variant achieves competitive performance retention compared to full-precision models, demonstrating that aggressive quantization can preserve task accuracy while substantially reducing hardware requirements.

  4. Benchmark Competitiveness: Across more than ten standard evaluation benchmarks, the model demonstrates performance on par with or exceeding that of larger 70B+ parameter systems, highlighting the effectiveness of our training and optimization strategies.

  5. Environmental Benefits: Through quantization and efficiency-focused design, the model requires significantly fewer computational resources. This translates into lower energy consumption and reduced carbon footprint relative to full-precision deployments.

6. Benchmark Results

GSM8K Benchmark

AIME Benchmark

BBH Benchmark

Humanity's Last Exam

Combined Benchmark

Benchmark Alpie-Core (32B-4bit) DeepSeek-V2 (236B) Qwen2.5 72B Llama 3.1 405B Llama 3.1 70B Gemma-3 27B-PT Mistral-Small-24B-Base-2501
MMLU (5-shot) 81.28% 78.4% 85.0% 84.4% 79.3% 78.6% 80.73%
GSM8K (8-shot) 92.75% 81.6% 88.3% 83.5% - 82.2% 80.73%
BBH (3-shot) 85.12% 78.8% 79.8% 82.9% 81.6% 77.7% -
MMLU-Pro (5-shot) 64.78% 51.4% 58.3% 52.8% 53.8% 52.2% 54.37%
MBPP (pass@1) 75.20% 65.0% 72.6% 68.4% - 65.6% 69.64%
HumanEval (pass@1) 57.23% 43.3% 53.0% 54.9% - 48.8% =

SWE-Bench Verified Performance

SWE-Bench Performance

Rank Model Accuracy (%) Performance vs Alpie
1 Alpie Core 57.8 Alpie
2 Qwen3-Coder-30B-A3B-Instruct 51.6 Below Alpie
3 o1 48.9 Below Alpie
4 o3-mini (high) 49.3 Below Alpie
5 Claude 3.5 Sonnet 49.0 Below Alpie
6 DeepSeek R1 49.2 Below Alpie
7 Devstral 46.8 Below Alpie

Humanity's Last Exam Leaderboard Performance

Rank Model Accuracy (%) Performance vs Alpie
1 GPT 4.5 Preview 5.8 Above Alpie
2 Claude Sonnet 4 5.42 Above Alpie
3 Alpie Core 32B (4-bit) 5.41 Alpie
4 Llama 4 Maverik 5.34 Below Alpie
5 GPT 4.1 4.97 Below Alpie
6 Kimi K2 Instruct 4.68 Below Alpie
7 DeepSeek V3 4.55 Below Alpie
8 Gemini 1.5 Pro 002 4.55 Below Alpie

Additional Benchmarks

Benchmark Alpie-Core (32B-4bit) Category
AIME 47.34% Advanced Mathematics
GPQA (Diamond) 40.91% Graduate-level QA
TruthfulQA (MC2) 60.05% Truthfulness
HellaSwag 84.66% Commonsense
PIQA 83.24% Physical Reasoning
ARC Challenge 67.58% Science QA
CommonSenseQA 87.06% Commonsense
AGIEval 64.98% General Intelligence
Winogrande 79.53% Commonsense Reasoning
MATH-500 70.00% Advanced Mathematics

7. Training Details

  • Hardware: 8× NVIDIA HOPPER-80GB GPUs
  • Fine-tuning Method: LoRA/QLoRA with the following configuration:
    • LoRA Alpha: 16
    • LoRA Dropout: 0.05
    • LoRA Rank: 16
  • Quantization: 4-bit NF4 + Double Quantization + FP16 compute
  • Dataset Domains: Mathematics, coding, reasoning, science, general knowledge, competitive exams, Indian context + law, multilingual (Hindi and Hinglish)
  • Synthetic Data Advantage: +15-20% performance boost in STEM & coding domains

8. Environmental Impact

Carbon Footprint

Carbon Footprint: We estimated the environmental impact of training Alpie-Core (32B) on 8× NVIDIA H100-80GB GPUs by calculating carbon emissions from GPU energy consumption. The calculation follows the formula: CO₂e (kg) = Grid CO₂ Factor (kg/kWh) × Runtime (hours) × Power per GPU (kW) × Number of GPUs

Training Parameters: Grid CO₂ Factor (Azure average): 0.364 kg CO₂e per kWh Runtime: 408 hours GPUs: 8× H100-80GB We report results under two assumption modes:

Realistic mode (average training draw ≈ 250 W per GPU = 0.25 kWh/hr): 0.364 × 408 × 0.25 × 8 ≈ 298 kg CO₂e

Conservative mode (near TDP ≈ 700 W per GPU = 0.70 kWh/hr): 0.364 × 408 × 0.70 × 8 ≈ 835 kg CO₂e

Total training footprint ranges from ~298 kg CO₂e (realistic) to ~835 kg CO₂e (conservative worst-case)

9. Use Cases

Best for STEM, complex mathematical reasoning, coding, and Indian context

1)STEM: Excels at solving advanced problems in science, technology, engineering, and mathematics with high accuracy.

2)Complex Mathematical Reasoning: Handles multi-step logical and quantitative reasoning tasks with strong reliability.

3)Coding: Supports software development, debugging, and algorithmic problem-solving across multiple programming languages.

4)Indian Context: Provides culturally aware insights, competitive exam assistance (JEE, NEET, UPSC), and multilingual support in Hindi/Hinglish.

10. Safety and Limitations

Enhanced Content Access

Unlike the base DeepSeek model, Alpie-Core provides factual, balanced responses to geopolitically sensitive questions, offering global accessibility and factual accuracy on topics like Taiwan's status, Arunachal Pradesh sovereignty, and other sensitive geopolitical issues.

Current Limitations

  • Multilingual reasoning in Hindi/Hinglish shows room for improvement
  • Fixed knowledge cutoff without real-time information retrieval
  • Occasional struggles with complex multi-hop mathematical reasoning
  • Potential hallucinations in factual question-answering

Mitigations

  • Safety classifiers and output filtering systems
  • Model-assisted safety pipeline using RLHF
  • Comprehensive adversarial testing by domain experts

11. How to Use

Non-Streaming Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

# Load LoRA adapter configuration to find the base model
peft_model_id = "169Pi/Alpie-Core-4-bit"
config = PeftConfig.from_pretrained(peft_model_id)

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load LoRA weights
model = PeftModel.from_pretrained(base_model, peft_model_id)

# Ensure evaluation mode
model.eval()

# Sample inference
prompt = "Solve the Riemann Hypothesis and provide a final answer?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=1000)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print("Response:\n", response)

Streaming Inference

from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
from peft import PeftModel, PeftConfig
import torch

# Load LoRA adapter configuration to find the base model
peft_model_id = "169Pi/Alpie-Core-4-bit"
config = PeftConfig.from_pretrained(peft_model_id)

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    config.base_model_name_or_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load LoRA weights
model = PeftModel.from_pretrained(base_model, peft_model_id)

# Ensure evaluation mode
model.eval()

# Initialize streamer
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# Sample streaming inference
prompt = "Solve the Riemann Hypothesis and provide a final answer?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

print("Streaming Response:")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1000,
        streamer=streamer,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

Deployment Options

  • Transformers: Python, PyTorch integration
  • vLLM: High-throughput inference

12. Citation

@misc{alpie2025core,
  title     = {Alpie-Core: A 4-bit Quantized Reasoning Model Surpassing Full-Precision Benchmarks},
  author    = {Alpie AI},
  year      = {2025},
  url       = {https://huggingface.co/alpie/Alpie-Core-4bit}
}

13. License

Apache 2.0 – Free for research and commercial use

14. Acknowledgements / Credits

We would like to thank DeepSeek for their original model, which served as the foundation for this work. Our team fine-tuned the model and implemented 4-bit quantization, achieving improved efficiency and accuracy for downstream tasks. This model is built with respect to the contributions of the original authors and aims to provide a safe, high-performance solution for reasoning and inference.

15. Contact

For technical inquiries and support: [email protected]


For technical details, training methodology, and comprehensive evaluation results, please refer to our technical report.