|
|
--- |
|
|
datasets: |
|
|
- zwhe99/DeepMath-103K |
|
|
base_model: |
|
|
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
|
|
--- |
|
|
# AutoDeco |
|
|
Official Implementation of "[The End of Manual Decoding: Towards Truly End-to-End Language Models](https://arxiv.org/abs/2510.26697)" |
|
|
|
|
|
**AutoDeco** is a framework that adds token-level adaptive decoding parameter prediction capabilities to Large Language Models (LLMs). By adding lightweight prediction heads on top of pre-trained models, AutoDeco can dynamically predict optimal temperature and top-p parameters for each token during decoding. |
|
|
|
|
|
## 🎯 Key Features |
|
|
|
|
|
- **Token-Level Decoding Parameter Prediction**: Dynamically predict decoding parameters (temperature and top-p) for each generated token |
|
|
- **Lightweight Design**: Only adds two small MLP prediction heads (~5MB), without modifying the base model |
|
|
- **Universal Architecture**: Supports multiple mainstream LLM architectures (Llama, Qwen2/2.5, Qwen3, MoE models, etc.) |
|
|
- **End-to-End Training**: Complete training with implicit gradient backpropagation through cross-entropy loss only |
|
|
- **Flexible Training**: Supports independent training of temperature head, top-p head, or joint training |
|
|
- **Efficient Deployment**: Only saves AutoDeco prediction head weights during training, merges with base model during decoding. |
|
|
|
|
|
## 🏗️ Architecture |
|
|
|
|
|
The AutoDeco framework consists of two core components: |
|
|
|
|
|
 |
|
|
|
|
|
### Model Workflow |
|
|
|
|
|
``` |
|
|
Input Tokens |
|
|
↓ |
|
|
Base LLM (frozen during head training) |
|
|
↓ |
|
|
Hidden States |
|
|
├──→ LM Head → Logits |
|
|
├──→ TempHead → Temperature |
|
|
└──→ TopPHead → Top-P |
|
|
``` |
|
|
|
|
|
During training, the base LLM parameters are frozen, and only the two prediction heads are trained. |
|
|
|
|
|
## 🤖 Supported Models |
|
|
|
|
|
AutoDeco supports all current autoregressive LLMs, and we unified them with the following model architectures `AutoDecoModelForCausalLM` interface. |
|
|
|
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
| **Base Model** | **#Base Params** | **#AutoDeco Params** | **Download** | |
|
|
| :------------: | :------------: | :------------: | :------------: | |
|
|
| Llama-3.1-Nemotron-Nano-8B-v1 | 8B | 2.1M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Llama-Nemotron-8B) | |
|
|
| DeepSeek-R1-Distill-Qwen-7B | 7B | 1.84M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-R1-Distill-Qwen-7B) | |
|
|
| Qwen3-30B-A3B-Instruct-2507 | 30B | 1.05M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-Qwen3-30B-A3B-Instruct-2507) | |
|
|
| OpenAI-GPT-OSS-20B | 20B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-20B) | |
|
|
| OpenAI-GPT-OSS-120B | 120B | 1.48M | [🤗 HuggingFace](https://huggingface.co/Jadeislaw/AutoDeco-GPT-Oss-120B) | |
|
|
| Qwen3-235B-A22B-Thinking | 235B | 2.1M | [🤗 HuggingFace](https://huggingface.co/zacks917/AutoDeco-Qwen3-235B-A22B-Thinking-2507) | |
|
|
| DeepSeek-V3.1-Terminus | 671B | - | Comming Soon | |
|
|
|
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
## 🚀 Installation |
|
|
|
|
|
### Recommended Requirements |
|
|
|
|
|
- Python >= 3.10 |
|
|
- PyTorch >= 2.0 |
|
|
- CUDA >= 12.0 (recommended for training) |
|
|
|
|
|
### Install Dependencies |
|
|
|
|
|
```bash |
|
|
# Clone repository |
|
|
cd AutoDeco |
|
|
|
|
|
# Install core dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Optional: for training monitoring |
|
|
pip install wandb |
|
|
``` |
|
|
|
|
|
## 💡 Quick Start |
|
|
|
|
|
### Initialize AutoDeco Model |
|
|
|
|
|
```python |
|
|
python script/construct_autodeco.py \ |
|
|
--base_model_name_or_path path_to_your_base_LLM \ |
|
|
--output_dir path_to_your_AutoDeco_model |
|
|
``` |
|
|
|
|
|
<!-- ### 2. Inference |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("path/to/model") |
|
|
inputs = tokenizer("What is the meaning of life?", return_tensors="pt") |
|
|
|
|
|
# Forward pass to get predictions |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# outputs contains: |
|
|
# - outputs.logits: Regular language model logits |
|
|
# - outputs.temp_logits: Predicted temperature values |
|
|
# - outputs.top_p_logits: Predicted top-p values |
|
|
``` |
|
|
|
|
|
### 3. Efficient Inference with vLLM |
|
|
|
|
|
We have integrated AutoDeco with vLLM for efficient batch inference: |
|
|
|
|
|
- Install vLLM from source code first |
|
|
```bash |
|
|
cd vllm |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
- Inference |
|
|
```bash |
|
|
# Use training script for evaluation |
|
|
python llm_eval.py \ |
|
|
--model_name_or_path path/to/autodeco_model \ |
|
|
--dataset aime24 \ |
|
|
--temp 1.0 \ |
|
|
--top_p 1.0 \ |
|
|
--k 16 \ |
|
|
--tp_size 4 |
|
|
``` --> |
|
|
|
|
|
## 🔥 Training |
|
|
|
|
|
### Prepare Training Data |
|
|
|
|
|
Training data should be in JSONL format, with one sample per line. AutoDeco supports standard conversation format: |
|
|
|
|
|
|
|
|
```bash |
|
|
{ |
|
|
"prompt": "formatted prompt text", |
|
|
"completion": "expected completion" |
|
|
} |
|
|
|
|
|
# example |
|
|
{ |
|
|
"prompt": "<|im_start|>user\nEvaluate the limit:$$\\lim_{(x, y) \\to (1, 2)} \\frac{(x-1)(y-2)-x+3}{x^2-2x+y^2-4}$$\nMake sure you output the final answer within \\boxed{}<|im_end|>\n< im_start>assistant\n", |
|
|
"completion": "......### ✅ Final Answer:\n$$\n\\boxed{-1}\n$$"" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Train AutoDeco Heads |
|
|
|
|
|
Use the provided training script: |
|
|
|
|
|
```bash |
|
|
# Edit script/trl_train.sh to configure parameters |
|
|
# Key parameters: |
|
|
# - MODEL_NAME_OR_PATH: Your initialized AutoDeco Model Path |
|
|
# - DATA_NAME: Training data filename (in data directory) |
|
|
# - MAX_LENGTH: Maximum sequence length |
|
|
# - train_temp: Whether to train temperature head |
|
|
# - train_top_p: Whether to train top-p head |
|
|
|
|
|
bash script/trl_train.sh |
|
|
``` |
|
|
|
|
|
Training configuration examples: |
|
|
|
|
|
```bash |
|
|
# Train only temperature head |
|
|
accelerate launch trl_train.py \ |
|
|
--model_name_or_path AutoDeco-Llama-3.1-8B \ |
|
|
--dataset_name train_data.jsonl \ |
|
|
--train_temp true \ |
|
|
--train_top_p false \ |
|
|
--learning_rate 5e-6 \ |
|
|
--num_train_epochs 1 \ |
|
|
--output_dir ckpt/llama3_temp_head |
|
|
``` |
|
|
|
|
|
## 📊 Inference |
|
|
|
|
|
### Batch Evaluation with vLLM |
|
|
|
|
|
```bash |
|
|
# Single evaluation |
|
|
python llm_eval.py \ |
|
|
--model_name_or_path ckpt/autodeco_model \ |
|
|
--dataset aime24 \ |
|
|
--temp 1.0 \ |
|
|
--top_p 1.0 \ |
|
|
--k 16 \ |
|
|
--seed 42 |
|
|
|
|
|
# Batch evaluation with script (automatically generates multiple random seeds) |
|
|
bash script/test_generation.sh aime24 1.0 1.0 -1 1.0 path/to/model |
|
|
``` |
|
|
|
|
|
Evaluation results are saved in the `generation_log/` directory, including: |
|
|
- Pass@K metrics |
|
|
- Average accuracy |
|
|
- Detailed generation results for each sample |
|
|
|
|
|
### Deploy with vLLM |
|
|
```bash |
|
|
# example |
|
|
vllm serve |
|
|
``` |
|
|
|
|
|
## 📁 Project Structure |
|
|
``` |
|
|
AutoDeco/ |
|
|
├── model/ # Model definitions |
|
|
│ ├── templlm_auto.py # Unified AutoDeco model (recommended) |
|
|
definitions |
|
|
│ |
|
|
├── trainer/ # Trainers |
|
|
│ └── trl_Temp.py # AutoDeco trainer |
|
|
│ |
|
|
├── script/ # Scripts |
|
|
│ ├── trl_train.sh # Training launch script |
|
|
│ ├── test_generation.sh # Batch evaluation script |
|
|
│ └── merge_autodeco.py # Merge or split heads |
|
|
│ |
|
|
├── config/ # Configuration files |
|
|
│ └── deepspeed/ # DeepSpeed configuration |
|
|
│ └── deepspeed_zero3_gradaccu4.yaml |
|
|
│ |
|
|
├── trl_train.py # Training main program |
|
|
├── llm_eval.py # Evaluation main program (vLLM) |
|
|
├── boxed_extract.py # Answer extraction tool |
|
|
├── requirements.txt # requirements |
|
|
└── README.md # This document |
|
|
|
|
|
``` |
|
|
|
|
|
## 🔧 Advanced Usage |
|
|
|
|
|
### 1. Extract AutoDeco Heads from AutoDeco Model |
|
|
|
|
|
```python |
|
|
python merge_autodeco.py split \ |
|
|
--full-checkpoint path_to_your_full_model \ |
|
|
--output path_to_split_head |
|
|
``` |
|
|
|
|
|
This generates a lightweight checkpoint (~5MB) containing: |
|
|
- `config.json`: AutoDeco configuration (including base_model_name_or_path) |
|
|
- `autodeco_heads.safetensors`: Heads weights |
|
|
|
|
|
### 2. Merge AutoDeco Heads to Base Model (for vLLM Deployment) |
|
|
|
|
|
If you need to create a complete model file with heads for inference engines like vLLM: |
|
|
|
|
|
```python |
|
|
python merge_autodeco.py merge \ |
|
|
--autodeco-path path_to_autodeco_heads \ |
|
|
--base-model-path path_to_base_LLM \ |
|
|
--output path_to_your_full_model |
|
|
``` |
|
|
|
|
|
|
|
|
## 📝 Citation |
|
|
|
|
|
If you use AutoDeco in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{wang2025endmanualdecodingtruly, |
|
|
title={The End of Manual Decoding: Towards Truly End-to-End Language Models}, |
|
|
author={Zhichao Wang and Dongyang Ma and Xinting Huang and Deng Cai and Tian Lan and Jiahao Xu and Haitao Mi and Xiaoying Tang and Yan Wang}, |
|
|
year={2025}, |
|
|
eprint={2510.26697}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2510.26697}, |
|
|
} |
|
|
``` |
|
|
|
|
|
<!-- ## Acknowledgments |
|
|
|
|
|
- Built on [Transformers](https://github.com/huggingface/transformers) and [TRL](https://github.com/huggingface/trl) |
|
|
- Training framework uses [DeepSpeed](https://github.com/microsoft/DeepSpeed) |
|
|
- Inference optimization uses [vLLM](https://github.com/vllm-project/vllm) --> |