|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- 2048-game |
|
|
- reinforcement-learning |
|
|
- qwen |
|
|
- game-playing |
|
|
- rl |
|
|
- grpo |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- custom |
|
|
--- |
|
|
|
|
|
# agent-2048-game-qwen-7b-8k-ds |
|
|
|
|
|
This model is a specialized game-playing AI trained to master the 2048 puzzle game using advanced reinforcement learning techniques. Based on the Qwen-7B architecture, it demonstrates sophisticated strategic planning and spatial reasoning capabilities. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model:** Qwen-7B-Instruct |
|
|
- **Training Approach:** Group Relative Policy Optimization (GRPO) |
|
|
- **Training Dataset:** 8,000 carefully curated game states |
|
|
- **Hardware Used:** Single RTX 4090 (24GB) |
|
|
- **Training Time:** ~10 hours |
|
|
- **Framework:** Implemented using trl library and accelerated by Unsloth |
|
|
|
|
|
### Training Configuration |
|
|
- **Learning Rate:** 4e-5 (optimized after extensive testing) |
|
|
- **LoRA Rank:** 16 |
|
|
- **Max Sequence Length:** 1000 tokens |
|
|
- **Batch Size:** 1 (with gradient accumulation steps of 4) |
|
|
- **Optimizer:** paged_adamw_8bit |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is designed to play the 2048 game by: |
|
|
1. Analyzing the current board state |
|
|
2. Planning strategic moves |
|
|
3. Maximizing score and achieving high-value tiles |
|
|
4. Maintaining efficient board organization |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The training data was generated through a sophisticated pipeline: |
|
|
- Simulated gameplay for realistic board states |
|
|
- Custom difficulty scoring system |
|
|
- 5-level difficulty classification |
|
|
- Balanced sampling across difficulty levels |
|
|
- Parallel processing for efficient generation |
|
|
|
|
|
## Training Approach |
|
|
|
|
|
### Reward System |
|
|
The model was trained using multiple reward components: |
|
|
1. **Density Reward:** Encourages efficient tile merging and space utilization |
|
|
2. **Highest Tile Reward:** Incentivizes creation of high-value tiles |
|
|
3. **Survival Reward:** Promotes moves that maintain game continuity |
|
|
4. **Format Compliance:** Ensures proper response structure |
|
|
|
|
|
### Optimization |
|
|
- Utilized Unsloth for 2x faster fine-tuning |
|
|
- 4-bit quantization for efficient training |
|
|
- Implemented efficient LoRA adaptation |
|
|
|
|
|
## Performance and Limitations |
|
|
|
|
|
### Strengths |
|
|
- Strong strategic planning capabilities |
|
|
- Efficient tile merging and space management |
|
|
- Consistent high-score achievement |
|
|
- Structured decision-making process |
|
|
|
|
|
### Limitations |
|
|
- Performance may vary with random seeds |
|
|
- Success not guaranteed due to game's inherent randomness |
|
|
- Model requires specific input formatting |
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
# Format your 4x4 game board as a string |
|
|
board_state = """ |
|
|
2 | 4 | 8 | 16 |
|
|
. | . | 2 | 4 |
|
|
. | . | . | 2 |
|
|
. | . | . | . |
|
|
""" |
|
|
|
|
|
# Model will output one of: up, down, left, right |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{dalal2024agent2048blog, |
|
|
author = {Dalal, Hrishbh}, |
|
|
title = {Agent 2048: Forging Strategic Gameplay in an AI Through Data, Rewards, and RL}, |
|
|
year = {2024}, |
|
|
month = {March}, |
|
|
url = {https://yourwebsite.com/blog/ai-agent-plays-2048}, |
|
|
note = {[Blog post] Accessed: March 30, 2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Author |
|
|
|
|
|
Hrishbh Dalal |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
Special thanks to the research community on Twitter/X for valuable feedback on data generation strategies and training approaches. |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license. |
|
|
|