HriDal's picture
Upload folder using huggingface_hub
5a474ca verified
---
language: en
tags:
- 2048-game
- reinforcement-learning
- qwen
- game-playing
- rl
- grpo
license: apache-2.0
datasets:
- custom
---
# agent-2048-game-qwen-7b-8k-ds
This model is a specialized game-playing AI trained to master the 2048 puzzle game using advanced reinforcement learning techniques. Based on the Qwen-7B architecture, it demonstrates sophisticated strategic planning and spatial reasoning capabilities.
## Model Description
- **Base Model:** Qwen-7B-Instruct
- **Training Approach:** Group Relative Policy Optimization (GRPO)
- **Training Dataset:** 8,000 carefully curated game states
- **Hardware Used:** Single RTX 4090 (24GB)
- **Training Time:** ~10 hours
- **Framework:** Implemented using trl library and accelerated by Unsloth
### Training Configuration
- **Learning Rate:** 4e-5 (optimized after extensive testing)
- **LoRA Rank:** 16
- **Max Sequence Length:** 1000 tokens
- **Batch Size:** 1 (with gradient accumulation steps of 4)
- **Optimizer:** paged_adamw_8bit
## Intended Use
This model is designed to play the 2048 game by:
1. Analyzing the current board state
2. Planning strategic moves
3. Maximizing score and achieving high-value tiles
4. Maintaining efficient board organization
## Training Data
The training data was generated through a sophisticated pipeline:
- Simulated gameplay for realistic board states
- Custom difficulty scoring system
- 5-level difficulty classification
- Balanced sampling across difficulty levels
- Parallel processing for efficient generation
## Training Approach
### Reward System
The model was trained using multiple reward components:
1. **Density Reward:** Encourages efficient tile merging and space utilization
2. **Highest Tile Reward:** Incentivizes creation of high-value tiles
3. **Survival Reward:** Promotes moves that maintain game continuity
4. **Format Compliance:** Ensures proper response structure
### Optimization
- Utilized Unsloth for 2x faster fine-tuning
- 4-bit quantization for efficient training
- Implemented efficient LoRA adaptation
## Performance and Limitations
### Strengths
- Strong strategic planning capabilities
- Efficient tile merging and space management
- Consistent high-score achievement
- Structured decision-making process
### Limitations
- Performance may vary with random seeds
- Success not guaranteed due to game's inherent randomness
- Model requires specific input formatting
## Example Usage
```python
# Format your 4x4 game board as a string
board_state = """
2 | 4 | 8 | 16
. | . | 2 | 4
. | . | . | 2
. | . | . | .
"""
# Model will output one of: up, down, left, right
```
## Citation
```bibtex
@misc{dalal2024agent2048blog,
author = {Dalal, Hrishbh},
title = {Agent 2048: Forging Strategic Gameplay in an AI Through Data, Rewards, and RL},
year = {2024},
month = {March},
url = {https://yourwebsite.com/blog/ai-agent-plays-2048},
note = {[Blog post] Accessed: March 30, 2024}
}
```
## Author
Hrishbh Dalal
## Acknowledgments
Special thanks to the research community on Twitter/X for valuable feedback on data generation strategies and training approaches.
## License
This model is released under the Apache 2.0 license.