HriDal
/

agent-2048-game-qwen-7b-8k-ds

Reinforcement Learning

Model card Files Files and versions

agent-2048-game-qwen-7b-8k-ds / README.md

HriDal's picture

Upload folder using huggingface_hub

5a474ca verified 8 months ago

|

history blame contribute delete

3.2 kB

	---
	language: en
	tags:
	- 2048-game
	- reinforcement-learning
	- qwen
	- game-playing
	- rl
	- grpo
	license: apache-2.0
	datasets:
	- custom
	---

	# agent-2048-game-qwen-7b-8k-ds

	This model is a specialized game-playing AI trained to master the 2048 puzzle game using advanced reinforcement learning techniques. Based on the Qwen-7B architecture, it demonstrates sophisticated strategic planning and spatial reasoning capabilities.

	## Model Description

	- Base Model: Qwen-7B-Instruct
	- Training Approach: Group Relative Policy Optimization (GRPO)
	- Training Dataset: 8,000 carefully curated game states
	- Hardware Used: Single RTX 4090 (24GB)
	- Training Time: ~10 hours
	- Framework: Implemented using trl library and accelerated by Unsloth

	### Training Configuration
	- Learning Rate: 4e-5 (optimized after extensive testing)
	- LoRA Rank: 16
	- Max Sequence Length: 1000 tokens
	- Batch Size: 1 (with gradient accumulation steps of 4)
	- Optimizer: paged_adamw_8bit

	## Intended Use

	This model is designed to play the 2048 game by:
	1. Analyzing the current board state
	2. Planning strategic moves
	3. Maximizing score and achieving high-value tiles
	4. Maintaining efficient board organization

	## Training Data

	The training data was generated through a sophisticated pipeline:
	- Simulated gameplay for realistic board states
	- Custom difficulty scoring system
	- 5-level difficulty classification
	- Balanced sampling across difficulty levels
	- Parallel processing for efficient generation

	## Training Approach

	### Reward System
	The model was trained using multiple reward components:
	1. Density Reward: Encourages efficient tile merging and space utilization
	2. Highest Tile Reward: Incentivizes creation of high-value tiles
	3. Survival Reward: Promotes moves that maintain game continuity
	4. Format Compliance: Ensures proper response structure

	### Optimization
	- Utilized Unsloth for 2x faster fine-tuning
	- 4-bit quantization for efficient training
	- Implemented efficient LoRA adaptation

	## Performance and Limitations

	### Strengths
	- Strong strategic planning capabilities
	- Efficient tile merging and space management
	- Consistent high-score achievement
	- Structured decision-making process

	### Limitations
	- Performance may vary with random seeds
	- Success not guaranteed due to game's inherent randomness
	- Model requires specific input formatting

	## Example Usage

	```python
	# Format your 4x4 game board as a string
	board_state = """
	2 \| 4 \| 8 \| 16
	. \| . \| 2 \| 4
	. \| . \| . \| 2
	. \| . \| . \| .
	"""

	# Model will output one of: up, down, left, right
	```

	## Citation

	```bibtex
	@misc{dalal2024agent2048blog,
	author = {Dalal, Hrishbh},
	title = {Agent 2048: Forging Strategic Gameplay in an AI Through Data, Rewards, and RL},
	year = {2024},
	month = {March},
	url = {https://yourwebsite.com/blog/ai-agent-plays-2048},
	note = {[Blog post] Accessed: March 30, 2024}
	}
	```

	## Author

	Hrishbh Dalal

	## Acknowledgments

	Special thanks to the research community on Twitter/X for valuable feedback on data generation strategies and training approaches.

	## License

	This model is released under the Apache 2.0 license.