HRM-MoE: Hierarchical Recurrent Memory with Mixture of Experts
HRM-MoE is an experimental language model that combines:
- Hierarchical Recurrent Memory (HRM) architecture for deep reasoning
- Mixture of Experts (MoE) for efficient scaling
Model Description
This model integrates HRM's Specialist/Manager hierarchy into a Mixture of Experts framework, allowing different experts to specialize in various aspects of language understanding.
Architecture
- Model Size: 228,352,512 parameters
- Expert Parameters: 204,506,112
- Non-Expert Parameters: 23,846,400
- Embedding Dimension: 512
- Layers: 6
- Attention Heads: 8
- FFN Dimension: 2048
- Number of Experts: 8
- Experts per Token: 2
Expert Types
- GLU/GEGLU Experts (4): Standard gated linear units
- Pattern Experts (2): Deep FFN for pattern recognition
- Local Conv Experts (1): Local neighborhood operations
- HRM Experts (1): Hierarchical reasoning with Specialist/Manager modules
Training
- Dataset: wikimedia/wikipedia
- Training Epochs: 1
- Batch Size: 8 (effective: 32)
- Learning Rate: 5e-05 โ 1e-06 (cosine)
- Mixed Precision: Enabled
Latest Performance (Epoch 0)
- Validation Loss:
6.3832 - Validation Perplexity:
591.83
Features
- Adaptive Routing: Gumbel-Softmax with temperature annealing
- Load Balancing: Importance, load, and entropy regularization
- Expert Specialization: Diverse expert types for different aspects of language
Usage
import torch
from transformers import T5Tokenizer
# Load tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small", use_fast=False)
# Load model (you'll need the model architecture from the repo)
# See: https://github.com/your-repo/hrm-moe
# Generate text
# (example code here)
Citation
If you use this model, please cite the original HRM paper:
@article{hrm2024,
title={Hierarchical Reasoning Model},
author={...},
journal={arXiv preprint},
year={2024}
}
License
Apache 2.0
๐ค Generated with HRM-MoE Training Script