HRM-MoE: Hierarchical Recurrent Memory with Mixture of Experts

HRM-MoE is an experimental language model that combines:

Hierarchical Recurrent Memory (HRM) architecture for deep reasoning
Mixture of Experts (MoE) for efficient scaling

Model Description

This model integrates HRM's Specialist/Manager hierarchy into a Mixture of Experts framework, allowing different experts to specialize in various aspects of language understanding.

Architecture

Model Size: 228,352,512 parameters
- Expert Parameters: 204,506,112
- Non-Expert Parameters: 23,846,400
Embedding Dimension: 512
Layers: 6
Attention Heads: 8
FFN Dimension: 2048
Number of Experts: 8
Experts per Token: 2

Expert Types

GLU/GEGLU Experts (4): Standard gated linear units
Pattern Experts (2): Deep FFN for pattern recognition
Local Conv Experts (1): Local neighborhood operations
HRM Experts (1): Hierarchical reasoning with Specialist/Manager modules

Training

Dataset: wikimedia/wikipedia
Training Epochs: 1
Batch Size: 8 (effective: 32)
Learning Rate: 5e-05 → 1e-06 (cosine)
Mixed Precision: Enabled

Latest Performance (Epoch 0)

Validation Loss: 6.3832
Validation Perplexity: 591.83

Features

Adaptive Routing: Gumbel-Softmax with temperature annealing
Load Balancing: Importance, load, and entropy regularization
Expert Specialization: Diverse expert types for different aspects of language

Usage

import torch
from transformers import T5Tokenizer

# Load tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small", use_fast=False)

# Load model (you'll need the model architecture from the repo)
# See: https://github.com/your-repo/hrm-moe

# Generate text
# (example code here)

Citation

If you use this model, please cite the original HRM paper:

@article{hrm2024,
  title={Hierarchical Reasoning Model},
  author={...},
  journal={arXiv preprint},
  year={2024}
}

License

Apache 2.0

🤖 Generated with HRM-MoE Training Script

Downloads last month: -; Downloads are not tracked for this model. How to track