Imagen
Collection
My DIT-VAE models!
β’
1 item
β’
Updated
β’
1
A prototype modern diffusion transformer trained on high-quality recaptioned data
![]() |
![]() |
![]() |
![]() |
![]() |
| "A silver pot on wood" | "Entrance of a luxury house" | "A house in a green field" | "A forest" | "A handwritten poem" |
Pipeline: Text Prompt β T5-Large (770M) β DiT-320M β Noise + Timestep β SDXL VAE Decoder β Image (256Γ256)
Each transformer block contains:
| Architecture | Diffusion Transformer (DiT) |
| Parameters | ~320M (DiT only) |
| Hidden Dimension | 1024 |
| Transformer Depth | 12 layers |
| Attention Heads | 16 |
| Patch Size | 2Γ2 |
| MLP Ratio | 4Γ |
| Context Dimension | 1024 (T5-Large) |
| Feature | Description | Origin |
|---|---|---|
| RoPE | Rotary Positional Embeddings for 2D patches | LLaMA, Flux |
| QK-Normalization | Stabilizes attention at scale | ViT-22B |
| SwiGLU | Gated activation for better gradient flow | PaLM, LLaMA |
| AdaLN-Zero | Adaptive layer norm for timestep conditioning | DiT |
| RMSNorm | Faster than LayerNorm with similar quality | LLaMA |
| Resolution | 256Γ256 |
| Batch Size | 80 (effective, 10 Γ 8 accumulation) |
| Learning Rate | 2Γ10β»β΄ |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95) |
| Precision | bfloat16 |
| EMA Decay | 0.9999 |
| Warmup Steps | 500 |
| Gradient Clipping | 1.0 |
| Timesteps | 1000 |
| Schedule | Cosine (Improved DDPM) |
| Prediction | Ξ΅-prediction |
| CFG Dropout | 10% |
| Sampler | DDIM |
pip install torch transformers diffusers einops huggingface_hub
Download and run the inference script:
# Download inference script
wget https://huggingface.co/kerzgrr/imagenv1m/resolve/main/inference.py
# Generate an image
python inference.py "A cat sitting on a windowsill"
# With options
python inference.py "A forest at sunset" --steps 100 --cfg 7.5 --seed 42 --output forest.png
The script automatically downloads all required models from HuggingFace:
kerzgrr/imagenv1mgoogle/flan-t5-large stabilityai/sdxl-vaeAll model sizes are available as separate repositories in the Imagen Collection.
| Version | Parameters | Repository | Status |
|---|---|---|---|
| v1-medium | ~320M | kerzgrr/imagenv1m | β Available |
| Version | Parameters | Repository | Status |
|---|---|---|---|
| v1-nano | ~50M | kerzgrr/imagenv1n | π Planned |
| v1-mini | ~150M | kerzgrr/imagenv1s | π Planned |
| v1-large | ~700M | kerzgrr/imagenv1l | π Planned |
| v1-xlarge | ~1.5B | kerzgrr/imagenv1xl | π Planned |
{
"model_state_dict": ..., # DiT weights (EMA)
"step": int, # Training step
"config": dict, # Model config
}
@misc{imagenv1m,
title={Imagen v1 Medium: A Diffusion Transformer for Text-to-Image Generation},
author={kerzgrr},
year={2025},
url={https://huggingface.co/kerzgrr/imagenv1m}
}
Built on the shoulders of giants:
Trained on an RTX 5090 Laptop Edition for around 3 days, specifically an MSI Titan 18 HX Dragon Edition Norse Myth A2XW