🎨 Imagen DiT-320M

Frontier Diffusion Transformer for Text-to-Image Generation

Model License PyTorch

A prototype modern diffusion transformer trained on high-quality recaptioned data


πŸ–ΌοΈ Sample Generations

"A silver pot on wood" "Entrance of a luxury house" "A house in a green field" "A forest" "A handwritten poem"

πŸ—οΈ Model Architecture

Pipeline: Text Prompt β†’ T5-Large (770M) β†’ DiT-320M ← Noise + Timestep β†’ SDXL VAE Decoder β†’ Image (256Γ—256)

DiT Block (Γ—12)

Each transformer block contains:

  • Self-Attention with RoPE positional embeddings and QK-Normalization
  • Cross-Attention for text conditioning from T5
  • SwiGLU MLP with 4Γ— hidden expansion
  • AdaLN-Zero for timestep conditioning
  • RMSNorm instead of LayerNorm for efficiency

Technical Specifications

ArchitectureDiffusion Transformer (DiT)
Parameters~320M (DiT only)
Hidden Dimension1024
Transformer Depth12 layers
Attention Heads16
Patch Size2Γ—2
MLP Ratio4Γ—
Context Dimension1024 (T5-Large)

Modern Features

FeatureDescriptionOrigin
RoPERotary Positional Embeddings for 2D patchesLLaMA, Flux
QK-NormalizationStabilizes attention at scaleViT-22B
SwiGLUGated activation for better gradient flowPaLM, LLaMA
AdaLN-ZeroAdaptive layer norm for timestep conditioningDiT
RMSNormFaster than LayerNorm with similar qualityLLaMA

πŸ“Š Training Details

Dataset

Training Configuration

Resolution256Γ—256
Batch Size80 (effective, 10 Γ— 8 accumulation)
Learning Rate2Γ—10⁻⁴
OptimizerAdamW (β₁=0.9, Ξ²β‚‚=0.95)
Precisionbfloat16
EMA Decay0.9999
Warmup Steps500
Gradient Clipping1.0

Diffusion Process

Timesteps1000
ScheduleCosine (Improved DDPM)
PredictionΞ΅-prediction
CFG Dropout10%
SamplerDDIM

πŸš€ Quick Start

Installation

pip install torch transformers diffusers einops huggingface_hub

Inference

Download and run the inference script:

# Download inference script
wget https://huggingface.co/kerzgrr/imagenv1m/resolve/main/inference.py

# Generate an image
python inference.py "A cat sitting on a windowsill"

# With options
python inference.py "A forest at sunset" --steps 100 --cfg 7.5 --seed 42 --output forest.png

The script automatically downloads all required models from HuggingFace:

  • DiT checkpoint from kerzgrr/imagenv1m
  • T5-Large from google/flan-t5-large
  • SDXL VAE from stabilityai/sdxl-vae

πŸ“ Model Versions

All model sizes are available as separate repositories in the Imagen Collection.

Available Now

VersionParametersRepositoryStatus
v1-medium~320Mkerzgrr/imagenv1mβœ… Available

Coming Soon

VersionParametersRepositoryStatus
v1-nano~50Mkerzgrr/imagenv1nπŸ”œ Planned
v1-mini~150Mkerzgrr/imagenv1sπŸ”œ Planned
v1-large~700Mkerzgrr/imagenv1lπŸ”œ Planned
v1-xlarge~1.5Bkerzgrr/imagenv1xlπŸ”œ Planned

Checkpoint Contents

{
    "model_state_dict": ...,      # DiT weights (EMA)
    "step": int,                  # Training step
    "config": dict,               # Model config
}

⚠️ Limitations

  • Quality: This is only a prototype, not for production use
  • Resolution: Currently only supports 256Γ—256
  • Subjects: May struggle with very specific prompts
  • Text Rendering: Cannot make text in images yet
  • Hands/Anatomy: Human and animal anatomy not yet understood fully

πŸ“œ Citation

@misc{imagenv1m,
  title={Imagen v1 Medium: A Diffusion Transformer for Text-to-Image Generation},
  author={kerzgrr},
  year={2025},
  url={https://huggingface.co/kerzgrr/imagenv1m}
}

πŸ™ Acknowledgments

Built on the shoulders of giants:

  • DiT - Original Diffusion Transformer architecture (Peebles & Xie, 2023)
  • SDXL VAE - Latent autoencoder from Stability AI
  • Flan-T5 - Text encoder from Google
  • Recap-DataComp-1B - Training dataset from UCSC-VLAA

Trained on an RTX 5090 Laptop Edition for around 3 days, specifically an MSI Titan 18 HX Dragon Edition Norse Myth A2XW

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train kerzgrr/imagenv1m

Collection including kerzgrr/imagenv1m