amd
/

Nitro-E

+---
+license: mit
+pipeline_tag: text-to-image
+library_name: diffusers
+---
+# AMD Nitro-E
+![image/png](https://huggingface.co/amd/Nitro-E/resolve/main/assets/teaser.png)
+## Introduction
+Nitro-E is a family of text-to-image diffusion models focused on highly efficient training. With just 304M parameters, Nitro-E is designed to be resource-friendly for both training and inference. For training, it only takes 1.5 days on a single node with 8 AMD Instinct™ MI300X GPUs. On the inference side, Nitro-E delivers a throughput of 18.8 samples per second (batch size 32, 512px images) a single AMD Instinct MI300X GPU. The distilled version can further increase the throughput to 39.3 samples per second. The release consists of:
+* [Nitro-E-512px](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px.safetensors): a EMMDiT-based 20-steps model train from scratch.
+* [Nitro-E-512px-dist](https://huggingface.co/amd/Nitro-E/blob/main/Nitro-E-512px-dist.safetensors): a EMMDiT-based model distilled from Nitro-E-512px.
+* [Nitro-E-512px-GRPO](https://huggingface.co/amd/Nitro-E/tree/main/ckpt_grpo_512px): a post-training model fine-tuned from Nitro-E-512px using Group Relative Policy Optimization (GRPO) strategy.
+⚡️ [Open-source code](https://github.com/AMD-AGI/Nitro-E)!
+⚡️ [technical blog](https://advanced-micro-devices-rocm-blogs--1559.com.readthedocs.build/projects/internal/en/1559/artificial-intelligence/nitro-e/README.html)!
+## Details
+* **Model architecture**: We propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M
+parameters for fast image synthesis requiring low training resources. Our design philosophy centers on token reduction as the computational
+cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression
+module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence,
+and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an
+efficient lightweight module for computing modulation parameters in transformer blocks. See our technical blog post for more details.
+* **Dataset**: Our models were trained on a dataset of ~25M images consisting of both real and synthetic data sources that are openly available on the internet. We make use of the following datasets for training: [Segment-Anything-1B](https://ai.meta.com/datasets/segment-anything/), [JourneyDB](https://journeydb.github.io/), [DiffusionDB](https://github.com/poloclub/diffusiondb) and [DataComp](https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B) as prompt of the generated data.
+* **Training cost**: The Nitro-E-512px model requires only 1.5 days of training from scratch on a single node with 8 AMD Instinct™ MI300X GPUs.
+## Quickstart
+* **Image generation with 20 steps**:
+```python
+import torch
+from core.tools.inference_pipe import init_pipe
+device = torch.device('cuda:0')
+dtype = torch.bfloat16
+resolution = 512
+repo_name = "amd/Nitro-E"
+ckpt_name = 'Nitro-E-512px.safetensors'
+use_grpo = True
+if use_grpo:
+    pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name, ckpt_path_grpo='ckpt_grpo_512px')
+else:
+    pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
+prompt = 'A hot air balloon in the shape of a heart grand canyon'
+images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=20, guidance_scale=4.5).images
+```
+* **Image generation with 4 steps**:
+```python
+import torch
+from core.tools.inference_pipe import init_pipe
+device = torch.device('cuda:0')
+dtype = torch.bfloat16
+resolution = 512
+repo_name = "amd/Nitro-E"
+ckpt_name = 'Nitro-E-512px-dist.safetensors'
+pipe = init_pipe(device, dtype, resolution, repo_name=repo_name, ckpt_name=ckpt_name)
+prompt = 'A hot air balloon in the shape of a heart grand canyon'
+images = pipe(prompt=prompt, width=resolution, height=resolution, num_inference_steps=4, guidance_scale=0).images
+```
+## License
+Copyright (c) 2025 Advanced Micro Devices, Inc. All Rights Reserved.
+This project is licensed under the [MIT License](https://mit-license.org/).