Emu3.5
					Collection
				
Native Multimodal Models are World Learners 🌍
					• 
				3 items
				• 
				Updated
					
				•
					
					55
Emu3.5 Team, BAAI
 
 
| 🔹 | Core Concept | Description | 
|---|---|---|
| 🧠 | Unified World Modeling | Predicts the next state jointly across vision and language, enabling coherent world modeling and generation. | 
| 🧩 | End-to-End Pretraining | Trained with a unified next-token prediction objective over interleaved vision–language sequences. | 
| 📚 | Over 10T+ Multimodal Tokens | Pre-trained on over 10 trillion interleaved tokens from video frames and transcripts, capturing spatiotemporal structure. | 
| 🔄 | Native Multimodal I/O | Processes and generates interleaved visual–text sequences without modality adapters or task-specific heads. | 
| 🎯 | RL Post-Training | Large-scale reinforcement learning enhances reasoning, compositionality, and generation quality. | 
| ⚡ | Discrete Diffusion Adaptation (DiDA) | Converts sequential decoding → bidirectional parallel prediction, achieving ≈20× faster inference without performance loss. | 
| 🖼️ | Versatile Generation | Excels in long-horizon vision–language generation, any-to-image (X2I) synthesis, and text-rich image creation. | 
| 🌐 | Generalizable World Modeling | Enables spatiotemporally consistent world exploration, and open-world embodied manipulation across diverse scenarios. | 
| 🏆 | Performance Benchmark | Matches Gemini 2.5 Flash Image (Nano Banana) on image generation/editing, and outperforms on interleaved generation tasks. | 
git clone https://github.com/baaivision/Emu3.5
cd Emu3.5
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
Edit configs/config.py to set:
model_path, vq_pathtask_type in {t2i, x2i, howto, story, explore, vla}, use_image controls <|IMAGE|> usage (set to true when reference images are provided)sampling_params (classifier_free_guidance, temperature, top_k/top_p, etc.)python inference.py --cfg configs/config.py
Protobuf outputs are written to outputs/<exp_name>/proto/. For better throughput, we recommend ≥2 GPUs.
To visualize generated protobuf files:
python src/utils/vis_proto.py --input <input_proto_file> --output <output_dir>
@misc{cui2025emu35nativemultimodalmodels,
      title={Emu3.5: Native Multimodal Models are World Learners}, 
      author={Yufeng Cui and Honghao Chen and Haoge Deng and Xu Huang and Xinghang Li and Jirong Liu and Yang Liu and Zhuoyan Luo and Jinsheng Wang and Wenxuan Wang and Yueze Wang and Chengyuan Wang and Fan Zhang and Yingli Zhao and Ting Pan and Xianduo Li and Zecheng Hao and Wenxuan Ma and Zhuo Chen and Yulong Ao and Tiejun Huang and Zhongyuan Wang and Xinlong Wang},
      year={2025},
      eprint={2510.26583},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.26583}, 
}