--- license: apache-2.0 ---
## πArchitecture **RynnEC** can handle a variety of input types, including images, videos, visual prompts, and task instructions. Visual inputs are processed using a Vision Encoder equipped with an any-resolution strategy, while visual prompts are handled by a region encoder to extract fine-grained features. Textual inputs are seamlessly converted into a unified token stream through tokenization. For video segmentation tasks, a mask decoder is employed to transform the output segmentation embeddings into binary masks, ensuring precise and effective results.
## π Model Zoo | Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | RynnEC-2B | Qwen2.5-1.5B-Instruct | [Alibaba-DAMO-Academy/RynnEC-2B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-2B) | | RynnEC-7B | Qwen2.5-7B-Instruct | [Alibaba-DAMO-Academy/RynnEC-7B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-7B) | ## π Main Results Benchmark comparison across object cognition and spatial cognition. With a highly efficient **2B**-parameter architecture, **RynnEC-2B** achieves state-of-the-art (SOTA) performance on complex spatial cognition tasks.
## π Citation If you find RynnEC useful for your research and applications, please cite using this BibTeX: ```bibtex @misc{dang2025rynnecbringingmllmsembodied, title={RynnEC: Bringing MLLMs into Embodied World}, author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao}, year={2025}, eprint={2508.14160}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.14160}, } ```