ππ Motivation
Two lingering clouds cast shadows over its widespread exploration and promotion:
What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field.
We construct native VLMs built from first principles, where its primitive should:
effectively align pixel and word representations within a shared semantic space;
seamlessly integrate the strengths of separate vision and language modules;
inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning.
ππ Highlight
With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives.
NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem.
π§βπ¨π§βπ¨ Model Overview
NEO1_0-2B has the following features:
Model Type: Native Vision-Language Models
Model Mode: Mixed Native-Attn & Native-RoPE
Layer Parameters: 56M vs. 50M (Qwen3-1.7B)
Model Parameters: 2.2B (Non-Embedding)
Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM)
Number of Heads: 16 for Q and 8 for KV (GQA)
Head Dimensions: 128 * 2 for QK and 128 for V
π₯π₯ Model Performance
ππ Model Weights
We release the 2B weights of NEO1_0 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).
| Model name | Weight |
|---|---|
| NEO-2B-PT | π€ NEO-2B-PT HF link |
| NEO-2B-MT | π€ NEO-2B-MT HF link |
| NEO-2B-SFT | π€ NEO-2B-SFT HF link |
βοΈβοΈ Citation
If NEO is helpful for your research, please consider star β and citation π :
@article{Diao2025NEO,
title = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale},
author = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
journal = {arXiv preprint arXiv:2510.14979},
year = {2025}
}
- Downloads last month
- 439