From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
| Paper | Code |

🌟🌟 Motivation

Two lingering clouds cast shadows over its widespread exploration and promotion:

What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field.

We construct native VLMs built from first principles, where its primitive should:

effectively align pixel and word representations within a shared semantic space;
seamlessly integrate the strengths of separate vision and language modules;
inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning.

🚀🚀 Highlight

With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives.
NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem.

🧑‍🎨🧑‍🎨 Model Overview

NEO1_0-2B has the following features:

Model Type: Native Vision-Language Models
Model Mode: Mixed Native-Attn & Native-RoPE
Layer Parameters: 56M vs. 50M (Qwen3-1.7B)
Model Parameters: 2.2B (Non-Embedding)
Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM)
Number of Heads: 16 for Q and 8 for KV (GQA)
Head Dimensions: 128 * 2 for QK and 128 for V

🔥🔥 Model Performance

📚📚 Model Weights

We release the 2B weights of NEO1_0 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model name	Weight
NEO-2B-PT	🤗 NEO-2B-PT HF link
NEO-2B-MT	🤗 NEO-2B-MT HF link
NEO-2B-SFT	🤗 NEO-2B-SFT HF link

✒️✒️ Citation

If NEO is helpful for your research, please consider star ⭐ and citation 📝 :

@article{Diao2025NEO,
  title        = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale},
  author       = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
  journal      = {arXiv preprint arXiv:2510.14979},
  year         = {2025}
}

Downloads last month: 439

Safetensors

Model size

3B params

Tensor type

BF16

Collection including Paranioar/NEO1_0-2B-SFT

NEO1_0

Collection

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale • 7 items • Updated 10 days ago • 3

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale | Paper | Code |