Vision Tokens vs Text Tokens: Understanding the 10× Compression
The Claim
DeepSeek-OCR demonstrates that 100 vision tokens can represent approximately 1000 text tokens with 97%+ accuracy. At first glance, this seems like a simple 10× compression ratio. But what most people miss is why this is possible—and it reveals fundamental differences in how these two types of tokens work.
One Vision Token Contains Way More Information
A Concrete Example
Let's look at what's actually in each type of token:
Text Token:
Token: "Annual"
Information: One word (or subword)
Representation: Token ID → Embedding (4096-dim vector)
Vision Token (from DeepSeek-OCR):
After encoding a 1024×1024 document image:
- Initial patches: 1024/16 × 1024/16 = 4096 patches (16×16 pixels each)
- After 16× compression: 4096/16 = 256 vision tokens
- Each vision token represents: 64×64 pixels
At typical document DPI (150-200), in a 64×64 pixel region:
- Characters: ~6-8 chars wide × 4-5 lines tall
- Words: approximately 5-8 words
- Plus: font style, size, layout, spacing information
Visualizing a Vision Token
Here's what one vision token might contain in a document:
┌──────────────────────────────┐ ← 64 pixels wide
│                              │
│  Annual Revenue Growth       │ ← Line 1 (3 words)
│  Q4 2024: $2.1M              │ ← Line 2 (3 words)
│  Increase: 15.3%             │ ← Line 3 (2 words)
│                              │
└──────────────────────────────┘
      ↓ (Vision Encoder)
  One 4096-dim vector
vs.
Text tokens: ["Annual", "Revenue", "Growth", "Q4", "2024", ":", "$", "2", ".", "1", "M", ...]
             ~12 separate tokens for the same content
The Information Density Gap
| Type | Coverage | Information Content | 
|---|---|---|
| Text Token | 1 word | ~1 word of text | 
| Vision Token | 64×64 pixels | ~5-8 words + layout + formatting | 
A vision token contains 5-10× more information than a text token, yet they both get mapped to the same embedding dimension (4096-dim).
This is why 100 vision tokens can effectively represent 1000 text tokens—the information density is fundamentally different.
Why Do They End Up the Same Embedding Size?
Despite containing vastly different amounts of information, both token types end up as 4096-dimensional vectors. But they get there very differently.
The 4096-Dim Latent Space
The embedding dimension is chosen for representation richness: enough dimensions to capture semantic relationships and allow attention mechanisms to work. This is a learned, dense, continuous space.
Different Journeys to 4096-Dim
Text Tokens: Through the Vocabulary
Token ID: 42 ("Annual")
   ↓
[Implicitly: 129K-dimensional vocabulary space] 
   ↓
Embedding lookup → 4096-dim
   ↓
LLM processing (4096-dim)
   ↓
Output projection → 129K logits
   ↓
Softmax → next token ID
Vision Tokens: Direct Compression
Raw pixels: 64×64×3 = 12,288 values
   ↓
Vision encoder → 4096-dim
   ↓
LLM processing (4096-dim)
   ↓
[No output - vision tokens are input only]
Vision tokens are already continuous—they compress directly into the latent space and stay there. No vocabulary, no blowup, seamless.
References
- DeepSeek-OCR Paper: "DeepSeek-OCR: Contexts Optical Compression"
- Fox benchmark compression results (Table 2, page 10)
- DeepEncoder architecture (Section 3.2, pages 5-7)
 
					