File size: 6,418 Bytes
816ab0f
e5efa29
a919a15
e5efa29
 
 
 
 
 
 
 
 
 
cf0a051
e5efa29
 
 
 
 
 
 
816ab0f
 
60b8c93
 
 
 
e5efa29
816ab0f
0ba164e
816ab0f
e5efa29
816ab0f
b143381
 
dc89609
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
9b6bede
e5efa29
816ab0f
 
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
816ab0f
42fb5b4
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
 
 
520d89a
 
 
e5efa29
520d89a
e5efa29
 
 
 
 
 
 
 
b6a6b1c
e5efa29
 
 
 
 
 
 
 
 
 
 
 
 
b6a6b1c
e5efa29
 
 
 
 
 
 
 
b7389c4
 
e5efa29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7389c4
e5efa29
 
 
 
 
 
 
 
 
816ab0f
e5efa29
816ab0f
64cc95d
e5efa29
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
b6a6b1c
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
42fb5b4
816ab0f
e5efa29
816ab0f
b6a6b1c
e5efa29
 
816ab0f
dc89609
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
01e5d98
e5efa29
01e5d98
e5efa29
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
license: apache-2.0
pipeline_tag: image-to-text
language:
- en
- fr
- de
- es
- it
- nl
- pt
- sv
- da
library_name: transformers
tags:
- ocr
- document-understanding
- vision-language
- pdf
- tables
- forms
---

<div align="center">
  <img src="lightonocr-banner.png" alt="LightOn OCR-1B Banner" width="400"/>
</div>

# LightOnOCR-1B-1025

Full BF16 version of the model. We recommend this variant for inference and further fine-tuning.

**LightOnOCR-1B** is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb)

📝 **[Read the full blog post](https://huggingface.co/blog/lightonai/lightonocr/)** | 🚀 **[Try the demo](https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo)** | 📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**

**Highlights**

***Speed:** 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR
* 💸 **Efficiency:** Processes 5.71 pages/s on a single H100 (~493k pages/day) for **<$0.01 per 1,000 pages**
* 🧠 **End-to-End:** Fully differentiable, no external OCR pipeline
* 🧾 **Versatile:** Handles tables, receipts, forms, multi-column layouts, and math notation
* 🌍 **Compact variants:** 32k and 16k vocab options for European languages

---

## Model Overview

**LightOnOCR** combines a Vision Transformer encoder(Pixtral-based) with a lightweight text decoder(Qwen3-based) distilled from high-quality open VLMs.
It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.


---

## Benchmarks

| Model              | ArXiv | Old Scans | Math | Tables | Multi-Column | Tiny Text | Base | Overall |
| :----------------- | :---: | :-------: | :--: | :----: | :----------: | :-------: | :--: | :-----: |
| [LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025) (151k vocab) | 81.4 | 71.6 | 76.4 | 35.2 | 80.0 | 88.7 | 99.5 | **76.1** |
| [LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025) (32k vocab) | 80.6 | 66.2 | 73.5 | 33.5 | 71.2 | 87.6 | 99.5 | **73.1** |
| [LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025) (16k vocab) | 82.3 | 72.9 | 75.3 | 33.5 | 78.6 | 85.1 | 99.8 | **75.4** |

All benchmarks evaluated using **vLLM** on the Olmo-Bench.

---

## Installation

```bash

uv venv --python 3.12 --seed
source .venv/bin/activate

export VLLM_COMMIT=e88bdd60d9a25d985168c9f4a60ab10095236d7c
uv pip install vllm \
    'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} \
    --prerelease=allow

uv pip install pypdfium2 pillow requests
```

## Start Server

```bash
vllm serve lightonai/LightOnOCR-1B-1025 \
    --limit-mm-per-prompt '{"image": 1}' \
    --async-scheduling
```

## PDF Inference

```python
import base64
import requests
import pypdfium2 as pdfium
import io

ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/LightOnOCR-1B-1025"

# Download PDF from arXiv
pdf_url = "https://arxiv.org/pdf/2412.13663"
pdf_data = requests.get(pdf_url).content

# Open PDF and convert first page to image
pdf = pdfium.PdfDocument(pdf_data)
page = pdf[0]
# Render at 200 DPI (scale factor = 200/72 ≈ 2.77)
pil_image = page.render(scale=2.77).to_pil()

# Convert to base64
buffer = io.BytesIO()
pil_image.save(buffer, format="PNG")
image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

# Make request
payload = {
    "model": MODEL,
    "messages": [{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{image_base64}"}
        }]
    }],
    "max_tokens": 4096,
    "temperature": 0.2,
    "top_p": 0.9,
}

response = requests.post(ENDPOINT, json=payload)
text = response.json()['choices'][0]['message']['content']
print(text)
```
---

## Rendering and Preprocessing Tips

* Render PDFs to **PNG** or **JPEG** at a target longest dimension of **1540px**
* Maintain aspect ratio to preserve text geometry
* Use one image per page; batching supported by vLLM

---

## Variants

| Variant                                                                            | Description                                   |
| :--------------------------------------------------------------------------------- | :-------------------------------------------- |
| **[LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025)**      | Full multilingual model (default)             |
| **[LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025)** | Fastest pruned-vocabulary version (32k tokens) optimized for European languages |
| **[LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025)** | Most compact variant with smallest vocabulary          |

---

## Fine-tuning

**Transformers integration is coming soon for training and inference.**

LightOnOCR is fully differentiable and supports:

* LoRA fine-tuning
* Domain adaptation (receipts, scientific articles, forms, etc.)
* Multilingual fine-tuning with task-specific corpora

📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**

---

## Data

Trained on a diverse large-scale PDF corpus covering:

* Scientific papers, books, receipts, invoices, tables, forms, and handwritten text
* Multiple languages (Latin alphabet dominant)
* Real and synthetic document scans

The dataset will be released under an open license.

---

## License

Apache License 2.0

---

## Citation

```
@misc{lightonocr2025,
  title        = {LightOnOCR-1B: End-to-End and Efficient Domain-Specific Vision-Language Models for OCR},
  author       = {Said Taghadouini and Baptiste Aubertin and Adrien Cavaillès},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/blog/lightonai/lightonocr}}
}
```