File size: 6,418 Bytes
816ab0f e5efa29 a919a15 e5efa29 cf0a051 e5efa29 816ab0f 60b8c93 e5efa29 816ab0f 0ba164e 816ab0f e5efa29 816ab0f b143381 dc89609 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f 9b6bede e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f 42fb5b4 816ab0f e5efa29 816ab0f e5efa29 520d89a e5efa29 520d89a e5efa29 b6a6b1c e5efa29 b6a6b1c e5efa29 b7389c4 e5efa29 b7389c4 e5efa29 816ab0f e5efa29 816ab0f 64cc95d e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 b6a6b1c 816ab0f e5efa29 816ab0f e5efa29 816ab0f 42fb5b4 816ab0f e5efa29 816ab0f b6a6b1c e5efa29 816ab0f dc89609 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 816ab0f e5efa29 01e5d98 e5efa29 01e5d98 e5efa29 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
---
license: apache-2.0
pipeline_tag: image-to-text
language:
- en
- fr
- de
- es
- it
- nl
- pt
- sv
- da
library_name: transformers
tags:
- ocr
- document-understanding
- vision-language
- pdf
- tables
- forms
---
<div align="center">
<img src="lightonocr-banner.png" alt="LightOn OCR-1B Banner" width="400"/>
</div>
# LightOnOCR-1B-1025
Full BF16 version of the model. We recommend this variant for inference and further fine-tuning.
**LightOnOCR-1B** is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.
[](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb)
📝 **[Read the full blog post](https://huggingface.co/blog/lightonai/lightonocr/)** | 🚀 **[Try the demo](https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo)** | 📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**
**Highlights**
* ⚡ **Speed:** 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR
* 💸 **Efficiency:** Processes 5.71 pages/s on a single H100 (~493k pages/day) for **<$0.01 per 1,000 pages**
* 🧠 **End-to-End:** Fully differentiable, no external OCR pipeline
* 🧾 **Versatile:** Handles tables, receipts, forms, multi-column layouts, and math notation
* 🌍 **Compact variants:** 32k and 16k vocab options for European languages
---
## Model Overview
**LightOnOCR** combines a Vision Transformer encoder(Pixtral-based) with a lightweight text decoder(Qwen3-based) distilled from high-quality open VLMs.
It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.
---
## Benchmarks
| Model | ArXiv | Old Scans | Math | Tables | Multi-Column | Tiny Text | Base | Overall |
| :----------------- | :---: | :-------: | :--: | :----: | :----------: | :-------: | :--: | :-----: |
| [LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025) (151k vocab) | 81.4 | 71.6 | 76.4 | 35.2 | 80.0 | 88.7 | 99.5 | **76.1** |
| [LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025) (32k vocab) | 80.6 | 66.2 | 73.5 | 33.5 | 71.2 | 87.6 | 99.5 | **73.1** |
| [LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025) (16k vocab) | 82.3 | 72.9 | 75.3 | 33.5 | 78.6 | 85.1 | 99.8 | **75.4** |
All benchmarks evaluated using **vLLM** on the Olmo-Bench.
---
## Installation
```bash
uv venv --python 3.12 --seed
source .venv/bin/activate
export VLLM_COMMIT=e88bdd60d9a25d985168c9f4a60ab10095236d7c
uv pip install vllm \
'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} \
--prerelease=allow
uv pip install pypdfium2 pillow requests
```
## Start Server
```bash
vllm serve lightonai/LightOnOCR-1B-1025 \
--limit-mm-per-prompt '{"image": 1}' \
--async-scheduling
```
## PDF Inference
```python
import base64
import requests
import pypdfium2 as pdfium
import io
ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/LightOnOCR-1B-1025"
# Download PDF from arXiv
pdf_url = "https://arxiv.org/pdf/2412.13663"
pdf_data = requests.get(pdf_url).content
# Open PDF and convert first page to image
pdf = pdfium.PdfDocument(pdf_data)
page = pdf[0]
# Render at 200 DPI (scale factor = 200/72 ≈ 2.77)
pil_image = page.render(scale=2.77).to_pil()
# Convert to base64
buffer = io.BytesIO()
pil_image.save(buffer, format="PNG")
image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
# Make request
payload = {
"model": MODEL,
"messages": [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_base64}"}
}]
}],
"max_tokens": 4096,
"temperature": 0.2,
"top_p": 0.9,
}
response = requests.post(ENDPOINT, json=payload)
text = response.json()['choices'][0]['message']['content']
print(text)
```
---
## Rendering and Preprocessing Tips
* Render PDFs to **PNG** or **JPEG** at a target longest dimension of **1540px**
* Maintain aspect ratio to preserve text geometry
* Use one image per page; batching supported by vLLM
---
## Variants
| Variant | Description |
| :--------------------------------------------------------------------------------- | :-------------------------------------------- |
| **[LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025)** | Full multilingual model (default) |
| **[LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025)** | Fastest pruned-vocabulary version (32k tokens) optimized for European languages |
| **[LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025)** | Most compact variant with smallest vocabulary |
---
## Fine-tuning
**Transformers integration is coming soon for training and inference.**
LightOnOCR is fully differentiable and supports:
* LoRA fine-tuning
* Domain adaptation (receipts, scientific articles, forms, etc.)
* Multilingual fine-tuning with task-specific corpora
📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**
---
## Data
Trained on a diverse large-scale PDF corpus covering:
* Scientific papers, books, receipts, invoices, tables, forms, and handwritten text
* Multiple languages (Latin alphabet dominant)
* Real and synthetic document scans
The dataset will be released under an open license.
---
## License
Apache License 2.0
---
## Citation
```
@misc{lightonocr2025,
title = {LightOnOCR-1B: End-to-End and Efficient Domain-Specific Vision-Language Models for OCR},
author = {Said Taghadouini and Baptiste Aubertin and Adrien Cavaillès},
year = {2025},
howpublished = {\url{https://huggingface.co/blog/lightonai/lightonocr}}
}
``` |