File size: 6,418 Bytes

816ab0f
e5efa29
a919a15
e5efa29
 
 
 
 
 
 
 
 
 
cf0a051
e5efa29
 
 
 
 
 
 
816ab0f
 
60b8c93
 
 
 
e5efa29
816ab0f
0ba164e
816ab0f
e5efa29
816ab0f
b143381
 
dc89609
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
9b6bede
e5efa29
816ab0f
 
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
816ab0f
42fb5b4
816ab0f
e5efa29
816ab0f
e5efa29
 
 
 
 
 
 
520d89a
 
 
e5efa29
520d89a
e5efa29
 
 
 
 
 
 
 
b6a6b1c
e5efa29
 
 
 
 
 
 
 
 
 
 
 
 
b6a6b1c
e5efa29
 
 
 
 
 
 
 
b7389c4
 
e5efa29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b7389c4
e5efa29
 
 
 
 
 
 
 
 
816ab0f
e5efa29
816ab0f
64cc95d
e5efa29
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
b6a6b1c
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
42fb5b4
816ab0f
e5efa29
816ab0f
b6a6b1c
e5efa29
 
816ab0f
dc89609
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
816ab0f
e5efa29
 
 
01e5d98
e5efa29
01e5d98
e5efa29

---
license: apache-2.0
pipeline_tag: image-to-text
language:
- en
- fr
- de
- es
- it
- nl
- pt
- sv
- da
library_name: transformers
tags:
- ocr
- document-understanding
- vision-language
- pdf
- tables
- forms
---

<div align="center">
  <img src="lightonocr-banner.png" alt="LightOn OCR-1B Banner" width="400"/>
</div>

# LightOnOCR-1B-1025

Full BF16 version of the model. We recommend this variant for inference and further fine-tuning.

**LightOnOCR-1B** is a compact, end-to-end vision–language model for Optical Character Recognition (OCR) and document understanding. It achieves state-of-the-art accuracy in its weight class while being several times faster and cheaper than larger general-purpose VLMs.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https%3A//huggingface.co/lightonai/LightOnOCR-1B-1025/blob/main/notebook.ipynb)

📝 **[Read the full blog post](https://huggingface.co/blog/lightonai/lightonocr/)** | 🚀 **[Try the demo](https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo)** | 📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**

**Highlights**

* ⚡ **Speed:** 5× faster than dots.ocr, 2× faster than PaddleOCR-VL-0.9B, 1.73× faster than DeepSeekOCR
* 💸 **Efficiency:** Processes 5.71 pages/s on a single H100 (~493k pages/day) for **<$0.01 per 1,000 pages**
* 🧠 **End-to-End:** Fully differentiable, no external OCR pipeline
* 🧾 **Versatile:** Handles tables, receipts, forms, multi-column layouts, and math notation
* 🌍 **Compact variants:** 32k and 16k vocab options for European languages

---

## Model Overview

**LightOnOCR** combines a Vision Transformer encoder(Pixtral-based) with a lightweight text decoder(Qwen3-based) distilled from high-quality open VLMs.
It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.


---

## Benchmarks

| Model              | ArXiv | Old Scans | Math | Tables | Multi-Column | Tiny Text | Base | Overall |
| :----------------- | :---: | :-------: | :--: | :----: | :----------: | :-------: | :--: | :-----: |
| [LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025) (151k vocab) | 81.4 | 71.6 | 76.4 | 35.2 | 80.0 | 88.7 | 99.5 | **76.1** |
| [LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025) (32k vocab) | 80.6 | 66.2 | 73.5 | 33.5 | 71.2 | 87.6 | 99.5 | **73.1** |
| [LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025) (16k vocab) | 82.3 | 72.9 | 75.3 | 33.5 | 78.6 | 85.1 | 99.8 | **75.4** |

All benchmarks evaluated using **vLLM** on the Olmo-Bench.

---

## Installation

```bash

uv venv --python 3.12 --seed
source .venv/bin/activate

export VLLM_COMMIT=e88bdd60d9a25d985168c9f4a60ab10095236d7c
uv pip install vllm \
    'triton-kernels @ git+https://github.com/triton-lang/[email protected]#subdirectory=python/triton_kernels' \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/${VLLM_COMMIT} \
    --prerelease=allow

uv pip install pypdfium2 pillow requests
```

## Start Server

```bash
vllm serve lightonai/LightOnOCR-1B-1025 \
    --limit-mm-per-prompt '{"image": 1}' \
    --async-scheduling
```

## PDF Inference

```python
import base64
import requests
import pypdfium2 as pdfium
import io

ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/LightOnOCR-1B-1025"

# Download PDF from arXiv
pdf_url = "https://arxiv.org/pdf/2412.13663"
pdf_data = requests.get(pdf_url).content

# Open PDF and convert first page to image
pdf = pdfium.PdfDocument(pdf_data)
page = pdf[0]
# Render at 200 DPI (scale factor = 200/72 ≈ 2.77)
pil_image = page.render(scale=2.77).to_pil()

# Convert to base64
buffer = io.BytesIO()
pil_image.save(buffer, format="PNG")
image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')

# Make request
payload = {
    "model": MODEL,
    "messages": [{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {"url": f"data:image/png;base64,{image_base64}"}
        }]
    }],
    "max_tokens": 4096,
    "temperature": 0.2,
    "top_p": 0.9,
}

response = requests.post(ENDPOINT, json=payload)
text = response.json()['choices'][0]['message']['content']
print(text)
```
---

## Rendering and Preprocessing Tips

* Render PDFs to **PNG** or **JPEG** at a target longest dimension of **1540px**
* Maintain aspect ratio to preserve text geometry
* Use one image per page; batching supported by vLLM

---

## Variants

| Variant                                                                            | Description                                   |
| :--------------------------------------------------------------------------------- | :-------------------------------------------- |
| **[LightOnOCR-1B-1025](https://huggingface.co/lightonai/LightOnOCR-1B-1025)**      | Full multilingual model (default)             |
| **[LightOnOCR-1B-32k](https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025)** | Fastest pruned-vocabulary version (32k tokens) optimized for European languages |
| **[LightOnOCR-1B-16k](https://huggingface.co/lightonai/LightOnOCR-0.9B-16k-1025)** | Most compact variant with smallest vocabulary          |

---

## Fine-tuning

**Transformers integration is coming soon for training and inference.**

LightOnOCR is fully differentiable and supports:

* LoRA fine-tuning
* Domain adaptation (receipts, scientific articles, forms, etc.)
* Multilingual fine-tuning with task-specific corpora

📓 **[Finetuning notebook](https://colab.research.google.com/drive/1WjbsFJZ4vOAAlKtcCauFLn_evo5UBRNa?usp=sharing)**

---

## Data

Trained on a diverse large-scale PDF corpus covering:

* Scientific papers, books, receipts, invoices, tables, forms, and handwritten text
* Multiple languages (Latin alphabet dominant)
* Real and synthetic document scans

The dataset will be released under an open license.

---

## License

Apache License 2.0

---

## Citation

```
@misc{lightonocr2025,
  title        = {LightOnOCR-1B: End-to-End and Efficient Domain-Specific Vision-Language Models for OCR},
  author       = {Said Taghadouini and Baptiste Aubertin and Adrien Cavaillès},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/blog/lightonai/lightonocr}}
}
```