LightOn AI

Team

company

Verified

https://lighton.ai

LightOnIO

lightonai

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ameliechatelain updated a dataset 1 day ago

lightonai/veracier-industries

ADgui updated a dataset 1 day ago

lightonai/veracier-industries

ADgui published a dataset 1 day ago

lightonai/veracier-industries

View all activity

Papers

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

View all Papers

Articles

DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models

7 days ago

•

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?

Feb 19

•

LateOn-Code & ColGrep: LightOn unveils state-of-the-art code retrieval models and code search tooling

Feb 12

•

LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family

Jan 19

•

LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR

Oct 23, 2025

•

View all articles

lightonai 's collections 14

DenseOn & LateOn

A collection of open state-of-the-art single and multi-vector models

lightonai/LateOn

Sentence Similarity • 0.1B • Updated 7 days ago • 1.43k • • 33
lightonai/DenseOn

Sentence Similarity • 0.1B • Updated 7 days ago • 586 • 22
lightonai/LateOn-unsupervised

Sentence Similarity • 0.1B • Updated 7 days ago • 114 • 7
lightonai/DenseOn-unsupervised

Sentence Similarity • 0.1B • Updated 7 days ago • 19 • 7

ColBERT-Zero 🐶

First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Paper • 2602.16609 • Published Feb 18 • 7
lightonai/ColBERT-Zero

Sentence Similarity • 0.1B • Updated Feb 23 • 5.34k • • 37
lightonai/ColBERT-Zero-supervised

Sentence Similarity • 0.1B • Updated Feb 23 • 60 • 3
lightonai/ColBERT-Zero-unsupervised

Sentence Similarity • 0.1B • Updated Feb 23 • 39 • 2

OriOn 💫

Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct

lightonai/OriOn-Qwen-SR1

Image-Text-to-Text • 33B • Updated 19 days ago • 190 • 4
lightonai/OriOn-Qwen

33B • Updated Feb 18 • 13 • 8
lightonai/OriOn-Mistral

24B • Updated Feb 18 • 39 • 4
lightonai/MMLBD-C

Viewer • Updated Feb 18 • 1.08k • 202 • 5

LightOnOCR 🦉

The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR

lightonai/LightOnOCR-1B-1025

Image-to-Text • Updated Feb 20 • 149k • 248
lightonai/LightOnOCR-0.9B-16k-1025

Updated Feb 20 • 125 • 12
lightonai/LightOnOCR-0.9B-32k-1025

Updated Feb 20 • 21 • 19
Running

Agents

42

LightOnOCR 1B Demo

💬

42

Extract text from images or PDFs with OCR

Ettin

A collection of SOTA, open-data, paired encoder-only and decoder only models ranging from 17M params to 1B

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Paper • 2507.11412 • Published Jul 15, 2025 • 31
jhu-clsp/ettin-encoder-17m

Fill-Mask • Updated Jul 16, 2025 • 7.12k • 15
jhu-clsp/ettin-encoder-32m

Feature Extraction • Updated Jul 18, 2025 • 2.52k • • 11
jhu-clsp/ettin-encoder-150m

Fill-Mask • Updated Jul 18, 2025 • 21.7k • • 10

PAGnol 🇫🇷

French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT

lightonai/pagnol-small

Text Generation • Updated Mar 21, 2024 • 818 • 1
lightonai/pagnol-medium

Text Generation • 0.4B • Updated Jan 6, 2025 • 37 • 1
lightonai/pagnol-large

Text Generation • Updated Mar 24, 2024 • 21 • 1
lightonai/pagnol-xl

Text Generation • 2B • Updated Nov 7, 2024 • 44 • 1

Mamba 🐍

lightonai/mambaoutai

Text Generation • 2B • Updated Apr 25, 2024 • 28 • 5

LightOnOCR-2 🦉

LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family

lightonai/LightOnOCR-2-1B

Image-Text-to-Text • 1B • Updated 7 days ago • 831k • 668
lightonai/LightOnOCR-2-1B-bbox

Image-Text-to-Text • 1B • Updated Jan 23 • 4.46k • 25
Running on Zero

Agents

Featured

113

LightOnOCR 2 1B Demo

🐨

113

Extract text from images or PDFs with OCR
lightonai/LightOnOCR-2-1B-base

Image-Text-to-Text • 1B • Updated Jan 21 • 8.94k • 11

LateOn-Code 💻

State-of-the-art late interaction code retrieval models

lightonai/LateOn-Code-edge

Sentence Similarity • 16.8M • Updated Feb 12 • 2.02k • • 27
lightonai/LateOn-Code

Sentence Similarity • 0.1B • Updated Feb 12 • 261 • 25
lightonai/LateOn-Code-edge-pretrain

Sentence Similarity • 16.8M • Updated Feb 12 • 30 • 4
lightonai/LateOn-Code-pretrain

Sentence Similarity • 0.1B • Updated Feb 13 • 26 • 3

PyLate 🐕

State-of-the-art late interaction models trained using PyLate

lightonai/Reason-ModernColBERT

Sentence Similarity • 0.1B • Updated 14 days ago • 6.94k • • 241
lightonai/GTE-ModernColBERT-v1

Sentence Similarity • Updated Jan 21 • 107k • 167
lightonai/LateOn-Code-edge

Sentence Similarity • 16.8M • Updated Feb 12 • 2.02k • • 27
lightonai/LateOn-Code

Sentence Similarity • 0.1B • Updated Feb 12 • 261 • 25

Embeddings datasets ⚡️

This collection gather datasets for embeddings pre-training and fine-tuning.

lightonai/embeddings-pre-training

Viewer • Updated 12 days ago • 1.38B • 1.66k • 41
lightonai/nanobeir-multilingual

Viewer • Updated Sep 16, 2025 • 522k • 405 • 11

ModernBERT

Bringing BERT into modernity via both architecture changes and scaling

answerdotai/ModernBERT-base

Fill-Mask • 0.1B • Updated Jan 15, 2025 • 1.11M • 1.03k
lightonai/GTE-ModernColBERT-v1

Sentence Similarity • Updated Jan 21 • 107k • 167
lightonai/Reason-ModernColBERT

Sentence Similarity • 0.1B • Updated 14 days ago • 6.94k • • 241
lightonai/modernbert-embed-large

Sentence Similarity • 0.4B • Updated May 14, 2025 • 8.21k • • 33

RITA 🧿

A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences.

lightonai/RITA_s

Text Generation • 85.1M • Updated Nov 13, 2024 • 4.55k • 3
lightonai/RITA_m

Text Generation • 0.3B • Updated Jan 6, 2025 • 50
lightonai/RITA_l

Text Generation • Updated May 19, 2022 • 2.92k
lightonai/RITA_xl

Text Generation • 1B • Updated Dec 10, 2024 • 3.97k • 3

ArabicWeb24-ablation-models

900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc)

lightonai/ArabicWeb24-ablation-model-v1

Text Generation • Updated Aug 19, 2024 • 15
lightonai/ArabicWeb24-ablation-model-v5

Text Generation • Updated Aug 19, 2024 • 10

DenseOn & LateOn

A collection of open state-of-the-art single and multi-vector models

lightonai/LateOn

Sentence Similarity • 0.1B • Updated 7 days ago • 1.43k • • 33
lightonai/DenseOn

Sentence Similarity • 0.1B • Updated 7 days ago • 586 • 22
lightonai/LateOn-unsupervised

Sentence Similarity • 0.1B • Updated 7 days ago • 114 • 7
lightonai/DenseOn-unsupervised

Sentence Similarity • 0.1B • Updated 7 days ago • 19 • 7

LightOnOCR-2 🦉

LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family

lightonai/LightOnOCR-2-1B

Image-Text-to-Text • 1B • Updated 7 days ago • 831k • 668
lightonai/LightOnOCR-2-1B-bbox

Image-Text-to-Text • 1B • Updated Jan 23 • 4.46k • 25
Running on Zero

Agents

Featured

113

LightOnOCR 2 1B Demo

🐨

113

Extract text from images or PDFs with OCR
lightonai/LightOnOCR-2-1B-base

Image-Text-to-Text • 1B • Updated Jan 21 • 8.94k • 11

ColBERT-Zero 🐶

First large-scale fully pre-trained ColBERT model using only public data, outperforming GTE-ModernColBERT and GTE-ModernBERT

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models

Paper • 2602.16609 • Published Feb 18 • 7
lightonai/ColBERT-Zero

Sentence Similarity • 0.1B • Updated Feb 23 • 5.34k • • 37
lightonai/ColBERT-Zero-supervised

Sentence Similarity • 0.1B • Updated Feb 23 • 60 • 3
lightonai/ColBERT-Zero-unsupervised

Sentence Similarity • 0.1B • Updated Feb 23 • 39 • 2

LateOn-Code 💻

State-of-the-art late interaction code retrieval models

lightonai/LateOn-Code-edge

Sentence Similarity • 16.8M • Updated Feb 12 • 2.02k • • 27
lightonai/LateOn-Code

Sentence Similarity • 0.1B • Updated Feb 12 • 261 • 25
lightonai/LateOn-Code-edge-pretrain

Sentence Similarity • 16.8M • Updated Feb 12 • 30 • 4
lightonai/LateOn-Code-pretrain

Sentence Similarity • 0.1B • Updated Feb 13 • 26 • 3

OriOn 💫

Visual long document VLMs based on Mistral-Small-3.1-24B-Instruct-2503 and Qwen3-VL-32B-Instruct

lightonai/OriOn-Qwen-SR1

Image-Text-to-Text • 33B • Updated 19 days ago • 190 • 4
lightonai/OriOn-Qwen

33B • Updated Feb 18 • 13 • 8
lightonai/OriOn-Mistral

24B • Updated Feb 18 • 39 • 4
lightonai/MMLBD-C

Viewer • Updated Feb 18 • 1.08k • 202 • 5

PyLate 🐕

State-of-the-art late interaction models trained using PyLate

lightonai/Reason-ModernColBERT

Sentence Similarity • 0.1B • Updated 14 days ago • 6.94k • • 241
lightonai/GTE-ModernColBERT-v1

Sentence Similarity • Updated Jan 21 • 107k • 167
lightonai/LateOn-Code-edge

Sentence Similarity • 16.8M • Updated Feb 12 • 2.02k • • 27
lightonai/LateOn-Code

Sentence Similarity • 0.1B • Updated Feb 12 • 261 • 25

LightOnOCR 🦉

The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR

lightonai/LightOnOCR-1B-1025

Image-to-Text • Updated Feb 20 • 149k • 248
lightonai/LightOnOCR-0.9B-16k-1025

Updated Feb 20 • 125 • 12
lightonai/LightOnOCR-0.9B-32k-1025

Updated Feb 20 • 21 • 19
Running

Agents

42

LightOnOCR 1B Demo

💬

42

Extract text from images or PDFs with OCR

Embeddings datasets ⚡️

This collection gather datasets for embeddings pre-training and fine-tuning.

lightonai/embeddings-pre-training

Viewer • Updated 12 days ago • 1.38B • 1.66k • 41
lightonai/nanobeir-multilingual

Viewer • Updated Sep 16, 2025 • 522k • 405 • 11

Ettin

A collection of SOTA, open-data, paired encoder-only and decoder only models ranging from 17M params to 1B

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Paper • 2507.11412 • Published Jul 15, 2025 • 31
jhu-clsp/ettin-encoder-17m

Fill-Mask • Updated Jul 16, 2025 • 7.12k • 15
jhu-clsp/ettin-encoder-32m

Feature Extraction • Updated Jul 18, 2025 • 2.52k • • 11
jhu-clsp/ettin-encoder-150m

Fill-Mask • Updated Jul 18, 2025 • 21.7k • • 10

ModernBERT

Bringing BERT into modernity via both architecture changes and scaling

answerdotai/ModernBERT-base

Fill-Mask • 0.1B • Updated Jan 15, 2025 • 1.11M • 1.03k
lightonai/GTE-ModernColBERT-v1

Sentence Similarity • Updated Jan 21 • 107k • 167
lightonai/Reason-ModernColBERT

Sentence Similarity • 0.1B • Updated 14 days ago • 6.94k • • 241
lightonai/modernbert-embed-large

Sentence Similarity • 0.4B • Updated May 14, 2025 • 8.21k • • 33

PAGnol 🇫🇷

French language models. These model were trained in early 2021 following the then scaling laws and using the exact same training data as the CamemBERT

lightonai/pagnol-small

Text Generation • Updated Mar 21, 2024 • 818 • 1
lightonai/pagnol-medium

Text Generation • 0.4B • Updated Jan 6, 2025 • 37 • 1
lightonai/pagnol-large

Text Generation • Updated Mar 24, 2024 • 21 • 1
lightonai/pagnol-xl

Text Generation • 2B • Updated Nov 7, 2024 • 44 • 1

RITA 🧿

A suite of autoregressive generative models for protein sequences, with up to 1.2Bparameters, trained on over 280 million protein sequences.

lightonai/RITA_s

Text Generation • 85.1M • Updated Nov 13, 2024 • 4.55k • 3
lightonai/RITA_m

Text Generation • 0.3B • Updated Jan 6, 2025 • 50
lightonai/RITA_l

Text Generation • Updated May 19, 2022 • 2.92k
lightonai/RITA_xl

Text Generation • 1B • Updated Dec 10, 2024 • 3.97k • 3

Mamba 🐍

lightonai/mambaoutai

Text Generation • 2B • Updated Apr 25, 2024 • 28 • 5

ArabicWeb24-ablation-models

900M models trained on 25BT to compare different data processing choices (filtering, sentence dedup, minhash, etc)

lightonai/ArabicWeb24-ablation-model-v1

Text Generation • Updated Aug 19, 2024 • 15
lightonai/ArabicWeb24-ablation-model-v5

Text Generation • Updated Aug 19, 2024 • 10

AI & ML interests

Recent Activity

Papers

Articles

DenseOn with the LateOn: Open State-of-the-Art Single and Multi-Vector Models

**ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?**

LateOn-Code & ColGrep: LightOn unveils state-of-the-art code retrieval models and code search tooling

LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family

LightOnOCR-1B: The Case for End-to-End and Efficient Domain-Specific Vision-Language Models for OCR

Team members 24

lightonai 's collections 14

LightOnOCR 1B Demo

LightOnOCR 2 1B Demo

LightOnOCR 2 1B Demo

LightOnOCR 1B Demo

ColBERT-Zero: To Pre-train Or Not To Pre-train ColBERT models?