arxiv:2601.14251

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Published on Jan 20

· Submitted by

Said Taghadouini on Jan 21

LightOn AI

Upvote

Authors:

Said Taghadouini ,

Adrien Cavaillès ,

Baptiste Aubertin

Abstract

LightOnOCR-2-1B is a compact 1B-parameter vision-language model that performs end-to-end document image-to-text conversion with improved localization and robustness through specialized training techniques.

AI-generated summary

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9times smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

View arXiv page View PDF Add to collection

Community

staghado

Paper author Paper submitter 1 day ago

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision-language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9× smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-benchevaluation under their respective licenses.

librarian-bot

about 10 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

NVIDIA Nemotron Parse 1.1 (2025)
Qwen3-VL Technical Report (2025)
STEP3-VL-10B Technical Report (2026)
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model (2025)
HunyuanOCR Technical Report (2025)
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents (2025)
CropVLM: Learning to Zoom for Fine-Grained Vision-Language Perception (2025)

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 11

Browse 11 models citing this paper

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Abstract

Community

Models citing this paper 11

Datasets citing this paper 2

Spaces citing this paper 7

Collections including this paper 4