DeepSeek-OCR running in google colab

#27

by Javedalam - opened Oct 21, 2025

Oct 21, 2025

The url to shared Google colab notebook

https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing

DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
It’s built to read complex, real-world documents — screenshots, PDFs, forms, tables, and handwritten or noisy text — and output clean, structured Markdown.

⚙️ Core capabilities

Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to “see” layouts and generate text like a language model rather than just classifying characters.

Markdown output:
Instead of raw text, it structures output with Markdown syntax — headings, bullet lists, tables, and inline formatting — which makes the results ideal for direct use in notebooks or LLM pipelines.

PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.

Adaptive tiling (“crop_mode”):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts — the “Gundam mode” mentioned in their docs.

Vision backbone:
Based on DeepSeek-V2’s VL-encoder (≈3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 × 1280 px and dynamically scales lower.

Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.

Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.

🆕 What’s new about its approach

Traditional OCR (e.g., Tesseract, PaddleOCR) → detects and classifies glyphs.
DeepSeek-OCR → interprets the entire document as a multimodal sequence:

Encode the image as patches (visual tokens).
Feed the vision tokens + prompt to the text decoder.
The model “writes out” the text and structure directly.

This end-to-end generation-based OCR means:

No need for bounding-box parsing pipelines.

Better recovery of formatting and logical order.

Robust to blur, background noise, and complex layouts.

🚀 Performance & requirements

Model size: ~6.7 GB BF16 (≈3 B parameters).

Runs best on L4 / A100 GPUs (≥ 16 GB VRAM).

Works with Transformers 4.46+ using attn_implementation="eager".

On A100, achieves ~2500 tokens/s in vLLM mode for PDFs.

🧾 In short

DeepSeek-OCR = OCR reimagined as text generation.
It reads like a human: “see → understand → write.”
You get Markdown that preserves layout, context, and meaning — a major step up from bounding-box OCR engines.

annasuhstuff1

Oct 22, 2025

Your session crashed after using all available RAM.

Ashenpasihdu

Oct 22, 2025

•

edited Oct 22, 2025

Runtime type change T4 GPU or other TPU

Rajneel18

Oct 23, 2025

Is it possible to get coordinates of bounding box from deepseekOCR ??

SWAGATAM041

Oct 23, 2025

Does it supports bounding boxes ,labels, confidence ?

bigpappic

Oct 23, 2025

Shut 3

bigpappic

Oct 23, 2025

Yes

SWAGATAM041

Oct 23, 2025

can u give me the code to retrieve that ?

Banchert

Oct 25, 2025

How to finetune this Model ?

aapalireza

Oct 25, 2025

This comment has been hidden (marked as Resolved)

aapalireza

Oct 25, 2025

•

edited Oct 25, 2025

Your session crashed after using all available RAM.

Some users reported that the notebook crashes due to limited VRAM on free Colab sessions.
I’ve loaded DeepSeek-OCR in 4-bit precision, which reduces memory usage and allows it to run on a T4 GPU.

You can test it here:
https://colab.research.google.com/github/Alireza-Akhavan/LLM/blob/main/deepseek_ocr_inference_4bit.ipynb

ducdtran

Oct 27, 2025

Is it possible to stop the model from writing out the result?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment