DeepSeek-OCR running in google colab
The url to shared Google colab notebook
https://colab.research.google.com/drive/1Fjzv3UYNoOt28HpM0RMUc8kG34EFgvuu?usp=sharing
DeepSeek-OCR is a new open-source, vision-language OCR model from DeepSeek-AI (the same lab behind the DeepSeek-V and DeepSeek-R series).
Itβs built to read complex, real-world documents β screenshots, PDFs, forms, tables, and handwritten or noisy text β and output clean, structured Markdown.
βοΈ Core capabilities
Multimodal (Vision + Language):
Uses a hybrid vision encoder + causal text decoder to βseeβ layouts and generate text like a language model rather than just classifying characters.
Markdown output:
Instead of raw text, it structures output with Markdown syntax β headings, bullet lists, tables, and inline formatting β which makes the results ideal for direct use in notebooks or LLM pipelines.
PDF-aware:
Includes a built-in PDF runner that automatically slices pages into tiles, processes each region, and re-assembles multi-page outputs.
Adaptive tiling (βcrop_modeβ):
Automatically splits large pages into overlapping tiles for better recognition of dense, small fonts β the βGundam modeβ mentioned in their docs.
Vision backbone:
Based on DeepSeek-V2βs VL-encoder (β3 B parameters) trained on massive document + scene-text corpora.
Handles resolutions up to 1280 Γ 1280 px and dynamically scales lower.
Language head:
Uses the same causal decoder family as DeepSeek-V2, fine-tuned for text reconstruction, so it can reason about table alignment, code blocks, and list structures.
Open and MIT-licensed:
Weights and inference code are fully open under the MIT license, allowing integration into other projects or retraining for domain-specific OCR.
π Whatβs new about its approach
Traditional OCR (e.g., Tesseract, PaddleOCR) β detects and classifies glyphs.
DeepSeek-OCR β interprets the entire document as a multimodal sequence:
Encode the image as patches (visual tokens).
Feed the vision tokens + prompt to the text decoder.
The model βwrites outβ the text and structure directly.
This end-to-end generation-based OCR means:
No need for bounding-box parsing pipelines.
Better recovery of formatting and logical order.
Robust to blur, background noise, and complex layouts.
π Performance & requirements
Model size: ~6.7 GB BF16 (β3 B parameters).
Runs best on L4 / A100 GPUs (β₯ 16 GB VRAM).
Works with Transformers 4.46+ using attn_implementation="eager".
On A100, achieves ~2500 tokens/s in vLLM mode for PDFs.
π§Ύ In short
DeepSeek-OCR = OCR reimagined as text generation.
It reads like a human: βsee β understand β write.β
You get Markdown that preserves layout, context, and meaning β a major step up from bounding-box OCR engines.
Your session crashed after using all available RAM.
Runtime type change T4 GPU or other TPU
Is it possible to get coordinates of bounding box from deepseekOCR ??
Does it supports bounding boxes ,labels, confidence ?
Shut 3
Yes
can u give me the code to retrieve that ?
How to finetune this Model ?
Your session crashed after using all available RAM.
Some users reported that the notebook crashes due to limited VRAM on free Colab sessions.
Iβve loaded DeepSeek-OCR in 4-bit precision, which reduces memory usage and allows it to run on a T4 GPU.
You can test it here:
https://colab.research.google.com/github/Alireza-Akhavan/LLM/blob/main/deepseek_ocr_inference_4bit.ipynb
Is it possible to stop the model from writing out the result?