Add a plugin-first PaddleOCR package for scanned PDF fallback by jimmyzhuu · Pull Request #1652 · microsoft/markitdown

jimmyzhuu · 2026-03-30T14:43:07Z

Summary

This draft PR proposes a new package, markitdown-paddleocr, as a plugin-first OCR option for scanned PDFs.

The scope is intentionally narrow:

PDF only
plugin-first
opt-in only
no changes to MarkItDown core APIs
no changes to default converter behavior

This is presented as a draft because my main goal is to make the implementation and tradeoffs easy to review before assuming this package belongs in the main repository.

Why this package exists

MarkItDown already has two strong paths today:

built-in extraction for machine-readable PDFs
an OCR plugin path based on LLM vision models

This package is aimed at a different use case that is not fully covered by those two paths:

local / offline OCR
CPU-friendly deployment options
no dependency on an LLM API
practical support for Chinese scanned report-style pages

In local testing with representative Chinese samples, the current built-in MarkItDown flow returned empty output for scanned pages, while a PaddleOCR-based fallback recovered meaningful text. The strongest results were on report-style and table-heavy pages.

Design goals

The implementation is deliberately conservative.

It does not try to replace the built-in PDF converter, and it does not try to replace the existing markitdown-ocr package.

Instead, it uses the following strategy:

run the built-in PdfConverter first
if the built-in converter returns non-empty markdown, return that result unchanged
only if the built-in result is empty, render full PDF pages to images and run PaddleOCR

This means:

normal machine-readable PDFs continue to use the existing built-in extraction path
OCR is only used as a fallback for scanned or image-only PDFs
default user behavior is preserved unless the plugin is explicitly installed and enabled

Why plugin-first

This package follows the plugin-first direction already present in MarkItDown:

backend-specific functionality can live outside the core package
optional runtimes and dependencies stay opt-in
different users can choose different OCR tradeoffs without expanding the default core surface area

That is also why this package does not introduce new core constructor arguments or provider-specific CLI flags in MarkItDown itself.

Why PaddleOCR specifically

The motivation here is not provider branding. The motivation is that PaddleOCR fills a practical gap in the current extension landscape:

it supports local execution
it can be used in environments where users do not want an LLM API dependency
it performs well in Chinese OCR scenarios that are common in report, disclosure, and table-heavy documents
as of March 30, 2026, PaddleOCR's GitHub repository shows 73.5k stars, making it the most-starred open-source OCR project on GitHub at the time of writing

That made it a reasonable candidate for a plugin package that complements, rather than replaces, the existing LLM-based OCR path.

What is included in this PR

This PR adds a new package under packages/markitdown-paddleocr with:

a pyproject.toml plugin package definition
a PaddleOCRService with lazy backend initialization
a PdfConverterWithPaddleOCR that only falls back when the built-in PDF result is empty
a plugin registration entry point
unit tests for converter behavior and plugin registration
demo and documentation materials
a small local comparison helper script

What is explicitly out of scope

This PR does not attempt to do any of the following:

no DOCX support
no PPTX support
no XLSX support
no image interleaving into mixed-layout PDFs
no table reconstruction into rich Markdown tables
no changes to MarkItDown core APIs
no automatic OCR on all PDFs
no replacement of markitdown-ocr
no cloud-provider-specific configuration surface in MarkItDown core

Implementation notes

Fallback behavior

The main converter is intentionally narrow and easy to reason about:

if built-in extraction succeeds, its output wins
if built-in extraction is empty, OCR fallback runs
if OCR fallback fails, the converter still degrades gracefully

Dependency boundary

The package keeps Paddle-specific dependencies isolated to the plugin package rather than expanding the dependencies of packages/markitdown.

Configuration model

The plugin reads opt-in kwargs during plugin registration:

paddleocr_enabled
paddleocr_lang
paddleocr_kwargs

These stay inside the plugin path and do not require MarkItDown core API changes.

Local demo observations

Local demo notes are included in packages/markitdown-paddleocr/DEMO.md.

In the local samples used during testing:

built-in MarkItDown returned empty output for scanned Chinese pages
PaddleOCR recovered useful text from report-style pages
PaddleOCR also recovered useful text from table-heavy pages
textbook / handwritten pages improved as well, though with more OCR noise than report pages

This does not prove that the package solves every OCR scenario. It does suggest that it covers a real gap with a narrow plugin-first design.

Tests

This package currently includes unit tests that verify:

built-in PDF output is preserved when already available
OCR fallback only runs when built-in output is empty
plugin registration remains opt-in and receives kwargs correctly
OCR result normalization handles nested PaddleOCR-style outputs

Why open this as a draft

I wanted to surface the full implementation for review while still signaling that I am open to feedback on repository fit.

If maintainers feel that:

this should stay as an external plugin only,
this should be split further, or
only part of this belongs upstream,

I am happy to adjust the scope.

Backward compatibility

This PR does not change the behavior of the default MarkItDown installation.

Users only get this functionality if they install the plugin package and opt in to it.

Related context

Related docs PR: Clarify plugin-first OCR extension docs #1651
Related discussion: Question: would a plugin-first offline OCR backend for scanned PDFs fit the current MarkItDown direction? #1650

Prepare paddleocr plugin package

57e4f74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a plugin-first PaddleOCR package for scanned PDF fallback#1652

Add a plugin-first PaddleOCR package for scanned PDF fallback#1652
jimmyzhuu wants to merge 1 commit intomicrosoft:mainfrom
jimmyzhuu:codex/paddleocr-plugin-clean

jimmyzhuu commented Mar 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimmyzhuu commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this package exists

Design goals

Why plugin-first

Why PaddleOCR specifically

What is included in this PR

What is explicitly out of scope

Implementation notes

Fallback behavior

Dependency boundary

Configuration model

Local demo observations

Tests

Why open this as a draft

Backward compatibility

Related context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jimmyzhuu commented Mar 30, 2026 •

edited

Loading