Skip to content

Add a plugin-first PaddleOCR package for scanned PDF fallback#1652

Draft
jimmyzhuu wants to merge 1 commit intomicrosoft:mainfrom
jimmyzhuu:codex/paddleocr-plugin-clean
Draft

Add a plugin-first PaddleOCR package for scanned PDF fallback#1652
jimmyzhuu wants to merge 1 commit intomicrosoft:mainfrom
jimmyzhuu:codex/paddleocr-plugin-clean

Conversation

@jimmyzhuu
Copy link
Copy Markdown

@jimmyzhuu jimmyzhuu commented Mar 30, 2026

Summary

This draft PR proposes a new package, markitdown-paddleocr, as a plugin-first OCR option for scanned PDFs.

The scope is intentionally narrow:

  • PDF only
  • plugin-first
  • opt-in only
  • no changes to MarkItDown core APIs
  • no changes to default converter behavior

This is presented as a draft because my main goal is to make the implementation and tradeoffs easy to review before assuming this package belongs in the main repository.

Why this package exists

MarkItDown already has two strong paths today:

  • built-in extraction for machine-readable PDFs
  • an OCR plugin path based on LLM vision models

This package is aimed at a different use case that is not fully covered by those two paths:

  • local / offline OCR
  • CPU-friendly deployment options
  • no dependency on an LLM API
  • practical support for Chinese scanned report-style pages

In local testing with representative Chinese samples, the current built-in MarkItDown flow returned empty output for scanned pages, while a PaddleOCR-based fallback recovered meaningful text. The strongest results were on report-style and table-heavy pages.

Design goals

The implementation is deliberately conservative.

It does not try to replace the built-in PDF converter, and it does not try to replace the existing markitdown-ocr package.

Instead, it uses the following strategy:

  1. run the built-in PdfConverter first
  2. if the built-in converter returns non-empty markdown, return that result unchanged
  3. only if the built-in result is empty, render full PDF pages to images and run PaddleOCR

This means:

  • normal machine-readable PDFs continue to use the existing built-in extraction path
  • OCR is only used as a fallback for scanned or image-only PDFs
  • default user behavior is preserved unless the plugin is explicitly installed and enabled

Why plugin-first

This package follows the plugin-first direction already present in MarkItDown:

  • backend-specific functionality can live outside the core package
  • optional runtimes and dependencies stay opt-in
  • different users can choose different OCR tradeoffs without expanding the default core surface area

That is also why this package does not introduce new core constructor arguments or provider-specific CLI flags in MarkItDown itself.

Why PaddleOCR specifically

The motivation here is not provider branding. The motivation is that PaddleOCR fills a practical gap in the current extension landscape:

  • it supports local execution
  • it can be used in environments where users do not want an LLM API dependency
  • it performs well in Chinese OCR scenarios that are common in report, disclosure, and table-heavy documents
  • as of March 30, 2026, PaddleOCR's GitHub repository shows 73.5k stars, making it the most-starred open-source OCR project on GitHub at the time of writing

That made it a reasonable candidate for a plugin package that complements, rather than replaces, the existing LLM-based OCR path.

What is included in this PR

This PR adds a new package under packages/markitdown-paddleocr with:

  • a pyproject.toml plugin package definition
  • a PaddleOCRService with lazy backend initialization
  • a PdfConverterWithPaddleOCR that only falls back when the built-in PDF result is empty
  • a plugin registration entry point
  • unit tests for converter behavior and plugin registration
  • demo and documentation materials
  • a small local comparison helper script

What is explicitly out of scope

This PR does not attempt to do any of the following:

  • no DOCX support
  • no PPTX support
  • no XLSX support
  • no image interleaving into mixed-layout PDFs
  • no table reconstruction into rich Markdown tables
  • no changes to MarkItDown core APIs
  • no automatic OCR on all PDFs
  • no replacement of markitdown-ocr
  • no cloud-provider-specific configuration surface in MarkItDown core

Implementation notes

Fallback behavior

The main converter is intentionally narrow and easy to reason about:

  • if built-in extraction succeeds, its output wins
  • if built-in extraction is empty, OCR fallback runs
  • if OCR fallback fails, the converter still degrades gracefully

Dependency boundary

The package keeps Paddle-specific dependencies isolated to the plugin package rather than expanding the dependencies of packages/markitdown.

Configuration model

The plugin reads opt-in kwargs during plugin registration:

  • paddleocr_enabled
  • paddleocr_lang
  • paddleocr_kwargs

These stay inside the plugin path and do not require MarkItDown core API changes.

Local demo observations

Local demo notes are included in packages/markitdown-paddleocr/DEMO.md.

In the local samples used during testing:

  • built-in MarkItDown returned empty output for scanned Chinese pages
  • PaddleOCR recovered useful text from report-style pages
  • PaddleOCR also recovered useful text from table-heavy pages
  • textbook / handwritten pages improved as well, though with more OCR noise than report pages

This does not prove that the package solves every OCR scenario. It does suggest that it covers a real gap with a narrow plugin-first design.

Tests

This package currently includes unit tests that verify:

  • built-in PDF output is preserved when already available
  • OCR fallback only runs when built-in output is empty
  • plugin registration remains opt-in and receives kwargs correctly
  • OCR result normalization handles nested PaddleOCR-style outputs

Why open this as a draft

I wanted to surface the full implementation for review while still signaling that I am open to feedback on repository fit.

If maintainers feel that:

  • this should stay as an external plugin only,
  • this should be split further, or
  • only part of this belongs upstream,

I am happy to adjust the scope.

Backward compatibility

This PR does not change the behavior of the default MarkItDown installation.

Users only get this functionality if they install the plugin package and opt in to it.

Related context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant