Document revised OCR processing by JorjMcKie · Pull Request #4940 · pymupdf/PyMuPDF

JorjMcKie · 2026-03-14T16:38:35Z

No description provided.

Copilot

Pull request overview

Adds/updates PyMuPDF4LLM documentation to describe the revised OCR plugin system and clarify OCR-related API behavior in the generated docs.

Changes:

Adds a new documentation page describing default OCR plugins, selection order, hybrid OCR workflow, and how to provide custom OCR functions.
Updates the PyMuPDF4LLM feature list to mention automatic OCR-benefit page detection and multiple OCR engines.
Updates API docs for force_ocr, ocr_dpi, and ocr_function to reflect default OCR engine/plugin behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
docs/pymupdf4llm/ocr-plugins.rst	New page documenting OCR plugin options, selection logic, and customization.
docs/pymupdf4llm/index.rst	Adds OCR-related capability to the feature list.
docs/pymupdf4llm/api.rst	Clarifies OCR-related parameter documentation and defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

docs/pymupdf4llm/ocr-plugins.rst

docs/pymupdf4llm/api.rst

docs/pymupdf4llm/ocr-plugins.rst

+Default OCR Functions
+======================
+
+PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the samne time also reducing the overall OCR processing time.
+
+Here is an overview of the available default plugins:


docs/pymupdf4llm/ocr-plugins.rst

docs/pymupdf4llm/api.rst

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Document revised OCR processing

189a43e

JorjMcKie requested a review from Copilot March 14, 2026 16:38

Copilot started reviewing on behalf of JorjMcKie March 14, 2026 16:39 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

typos

ceef80b

JorjMcKie requested a review from jamie-lemon March 14, 2026 16:55

JorjMcKie and others added 2 commits March 14, 2026 12:58

Potential fix for pull request finding

af65c4f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

update comments

74ac78e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document revised OCR processing#4940

Document revised OCR processing#4940
JorjMcKie wants to merge 4 commits intomainfrom
ocr-doc-updates

JorjMcKie commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JorjMcKie commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants