From 189a43e67d7517dbfff367db107a06202c002a5c Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Sat, 14 Mar 2026 12:37:48 -0400 Subject: [PATCH 1/4] Document revised OCR processing --- docs/pymupdf4llm/api.rst | 10 +-- docs/pymupdf4llm/index.rst | 1 + docs/pymupdf4llm/ocr-plugins.rst | 142 +++++++++++++++++++++++++++++++ 3 files changed, 147 insertions(+), 6 deletions(-) create mode 100644 docs/pymupdf4llm/ocr-plugins.rst diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst index db46ea408..c7b72f8a1 100644 --- a/docs/pymupdf4llm/api.rst +++ b/docs/pymupdf4llm/api.rst @@ -85,13 +85,11 @@ The PyMuPDF4LLM API :arg bool footer: |PyMuPDFLayoutMode_Valid| boolean to switch on/off page footer content. This parameter controls whether to include or omit footer text from all the document pages. Useful if the document has repetitive footer content which doesn't add any value to the overall extraction data. Default is `True` meaning that footer content will be considered. :arg bool force_ocr: |PyMuPDFLayoutMode_Valid| if `True`, OCR will be applied to all pages regardless of their content. - - Only works when `ocr_function` is specified. - + This may be useful for documents which are known to be image-based and thus profit from OCR, but which do not meet the default criteria for applying OCR. Default is `False` meaning that OCR will only be applied to pages which meet the default criteria. .. warning:: - Requires `ocr_function` to be specified otherwise an exception will be raised. + Requires that either one of the default supported OCR engines is installed or `ocr_function` specifies a callable OCR function. Otherwise, an exception will be raised. :arg bool force_text: generate text output even when overlapping images / graphics. This text then appears after the respective image. @@ -121,9 +119,9 @@ The PyMuPDF4LLM API * `(top, bottom)` yields `(0, top, 0, bottom)`. * To always read full pages **(default)**, use `margins=0`. - :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. + :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a O(n²) impact on processing time and memory requirements. - :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), the built-in Tesseract OCR engine will be used. + :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), one of the available built-in OCR engines will be used. :arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German. diff --git a/docs/pymupdf4llm/index.rst b/docs/pymupdf4llm/index.rst index 62ae2a007..83af4697a 100644 --- a/docs/pymupdf4llm/index.rst +++ b/docs/pymupdf4llm/index.rst @@ -32,6 +32,7 @@ Features - Support for image and vector graphics extraction. - Layout analysis for better semantic understanding of document structure. - Support for page chunking output. + - Automatic detection of pages which profit from OCR and support for various OCR engines. - Integration with :ref:`LlamaIndex ` & :ref:`LangChain `. API diff --git a/docs/pymupdf4llm/ocr-plugins.rst b/docs/pymupdf4llm/ocr-plugins.rst new file mode 100644 index 000000000..b2926b6ba --- /dev/null +++ b/docs/pymupdf4llm/ocr-plugins.rst @@ -0,0 +1,142 @@ +Default OCR Functions +====================== + +PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the samne time also reducing the overall OCR processing time. + +Here is an overview of the available default plugins: + +============== ========================= ================================================================================= +Plugin Name Engines Description +============== ========================= ================================================================================= +rapidocr_api RapidOCR Uses RapidOCR for both text **detection** and text **recognition** +paddleocr_api PaddleOCR Uses PaddleOCR for both text **detection** and text **recognition** +tesseract_api Tesseract OCR Uses Tesseract OCR for both text **detection** and text **recognition** +rapidtess_api RapidOCR + Tesseract OCR Uses RapidOCR for text **detection** and Tesseract OCR for text **recognition** +paddletess_api PaddleOCR + Tesseract OCR Uses PaddleOCR for text **detection** and Tesseract OCR for text **recognition** +============== ========================= ================================================================================= + +If not explicitely selected via the `ocr_function` parameter, PyMuPDF4LLM will check the availability of the three OCR engines and pick one of the above plugins in the following order of preference: + +1. `rapidtess_api` (if both RapidOCR and Tesseract OCR are available) +2. `paddletess_api` (if both PaddleOCR and Tesseract OCR are available) +3. `rapidocr_api` (if RapidOCR is available, but not Tesseract OCR) +4. `paddleocr_api` (if PaddleOCR is available, but not Tesseract OCR) +5. `tesseract_api` (if Tesseract OCR is available, but neither RapidOCR nor PaddleOCR are available) + +If none of these engines is available (and no own plugin is provided), no OCR will be performed at all. If the `force_ocr` parameter is ``True``, an error will be raised. Otherwise, the document will be processed without OCR and a warning will be displayed. + +The chosen plugin is displayed as an information message. + +How Default Plugins Work +------------------------ + +The provided default plugins use the following **"hybrid"** OCR approach: + +1. Each page is cleaned from any existing standard text content. +2. The remaining page is rendered as an image and passed to the OCR engine for text detection and recognition. +3. Only the detected text is inserted back into the original page as standard text content. + +In this way, all original content (text and other elements) is preserved and only **augmented** with the newly recognized text. This allows for a more accurate and complete text extraction while also preserving the original document structure and formatting as much as possible. It also allows for a more efficient OCR processing since only the non-extractable text is processed by the OCR engine. This can significantly reduce the overall processing time. + +It also increases the chances for a successful layout detection, because other original content like vectors remains intact and is not rendered to pixels. + +Forcing the Choice of a Default Plugin +--------------------------------------- +The default plugins are designed to be used as is, without any need for configuration. However, if you want to use a specific plugin, you can do so by using the following approach (which enforces for instance using RapidOCR and skipping above selection process). Please note that all plugins have a function named `exec_ocr` that does the actual OCR:: + + import pymupdf4llm + from pymupdf4llm.ocr import rapidocr_api + + my_ocr_function = rapidocr_api.exec_ocr + + # Use my_ocr_function as the OCR function in PyMuPDF4LLM + md_text = pymupdf4llm.to_markdown("input.pdf", ocr_function=my_ocr_function) + + +Providing your Own Plugin +------------------------- + +If you want to use your own OCR function, you can do so as follows:: + + import pymupdf4llm + + def my_ocr_function(page, pixmap=None, dpi=300, language="eng"): + # Your OCR implementation here + return None + + # Use my_ocr_function as the OCR function in PyMuPDF4LLM + md_text = pymupdf4llm.to_markdown("input.pdf", ocr_function=my_ocr_function) + +Your plugin must accept at least the ``page`` parameter which is a PyMuPDF Page object. The other parameters are optional. The plugin must create (or extend) the text of the passed-in page object by simply inserting text (using any of PyMuPDF's text insertion methods). No return values expected. + +Be prepared to accept ``None`` or a PyMuPDF Pixmap object as the `pixmap` parameter, which is the rendered image of the page if provided. Parameters ``dpi`` and ``language`` are passed thru from the respective function parameters. + + +Selecting Pages for OCR +------------------------ + +Usually in document processing, the vast majority of pages contain extractable text and do not require OCR. PyMuPDF4LLM contains logic that analyzes the content based on a number of criteria including (but not restricted to) the following: + +* Presence of extractable and legible (!) text +* Presence of images that appear to contain text +* Presence of vector graphics that simulate text +* Presence of text generated by previous OCR activities + +The OCR decision is internally based on the results of the following function:: + + from pymupdf4llm.helpers.utils import analyze_page + + analysis = analyze_page(page) + +The result ``analysis`` is a dictionary with the following keys and values. The area-related float values are computed as fractions of the total covered area. + +* "covered": pymupdf.Rect, page area covered by content +* "img_joins": float, fraction of area of the joined images +* "img_area": float, fraction of **sum** of image area sizes +* "txt_joins": float, fraction of area of the joined text spans +* "txt_area": float, fraction of **sum** of text span bbox area sizes +* "vec_joins": float, fraction of area of the joined vector characters +* "vec_area": float, fraction of **sum** of vector character area sizes +* "chars_total": int, count of visible characters +* "chars_bad": int, count of Replacement Unicode characters +* "ocr_spans": int, count: text spans with ignored text (render mode 3) +* "img_var": float, area-weighted image variance +* "img_edges": float, area-weighted image edge energy +* "vec_suspicious": int, minimum number of suspected vector-based glyphs +* "reason": str, reason for the OCR decision, else ``None`` +* "needs_ocr": bool, OCR decision (recommendation) + +The reason is one of the following values: +* "chars_bad": more than 10% of all characters are illegible (i.e. Replacement Unicode characters) +* "ocr_spans": there exist text spans created from previous OCR executions (render mode 3) +* "vec_text": there exist suspected vector-based glyphs +* "img_text": there exist images which probably contains recognizable text + +Based on this analysis, PyMuPDF4LLM will decide whether to invoke or skip OCR for a page. This is done to optimize processing time and resource usage by only performing OCR when it is likely to yield additional text content that cannot be extracted by other means. + +You can override this logic in the following ways: + +1. By setting `force_ocr=True` in the output functions (`to_markdown`, `to_text`, `to_json`). All pages will then be OCRed with the selected or provided OCR function regardless of their content. This will obviously have a massive impact on your execution time: expect several seconds duration per each page. + +2. Do as before, but add your own selection logic to the OCR plugin:: + + import pymupdf4llm + from pymupdf4llm.ocr import rapidocr_api + from pymupdf4llm.helpers.utils import analyze_page + + def my_ocr_function(page, pixmap=None, dpi=300, language="eng"): + # analyze the page content and perform OCR only if necessary + analysis = analyze_page(page) + if not analysis["needs_ocr"]: + return None + + # if OCR is recommended, you can decide differently based on your own insights, e.g. + if analysis["reason"] == "ocr_spans": + # we might want to accept previous OCR: + return None + + rapidocr_api.exec_ocr(page, pixmap=pixmap, dpi=dpi, language=language) + return None + + md_text = pymupdf4llm.to_markdown("input.pdf", force_ocr=True, ocr_function=my_ocr_function, ...) + From ceef80bc0c803827cfe571240ee6012b69417226 Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Sat, 14 Mar 2026 12:55:05 -0400 Subject: [PATCH 2/4] typos --- docs/pymupdf4llm/api.rst | 2 +- docs/pymupdf4llm/ocr-plugins.rst | 16 ++++++++++------ 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst index c7b72f8a1..974d80a3d 100644 --- a/docs/pymupdf4llm/api.rst +++ b/docs/pymupdf4llm/api.rst @@ -119,7 +119,7 @@ The PyMuPDF4LLM API * `(top, bottom)` yields `(0, top, 0, bottom)`. * To always read full pages **(default)**, use `margins=0`. - :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a O(n²) impact on processing time and memory requirements. + :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a quadratic impact on processing time and memory requirements. :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), one of the available built-in OCR engines will be used. diff --git a/docs/pymupdf4llm/ocr-plugins.rst b/docs/pymupdf4llm/ocr-plugins.rst index b2926b6ba..7e179dc01 100644 --- a/docs/pymupdf4llm/ocr-plugins.rst +++ b/docs/pymupdf4llm/ocr-plugins.rst @@ -1,7 +1,10 @@ +.. include:: ../header.rst + + Default OCR Functions ====================== -PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the samne time also reducing the overall OCR processing time. +PyMuPDF4LLM supports default OCR functions. They come in the form of plugins that are present in its `ocr` subpackage. They are based on currently 3 popular OCR engines, Tesseract OCR, RapidOCR and PaddleOCR. Some engines can be combined to make use of their strengths and mitigate their weaknesses. For example, Tesseract OCR is very good at **recognizing** text, while RapidOCR is better at **detecting** text bounding boxes in images with complex backgrounds. By combining the two engines, we can achieve better overall OCR results while at the same time also reducing the overall OCR processing time. Here is an overview of the available default plugins: @@ -15,7 +18,7 @@ rapidtess_api RapidOCR + Tesseract OCR Uses RapidOCR for text **detection** an paddletess_api PaddleOCR + Tesseract OCR Uses PaddleOCR for text **detection** and Tesseract OCR for text **recognition** ============== ========================= ================================================================================= -If not explicitely selected via the `ocr_function` parameter, PyMuPDF4LLM will check the availability of the three OCR engines and pick one of the above plugins in the following order of preference: +If not explicitly selected via the `ocr_function` parameter, PyMuPDF4LLM will check the availability of the three OCR engines and pick one of the above plugins in the following order of preference: 1. `rapidtess_api` (if both RapidOCR and Tesseract OCR are available) 2. `paddletess_api` (if both PaddleOCR and Tesseract OCR are available) @@ -38,7 +41,7 @@ The provided default plugins use the following **"hybrid"** OCR approach: In this way, all original content (text and other elements) is preserved and only **augmented** with the newly recognized text. This allows for a more accurate and complete text extraction while also preserving the original document structure and formatting as much as possible. It also allows for a more efficient OCR processing since only the non-extractable text is processed by the OCR engine. This can significantly reduce the overall processing time. -It also increases the chances for a successful layout detection, because other original content like vectors remains intact and is not rendered to pixels. +It also increases the chances for a successful layout detection, because other original content like vectors remain intact and will not be rendered to pixels. Forcing the Choice of a Default Plugin --------------------------------------- @@ -69,7 +72,7 @@ If you want to use your own OCR function, you can do so as follows:: Your plugin must accept at least the ``page`` parameter which is a PyMuPDF Page object. The other parameters are optional. The plugin must create (or extend) the text of the passed-in page object by simply inserting text (using any of PyMuPDF's text insertion methods). No return values expected. -Be prepared to accept ``None`` or a PyMuPDF Pixmap object as the `pixmap` parameter, which is the rendered image of the page if provided. Parameters ``dpi`` and ``language`` are passed thru from the respective function parameters. +Be prepared to accept ``None`` or a PyMuPDF Pixmap object as the `pixmap` parameter, which is the rendered image of the page if provided. Parameters ``dpi`` and ``language`` are passed through from the respective function parameters. Selecting Pages for OCR @@ -110,7 +113,7 @@ The reason is one of the following values: * "chars_bad": more than 10% of all characters are illegible (i.e. Replacement Unicode characters) * "ocr_spans": there exist text spans created from previous OCR executions (render mode 3) * "vec_text": there exist suspected vector-based glyphs -* "img_text": there exist images which probably contains recognizable text +* "img_text": there exist images which (probably) contain recognizable text Based on this analysis, PyMuPDF4LLM will decide whether to invoke or skip OCR for a page. This is done to optimize processing time and resource usage by only performing OCR when it is likely to yield additional text content that cannot be extracted by other means. @@ -129,7 +132,7 @@ You can override this logic in the following ways: analysis = analyze_page(page) if not analysis["needs_ocr"]: return None - + # if OCR is recommended, you can decide differently based on your own insights, e.g. if analysis["reason"] == "ocr_spans": # we might want to accept previous OCR: @@ -140,3 +143,4 @@ You can override this logic in the following ways: md_text = pymupdf4llm.to_markdown("input.pdf", force_ocr=True, ocr_function=my_ocr_function, ...) +.. include:: ../footer.rst From af65c4f1642586fdc6ad6c6972f27a5ab59b4083 Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Sat, 14 Mar 2026 12:58:15 -0400 Subject: [PATCH 3/4] Potential fix for pull request finding Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> --- docs/pymupdf4llm/api.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst index 974d80a3d..afa88bede 100644 --- a/docs/pymupdf4llm/api.rst +++ b/docs/pymupdf4llm/api.rst @@ -119,7 +119,7 @@ The PyMuPDF4LLM API * `(top, bottom)` yields `(0, top, 0, bottom)`. * To always read full pages **(default)**, use `margins=0`. - :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that this value has a quadratic impact on processing time and memory requirements. + :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Larger values do not usually increase the OCR precision. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high - in many cases you should see satisfactory results already with values of 150 or 200. Be aware that processing time and memory requirements grow quadratically with this value (an O(ocr_dpi²) impact). :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), one of the available built-in OCR engines will be used. From 74ac78e14c557d740559f57c63a8195aaeffa1c5 Mon Sep 17 00:00:00 2001 From: "Jorj X. McKie" Date: Sun, 15 Mar 2026 09:14:05 -0400 Subject: [PATCH 4/4] update comments --- docs/pymupdf4llm/ocr-plugins.rst | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/docs/pymupdf4llm/ocr-plugins.rst b/docs/pymupdf4llm/ocr-plugins.rst index 7e179dc01..87b77060a 100644 --- a/docs/pymupdf4llm/ocr-plugins.rst +++ b/docs/pymupdf4llm/ocr-plugins.rst @@ -130,14 +130,21 @@ You can override this logic in the following ways: def my_ocr_function(page, pixmap=None, dpi=300, language="eng"): # analyze the page content and perform OCR only if necessary analysis = analyze_page(page) + + # inspect the items of the analysis dictionary to make your own + # decision about whether to perform OCR or not, e.g.: if not analysis["needs_ocr"]: + # accept decision NOT to perform OCR: return None - # if OCR is recommended, you can decide differently based on your own insights, e.g. + # if OCR is recommended, you can decide differently based on + # your own insights, e.g. we might want to accept previous OCR + # results and skip OCR if there are already text spans created + # from previous OCR executions (render mode 3): if analysis["reason"] == "ocr_spans": - # we might want to accept previous OCR: return None + # execute desired OCR engine rapidocr_api.exec_ocr(page, pixmap=pixmap, dpi=dpi, language=language) return None