Skip to content

feat: add parallelization for parsing#216

Merged
PeterStaar-IBM merged 18 commits intomainfrom
dev/add-parallelization-for-parsing
Mar 4, 2026
Merged

feat: add parallelization for parsing#216
PeterStaar-IBM merged 18 commits intomainfrom
dev/add-parallelization-for-parsing

Conversation

@PeterStaar-IBM
Copy link
Member

@PeterStaar-IBM PeterStaar-IBM commented Feb 14, 2026

Python code

Parse pages from one or more PDFs in parallel using a thread pool with backpressure:

from docling_parse.pdf_parser import (
    DoclingThreadedPdfParser,
    ThreadedPdfParserConfig,
)
from docling_parse.pdf_parsers import DecodePageConfig  # type: ignore[import]

parser_config = ThreadedPdfParserConfig(
    loglevel="fatal",
    threads=4,                # worker threads
    max_concurrent_results=32 # cap buffered results to limit memory
)
decode_config = DecodePageConfig()

parser = DoclingThreadedPdfParser(
    parser_config=parser_config,
    decode_config=decode_config,
)

# load one or more documents
for source in ["doc_a.pdf", "doc_b.pdf"]:
    parser.load(source)

# consume decoded pages as they become available
while parser.has_tasks():
    task = parser.get_task()

    if task.success:
        page_decoder, timings = task.get()
        print(f"{task.doc_key} p{task.page_number}: "
              f"{len(list(page_decoder.get_word_cells()))} words")
    else:
        print(f"error on {task.doc_key} p{task.page_number}: {task.error()}")

testing results (to be updated)

The sequential path now sets do_thread_safe = False, which avoids the per-page QPDF document creation
overhead and shares the parent document's QPDF instance instead — a fairer baseline since that cost is only
needed for parallelism.

uv run python ./perf/perf_scaling.py /Users/taa/Documents/projects/_data/bo767/pdf -r --threads 1,2,4,8 --limit 100

mode        threads      wall_time (s)  vs sequential    vs threaded(1)
----------  ---------  ---------------  ---------------  ----------------
sequential  -                   43.721  1.00x            1.35x
threaded    1                   59.193  0.74x            1.00x
threaded    2                   30.979  1.41x            1.91x
threaded    4                   15.736  2.78x            3.76x
threaded    6                   11.581  3.78x            5.11x
threaded    8                    8.734  5.01x            6.78x
threaded    10                   7.576  5.77x            7.81x
threaded    12                   6.909  6.33x            8.57x

mode        threads      pages/sec
----------  ---------  -----------
sequential  -                169.1
threaded    1                124.9
threaded    2                238.7
threaded    4                469.9
threaded    6                638.4
threaded    8                846.6
threaded    10               975.9
threaded    12              1070.3

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
@mergify
Copy link

mergify bot commented Feb 14, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@github-actions
Copy link
Contributor

github-actions bot commented Feb 14, 2026

DCO Check Passed

Thanks @PeterStaar-IBM, all your commits are properly signed off. 🎉

Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Signed-off-by: Peter Staar <taa@zurich.ibm.com>
Copy link
Member

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@PeterStaar-IBM PeterStaar-IBM merged commit ae66f6d into main Mar 4, 2026
34 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/add-parallelization-for-parsing branch March 4, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants