Skip to content

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325

Open
statxc wants to merge 2 commits intoUnstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy
Open

feat: infer hierarchical heading levels (H1-H6) for PDFs (#4204)#4325
statxc wants to merge 2 commits intoUnstructured-IO:mainfrom
statxc:statxc/feat-pdf-heading-hierarchy

Conversation

@statxc
Copy link
Copy Markdown

@statxc statxc commented Apr 7, 2026

Summary

  • Add two-strategy heading level inference for PDF Title elements via category_depth metadata
    • Outline extraction (primary): walks PDF bookmark tree and matches entries to Title elements by page number + text similarity
    • Font-size analysis (fallback): clusters distinct font sizes from pdfminer LTChar data, ranks largest-first to assign depth 0-5
  • Integrates as a post-processing step in partition_pdf_or_image(), works with all strategies (fast, hi_res, ocr_only)
  • Correctly skipped for image partitioning

Test plan

  • Existing test_document_to_element_list_sets_category_depth_titles passes unchanged
  • 154 existing PDF tests pass (1 pre-existing OCR language failure unrelated)

Closes #4204

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 10, 2026

Hi, @PastelStorm @cragwolfe
Could you review my PR please.
Please let me know if anything else is needed to update.
Thanks.

@codebymikey
Copy link
Copy Markdown

There are currenty some merge conflicts that need resolving.

@statxc
Copy link
Copy Markdown
Author

statxc commented Apr 14, 2026

@codebymikey @PastelStorm
I fixed conflict problem. Can you review again please? I appreciate any feedback. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat/Infer the hierarchical heading/title levels such as H1, H2, H3, H4 for PDFs

2 participants