Fix compatibility for PDFs with embedded outlines by shaoqing404 · Pull Request #212 · VectifyAI/PageIndex

shaoqing404 · 2026-04-02T16:02:57Z

Problem

PageIndex currently assumes that PDF structure should be derived primarily from the text-based TOC path.

That works for many ordinary reports, but it breaks down on a class of operational/manual PDFs that already contain a rich embedded PDF outline/bookmark tree. In those files:

the document may contain both an effective-page list and a semantic TOC
pagination is often section-local rather than document-global
TOC page values are not reliable physical-page candidates
the built-in PDF outline is already a better structural source than text reconstruction

A concrete example is the early pdf-exploration analysis around the operation control manual. That investigation showed that the current TOC-offset workflow is a weak fit for this document class, and can fail or produce unstable results even though the PDF itself already exposes usable structure metadata.

Approach

This PR adds an outline-first compatibility path for PDFs that already contain a sufficiently usable embedded outline.

The change is intentionally small:

try to read the embedded PDF outline/bookmarks before running the normal TOC parser
convert that outline into the existing PageIndex tree shape with title, start_index, end_index, and nested nodes
infer missing parent start pages from children when possible
derive end_index from the next sibling / subtree boundary
fall back to the existing tree_parser flow when the outline is absent, empty, or too sparse to be trustworthy

So this does not replace the current TOC-based pipeline. It only bypasses it when the PDF already provides a strong native outline.

Why this helps

This makes the service compatible with manuals and similar structured PDFs where the old TOC inference path is the wrong abstraction.

It also keeps the fix general:

no document-specific special casing
no hardcoded handling for the operation manual
no behavior change for PDFs without a usable outline

Validation

added a regression test for get_pdf_outline_tree using a repository sample PDF with embedded outline
added a test that confirms page_index_main prefers the outline path and does not enter tree_parser when a usable outline is available
manually verified end-to-end parsing on an outline-bearing sample PDF in the repo

======================================语言分割线========================================
感谢项目组的绝妙思考。

我们对该项目进行了测试，并在当前公开政务类 PDF 上做了小范围验证：在 index 阶段取得了 103/120 的构建成功率，在召回准确率和回答准确率阶段取得了 93.51% 的成绩。这个项目的设计思路非常有启发性，也让我们对它在真实文档场景中的潜力感到兴奋。

这个 PR 主要是在现有 TOC 推断路径之外，补充一个更稳妥的兜底优化：

当 PDF 本身已经携带完整且可用的 embedded outline / bookmarks 时，优先直接读取其原生目录结构，并转换为 PageIndex 当前使用的结构树格式（title / start_index / end_index / nodes）；只有在 outline 缺失、过于稀疏或不可用时，才回退到现有的 TOC 推断流程。

这样做的原因是，我们在测试中发现，部分政务类/手册类 PDF 虽然已经包含质量较高的原生 outline，但正文前部往往同时存在“有效页清单”、分段页码、局部页码体系等结构。这类文档会让基于文本内容的 TOC 推断路径更容易出现页码映射不稳、结构恢复失败，甚至直接影响 index 构建成功率的问题。对于这类文件，优先使用 PDF 自带的 outline，通常比从文本重新推断目录更可靠。

这个改动保持了行为上的最小侵入：

不替换原有 TOC 推断逻辑
不对特定文档做硬编码
仅在 PDF 已经提供足够可靠的原生结构信息时启用 outline-first 路径
其余情况仍然保持现有流程不变

在我们的测试集中，这个优化使带有完整 outline / bookmarks 的文档可以直接构建结构树，将文档构建成功率从 103/120 提升到了 114/120；在同一测试集上，未观察到回答准确率的明显波动。

如果这个方向符合项目预期，我们也很愿意继续提供更多帮助，帮助把这一兼容性路径打磨得更稳定。

Working

Fix backfill-dedupe pagination: replace gh issue list with gh api

Backfill workflow triggers issue-dedupe via gh workflow run, which makes the actor github-actions. Add it to allowed_bots so claude-code-action accepts the trigger.

Allow github-actions bot to trigger claude-code-action

Issues are opened by external users who don't have write permissions. Add allowed_non_write_users: "*" so claude-code-action runs for all issue authors, not just repo collaborators.

Allow all users to trigger issue dedup

fix: make ChatGPT_API_with_finish_reason return consistent tuple

fix: prevent infinite loop in extract_toc_content

The loop variable `list_index = page_index - start_index` was overwriting the outer `list_index = incorrect_item['list_index']`, causing results to be written back to wrong index positions. Rename the loop variable to `page_list_idx` to avoid shadowing. Closes #66

Fix list_index variable shadowing in fix_incorrect_toc

* Integrate litellm for multi-provider LLM support * recover the default config yaml * Use litellm.acompletion for native async support * fix tob * Rename llm_complete/allm_complete to llm_completion/llm_acompletion, remove unused llm_complete_stream * Pin litellm to version 1.82.0 * resolve comments * args from cli is used to overrides config.yaml * Fix get_page_tokens hardcoded model default Pass opt.model to get_page_tokens so tokenization respects the configured model instead of always using gpt-4o-2024-11-20. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove explicit openai dependency from requirements.txt openai is no longer directly imported; it comes in as a transitive dependency of litellm. Pinning it explicitly risks version conflicts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Restore openai==1.101.0 pin in requirements.txt litellm==1.82.0 and openai-agents have conflicting openai version requirements, but openai==1.101.0 works at runtime for both. The pin is necessary to prevent litellm from pulling in openai>=2.x which would break openai-agents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Remove explicit openai dependency from requirements.txt openai is not directly used; it comes in as a transitive dependency of litellm. No openai-agents in this branch so no pin needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix an litellm error log * resolve comments --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…125) * Add PageIndexClient with retrieve, streaming support and litellm integration * Add OpenAI agents demo example * Update README with example agent demo section * Support separate retrieve_model configuration for index and retrieve

Simplify root directory

* Consolidate tests/ into examples/documents/ * Add line_count and reorder structure keys * Lazy-load documents with _meta.json index * Update demo script and add pre-shipped workspace * Extract shared helpers for JSON reading and meta entry building

* Simplify and fix agentic RAG demo * Show labeled reasoning output in RAG demo * Comment out reasoning model settings by default

* Disable agent tracing and auto-add litellm/ prefix for retrieve_model * Preserve supported retrieve_model prefixes * Remove temporary retrieve_model tests * Limit tracing disablement to demo execution

Polish demo docstring and migrate to pathlib

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

rejojer and others added 30 commits April 9, 2025 19:37

fix pdf name

633ba17

Update README.md

3c7f35b

Update README.md

b45f54c

Update README.md

266ee54

Update README.md

6c2bb5a

Update README.md

68c2306

fix physical index

455bc4a

Update README.md

5a0b38a

add async. various fixes.

e653fa4

fix index range

6c0f324

Merge pull request #10 from rejojer/working

446a4bf

Working

Update README.md

b1bcb68

fix arg parsing

22e21f8

Update README.md

dda2ba8

Update README.md

5ffeb48

Update CHANGELOG.md

8d371e6

Update README.md

42108ae

Update README.md

18fa2f4

Update README.md

4cb2252

Update README.md

4571f83

Update README.md

167a62d

Update README.md

935a410

Update README.md

a2af4c9

Update README.md

b832bd4

Update README.md

a10ef58

Update README.md

0e0820d

Update README.md

59f4da2

Update README.md

f84fdd2

Update README.md

6fb26f5

Update README.md

cd29c81

BukeLy and others added 27 commits March 2, 2026 18:31

Merge pull request #132 from VectifyAI/fix/backfill-dedupe-pagination

1644b81

Fix backfill-dedupe pagination: replace gh issue list with gh api

Allow github-actions bot to trigger claude-code-action

c551fac

Backfill workflow triggers issue-dedupe via gh workflow run, which makes the actor github-actions. Add it to allowed_bots so claude-code-action accepts the trigger.

Merge pull request #133 from VectifyAI/fix/allow-bot-trigger

7786a05

Allow github-actions bot to trigger claude-code-action

Allow all users to trigger issue dedup via claude-code-action

700bc05

Issues are opened by external users who don't have write permissions. Add allowed_non_write_users: "*" so claude-code-action runs for all issue authors, not just repo collaborators.

Merge pull request #142 from VectifyAI/fix/allow-all-users-dedupe

a4f19e0

Allow all users to trigger issue dedup

Merge pull request #63 from luojiyin1987/fix/api-error-return

212da0a

fix: make ChatGPT_API_with_finish_reason return consistent tuple

Merge pull request #65 from luojiyin1987/fix/extract-toc-infinite-loop

7633853

fix: prevent infinite loop in extract_toc_content

Merge pull request #167 from VectifyAI/fix/list-index-shadowing

7770ad9

Fix list_index variable shadowing in fix_incorrect_toc

Update demo example paper and polish README

ddd171f

Add agentic vectorless RAG example to README highlights

eea8e89

Update README

b10fb24

Simplify root directory

58e1c0a

Update README

f2599b6

Merge pull request #184 from VectifyAI/cleanup/simplify-root-directory

0e2ad53

Simplify root directory

Rename demo script and update README wording

2accef7

Simplify agentic vectorless RAG demo (#191)

d6a24ff

* Simplify and fix agentic RAG demo * Show labeled reasoning output in RAG demo * Comment out reasoning model settings by default

Disable agent tracing and auto-add litellm/ prefix for retrieve_model

9d3f97d

* Disable agent tracing and auto-add litellm/ prefix for retrieve_model * Preserve supported retrieve_model prefixes * Remove temporary retrieve_model tests * Limit tracing disablement to demo execution

Polish demo docstring and migrate to pathlib

266d7ea

Merge pull request #197 from VectifyAI/polish/demo-docstring-and-pathlib

b95ea6f

Polish demo docstring and migrate to pathlib

Polish agent system prompt wording

27ae416

Update developer links

6c0e086

Update README

2f245ba

Fix compatibility for PDFs with embedded outlines

25861f4

claude Bot reviewed Apr 2, 2026

View reviewed changes

shaoqing404 force-pushed the fix/main-compat branch from cea0c1e to 25861f4 Compare May 13, 2026 18:12

shaoqing404 closed this by deleting the head repository May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix compatibility for PDFs with embedded outlines#212

Fix compatibility for PDFs with embedded outlines#212
shaoqing404 wants to merge 272 commits into
VectifyAI:mainfrom
shaoqing404:fix/main-compat

shaoqing404 commented Apr 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

shaoqing404 commented Apr 2, 2026

Problem

Approach

Why this helps

Validation

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants