Skip to content

Fix compatibility for PDFs with embedded outlines#212

Closed
shaoqing404 wants to merge 272 commits into
VectifyAI:mainfrom
shaoqing404:fix/main-compat
Closed

Fix compatibility for PDFs with embedded outlines#212
shaoqing404 wants to merge 272 commits into
VectifyAI:mainfrom
shaoqing404:fix/main-compat

Conversation

@shaoqing404
Copy link
Copy Markdown

Problem

PageIndex currently assumes that PDF structure should be derived primarily from the text-based TOC path.

That works for many ordinary reports, but it breaks down on a class of operational/manual PDFs that already contain a rich embedded PDF outline/bookmark tree. In those files:

  • the document may contain both an effective-page list and a semantic TOC
  • pagination is often section-local rather than document-global
  • TOC page values are not reliable physical-page candidates
  • the built-in PDF outline is already a better structural source than text reconstruction

A concrete example is the early pdf-exploration analysis around the operation control manual. That investigation showed that the current TOC-offset workflow is a weak fit for this document class, and can fail or produce unstable results even though the PDF itself already exposes usable structure metadata.

Approach

This PR adds an outline-first compatibility path for PDFs that already contain a sufficiently usable embedded outline.

The change is intentionally small:

  • try to read the embedded PDF outline/bookmarks before running the normal TOC parser
  • convert that outline into the existing PageIndex tree shape with title, start_index, end_index, and nested nodes
  • infer missing parent start pages from children when possible
  • derive end_index from the next sibling / subtree boundary
  • fall back to the existing tree_parser flow when the outline is absent, empty, or too sparse to be trustworthy

So this does not replace the current TOC-based pipeline. It only bypasses it when the PDF already provides a strong native outline.

Why this helps

This makes the service compatible with manuals and similar structured PDFs where the old TOC inference path is the wrong abstraction.

It also keeps the fix general:

  • no document-specific special casing
  • no hardcoded handling for the operation manual
  • no behavior change for PDFs without a usable outline

Validation

  • added a regression test for get_pdf_outline_tree using a repository sample PDF with embedded outline
  • added a test that confirms page_index_main prefers the outline path and does not enter tree_parser when a usable outline is available
  • manually verified end-to-end parsing on an outline-bearing sample PDF in the repo

======================================语言分割线========================================
感谢项目组的绝妙思考。

我们对该项目进行了测试,并在当前公开政务类 PDF 上做了小范围验证:在 index 阶段取得了 103/120 的构建成功率,在召回准确率和回答准确率阶段取得了 93.51% 的成绩。这个项目的设计思路非常有启发性,也让我们对它在真实文档场景中的潜力感到兴奋。

这个 PR 主要是在现有 TOC 推断路径之外,补充一个更稳妥的兜底优化:

当 PDF 本身已经携带完整且可用的 embedded outline / bookmarks 时,优先直接读取其原生目录结构,并转换为 PageIndex 当前使用的结构树格式(title / start_index / end_index / nodes);只有在 outline 缺失、过于稀疏或不可用时,才回退到现有的 TOC 推断流程。

这样做的原因是,我们在测试中发现,部分政务类/手册类 PDF 虽然已经包含质量较高的原生 outline,但正文前部往往同时存在“有效页清单”、分段页码、局部页码体系等结构。这类文档会让基于文本内容的 TOC 推断路径更容易出现页码映射不稳、结构恢复失败,甚至直接影响 index 构建成功率的问题。对于这类文件,优先使用 PDF 自带的 outline,通常比从文本重新推断目录更可靠。

这个改动保持了行为上的最小侵入:

  • 不替换原有 TOC 推断逻辑
  • 不对特定文档做硬编码
  • 仅在 PDF 已经提供足够可靠的原生结构信息时启用 outline-first 路径
  • 其余情况仍然保持现有流程不变

在我们的测试集中,这个优化使带有完整 outline / bookmarks 的文档可以直接构建结构树,将文档构建成功率从 103/120 提升到了 114/120;在同一测试集上,未观察到回答准确率的明显波动。

如果这个方向符合项目预期,我们也很愿意继续提供更多帮助,帮助把这一兼容性路径打磨得更稳定。

BukeLy and others added 27 commits March 2, 2026 18:31
Fix backfill-dedupe pagination: replace gh issue list with gh api
Backfill workflow triggers issue-dedupe via gh workflow run, which
makes the actor github-actions. Add it to allowed_bots so
claude-code-action accepts the trigger.
Allow github-actions bot to trigger claude-code-action
Issues are opened by external users who don't have write permissions.
Add allowed_non_write_users: "*" so claude-code-action runs for all
issue authors, not just repo collaborators.
fix: make ChatGPT_API_with_finish_reason return consistent tuple
fix: prevent infinite loop in extract_toc_content
The loop variable `list_index = page_index - start_index` was
overwriting the outer `list_index = incorrect_item['list_index']`,
causing results to be written back to wrong index positions.

Rename the loop variable to `page_list_idx` to avoid shadowing.

Closes #66
Fix list_index variable shadowing in fix_incorrect_toc
* Integrate litellm for multi-provider LLM support

* recover the default config yaml

* Use litellm.acompletion for native async support

* fix tob

* Rename llm_complete/allm_complete to llm_completion/llm_acompletion, remove unused llm_complete_stream

* Pin litellm to version 1.82.0

* resolve comments

* args from cli is used to overrides config.yaml

* Fix get_page_tokens hardcoded model default

Pass opt.model to get_page_tokens so tokenization respects the
configured model instead of always using gpt-4o-2024-11-20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove explicit openai dependency from requirements.txt

openai is no longer directly imported; it comes in as a transitive
dependency of litellm. Pinning it explicitly risks version conflicts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Restore openai==1.101.0 pin in requirements.txt

litellm==1.82.0 and openai-agents have conflicting openai version
requirements, but openai==1.101.0 works at runtime for both.
The pin is necessary to prevent litellm from pulling in openai>=2.x
which would break openai-agents.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Remove explicit openai dependency from requirements.txt

openai is not directly used; it comes in as a transitive dependency
of litellm. No openai-agents in this branch so no pin needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix an litellm error log

* resolve comments

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…125)

* Add PageIndexClient with retrieve, streaming support and litellm integration
* Add OpenAI agents demo example
* Update README with example agent demo section
* Support separate retrieve_model configuration for index and retrieve
* Consolidate tests/ into examples/documents/

* Add line_count and reorder structure keys

* Lazy-load documents with _meta.json index

* Update demo script and add pre-shipped workspace

* Extract shared helpers for JSON reading and meta entry building
* Simplify and fix agentic RAG demo

* Show labeled reasoning output in RAG demo

* Comment out reasoning model settings by default
* Disable agent tracing and auto-add litellm/ prefix for retrieve_model

* Preserve supported retrieve_model prefixes

* Remove temporary retrieve_model tests

* Limit tracing disablement to demo execution
Polish demo docstring and migrate to pathlib
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@shaoqing404 shaoqing404 closed this by deleting the head repository May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants