fix: merge mixed-font symbols into line cells#230
fix: merge mixed-font symbols into line cells#230serge-medvedev wants to merge 2 commits intodocling-project:mainfrom
Conversation
Use the order-independent line merge path when building textline cells and disable same-font enforcement for line-cell merges. This prevents fallback-font symbols (for example arrows) from being emitted as standalone line cells when PDF content-stream order differs from visual order. Also make v2 line merging handle reverse adjacency symmetrically and add a targeted synthetic regression fixture/test that asserts the exact textline-cell output.
|
✅ DCO Check Passed Thanks @serge-medvedev, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Serge Medvedev <thismoment.main@gmail.com>, hereby add my Signed-off-by to this commit: 81dbadb Signed-off-by: Serge Medvedev <thismoment.main@gmail.com>
| // Use the order-independent merge path for line construction and do not require font equality. | ||
| contract_cells_into_lines_v2(line_cells, | ||
| config.horizontal_cell_tolerance, | ||
| false, |
There was a problem hiding this comment.
why would you hard code it if we have a config parameter for this?
There was a problem hiding this comment.
Due to lack of understanding of the code base.
The bug surfaced in docling-serve and it took a bit of triage to dig this deep.
What about adding a new config option, e.g. decode_page_config.enforce_same_font_for_line_cells (as the existing one decode_page_config.enforce_same_font is too coarse)?
There was a problem hiding this comment.
@serge-medvedev I would not mind doing that, but I need to first understand better. If you currently use the enforce_same_font=false, what does not work for you?
There was a problem hiding this comment.
- The issue is not only font mismatch.
contract_cells_into_lines_v1is order-sensitive and break-driven. With PDF content-stream order != visual order, symbols can still end up detached even if same-font enforcement is disabled. enforce_same_fontis shared for word and line construction.
decode_page_config.enforce_same_font applies to both word-cells and line-cells. Flipping it globally to fix line behavior also changes word-cell behavior, which is a broader semantic change.
enforce_same_font=false does not address the ordering/break behavior that caused my arrow misplacement case.
|
@serge-medvedev I think this PR might just fix all your issue(s): #234 |
Use the order-independent line merge path when building textline cells and disable same-font enforcement for line-cell merges. This prevents fallback-font symbols (for example arrows) from being emitted as standalone line cells when PDF content-stream order differs from visual order.
Also make v2 line merging handle reverse adjacency symmetrically and add a targeted synthetic regression fixture/test that asserts the exact textline-cell output.