fix(matcher): recover orphan pdf cells that match a row band but no column band#159
Open
scottf007 wants to merge 1 commit into
Open
fix(matcher): recover orphan pdf cells that match a row band but no column band#159scottf007 wants to merge 1 commit into
scottf007 wants to merge 1 commit into
Conversation
…olumn band `MatchingPostProcessor._pick_orphan_cells` iterates pdf cells whose y-position falls inside a row band but whose x-position lies outside every predicted column band. The existing code wraps the entire assignment block in `if pdf_cell_id in used_col_pdf_ids:` with no `else` branch. PDF cells that made it into `orphan_rows_pdf_ids` but never landed in a column band are therefore silently dropped — no log, no warning, no fallback. This shows up in real PDFs as missing data in the output. A representative case: a totals subtable rendered at the bottom of a larger table region where the predicted columns are sized to the upper region's geometry; the right-aligned totals values fall outside every predicted column band and disappear from `table_cells` with no observable signal. The fix: * Cache each column's X-centroid (`(col_x1 + col_x2) / 2`) once, during the per-column orphan loop, in a `col_centroids` dict keyed by column id. * Add an `else` branch in the orphan-rows loop: when a pdf cell has a row band but no col band, snap to the column whose centroid is nearest the pdf cell's X-centroid. Confidence is encoded as `-int(round(distance)) - 1` so it is always negative for fallback matches and downstream consumers can distinguish them from regular high-confidence matches if they care. * Emit a `WARNING`-level log so the recovery is observable in production output. (The function only logs at DEBUG today — operators reasonably expect a WARNING when the matcher resorts to a fallback path.) The cell-create / reuse / `new_matches[...]` block is hoisted out of the `if` so both paths share it. No behavioural change to the happy path (in-band orphans). This addresses one half of docling-project#28 (post-processing improvements). It does not fix the orthogonal row-band variant where a pdf cell's y-position is just outside every row band — that requires a separate change to the row-band inclusion logic or a complementary fallback for cells inside the table prov bbox but outside all row bands. Documented for follow-up. Tests: new `tests/test_matching_post_processor.py` with synthetic-input unit tests covering both the recovery path and the unchanged in-band fast path. Existing test suite unaffected. Signed-off-by: scott <scott@fletchcorp.com>
Contributor
|
✅ DCO Check Passed Thanks @scottf007, all your commits are properly signed off. 🎉 |
Contributor
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesWaiting for
This rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
`MatchingPostProcessor._pick_orphan_cells` iterates pdf cells whose y-position falls inside a row band but whose x-position lies outside every predicted column band. The existing code wraps the entire assignment block in `if pdf_cell_id in used_col_pdf_ids:` with no `else` branch. PDF cells that made it into `orphan_rows_pdf_ids` but never landed in a column band are therefore silently dropped — no log, no warning, no fallback.
This shows up in real PDFs as missing data in the output. A representative case: a totals subtable rendered at the bottom of a larger table region where the predicted columns are sized to the upper region's geometry; the right-aligned totals values fall outside every predicted column band and disappear from `table_cells` with no observable signal.
Fix
What this does not fix
There is an orthogonal silent-drop pathway where a pdf cell's y-position is just outside every row band (the row-band detection at lines 828-838 has the same inclusive-boundary issue). I encountered this on real-world Computershare dividend statements where a totals data row sits ~3 page units below the predicted row band's bottom edge, so the values never make it into `orphan_rows_pdf_ids` for my fix to recover. Addressing that requires either a complementary fallback for cells inside the table prov bbox but outside all row bands, or a relaxation of the row-band inclusion bounds. Tracking as follow-up.
Test plan
Issues addressed
Closes #28 (partially) — "Post processing improvements". The reporter asked for orphan tokens to be assigned rather than dropped. This addresses the column-band-miss case. The row-band-miss case is documented as follow-up above.
Notes