Skip to content

fix(matcher): recover orphan pdf cells that match a row band but no column band#159

Open
scottf007 wants to merge 1 commit into
docling-project:mainfrom
scottf007:fix/orphan-pdf-cell-silent-drop
Open

fix(matcher): recover orphan pdf cells that match a row band but no column band#159
scottf007 wants to merge 1 commit into
docling-project:mainfrom
scottf007:fix/orphan-pdf-cell-silent-drop

Conversation

@scottf007
Copy link
Copy Markdown

Summary

`MatchingPostProcessor._pick_orphan_cells` iterates pdf cells whose y-position falls inside a row band but whose x-position lies outside every predicted column band. The existing code wraps the entire assignment block in `if pdf_cell_id in used_col_pdf_ids:` with no `else` branch. PDF cells that made it into `orphan_rows_pdf_ids` but never landed in a column band are therefore silently dropped — no log, no warning, no fallback.

This shows up in real PDFs as missing data in the output. A representative case: a totals subtable rendered at the bottom of a larger table region where the predicted columns are sized to the upper region's geometry; the right-aligned totals values fall outside every predicted column band and disappear from `table_cells` with no observable signal.

Fix

  • Cache each column's X-centroid (`(col_x1 + col_x2) / 2`) once during the per-column orphan loop in a new `col_centroids: dict[int, float]` keyed by column id.
  • Add an `else` branch in the orphan-rows loop: when a pdf cell has a row-band match but no column-band match, snap to the column whose centroid is nearest the pdf cell's X-centroid.
  • Confidence is encoded as `-int(round(distance)) - 1` so fallback matches are always negative; downstream consumers can distinguish them from band matches if they care.
  • Emit a `WARNING`-level log so the recovery is observable. The function only logs at DEBUG today — operators reasonably expect a WARNING when the matcher resorts to a fallback path.
  • Hoist the cell-create / reuse / `new_matches[...]` block out of the `if` so both paths share it. No behavioural change to the happy path (in-band orphans).

What this does not fix

There is an orthogonal silent-drop pathway where a pdf cell's y-position is just outside every row band (the row-band detection at lines 828-838 has the same inclusive-boundary issue). I encountered this on real-world Computershare dividend statements where a totals data row sits ~3 page units below the predicted row band's bottom edge, so the values never make it into `orphan_rows_pdf_ids` for my fix to recover. Addressing that requires either a complementary fallback for cells inside the table prov bbox but outside all row bands, or a relaxation of the row-band inclusion bounds. Tracking as follow-up.

Test plan

  • New `tests/test_matching_post_processor.py`:
    • `test_orphan_with_no_col_band_match_is_recovered_to_nearest_column` — synthetic input where a pdf cell sits in a row's y-band but x=560 (outside both column x-bands at 10..90 and 110..200). Pre-fix the orphan is dropped from `new_matches`; post-fix it is assigned to the nearest column (col 1, centroid 155, dist 405) and confidence < 0.
    • `test_band_matched_orphans_use_normal_path` — sanity check that in-band orphans take the unchanged happy path with confidence >= 0.
  • `pre-commit run --files ` — Black, isort, MyPy all green after autoformat.
  • End-to-end smoke test on a real Computershare dividend PDF with a totals subtable: WARNING fires for the orphan pdf cell, the recovered cell appears in `table_cells` for row 21 of the table.
  • Existing test suite unaffected: light tests pass; integration tests (`test_tf_predictor.py`, `test_tableformer_v2.py`) not run locally — they require model weights — but no behaviour change in the in-band fast path means they should remain green.

Issues addressed

Closes #28 (partially) — "Post processing improvements". The reporter asked for orphan tokens to be assigned rather than dropped. This addresses the column-band-miss case. The row-band-miss case is documented as follow-up above.

Notes

  • `tab_cols`, `pdf_cells`, etc. are the existing function parameters; no signature change.
  • No new dependencies.
  • The change is purely additive to behaviour: orphans previously dropped now land in some column. The change cannot remove cells that were previously emitted, only add cells. Worst case is over-eager attachment of a stray pdf token to a column, which the WARNING log makes immediately visible in operator output.

…olumn band

`MatchingPostProcessor._pick_orphan_cells` iterates pdf cells whose y-position
falls inside a row band but whose x-position lies outside every predicted
column band. The existing code wraps the entire assignment block in
`if pdf_cell_id in used_col_pdf_ids:` with no `else` branch. PDF cells that
made it into `orphan_rows_pdf_ids` but never landed in a column band are
therefore silently dropped — no log, no warning, no fallback.

This shows up in real PDFs as missing data in the output. A representative
case: a totals subtable rendered at the bottom of a larger table region
where the predicted columns are sized to the upper region's geometry; the
right-aligned totals values fall outside every predicted column band and
disappear from `table_cells` with no observable signal.

The fix:

* Cache each column's X-centroid (`(col_x1 + col_x2) / 2`) once, during the
  per-column orphan loop, in a `col_centroids` dict keyed by column id.
* Add an `else` branch in the orphan-rows loop: when a pdf cell has a row
  band but no col band, snap to the column whose centroid is nearest the
  pdf cell's X-centroid. Confidence is encoded as `-int(round(distance)) - 1`
  so it is always negative for fallback matches and downstream consumers
  can distinguish them from regular high-confidence matches if they care.
* Emit a `WARNING`-level log so the recovery is observable in production
  output. (The function only logs at DEBUG today — operators reasonably
  expect a WARNING when the matcher resorts to a fallback path.)

The cell-create / reuse / `new_matches[...]` block is hoisted out of the
`if` so both paths share it. No behavioural change to the happy path
(in-band orphans).

This addresses one half of docling-project#28 (post-processing improvements). It does not
fix the orthogonal row-band variant where a pdf cell's y-position is just
outside every row band — that requires a separate change to the row-band
inclusion logic or a complementary fallback for cells inside the table
prov bbox but outside all row bands. Documented for follow-up.

Tests: new `tests/test_matching_post_processor.py` with synthetic-input
unit tests covering both the recovery path and the unchanged in-band
fast path. Existing test suite unaffected.

Signed-off-by: scott <scott@fletchcorp.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

DCO Check Passed

Thanks @scottf007, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 4, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Post processing improvements

1 participant