Skip to content

fix: Align start, end, and probability lists in ExtractiveReader#11347

Merged
bogdankostic merged 4 commits into
mainfrom
extractive_reader_bug
May 20, 2026
Merged

fix: Align start, end, and probability lists in ExtractiveReader#11347
bogdankostic merged 4 commits into
mainfrom
extractive_reader_bug

Conversation

@bogdankostic
Copy link
Copy Markdown
Contributor

Related Issues

ExtractiveReader.run was raising ValueError: zip() argument 3 is longer than arguments 1-2 whenever the number of valid answer spans for a sequence was smaller than answers_per_seq (e.g. short documents, or any case where the default answers_per_seq=20 exceeded the number of non-masked upper-triangular (start, end) token pairs).

Proposed Changes:

In _postprocess, filter each sequence's candidates_values row by the same valid mask and return a list[torch.Tensor] (one 1D tensor per sequence, variable length) instead of a 2D tensor. The return-type annotation and the _nest_answers probabilities parameter annotation were updated to match.

Root cause

_postprocess filtered the per-sequence start/end token-index lists by valid = candidates_values[i] > 0, producing variable-length lists, but returned candidates_values as the unfiltered 2D tensor of shape (batch_size, answers_per_seq).

How did you test it?

Added two unit tests in test/components/readers/test_extractive.py:

  • test_postprocess_filters_probs_when_answers_per_seq_exceeds_valid_spans:
    constructs a batch where answers_per_seq=5 exceeds the 3 valid upper-triangular spans, asserts all three returned per-sequence structures have matching length 3 and that retained probabilities are all > 0.
  • test_nest_answers_accepts_variable_length_probability_rows:
    calls _nest_answers with probabilities as a list[Tensor] containing rows of length 2 and 1 (the new contract) and
    verifies scores flow through correctly.

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct.
  • I have updated the related issue with new insights and changes.
  • I have added unit tests and updated the docstrings.
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I have documented my code.
  • I have added a release note file, following the contributors guidelines.
  • I have run pre-commit hooks and fixed any issue.

@bogdankostic bogdankostic requested a review from a team as a code owner May 20, 2026 12:29
@bogdankostic bogdankostic requested review from anakin87 and removed request for a team May 20, 2026 12:29
@vercel
Copy link
Copy Markdown

vercel Bot commented May 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
haystack-docs Ignored Ignored Preview May 20, 2026 2:22pm

Request Review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

Coverage report

Click to see where and how coverage changed

FileStatementsMissingCoverageCoverage
(new stmts)
Lines missing
  haystack/components/readers
  extractive.py
Project Total  

This report was generated by python-coverage-comment-action

@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@anakin87
Copy link
Copy Markdown
Member

while i < len(answers_without_query) and query_ids[i // answers_per_seq] == query_id:

(i // answers_per_seq) assumes that every sequence produced the same number of answers, which is no longer true in general. This still works because the reader uses a single query. I'd add a short comment here to explain this aspect.

@bogdankostic
Copy link
Copy Markdown
Contributor Author

while i < len(answers_without_query) and query_ids[i // answers_per_seq] == query_id:

(i // answers_per_seq) assumes that every sequence produced the same number of answers, which is no longer true in general. This still works because the reader uses a single query. I'd add a short comment here to explain this aspect.

Good catch! I added a comment.

Comment thread test/components/readers/test_extractive.py Outdated
Copy link
Copy Markdown
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good improvement.

My impression is that the implementation has always been wrong but then zip(..., strict=True) introduced in #10640 made the error surface.

I just left a minor comment to improve a test.

Co-authored-by: Stefano Fiorucci <stefanofiorucci@gmail.com>
@bogdankostic bogdankostic merged commit 435daf3 into main May 20, 2026
28 of 30 checks passed
@bogdankostic bogdankostic deleted the extractive_reader_bug branch May 20, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants