Skip to content

fix(knesset_committee_decisions): paginate OData responses#242

Open
wilfoa wants to merge 1 commit into
OpenBudget:masterfrom
wilfoa:fix/knesset-committee-decisions-pagination
Open

fix(knesset_committee_decisions): paginate OData responses#242
wilfoa wants to merge 1 commit into
OpenBudget:masterfrom
wilfoa:fix/knesset-committee-decisions-pagination

Conversation

@wilfoa
Copy link
Copy Markdown
Contributor

@wilfoa wilfoa commented Apr 25, 2026

Summary

pipelines/knesset/knesset_committee_decisions.py calls
KNS_DocumentCommitteeSession once at the top of flow() and ignores
@odata.nextLink. OData v4 on knesset.gov.il paginates at 100 rows,
so the pipeline only ever sees the first 100 documents — sorted by
Id ascending those are all from 2016 (Knesset 18/20). The output
index.csv has been frozen at 7 records from 2016 since.

Live API today reports 12.5M+ matching documents and the most recent
GroupTypeID=106 PDF is dated 2026-04-21. The data is there; the
pipeline just isn't reading it.

What this PR does

  • Adds an _odata_paged() helper that walks every @odata.nextLink.
  • Replaces the global KNS_DocumentCommitteeSession fetch with a
    per-session paginated query (CommitteeSessionID eq … and GroupTypeID eq 106). Each per-session response is small (typically
    0–3 PDFs) and bounded.
  • Also paginates KNS_Committee and KNS_CommitteeSession for
    consistency — the current dataset's committee/session counts are
    small enough today that they fit in 100 rows, but this prevents the
    same silent regression next year.
  • Renames the inner document loop variable so it no longer shadows
    the outer dict (the previous code reassigned document = requests.get(...) mid-loop, which would have broken the next
    iteration's document['ApplicationDesc'] access if that path had
    been reached more than once for the same dict).

Smoke-test against live OData (no dpp run needed)

# Per-session filter works:
session=2242368 Id=12594486 date=2026-04-21T09:06:13.71+03:00
session=2201184 Id=12577930 date=2026-04-20T11:04:11.54+03:00
session=2201184 Id=12577921 date=2026-04-20T11:02:13.193+03:00

# Pagination walks correctly:
committee 2024: walked 62 sessions across multiple pages

The first-page-only behavior of the existing code is reproducible by
just curling the OData URL — $top defaults to 100, response includes
@odata.nextLink, only documents 334231–341207 (all 2016) appear in
that first page when $orderby=Id ascending.

Test plan

  • Live OData smoke (above) confirming the pagination + per-session
    filter work and find current data.
  • Local dpp run and inspect that index.csv has thousands of
    rows spanning 2016 → 2026 instead of 7 from 2016.
  • After merge + pipeline rebuild, confirm
    next.obudget.org/datapackages/knesset/knesset_committee_decisions/datapackage.json
    reports count_of_rows > 7.

Why this matters downstream

whiletrue-industries/rebuilding-bots consumes
next.obudget.org/datapackages/knesset/knesset_committee_decisions/index.csv
to power one of the unified Parlibot's contexts. With the bug, the bot
has been answering committee-decision questions from a 7-row, 2016-only
corpus for the past several months. We just shipped a Layer-1 safety
rail there (refuses to overwrite local extraction CSVs when upstream
is empty / stale, alarms via SNS). This PR is Layer 2 — the actual
upstream fix.

🤖 Generated with Claude Code

The pipeline fetched the global KNS_DocumentCommitteeSession query in
one shot and ignored the @odata.nextLink. OData v4 on knesset.gov.il
returns at most 100 rows per response, so the pipeline was only ever
looking at the first 100 documents — sorted by Id ascending, those are
all from 2016 (Knesset 18/20). The output index.csv has been frozen
at 7 rows from 2016 ever since.

Smoke-test against the live API today shows 12.5M+ matching documents
in the dataset and the most recent GroupTypeID=106 PDF is dated
2026-04-21. Walking @odata.nextLink and pushing the
'CommitteeSessionID eq …' predicate into OData (instead of an
in-memory scan over the stale 100-row global list) recovers all of it.

Other changes:
- _odata_paged helper for re-use across the three pagination sites
  (committees, sessions, per-session documents).
- Renamed the inner 'document' loop variable to 'pdf_resp' so it no
  longer shadows the outer dict.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants