fix(knesset_committee_decisions): paginate OData responses by wilfoa · Pull Request #242 · OpenBudget/budgetkey-data-pipelines

wilfoa · 2026-04-25T13:53:35Z

Summary

pipelines/knesset/knesset_committee_decisions.py calls
KNS_DocumentCommitteeSession once at the top of flow() and ignores
@odata.nextLink. OData v4 on knesset.gov.il paginates at 100 rows,
so the pipeline only ever sees the first 100 documents — sorted by
Id ascending those are all from 2016 (Knesset 18/20). The output
index.csv has been frozen at 7 records from 2016 since.

Live API today reports 12.5M+ matching documents and the most recent
GroupTypeID=106 PDF is dated 2026-04-21. The data is there; the
pipeline just isn't reading it.

What this PR does

Adds an _odata_paged() helper that walks every @odata.nextLink.
Replaces the global KNS_DocumentCommitteeSession fetch with a
per-session paginated query (CommitteeSessionID eq … and GroupTypeID eq 106). Each per-session response is small (typically
0–3 PDFs) and bounded.
Also paginates KNS_Committee and KNS_CommitteeSession for
consistency — the current dataset's committee/session counts are
small enough today that they fit in 100 rows, but this prevents the
same silent regression next year.
Renames the inner document loop variable so it no longer shadows
the outer dict (the previous code reassigned document = requests.get(...) mid-loop, which would have broken the next
iteration's document['ApplicationDesc'] access if that path had
been reached more than once for the same dict).

Smoke-test against live OData (no `dpp run` needed)

# Per-session filter works:
session=2242368 Id=12594486 date=2026-04-21T09:06:13.71+03:00
session=2201184 Id=12577930 date=2026-04-20T11:04:11.54+03:00
session=2201184 Id=12577921 date=2026-04-20T11:02:13.193+03:00

# Pagination walks correctly:
committee 2024: walked 62 sessions across multiple pages

The first-page-only behavior of the existing code is reproducible by
just curling the OData URL — $top defaults to 100, response includes
@odata.nextLink, only documents 334231–341207 (all 2016) appear in
that first page when $orderby=Id ascending.

Test plan

Live OData smoke (above) confirming the pagination + per-session
filter work and find current data.
Local dpp run and inspect that index.csv has thousands of
rows spanning 2016 → 2026 instead of 7 from 2016.
After merge + pipeline rebuild, confirm
next.obudget.org/datapackages/knesset/knesset_committee_decisions/datapackage.json
reports count_of_rows > 7.

Why this matters downstream

whiletrue-industries/rebuilding-bots consumes
next.obudget.org/datapackages/knesset/knesset_committee_decisions/index.csv
to power one of the unified Parlibot's contexts. With the bug, the bot
has been answering committee-decision questions from a 7-row, 2016-only
corpus for the past several months. We just shipped a Layer-1 safety
rail there (refuses to overwrite local extraction CSVs when upstream
is empty / stale, alarms via SNS). This PR is Layer 2 — the actual
upstream fix.

🤖 Generated with Claude Code

The pipeline fetched the global KNS_DocumentCommitteeSession query in one shot and ignored the @odata.nextLink. OData v4 on knesset.gov.il returns at most 100 rows per response, so the pipeline was only ever looking at the first 100 documents — sorted by Id ascending, those are all from 2016 (Knesset 18/20). The output index.csv has been frozen at 7 rows from 2016 ever since. Smoke-test against the live API today shows 12.5M+ matching documents in the dataset and the most recent GroupTypeID=106 PDF is dated 2026-04-21. Walking @odata.nextLink and pushing the 'CommitteeSessionID eq …' predicate into OData (instead of an in-memory scan over the stale 100-row global list) recovers all of it. Other changes: - _odata_paged helper for re-use across the three pagination sites (committees, sessions, per-session documents). - Renamed the inner 'document' loop variable to 'pdf_resp' so it no longer shadows the outer dict.

wilfoa mentioned this pull request Apr 25, 2026

knesset ethics_committee_decisions: empty since 2025-09 — same Radware blocker as #243 #244

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(knesset_committee_decisions): paginate OData responses#242

fix(knesset_committee_decisions): paginate OData responses#242
wilfoa wants to merge 1 commit into
OpenBudget:masterfrom
wilfoa:fix/knesset-committee-decisions-pagination

wilfoa commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wilfoa commented Apr 25, 2026

Summary

What this PR does

Smoke-test against live OData (no dpp run needed)

Test plan

Why this matters downstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Smoke-test against live OData (no `dpp run` needed)