fix(knesset_committee_decisions): paginate OData responses#242
Open
wilfoa wants to merge 1 commit into
Open
Conversation
The pipeline fetched the global KNS_DocumentCommitteeSession query in one shot and ignored the @odata.nextLink. OData v4 on knesset.gov.il returns at most 100 rows per response, so the pipeline was only ever looking at the first 100 documents — sorted by Id ascending, those are all from 2016 (Knesset 18/20). The output index.csv has been frozen at 7 rows from 2016 ever since. Smoke-test against the live API today shows 12.5M+ matching documents in the dataset and the most recent GroupTypeID=106 PDF is dated 2026-04-21. Walking @odata.nextLink and pushing the 'CommitteeSessionID eq …' predicate into OData (instead of an in-memory scan over the stale 100-row global list) recovers all of it. Other changes: - _odata_paged helper for re-use across the three pagination sites (committees, sessions, per-session documents). - Renamed the inner 'document' loop variable to 'pdf_resp' so it no longer shadows the outer dict.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pipelines/knesset/knesset_committee_decisions.pycallsKNS_DocumentCommitteeSessiononce at the top offlow()and ignores@odata.nextLink. OData v4 onknesset.gov.ilpaginates at 100 rows,so the pipeline only ever sees the first 100 documents — sorted by
Idascending those are all from 2016 (Knesset 18/20). The outputindex.csvhas been frozen at 7 records from 2016 since.Live API today reports 12.5M+ matching documents and the most recent
GroupTypeID=106PDF is dated 2026-04-21. The data is there; thepipeline just isn't reading it.
What this PR does
_odata_paged()helper that walks every@odata.nextLink.KNS_DocumentCommitteeSessionfetch with aper-session paginated query (
CommitteeSessionID eq … and GroupTypeID eq 106). Each per-session response is small (typically0–3 PDFs) and bounded.
KNS_CommitteeandKNS_CommitteeSessionforconsistency — the current dataset's committee/session counts are
small enough today that they fit in 100 rows, but this prevents the
same silent regression next year.
documentloop variable so it no longer shadowsthe outer dict (the previous code reassigned
document = requests.get(...)mid-loop, which would have broken the nextiteration's
document['ApplicationDesc']access if that path hadbeen reached more than once for the same dict).
Smoke-test against live OData (no
dpp runneeded)The first-page-only behavior of the existing code is reproducible by
just curling the OData URL —
$topdefaults to 100, response includes@odata.nextLink, only documents 334231–341207 (all 2016) appear inthat first page when
$orderby=Idascending.Test plan
filter work and find current data.
dpp runand inspect thatindex.csvhas thousands ofrows spanning 2016 → 2026 instead of 7 from 2016.
next.obudget.org/datapackages/knesset/knesset_committee_decisions/datapackage.jsonreports
count_of_rows > 7.Why this matters downstream
whiletrue-industries/rebuilding-botsconsumesnext.obudget.org/datapackages/knesset/knesset_committee_decisions/index.csvto power one of the unified Parlibot's contexts. With the bug, the bot
has been answering committee-decision questions from a 7-row, 2016-only
corpus for the past several months. We just shipped a Layer-1 safety
rail there (refuses to overwrite local extraction CSVs when upstream
is empty / stale, alarms via SNS). This PR is Layer 2 — the actual
upstream fix.
🤖 Generated with Claude Code