Skip to content

AP-664: LDC provider and corpus file fetcher DAG#36

Merged
anarchivist merged 9 commits into
mainfrom
AP-664
May 15, 2026
Merged

AP-664: LDC provider and corpus file fetcher DAG#36
anarchivist merged 9 commits into
mainfrom
AP-664

Conversation

@anarchivist
Copy link
Copy Markdown
Member

@anarchivist anarchivist commented Apr 15, 2026

The Linguistic Data Consortium catalog provides downloads to linguistic corpora. As LDC does not provide an API, the LDC hook creates a session to temporarily persist session cookies and fetches HTML to be parsed into structured data.

  • Adds a new Airflow LDC provider and LDCHook for authenticated LDC catalog access, including session creation, refresh handling, corpora page retrieval, and download response streaming.
  • Adds LDC helper methods to mokelumne.util.ldc.
  • Implements a DAG to fetch files from LDC corpora.

@anarchivist anarchivist force-pushed the AP-664 branch 5 times, most recently from fed9866 to 386e98b Compare April 21, 2026 23:18
@anarchivist anarchivist force-pushed the AP-664 branch 2 times, most recently from aa02404 to e725c05 Compare April 29, 2026 03:07
@anarchivist anarchivist force-pushed the AP-664 branch 8 times, most recently from 0ee258c to 571b753 Compare May 14, 2026 01:15
@anarchivist anarchivist force-pushed the AP-664 branch 6 times, most recently from 4c85e8a to 5033968 Compare May 15, 2026 07:00
@anarchivist anarchivist marked this pull request as ready for review May 15, 2026 07:02
@anarchivist anarchivist changed the title AP-664: ldc corpus fetcher [WIP] AP-664: LDC provider and corpus file fetcher DAG May 15, 2026
The Linguistic Data Consortium catalog provides downloads to linguistic corpora. As LDC does not provide an API, the LDC hook creates a session to temporarily persist session cookies and fetches HTML to be parsed into structured data.

- Adds a new Airflow LDC provider under mokelumne/providers/ldc with provider.yaml and get_provider_info.py.
- Implements LDCHook for authenticated LDC catalog access, including session creation, refresh handling, corpora page retrieval, and download response streaming.
- Adds LDC helper methods to `mokelumne.util.ldc`.
- Implements a DAG to fetch files from LDC corpora.
Copy link
Copy Markdown
Contributor

@jason-raitz jason-raitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ with nits and comment discussions.
Most of the nits I have are pylint things. Sorry for the number.

Comment thread mokelumne/util/ldc.py
Comment thread mokelumne/dags/fetch_ldc_corpus_files.py Outdated
Comment thread mokelumne/dags/fetch_ldc_corpus_files.py
Comment thread mokelumne/dags/fetch_ldc_corpus_files.py
Comment thread mokelumne/providers/ldc/hooks/ldc.py
Comment thread test/unit/test_ldc_hook.py Outdated
Comment thread test/unit/test_ldc.py Outdated
Comment thread test/unit/test_ldc.py
Comment thread test/unit/test_ldc.py Outdated
Comment thread mokelumne/providers/ldc/get_provider_info.py
@anarchivist anarchivist requested a review from jason-raitz May 15, 2026 17:29
Comment thread mokelumne/providers/ldc/hooks/ldc.py Outdated
Copy link
Copy Markdown
Contributor

@jason-raitz jason-raitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+ I think that covers it. Cool stuff!

@anarchivist anarchivist merged commit 8242824 into main May 15, 2026
5 checks passed
@anarchivist anarchivist deleted the AP-664 branch May 15, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants