fix: Prevent infinite loop in legacy edX course export op#2212
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes a hang in the legacy Open edX course export op and improves resilience/observability so the daily courses/ S3 prefix is produced even when some exports fail or task-status records disappear.
Changes:
- Allow partial-failure responses from the Open edX export trigger endpoint (HTTP 400) to be returned to callers instead of immediately raising.
- Prevent infinite looping in the legacy export status polling by marking errored status checks as failed and skipping already-resolved courses.
- Add a 60-minute polling timeout safety net and log initially-failed course queue attempts.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/ol-orchestrate-lib/src/ol_orchestrate/resources/openedx.py | Adjusts export_courses HTTP handling to permit partial-failure (400) responses from the export trigger API. |
| dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py | Fixes the legacy op’s status-polling loop to avoid hanging forever and adds timeout + improved logging for failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Besides the comments from copilot, the code changes don't seem to address the 504 error on production https://pipelines.odl.mit.edu/runs/c1721bd6-3e37-43db-bf05-3890662dcbda, which happens on the initial POST to trigger the export:
httpx.HTTPStatusError: Server error '504 Gateway Time-out' for url 'https://studio.courses.learn.mit.edu/api/courses/v0/export/'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504
The PP changes only fixes the infinite polling loop after the export has already been queued, but the 504 occurs earlier during the initial export trigger request itself.
The export_edx_courses op in the legacy IRx pipeline had two bugs that caused the courses/ folder to silently go missing in the S3 data packages (observed since ~2026-01-21 for mitx-etl-mitxonline-production): 1. Infinite loop in the status-polling while loop When check_course_export_status() raised httpx.HTTPStatusError (e.g., HTTP 404 from django-user-tasks after record cleanup, or a deleted course), the bare 'continue' swallowed the error without adding the course to failed_exports. The while-loop condition len(successful U failed) < len(tasks) was never satisfied, so the job hung indefinitely and the courses/ folder was never written. Fix: catch HTTPStatusError, log the status code and course id, and add the course to failed_exports so the loop can terminate. Also add a 60-minute hard timeout (matching the openedx asset-based code) and skip courses already resolved to avoid redundant API calls. 2. Abort on partial-failure from the export trigger API The ol_openedx_course_export plugin returns HTTP 400 when any course fails to queue, even if other courses succeeded. The unconditional raise_for_status() call caused export_courses() to raise immediately, aborting the entire op before any successfully-queued tarballs could be copied. Fix: allow HTTP 400 through so the caller receives the response body (upload_task_ids + failed_uploads) and can still process the courses that were queued successfully. Log the initially-failed courses as a warning. Fixes: mitodl/hq#11104 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f95c79c to
925728e
Compare
What are the relevant tickets?
Fixes https://github.com/mitodl/hq/issues/11104
Description (What does it do?)
The
export_edx_coursesop in the legacy IRx pipeline (which populatess3://mitx-etl-mitxonline-production/<date>/courses/) had two bugs that caused thecourses/folder to silently go missing since ~2026-01-21.Bug 1 — Infinite loop in the status-polling while loop (primary cause)
When
check_course_export_status()returned HTTP 404 (e.g. thedjango-user-tasksrecord was cleaned up after a retention window, or a course was deleted from Studio), the barecontinueinside theexcept httpx.HTTPStatusErrorblock swallowed the error without ever adding the course tofailed_exports. The while-loop conditionlen(successful ∪ failed) < len(tasks)was therefore never satisfied, so the job hung indefinitely. The Dagster run would eventually be killed and thecourses/folder was never written to S3.Fix: catch
HTTPStatusError, log the HTTP status code and course ID for observability, and add the course tofailed_exportsso the loop can make progress. A 60-minute hard timeout (matching the pattern in the newer asset-basedopenedxcode location) is also added as a safety net, and already-resolved courses are skipped to avoid redundant API calls.Bug 2 — Premature abort on partial-failure from the export trigger API
The
ol_openedx_course_exportplugin returns HTTP 400 (not 200) when any course fails to queue during the initial trigger phase — even when other courses were queued successfully. The unconditionalraise_for_status()inOpenEdxApiClient.export_courses()caused the method to raise immediately, aborting the entire op before any successfully-queued tarballs could be processed.Fix: allow HTTP 400 through without raising; log the initially-failed courses as a warning via
failed_uploadsin the response body. All other non-2xx codes still raise as before.How can this be tested?
mitxonline_edx_course_pipelineDagster job in staging/QA.courses/sub-folder appears in the dated S3 prefix (e.g.s3://mitx-etl-mitxonline-qa/<date>/courses/) with the expected course tarballs.check_course_export_statusto always raisehttpx.HTTPStatusErrorwith a 404 response and verify the op now terminates (previously it would loop forever).export_coursesto return a 400 with a mix ofupload_task_idsandfailed_uploads, and verify the op continues to process the successful tasks instead of aborting.Additional Context
COURSE_EXPORT_GET_TASKS_STATUS_TIMEOUT = timedelta(minutes=60)) already exists in the newer asset-basedopenedxcode location (dg_projects/openedx/openedx/assets/openedx.py); this PR aligns the legacy op with that approach.date_overridefor each missing date.