Skip to content

fix: Prevent infinite loop in legacy edX course export op#2212

Merged
quazi-h merged 1 commit into
mainfrom
fix/mitxonline-courses-export-infinite-loop-v2
May 19, 2026
Merged

fix: Prevent infinite loop in legacy edX course export op#2212
quazi-h merged 1 commit into
mainfrom
fix/mitxonline-courses-export-infinite-loop-v2

Conversation

@quazi-h

@quazi-h quazi-h commented May 8, 2026

Copy link
Copy Markdown
Contributor

What are the relevant tickets?

Fixes https://github.com/mitodl/hq/issues/11104

Description (What does it do?)

The export_edx_courses op in the legacy IRx pipeline (which populates s3://mitx-etl-mitxonline-production/<date>/courses/) had two bugs that caused the courses/ folder to silently go missing since ~2026-01-21.

Bug 1 — Infinite loop in the status-polling while loop (primary cause)

When check_course_export_status() returned HTTP 404 (e.g. the django-user-tasks record was cleaned up after a retention window, or a course was deleted from Studio), the bare continue inside the except httpx.HTTPStatusError block swallowed the error without ever adding the course to failed_exports. The while-loop condition len(successful ∪ failed) < len(tasks) was therefore never satisfied, so the job hung indefinitely. The Dagster run would eventually be killed and the courses/ folder was never written to S3.

Fix: catch HTTPStatusError, log the HTTP status code and course ID for observability, and add the course to failed_exports so the loop can make progress. A 60-minute hard timeout (matching the pattern in the newer asset-based openedx code location) is also added as a safety net, and already-resolved courses are skipped to avoid redundant API calls.

Bug 2 — Premature abort on partial-failure from the export trigger API

The ol_openedx_course_export plugin returns HTTP 400 (not 200) when any course fails to queue during the initial trigger phase — even when other courses were queued successfully. The unconditional raise_for_status() in OpenEdxApiClient.export_courses() caused the method to raise immediately, aborting the entire op before any successfully-queued tarballs could be processed.

Fix: allow HTTP 400 through without raising; log the initially-failed courses as a warning via failed_uploads in the response body. All other non-2xx codes still raise as before.

How can this be tested?

  1. Run the mitxonline_edx_course_pipeline Dagster job in staging/QA.
  2. Confirm the courses/ sub-folder appears in the dated S3 prefix (e.g. s3://mitx-etl-mitxonline-qa/<date>/courses/) with the expected course tarballs.
  3. To reproduce Bug 1 locally: mock check_course_export_status to always raise httpx.HTTPStatusError with a 404 response and verify the op now terminates (previously it would loop forever).
  4. To reproduce Bug 2 locally: mock export_courses to return a 400 with a mix of upload_task_ids and failed_uploads, and verify the op continues to process the successful tasks instead of aborting.

Additional Context

  • The same timeout pattern (COURSE_EXPORT_GET_TASKS_STATUS_TIMEOUT = timedelta(minutes=60)) already exists in the newer asset-based openedx code location (dg_projects/openedx/openedx/assets/openedx.py); this PR aligns the legacy op with that approach.
  • Once merged and deployed, historical data since 2026-01-21 will need to be backfilled by manually triggering the pipeline with a date_override for each missing date.

@quazi-h quazi-h marked this pull request as ready for review May 8, 2026 19:28
Copilot AI review requested due to automatic review settings May 8, 2026 19:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a hang in the legacy Open edX course export op and improves resilience/observability so the daily courses/ S3 prefix is produced even when some exports fail or task-status records disappear.

Changes:

  • Allow partial-failure responses from the Open edX export trigger endpoint (HTTP 400) to be returned to callers instead of immediately raising.
  • Prevent infinite looping in the legacy export status polling by marking errored status checks as failed and skipping already-resolved courses.
  • Add a 60-minute polling timeout safety net and log initially-failed course queue attempts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
packages/ol-orchestrate-lib/src/ol_orchestrate/resources/openedx.py Adjusts export_courses HTTP handling to permit partial-failure (400) responses from the export trigger API.
dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py Fixes the legacy op’s status-polling loop to avoid hanging forever and adds timeout + improved logging for failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/resources/openedx.py Outdated
Comment thread dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py Outdated
Comment thread dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py Outdated
@rachellougee rachellougee self-assigned this May 8, 2026

@rachellougee rachellougee left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the comments from copilot, the code changes don't seem to address the 504 error on production https://pipelines.odl.mit.edu/runs/c1721bd6-3e37-43db-bf05-3890662dcbda, which happens on the initial POST to trigger the export:

httpx.HTTPStatusError: Server error '504 Gateway Time-out' for url 'https://studio.courses.learn.mit.edu/api/courses/v0/export/'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/504

The PP changes only fixes the infinite polling loop after the export has already been queued, but the 504 occurs earlier during the initial export trigger request itself.

Comment thread dg_projects/legacy_openedx/legacy_openedx/ops/open_edx.py Outdated
@rachellougee rachellougee removed their assignment May 19, 2026
The export_edx_courses op in the legacy IRx pipeline had two bugs that
caused the courses/ folder to silently go missing in the S3 data packages
(observed since ~2026-01-21 for mitx-etl-mitxonline-production):

1. Infinite loop in the status-polling while loop
   When check_course_export_status() raised httpx.HTTPStatusError (e.g.,
   HTTP 404 from django-user-tasks after record cleanup, or a deleted
   course), the bare 'continue' swallowed the error without adding the
   course to failed_exports. The while-loop condition
   len(successful U failed) < len(tasks) was never satisfied, so the
   job hung indefinitely and the courses/ folder was never written.

   Fix: catch HTTPStatusError, log the status code and course id, and
   add the course to failed_exports so the loop can terminate. Also add
   a 60-minute hard timeout (matching the openedx asset-based code) and
   skip courses already resolved to avoid redundant API calls.

2. Abort on partial-failure from the export trigger API
   The ol_openedx_course_export plugin returns HTTP 400 when any course
   fails to queue, even if other courses succeeded. The unconditional
   raise_for_status() call caused export_courses() to raise immediately,
   aborting the entire op before any successfully-queued tarballs could
   be copied.

   Fix: allow HTTP 400 through so the caller receives the response body
   (upload_task_ids + failed_uploads) and can still process the courses
   that were queued successfully. Log the initially-failed courses as a
   warning.

Fixes: mitodl/hq#11104

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@quazi-h quazi-h force-pushed the fix/mitxonline-courses-export-infinite-loop-v2 branch from f95c79c to 925728e Compare May 19, 2026 19:33
@quazi-h quazi-h merged commit 444e0f2 into main May 19, 2026
6 checks passed
@quazi-h quazi-h deleted the fix/mitxonline-courses-export-infinite-loop-v2 branch May 19, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants