feat(vector_search): embeddings deadline resilience (retry + finalize tail) by mbertrand · Pull Request #3526 · mitodl/mit-learn

mbertrand · 2026-06-25T17:46:04Z

What are the relevant tickets?

Closes mitodl/hq#11987

Description

Makes content-file embedding resilient to Qdrant gRPC DEADLINE_EXCEEDED errors that occur when many runs embed at once (e.g. backpopulate_mitxonline_files) and stop the parent task getting stuck in STARTED.

Run every chunk, then fail loud. generate_embeddings owns its retry (jittered exponential backoff, max_retries=3). On a keyed path, exhausted/non-transient failures are recorded to a per-invocation counter in the shared redis cache and swallowed so the chain continues; a finalize_embeddings tail then fails the parent (→ FAILED, never stuck) if any chunk failed. Generic callers (no failure_key) propagate exactly as before.
Cut per-chunk Qdrant load. create_qdrant_collections now runs once per worker process (was once per ~10-file chunk).
Defer indexing during bulk writes. Optimizer indexing_threshold ratio 0.4 → 0.8.

Note

A related PR to try to reduce the DEADLINE_EXCEEDED errors — a dedicated, concurrency-bounded celery-embeddings worker — will be in an ol-infrastructure PR. This PR is the in-app resilience + load reduction; it makes deadlines non-fatal but does not by itself bound write concurrency.

How can this be tested?

Reproduce the original failure and confirm the new behavior on a worker with embeddings queue:

Embed a run's content files: embed_run_content_files.delay(<run_id>) (or trigger a content-file load).
Stuck-parent fix: force a chunk to fail — temporarily raise in embed_learning_resources. Confirm the remaining chunks still run, the parent task resolves to FAILURE (not stuck in STARTED) in Flower/the result backend, and the failure shows in Sentry.
Counter cleanup: check redis has no leftover embed_errors:<task_id> key after finalize_embeddings runs.
Per-process guard: with debug logging, confirm create_qdrant_collections is called once per worker process across many chunks, not per chunk.
Happy path: a clean run finishes SUCCESS with all points embedded.

…ngs tail

… for generate_embeddings

…file embedding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… embedding path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Prevents the process-global _collections_ensured flag from leaking across tests when an earlier test runs embed_learning_resources. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… bulk writes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-25T17:49:41Z

OpenAPI Changes

No changes detected

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copilot

Pull request overview

This PR improves the robustness of vector embedding Celery tasks against transient Qdrant gRPC DEADLINE_EXCEEDED failures, ensuring chunked embedding chains can complete and then deterministically fail the parent task (instead of leaving it stuck in STARTED). It also reduces repeated Qdrant collection setup work and adjusts optimizer settings to defer indexing during bulk writes.

Changes:

Add retry-with-jitter behavior to generate_embeddings, with an optional per-parent failure counter and a finalize_embeddings chain tail that fails the parent if any chunk failed.
Add a per-process guard (ensure_qdrant_collections) to avoid repeatedly calling create_qdrant_collections during repeated embed operations.
Increase Qdrant optimizer indexing threshold ratio from 0.4 to 0.8 to defer indexing under heavy write load.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
vector_search/utils.py	Adds a per-process cached guard to ensure Qdrant collections are created only once per worker process for embedding paths.
vector_search/utils_test.py	Adds tests validating the per-process collection guard and that embedding uses the guard rather than direct collection creation.
vector_search/tasks.py	Implements jittered retries in `generate_embeddings`, records keyed failures to Redis, and adds `finalize_embeddings` to fail the parent after all chunks run.
vector_search/tasks_test.py	Adds coverage for retry behavior, keyed failure recording, finalize behavior, and updated chain construction expectations.
vector_search/constants.py	Updates Qdrant optimizer indexing threshold ratio to defer indexing during bulk writes.
vector_search/conftest.py	Ensures cached collection guard state is cleared between tests to avoid cross-test leakage.

shanbady

LGTM

mbertrand and others added 8 commits June 25, 2026 10:21

feat(vector_search): add embedding failure counter + finalize_embeddi…

dde40cf

…ngs tail

feat(vector_search): manual jittered retry + gated skip-on-exhaustion…

bd53968

… for generate_embeddings

feat(vector_search): finalize tail + failure_key for run/new content-…

6153bd3

…file embedding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(vector_search): assert failure_key threading on new-content-file…

7fcc43d

… embedding path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

perf(vector_search): ensure Qdrant collections once per process

7c76423

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(vector_search): autouse-reset Qdrant collection guard between tests

6233f5e

Prevents the process-global _collections_ensured flag from leaking across tests when an earlier test runs embed_learning_resources. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

perf(vector_search): raise Qdrant indexing_threshold ratio to 0.8 for…

0a2e24b

… bulk writes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Simplification

ca00cea

Copilot AI review requested due to automatic review settings June 25, 2026 17:46

Copilot started reviewing on behalf of mbertrand June 25, 2026 17:46 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

type hints

63618b1

mbertrand mentioned this pull request Jun 25, 2026

Reduce mit-learn embeddings worker max replicas 30 → 3 mitodl/ol-infrastructure#4827

Closed

shanbady self-requested a review June 25, 2026 18:18

shanbady approved these changes Jun 25, 2026

View reviewed changes

shanbady assigned mbertrand Jun 25, 2026

mbertrand merged commit 6e85dd5 into main Jun 25, 2026
13 checks passed

mbertrand deleted the mb/embeddings-deadline-resilience branch June 25, 2026 19:11

This was referenced Jun 25, 2026

Release 0.71.15 #3527

Closed

Release 0.71.16 #3531

Closed

Release 0.72.1 #3535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vector_search): embeddings deadline resilience (retry + finalize tail)#3526

feat(vector_search): embeddings deadline resilience (retry + finalize tail)#3526
mbertrand merged 9 commits into
mainfrom
mb/embeddings-deadline-resilience

mbertrand commented Jun 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

shanbady left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mbertrand commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description

How can this be tested?

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

shanbady left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbertrand commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading