feat(vector_search): embeddings deadline resilience (retry + finalize tail)#3526
Merged
Conversation
… for generate_embeddings
…file embedding Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… embedding path Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prevents the process-global _collections_ensured flag from leaking across tests when an earlier test runs embed_learning_resources. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bulk writes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenAPI ChangesNo changes detected Unexpected changes? Ensure your branch is up-to-date with |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the robustness of vector embedding Celery tasks against transient Qdrant gRPC DEADLINE_EXCEEDED failures, ensuring chunked embedding chains can complete and then deterministically fail the parent task (instead of leaving it stuck in STARTED). It also reduces repeated Qdrant collection setup work and adjusts optimizer settings to defer indexing during bulk writes.
Changes:
- Add retry-with-jitter behavior to
generate_embeddings, with an optional per-parent failure counter and afinalize_embeddingschain tail that fails the parent if any chunk failed. - Add a per-process guard (
ensure_qdrant_collections) to avoid repeatedly callingcreate_qdrant_collectionsduring repeated embed operations. - Increase Qdrant optimizer indexing threshold ratio from
0.4to0.8to defer indexing under heavy write load.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| vector_search/utils.py | Adds a per-process cached guard to ensure Qdrant collections are created only once per worker process for embedding paths. |
| vector_search/utils_test.py | Adds tests validating the per-process collection guard and that embedding uses the guard rather than direct collection creation. |
| vector_search/tasks.py | Implements jittered retries in generate_embeddings, records keyed failures to Redis, and adds finalize_embeddings to fail the parent after all chunks run. |
| vector_search/tasks_test.py | Adds coverage for retry behavior, keyed failure recording, finalize behavior, and updated chain construction expectations. |
| vector_search/constants.py | Updates Qdrant optimizer indexing threshold ratio to defer indexing during bulk writes. |
| vector_search/conftest.py | Ensures cached collection guard state is cleared between tests to avoid cross-test leakage. |
This was referenced Jun 25, 2026
Closed
Closed
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What are the relevant tickets?
Closes mitodl/hq#11987
Description
Makes content-file embedding resilient to Qdrant gRPC
DEADLINE_EXCEEDEDerrors that occur when many runs embed at once (e.g.backpopulate_mitxonline_files) and stop the parent task getting stuck in STARTED.generate_embeddingsowns its retry (jittered exponential backoff,max_retries=3). On a keyed path, exhausted/non-transient failures are recorded to a per-invocation counter in the sharedrediscache and swallowed so the chain continues; afinalize_embeddingstail then fails the parent (→ FAILED, never stuck) if any chunk failed. Generic callers (nofailure_key) propagate exactly as before.create_qdrant_collectionsnow runs once per worker process (was once per ~10-file chunk).indexing_thresholdratio 0.4 → 0.8.Note
A related PR to try to reduce the DEADLINE_EXCEEDED errors — a dedicated, concurrency-bounded
celery-embeddingsworker — will be in an ol-infrastructure PR. This PR is the in-app resilience + load reduction; it makes deadlines non-fatal but does not by itself bound write concurrency.How can this be tested?
Reproduce the original failure and confirm the new behavior on a worker with
embeddingsqueue:embed_run_content_files.delay(<run_id>)(or trigger a content-file load).raiseinembed_learning_resources. Confirm the remaining chunks still run, the parent task resolves to FAILURE (not stuck in STARTED) in Flower/the result backend, and the failure shows in Sentry.redishas no leftoverembed_errors:<task_id>key afterfinalize_embeddingsruns.create_qdrant_collectionsis called once per worker process across many chunks, not per chunk.