Skip to content

feat(vector_search): embeddings deadline resilience (retry + finalize tail)#3526

Merged
mbertrand merged 9 commits into
mainfrom
mb/embeddings-deadline-resilience
Jun 25, 2026
Merged

feat(vector_search): embeddings deadline resilience (retry + finalize tail)#3526
mbertrand merged 9 commits into
mainfrom
mb/embeddings-deadline-resilience

Conversation

@mbertrand

@mbertrand mbertrand commented Jun 25, 2026

Copy link
Copy Markdown
Member

What are the relevant tickets?

Closes mitodl/hq#11987

Description

Makes content-file embedding resilient to Qdrant gRPC DEADLINE_EXCEEDED errors that occur when many runs embed at once (e.g. backpopulate_mitxonline_files) and stop the parent task getting stuck in STARTED.

  • Run every chunk, then fail loud. generate_embeddings owns its retry (jittered exponential backoff, max_retries=3). On a keyed path, exhausted/non-transient failures are recorded to a per-invocation counter in the shared redis cache and swallowed so the chain continues; a finalize_embeddings tail then fails the parent (→ FAILED, never stuck) if any chunk failed. Generic callers (no failure_key) propagate exactly as before.
  • Cut per-chunk Qdrant load. create_qdrant_collections now runs once per worker process (was once per ~10-file chunk).
  • Defer indexing during bulk writes. Optimizer indexing_threshold ratio 0.4 → 0.8.

Note

A related PR to try to reduce the DEADLINE_EXCEEDED errors — a dedicated, concurrency-bounded celery-embeddings worker — will be in an ol-infrastructure PR. This PR is the in-app resilience + load reduction; it makes deadlines non-fatal but does not by itself bound write concurrency.

How can this be tested?

Reproduce the original failure and confirm the new behavior on a worker with embeddings queue:

  1. Embed a run's content files: embed_run_content_files.delay(<run_id>) (or trigger a content-file load).
  2. Stuck-parent fix: force a chunk to fail — temporarily raise in embed_learning_resources. Confirm the remaining chunks still run, the parent task resolves to FAILURE (not stuck in STARTED) in Flower/the result backend, and the failure shows in Sentry.
  3. Counter cleanup: check redis has no leftover embed_errors:<task_id> key after finalize_embeddings runs.
  4. Per-process guard: with debug logging, confirm create_qdrant_collections is called once per worker process across many chunks, not per chunk.
  5. Happy path: a clean run finishes SUCCESS with all points embedded.

mbertrand and others added 8 commits June 25, 2026 10:21
…file embedding

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… embedding path

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Prevents the process-global _collections_ensured flag from leaking across
tests when an earlier test runs embed_learning_resources.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… bulk writes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 25, 2026 17:46
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

OpenAPI Changes

No changes detected

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the robustness of vector embedding Celery tasks against transient Qdrant gRPC DEADLINE_EXCEEDED failures, ensuring chunked embedding chains can complete and then deterministically fail the parent task (instead of leaving it stuck in STARTED). It also reduces repeated Qdrant collection setup work and adjusts optimizer settings to defer indexing during bulk writes.

Changes:

  • Add retry-with-jitter behavior to generate_embeddings, with an optional per-parent failure counter and a finalize_embeddings chain tail that fails the parent if any chunk failed.
  • Add a per-process guard (ensure_qdrant_collections) to avoid repeatedly calling create_qdrant_collections during repeated embed operations.
  • Increase Qdrant optimizer indexing threshold ratio from 0.4 to 0.8 to defer indexing under heavy write load.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
vector_search/utils.py Adds a per-process cached guard to ensure Qdrant collections are created only once per worker process for embedding paths.
vector_search/utils_test.py Adds tests validating the per-process collection guard and that embedding uses the guard rather than direct collection creation.
vector_search/tasks.py Implements jittered retries in generate_embeddings, records keyed failures to Redis, and adds finalize_embeddings to fail the parent after all chunks run.
vector_search/tasks_test.py Adds coverage for retry behavior, keyed failure recording, finalize behavior, and updated chain construction expectations.
vector_search/constants.py Updates Qdrant optimizer indexing threshold ratio to defer indexing during bulk writes.
vector_search/conftest.py Ensures cached collection guard state is cleared between tests to avoid cross-test leakage.

@shanbady shanbady left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mbertrand mbertrand merged commit 6e85dd5 into main Jun 25, 2026
13 checks passed
@mbertrand mbertrand deleted the mb/embeddings-deadline-resilience branch June 25, 2026 19:11
This was referenced Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants