Retry embeddings tasks on Qdrant gRPC DEADLINE_EXCEEDED#3520
Merged
Conversation
…dly on non-retry errors
OpenAPI ChangesNo changes detected Unexpected changes? Ensure your branch is up-to-date with |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes silent “successful” completion of Qdrant embedding Celery tasks by ensuring gRPC timeouts (DEADLINE_EXCEEDED) are converted into RetryError (so Celery autoretry engages) and by letting other exceptions propagate (so tasks correctly record FAILURE instead of returning an error string).
Changes:
- Add explicit
grpc.RpcErrorhandling ingenerate_embeddingsandremove_embeddingsto retry only onDEADLINE_EXCEEDED. - Remove bare
except: ... return errorbehavior so failures aren’t swallowed and misreported as success. - Add unit tests covering retry-on-deadline and “don’t swallow exceptions” behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
vector_search/tasks.py |
Convert gRPC DEADLINE_EXCEEDED into RetryError for Celery autoretry; stop swallowing exceptions for embedding tasks. |
vector_search/tasks_test.py |
Add tests validating retry behavior and regression coverage for propagating errors. |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
shanbady
approved these changes
Jun 24, 2026
shanbady
left a comment
Contributor
There was a problem hiding this comment.
looks good. seems like any general error will still get caught via sentry
This was referenced Jun 24, 2026
Closed
Closed
Closed
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What are the relevant tickets?
mitodl/hq#11987
Description (What does it do?)
The
generate_embeddingsandremove_embeddingsCelery tasks talk to Qdrant over gRPC (prefer_grpc=True). A Qdrant timeout surfaces as agrpc.RpcErrorwith codeDEADLINE_EXCEEDED, which was being caught by a bareexceptthat logged it and returned an error string — so Celery marked the task SUCCESS, the embeddings were silently never written, and nothing retried (the data loss seen in Sentry).This change:
DEADLINE_EXCEEDEDgRPC error intoRetryError, so the existingautoretry_for=(RetryError,)+retry_backoffretries it. Other gRPC codes re-raise.except ... return errorfrom both tasks. Unhandled exceptions now propagate, so the task records FAILURE (still captured by Sentry) instead of a false success.How can this be tested?
Unit tests (handler + propagation for both tasks):
Manual — confirm a real Qdrant timeout is the exception the handler keys on (what the mocked tests can't prove). In
manage.py shell, force a sub-millisecond gRPC deadline against a populated collection:That
(grpc.RpcError, code() == DEADLINE_EXCEEDED)is exactly what the new handler matches before raisingRetryError.End-to-end (retry/backoff in worker logs): set
QDRANT_CLIENT_TIMEOUT=0.001and enqueue a bulk embed against a loaded/remote Qdrant — the worker logs showRetry ... in Ns(backoff) instead of a SUCCESS-with-error-string. Note: against a local single-node Qdrant, per-resource calls finish in under a millisecond and won't reliably trip the deadline; only heavy reads (like the scroll above) do.