Skip to content

Retry embeddings tasks on Qdrant gRPC DEADLINE_EXCEEDED#3520

Merged
mbertrand merged 2 commits into
mainfrom
mb/retry_or_err_deadline_exceeded
Jun 24, 2026
Merged

Retry embeddings tasks on Qdrant gRPC DEADLINE_EXCEEDED#3520
mbertrand merged 2 commits into
mainfrom
mb/retry_or_err_deadline_exceeded

Conversation

@mbertrand

@mbertrand mbertrand commented Jun 24, 2026

Copy link
Copy Markdown
Member

What are the relevant tickets?

mitodl/hq#11987

Description (What does it do?)

The generate_embeddings and remove_embeddings Celery tasks talk to Qdrant over gRPC (prefer_grpc=True). A Qdrant timeout surfaces as a grpc.RpcError with code DEADLINE_EXCEEDED, which was being caught by a bare except that logged it and returned an error string — so Celery marked the task SUCCESS, the embeddings were silently never written, and nothing retried (the data loss seen in Sentry).

This change:

  • Converts a DEADLINE_EXCEEDED gRPC error into RetryError, so the existing autoretry_for=(RetryError,) + retry_backoff retries it. Other gRPC codes re-raise.
  • Removes the bare except ... return error from both tasks. Unhandled exceptions now propagate, so the task records FAILURE (still captured by Sentry) instead of a false success.

How can this be tested?

Unit tests (handler + propagation for both tasks):

docker compose run --rm web uv run pytest vector_search/tasks_test.py -k embeddings

Manual — confirm a real Qdrant timeout is the exception the handler keys on (what the mocked tests can't prove). In manage.py shell, force a sub-millisecond gRPC deadline against a populated collection:

import grpc
from django.conf import settings
from qdrant_client import QdrantClient
c = QdrantClient(url=settings.QDRANT_HOST, api_key=settings.QDRANT_API_KEY,
                 grpc_port=6334, prefer_grpc=True, timeout=0.001)
try:
    c.scroll("resource_embeddings.content_files", limit=20000,
             with_payload=True, with_vectors=True)
except grpc.RpcError as e:
    print(e.code(), e.code() == grpc.StatusCode.DEADLINE_EXCEEDED)
# -> StatusCode.DEADLINE_EXCEEDED True

That (grpc.RpcError, code() == DEADLINE_EXCEEDED) is exactly what the new handler matches before raising RetryError.

End-to-end (retry/backoff in worker logs): set QDRANT_CLIENT_TIMEOUT=0.001 and enqueue a bulk embed against a loaded/remote Qdrant — the worker logs show Retry ... in Ns (backoff) instead of a SUCCESS-with-error-string. Note: against a local single-node Qdrant, per-resource calls finish in under a millisecond and won't reliably trip the deadline; only heavy reads (like the scroll above) do.

Copilot AI review requested due to automatic review settings June 24, 2026 18:47
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

OpenAPI Changes

No changes detected

View full changelog

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes silent “successful” completion of Qdrant embedding Celery tasks by ensuring gRPC timeouts (DEADLINE_EXCEEDED) are converted into RetryError (so Celery autoretry engages) and by letting other exceptions propagate (so tasks correctly record FAILURE instead of returning an error string).

Changes:

  • Add explicit grpc.RpcError handling in generate_embeddings and remove_embeddings to retry only on DEADLINE_EXCEEDED.
  • Remove bare except: ... return error behavior so failures aren’t swallowed and misreported as success.
  • Add unit tests covering retry-on-deadline and “don’t swallow exceptions” behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
vector_search/tasks.py Convert gRPC DEADLINE_EXCEEDED into RetryError for Celery autoretry; stop swallowing exceptions for embedding tasks.
vector_search/tasks_test.py Add tests validating retry behavior and regression coverage for propagating errors.

Comment thread vector_search/tasks_test.py
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mbertrand mbertrand added the Needs Review An open Pull Request that is ready for review label Jun 24, 2026
@shanbady shanbady self-requested a review June 24, 2026 18:55

@shanbady shanbady left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. seems like any general error will still get caught via sentry

@shanbady shanbady added Waiting on author and removed Needs Review An open Pull Request that is ready for review labels Jun 24, 2026
@mbertrand mbertrand merged commit e7e2016 into main Jun 24, 2026
13 checks passed
@mbertrand mbertrand deleted the mb/retry_or_err_deadline_exceeded branch June 24, 2026 19:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants