Skip to content

[codex] fix celery worker stability and workflow tracking#1473

Merged
earayu merged 1 commit into
mainfrom
codex/fix-celery-worker-stability
Apr 20, 2026
Merged

[codex] fix celery worker stability and workflow tracking#1473
earayu merged 1 commit into
mainfrom
codex/fix-celery-worker-stability

Conversation

@earayu
Copy link
Copy Markdown
Collaborator

@earayu earayu commented Apr 20, 2026

What Changed

  • disable LiteLLM async logging callbacks during Celery worker initialization to avoid cross-event-loop crashes in worker processes
  • follow the real child chord workflow when reporting Celery workflow status instead of reporting success as soon as fan-out is dispatched
  • treat skipped index tasks as skipped in workflow aggregation instead of misclassifying them as failures
  • initialize the evaluation system API key state safely when no key exists yet
  • remove a stale Celery task route that points at a task name no longer present in the repo

Why

The Celery review found both correctness issues and a likely worker-stability issue.

On the live cluster, the last celeryworker restart exited with code 133, and the previous worker logs showed LiteLLM Queue ... is bound to a different event loop errors before process termination. The worker currently runs threaded Celery tasks while some task paths create fresh event loops via asyncio.run() / manual loop wrappers. LiteLLM's process-global async logging worker is not safe under that pattern, so this change disables that async callback path inside Celery worker processes.

Separately, workflow status tracking was reporting the outer trigger task rather than the terminal chord result, and skipped tasks could be misreported as failures.

Impact

  • reduces the chance of Celery worker crashes caused by LiteLLM async logging worker loop affinity issues
  • makes workflow status and workflow result summaries match real execution much more closely
  • fixes a latent evaluation API key bootstrap bug
  • removes stale config that could mislead future Celery routing/debugging work

Validation

  • python3 -m py_compile config/celery.py config/celery_tasks.py aperag/tasks/scheduler.py aperag/service/evaluation_service.py aperag/llm/litellm_logging.py
  • inspected live Kubernetes pod state and previous celeryworker logs to confirm the restart signature and LiteLLM event-loop errors

@apecloud-bot apecloud-bot added the size/L Denotes a PR that changes 100-499 lines. label Apr 20, 2026
@apecloud-bot
Copy link
Copy Markdown
Collaborator

This branch name is not following the standards: feature/|bugfix/|release/|hotfix/|support/|releasing/|dependabot/

@earayu earayu merged commit 9225c9e into main Apr 20, 2026
9 of 10 checks passed
@earayu earayu deleted the codex/fix-celery-worker-stability branch April 20, 2026 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants