Skip to content

fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors)#989

Open
paulocavallari wants to merge 1 commit into
EKKOLearnAI:mainfrom
paulocavallari:fix/bridge-unknown-run-recovery
Open

fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors)#989
paulocavallari wants to merge 1 commit into
EKKOLearnAI:mainfrom
paulocavallari:fix/bridge-unknown-run-recovery

Conversation

@paulocavallari
Copy link
Copy Markdown
Contributor

Problema

Quando o agent-bridge Python worker reinicia (deploy via systemctl restart, OOM kill, crash recovery), todos os runs em vôo são perdidos porque o estado de runs é mantido apenas em memória em hermes_bridge.py:

self._runs: dict[str, _RunRecord] = {}  # in-process only

O próximo poll de get_output retorna KeyError: unknown run: <id> que vaza direto pro usuário como:

Error: 'unknown run: 4c8bcf97879348b6bf41e164aa190283'

O que congela a sessão sem feedback útil — a mensagem do usuário sumiu visualmente, mas continua persistida no DB.

Reprodução

  1. Iniciar uma conversa via Web UI (/api/chat-run socket).
  2. Em outro terminal, systemctl restart hermes-web-ui durante o run em andamento (ou pkill -TERM -f hermes_bridge.py).
  3. Cliente recebe Error: 'unknown run: <hex>' e a sessão fica travada.

Logs do bridge.log:

{"level":40, "response":{"ok":false,"error":"'unknown run: 4c8bcf...'"}, "msg":"[agent-bridge-client] request rejected"}

Fix

  1. agent-bridge/client.ts → streamOutput: detecta a rejeição unknown run no poll e converte numa exceção tipada BRIDGE_RUN_LOST em vez de loopar ou propagar o KeyError cru.

  2. run-chat/handle-bridge-run.ts: o catch reconhece a condição via isBridgeUnknownRunError e emite uma mensagem amigável em pt-BR explicando que a conexão foi reiniciada e que basta reenviar — preservando o estado da sessão.

  3. Logging estruturado: bridgeLogger.warn registra sessionId/runId/erro original para o operador correlacionar com restarts.

A mensagem original do usuário já está no DB (persistida por _prepersist_user_message antes do chat() do bridge), então um único reenvio continua a conversa.

Test plan

  • Rebuild + restart hermes-web-ui.service enquanto há um run ativo no Web UI mobile/desktop.
  • Confirmar que a UI agora mostra a mensagem amigável em vez de unknown run: <hex>.
  • Confirmar que reenviar a mesma mensagem funciona.

Notes

  • Não muda o comportamento do bridge Python — mantém estado em memória por design (latência baixa, sem disco).
  • Só transforma a falha de "raw KeyError leak" em "soft fail recoverable".
  • Solução durável também seria persistir _runs em SQLite no bridge, mas isso é mudança maior e não bloqueante pra esse fix.

When the agent-bridge Python worker restarts (deploy, OOM, crash) all
in-flight runs are lost because run state is held purely in-memory in
`hermes_bridge.py` (`self._runs` dict). The next `get_output` poll then
returns a raw `KeyError: unknown run: <id>` which leaks straight to the
user as `Error: 'unknown run: 4c8bcf...'` and freezes the session.

This change:

1. `streamOutput` in the bridge client detects the `unknown run` reject
   response and translates it into a typed `BRIDGE_RUN_LOST` error
   instead of looping forever or surfacing the raw KeyError.
2. `handle-bridge-run` recognises the lost-run condition and presents a
   user-facing message in pt-BR explaining the bridge restarted and the
   message can simply be resent — preserving the session state.
3. Logs the original error with structured context (`sessionId`, `runId`)
   so operators can still correlate restarts with run failures.

The user's original message remains in DB history (already persisted by
`_prepersist_user_message`), so a single retry continues the conversation.
Copilot AI review requested due to automatic review settings May 24, 2026 19:30
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improve resiliency when the Python agent bridge worker restarts mid-run by surfacing a typed error to callers and translating it into a user-friendly message in the chat run handler.

Changes:

  • Add detection + UI-friendly messaging/logging for “unknown run” / bridge restart scenarios in handleBridgeRun.
  • Wrap AgentBridgeClient.getOutput() polling to translate “unknown run” into a typed error (code = BRIDGE_RUN_LOST).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
packages/server/src/services/hermes/run-chat/handle-bridge-run.ts Detect bridge restart conditions and present a friendly message + warning log during run failure handling.
packages/server/src/services/hermes/agent-bridge/client.ts Convert “unknown run” bridge errors into a typed error to stop polling and enable graceful recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* the run as cleanly terminated so the UI can recover.
*/
export function isBridgeUnknownRunError(err: unknown): boolean {
const message = err instanceof Error ? err.message : String(err ?? '')
if (state.activeRunMarker !== runMarker) return
if (!state.isWorking) return
const queueLen = state.queue?.length ?? 0
const bridgeRestarted = isBridgeUnknownRunError(err)
Comment on lines +443 to +446
const restartErr: any = new Error(`bridge worker restarted; run ${runId} no longer tracked`)
restartErr.code = 'BRIDGE_RUN_LOST'
restartErr.runId = runId
throw restartErr
}

const BRIDGE_RESTART_USER_MESSAGE =
'A conexão com o agente foi reiniciada (geralmente por um deploy ou reinicialização do serviço). Sua mensagem foi salva — basta enviar de novo para continuar.'
@@ -318,7 +335,15 @@ export async function handleBridgeRun(
state.bridgePendingToolCallMarkup = undefined
flushBridgePendingToDb(state, session_id)
Comment on lines +338 to +344
const rawMessage = err instanceof Error ? err.message : String(err)
const message = bridgeRestarted ? BRIDGE_RESTART_USER_MESSAGE : rawMessage
if (bridgeRestarted) {
bridgeLogger.warn({
sessionId: session_id,
runId: state.runId,
rawError: rawMessage,
@EKKOLearnAI
Copy link
Copy Markdown
Owner

Thanks for the fix. Two requests before this can move forward:

  1. Please do not hard-code the user-facing recovery message in pt-BR. This project should not emit a Portuguese-only server error to all users. Please change it to the project default language (English) or route it through the existing localization/error-message pattern if one applies here.

  2. Please disable Copilot code review for this repository on your side and avoid requesting Copilot review on future PRs here. I checked the PR timeline and this review was requested by your account, not by our repository ruleset. The Copilot review noise makes the PR harder to review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants