fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors)#989
Conversation
When the agent-bridge Python worker restarts (deploy, OOM, crash) all in-flight runs are lost because run state is held purely in-memory in `hermes_bridge.py` (`self._runs` dict). The next `get_output` poll then returns a raw `KeyError: unknown run: <id>` which leaks straight to the user as `Error: 'unknown run: 4c8bcf...'` and freezes the session. This change: 1. `streamOutput` in the bridge client detects the `unknown run` reject response and translates it into a typed `BRIDGE_RUN_LOST` error instead of looping forever or surfacing the raw KeyError. 2. `handle-bridge-run` recognises the lost-run condition and presents a user-facing message in pt-BR explaining the bridge restarted and the message can simply be resent — preserving the session state. 3. Logs the original error with structured context (`sessionId`, `runId`) so operators can still correlate restarts with run failures. The user's original message remains in DB history (already persisted by `_prepersist_user_message`), so a single retry continues the conversation.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Improve resiliency when the Python agent bridge worker restarts mid-run by surfacing a typed error to callers and translating it into a user-friendly message in the chat run handler.
Changes:
- Add detection + UI-friendly messaging/logging for “unknown run” / bridge restart scenarios in
handleBridgeRun. - Wrap
AgentBridgeClient.getOutput()polling to translate “unknown run” into a typed error (code = BRIDGE_RUN_LOST).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| packages/server/src/services/hermes/run-chat/handle-bridge-run.ts | Detect bridge restart conditions and present a friendly message + warning log during run failure handling. |
| packages/server/src/services/hermes/agent-bridge/client.ts | Convert “unknown run” bridge errors into a typed error to stop polling and enable graceful recovery. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * the run as cleanly terminated so the UI can recover. | ||
| */ | ||
| export function isBridgeUnknownRunError(err: unknown): boolean { | ||
| const message = err instanceof Error ? err.message : String(err ?? '') |
| if (state.activeRunMarker !== runMarker) return | ||
| if (!state.isWorking) return | ||
| const queueLen = state.queue?.length ?? 0 | ||
| const bridgeRestarted = isBridgeUnknownRunError(err) |
| const restartErr: any = new Error(`bridge worker restarted; run ${runId} no longer tracked`) | ||
| restartErr.code = 'BRIDGE_RUN_LOST' | ||
| restartErr.runId = runId | ||
| throw restartErr |
| } | ||
|
|
||
| const BRIDGE_RESTART_USER_MESSAGE = | ||
| 'A conexão com o agente foi reiniciada (geralmente por um deploy ou reinicialização do serviço). Sua mensagem foi salva — basta enviar de novo para continuar.' |
| @@ -318,7 +335,15 @@ export async function handleBridgeRun( | |||
| state.bridgePendingToolCallMarkup = undefined | |||
| flushBridgePendingToDb(state, session_id) | |||
| const rawMessage = err instanceof Error ? err.message : String(err) | ||
| const message = bridgeRestarted ? BRIDGE_RESTART_USER_MESSAGE : rawMessage | ||
| if (bridgeRestarted) { | ||
| bridgeLogger.warn({ | ||
| sessionId: session_id, | ||
| runId: state.runId, | ||
| rawError: rawMessage, |
|
Thanks for the fix. Two requests before this can move forward:
|
Problema
Quando o
agent-bridgePython worker reinicia (deploy viasystemctl restart, OOM kill, crash recovery), todos os runs em vôo são perdidos porque o estado de runs é mantido apenas em memória emhermes_bridge.py:O próximo poll de
get_outputretornaKeyError: unknown run: <id>que vaza direto pro usuário como:O que congela a sessão sem feedback útil — a mensagem do usuário sumiu visualmente, mas continua persistida no DB.
Reprodução
/api/chat-runsocket).systemctl restart hermes-web-uidurante o run em andamento (oupkill -TERM -f hermes_bridge.py).Error: 'unknown run: <hex>'e a sessão fica travada.Logs do
bridge.log:Fix
agent-bridge/client.ts → streamOutput: detecta a rejeiçãounknown runno poll e converte numa exceção tipadaBRIDGE_RUN_LOSTem vez de loopar ou propagar oKeyErrorcru.run-chat/handle-bridge-run.ts: o catch reconhece a condição viaisBridgeUnknownRunErrore emite uma mensagem amigável em pt-BR explicando que a conexão foi reiniciada e que basta reenviar — preservando o estado da sessão.Logging estruturado:
bridgeLogger.warnregistrasessionId/runId/erro original para o operador correlacionar com restarts.A mensagem original do usuário já está no DB (persistida por
_prepersist_user_messageantes dochat()do bridge), então um único reenvio continua a conversa.Test plan
hermes-web-ui.serviceenquanto há um run ativo no Web UI mobile/desktop.unknown run: <hex>.Notes
_runsem SQLite no bridge, mas isso é mudança maior e não bloqueante pra esse fix.