fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors) by paulocavallari · Pull Request #989 · EKKOLearnAI/hermes-web-ui

paulocavallari · 2026-05-24T19:30:32Z

Problema

Quando o agent-bridge Python worker reinicia (deploy via systemctl restart, OOM kill, crash recovery), todos os runs em vôo são perdidos porque o estado de runs é mantido apenas em memória em hermes_bridge.py:

self._runs: dict[str, _RunRecord] = {}  # in-process only

O próximo poll de get_output retorna KeyError: unknown run: <id> que vaza direto pro usuário como:

Error: 'unknown run: 4c8bcf97879348b6bf41e164aa190283'

O que congela a sessão sem feedback útil — a mensagem do usuário sumiu visualmente, mas continua persistida no DB.

Reprodução

Iniciar uma conversa via Web UI (/api/chat-run socket).
Em outro terminal, systemctl restart hermes-web-ui durante o run em andamento (ou pkill -TERM -f hermes_bridge.py).
Cliente recebe Error: 'unknown run: <hex>' e a sessão fica travada.

Logs do bridge.log:

{"level":40, "response":{"ok":false,"error":"'unknown run: 4c8bcf...'"}, "msg":"[agent-bridge-client] request rejected"}

Fix

agent-bridge/client.ts → streamOutput: detecta a rejeição unknown run no poll e converte numa exceção tipada BRIDGE_RUN_LOST em vez de loopar ou propagar o KeyError cru.
run-chat/handle-bridge-run.ts: o catch reconhece a condição via isBridgeUnknownRunError e emite uma mensagem amigável em pt-BR explicando que a conexão foi reiniciada e que basta reenviar — preservando o estado da sessão.
Logging estruturado: bridgeLogger.warn registra sessionId/runId/erro original para o operador correlacionar com restarts.

A mensagem original do usuário já está no DB (persistida por _prepersist_user_message antes do chat() do bridge), então um único reenvio continua a conversa.

Test plan

Rebuild + restart hermes-web-ui.service enquanto há um run ativo no Web UI mobile/desktop.
Confirmar que a UI agora mostra a mensagem amigável em vez de unknown run: <hex>.
Confirmar que reenviar a mesma mensagem funciona.

Notes

Não muda o comportamento do bridge Python — mantém estado em memória por design (latência baixa, sem disco).
Só transforma a falha de "raw KeyError leak" em "soft fail recoverable".
Solução durável também seria persistir _runs em SQLite no bridge, mas isso é mudança maior e não bloqueante pra esse fix.

When the agent-bridge Python worker restarts (deploy, OOM, crash) all in-flight runs are lost because run state is held purely in-memory in `hermes_bridge.py` (`self._runs` dict). The next `get_output` poll then returns a raw `KeyError: unknown run: <id>` which leaks straight to the user as `Error: 'unknown run: 4c8bcf...'` and freezes the session. This change: 1. `streamOutput` in the bridge client detects the `unknown run` reject response and translates it into a typed `BRIDGE_RUN_LOST` error instead of looping forever or surfacing the raw KeyError. 2. `handle-bridge-run` recognises the lost-run condition and presents a user-facing message in pt-BR explaining the bridge restarted and the message can simply be resent — preserving the session state. 3. Logs the original error with structured context (`sessionId`, `runId`) so operators can still correlate restarts with run failures. The user's original message remains in DB history (already persisted by `_prepersist_user_message`), so a single retry continues the conversation.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Improve resiliency when the Python agent bridge worker restarts mid-run by surfacing a typed error to callers and translating it into a user-friendly message in the chat run handler.

Changes:

Add detection + UI-friendly messaging/logging for “unknown run” / bridge restart scenarios in handleBridgeRun.
Wrap AgentBridgeClient.getOutput() polling to translate “unknown run” into a typed error (code = BRIDGE_RUN_LOST).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
packages/server/src/services/hermes/run-chat/handle-bridge-run.ts	Detect bridge restart conditions and present a friendly message + warning log during run failure handling.
packages/server/src/services/hermes/agent-bridge/client.ts	Convert “unknown run” bridge errors into a typed error to stop polling and enable graceful recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+ * the run as cleanly terminated so the UI can recover.
+ */
+export function isBridgeUnknownRunError(err: unknown): boolean {
+  const message = err instanceof Error ? err.message : String(err ?? '')


    if (state.activeRunMarker !== runMarker) return
    if (!state.isWorking) return
    const queueLen = state.queue?.length ?? 0
+    const bridgeRestarted = isBridgeUnknownRunError(err)


+          const restartErr: any = new Error(`bridge worker restarted; run ${runId} no longer tracked`)
+          restartErr.code = 'BRIDGE_RUN_LOST'
+          restartErr.runId = runId
+          throw restartErr


+}
+
+const BRIDGE_RESTART_USER_MESSAGE =
+  'A conexão com o agente foi reiniciada (geralmente por um deploy ou reinicialização do serviço). Sua mensagem foi salva — basta enviar de novo para continuar.'


@@ -318,7 +335,15 @@ export async function handleBridgeRun(
    state.bridgePendingToolCallMarkup = undefined
    flushBridgePendingToDb(state, session_id)


+    const rawMessage = err instanceof Error ? err.message : String(err)
+    const message = bridgeRestarted ? BRIDGE_RESTART_USER_MESSAGE : rawMessage
+    if (bridgeRestarted) {
+      bridgeLogger.warn({
+        sessionId: session_id,
+        runId: state.runId,
+        rawError: rawMessage,


EKKOLearnAI · 2026-05-25T02:06:40Z

Thanks for the fix. Two requests before this can move forward:

Please do not hard-code the user-facing recovery message in pt-BR. This project should not emit a Portuguese-only server error to all users. Please change it to the project default language (English) or route it through the existing localization/error-message pattern if one applies here.
Please disable Copilot code review for this repository on your side and avoid requesting Copilot review on future PRs here. I checked the PR timeline and this review was requested by your account, not by our repository ruleset. The Copilot review noise makes the PR harder to review.

Copilot AI review requested due to automatic review settings May 24, 2026 19:30

Copilot AI reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors)#989

fix: handle agent-bridge worker restart gracefully (no more 'unknown run' UI errors)#989
paulocavallari wants to merge 1 commit into
EKKOLearnAI:mainfrom
paulocavallari:fix/bridge-unknown-run-recovery

paulocavallari commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

EKKOLearnAI commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -318,7 +335,15 @@ export async function handleBridgeRun(
		state.bridgePendingToolCallMarkup = undefined
		flushBridgePendingToDb(state, session_id)

Conversation

paulocavallari commented May 24, 2026

Problema

Reprodução

Fix

Test plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

EKKOLearnAI commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants