Skip to content

RunningHub executor: infinite retry loop on terminal errors (TASK_NOT_FOUND), unkillable by Ctrl+C on Windows #3

@jonathanzhan1975

Description

@jonathanzhan1975

Summary

When a RunningHub task is cancelled or removed on the server side, comfyui/runninghub_executor.py::_wait_for_task_completion enters an infinite retry loop that the user cannot escape with Ctrl+C in PowerShell / Windows Terminal. The process must be killed via Task Manager.

The bug is two layers of indiscriminate retry stacked on top of each other.

Version: comfykit 0.1.12
Platform: Windows 11 / PowerShell, Python 3.11

Reproduction

  1. Submit a workflow via RunningHubExecutor and let it start polling status.
  2. Open the RunningHub web UI and cancel the task.
  3. The local process now logs RunningHub API error: APIKEY_TASK_NOT_FOUND repeatedly forever.
  4. Ctrl+C in PowerShell does not stop the process.

Log evidence

16:02:36 WARNING runninghub_client.py:115 _make_request - Request failed (attempt 1/4): RunningHub API error: APIKEY_TASK_NOT_FOUND. Retrying in 1s...
16:02:37 WARNING runninghub_client.py:115 _make_request - Request failed (attempt 2/4): ... Retrying in 2s...
16:02:40 WARNING runninghub_client.py:115 _make_request - Request failed (attempt 3/4): ... Retrying in 4s...
16:02:44 ERROR   runninghub_client.py:118 _make_request - Request failed after 4 attempts: APIKEY_TASK_NOT_FOUND
16:02:44 ERROR   runninghub_client.py:293 query_task_status - Failed to query task status...
16:02:44 ERROR   runninghub_executor.py:371 _wait_for_task_completion - Error checking task status ...
16:02:46 WARNING runninghub_client.py:115 _make_request - Request failed (attempt 1/4): ...   ← outer while True loops back
16:02:48 WARNING runninghub_client.py:115 _make_request - Request failed (attempt 2/4): ...
... (forever)

Each outer cycle is ~9 seconds (1+2+4 inner backoff + 2s outer sleep), and never terminates.

Root cause #1 — outer infinite loop (primary)

comfyui/runninghub_executor.py L321-373:

# If both are None, no timeout limit (wait indefinitely)
if max_wait_time is None:
    max_wait_time = self.timeout
...
while True:
    elapsed_time = time.time() - start_time
    if max_wait_time is not None and elapsed_time >= max_wait_time:
        break
    try:
        status_info = await self.client.query_task_status(task_id)
        ...
    except Exception as e:
        logger.error(f""Error checking task status {task_id}: {e}"", exc_info=True)
        await asyncio.sleep(check_interval)
        continue   # ← swallows ALL errors, including terminal ones, and loops forever

TASK_NOT_FOUND / APIKEY_TASK_NOT_FOUND is a terminal state: the task is gone from the server and will never come back. It must not be retried — the executor should return an ExecuteResult(status=""error"") immediately.

When max_wait_time is None (the documented ""wait indefinitely"" default), this becomes a true infinite loop with no exit condition.

Root cause #2 — inner retry over business errors (amplifier)

comfyui/runninghub_client.py L82-120 in _make_request:

except Exception as e:
    last_exception = e
    ...
    if attempt < self.retry_count:
        wait_time = 2 ** attempt
        logger.warning(f""Request failed (attempt {attempt + 1}/{self.retry_count + 1}): {e}. Retrying in {wait_time}s..."")
        await asyncio.sleep(wait_time)

The retry handler treats every Exception the same, including the business error raised at L98:

raise Exception(f""RunningHub API error: {result.get('msg', 'Unknown error')}"")

Permanent business errors like APIKEY_TASK_NOT_FOUND, APIKEY_INVALID, WORKFLOW_NOT_FOUND are retried 4 times for no benefit, wasting ~7 seconds and quota per outer cycle.

Root cause #3 — Ctrl+C cannot break the loop on Windows

The combination of:

  • bare except Exception (which doesn't catch KeyboardInterrupt, but...)
  • asyncio.sleep inside the inner loop
  • aiohttp session.request and Windows asyncio's known signal-delivery issues

means SIGINT delivery is unreliable while the executor is sleeping/awaiting inside the nested loops. Practically, Ctrl+C in PowerShell does nothing and the user must kill the process from Task Manager. Even if Python eventually delivers KeyboardInterrupt, the long backoff windows make the program feel unresponsive.

Suggested fix

A. Classify terminal errors in the executor

TERMINAL_ERROR_TOKENS = (
    ""TASK_NOT_FOUND"",
    ""APIKEY_TASK_NOT_FOUND"",
    ""APIKEY_INVALID"",
    ""TASK_CANCELLED"",
    ""WORKFLOW_NOT_FOUND"",
)

except Exception as e:
    err_str = str(e)
    if any(tok in err_str for tok in TERMINAL_ERROR_TOKENS):
        logger.error(f""Task {task_id} terminated remotely: {e}"")
        return ExecuteResult(
            status=""error"",
            prompt_id=task_id,
            msg=f""Task cancelled or not found: {e}"",
        )
    # Transient — keep polling
    logger.warning(f""Transient error checking task status {task_id}: {e}"")
    await asyncio.sleep(check_interval)
    continue

A cleaner long-term solution: define typed exceptions in runninghub_client.py (e.g. RunningHubTerminalError, RunningHubTransientError) instead of raising bare Exception, and have the executor catch them separately.

B. Don't retry permanent business errors in _make_request

Inspect result.get('code') / result.get('msg') before raising; if it's a known permanent error, raise a RunningHubTerminalError and have _make_request re-raise it without retrying.

C. Add a hard ceiling on consecutive identical errors

Even with the above, defensively break out of _wait_for_task_completion if the same error message has been seen N times in a row (e.g. 5). This protects against unknown future error codes.

D. Cooperative cancellation / signal handling

Document that callers should use asyncio.run with a SIGINT handler that cancels the running task, or wrap the executor call with a cancellable task. Current behavior on Windows + PowerShell is effectively unkillable by Ctrl+C, which is a major UX issue.

Why this matters

Any user who cancels a task from the RunningHub web UI (a totally normal action) ends up with a hung local process that fills the log and cannot be stopped without Task Manager. Discovered while running the Pixelle-Video project, which depends on comfykit.

Happy to send a PR if the maintainer agrees with the direction above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions