fix(soul): keep agent loop alive while background tasks are running#1802
fix(soul): keep agent loop alive while background tasks are running#1802
Conversation
The agent loop exited prematurely when the LLM returned a text-only response (no tool calls) while background tasks were still running. This caused all pending background agents to be killed on session teardown, losing their work. Add a wait mechanism in _agent_loop() that detects active background tasks before exiting and waits for the next completion event instead. Each completion triggers one new step so the LLM can process notifications promptly. The wait polls every 2s to check for user steers, ensuring interactive responsiveness is preserved. Only applies to stop_reason="no_tool_calls" — other stop reasons like "tool_rejected" exit the turn immediately as before.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa8cb768db
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/kimi_cli/soul/kimisoul.py
Outdated
| event.clear() | ||
| # Re-check after clearing to avoid a race where a task | ||
| # completed between has_active_tasks() and clear(). | ||
| if self._runtime.background_tasks.has_active_tasks(): |
There was a problem hiding this comment.
Avoid clearing a valid completion signal before waiting
In _agent_loop, completion_event is cleared unconditionally before the wait starts. If one background task has just completed (so the event is already set) while another task is still active, this drops the completion signal and forces the loop to wait for a later completion or steer. In that case, the notification for the already-completed task stays pending and the LLM won’t process it promptly (or at all if the remaining task is long-lived), which breaks the intended per-completion progression.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR prevents the root agent loop from exiting when the LLM returns a text-only response while background tasks are still running, by waiting for background task completion signals and continuing the loop.
Changes:
- Add background-task-aware waiting logic to
KimiSoul._agent_loop()forstop_reason="no_tool_calls". - Introduce
BackgroundTaskManager.has_active_tasks()to detect non-terminal tasks. - Add a focused test suite covering background-wait behavior, cancellation, steers, and stop-reason bypass.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/kimi_cli/soul/kimisoul.py |
Keeps the agent loop alive by waiting for background completion events when appropriate. |
src/kimi_cli/background/manager.py |
Adds a helper to detect whether any background tasks are still active. |
tests/core/test_kimisoul_background_wait.py |
Adds regression and edge-case tests for the new background wait behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/kimi_cli/soul/kimisoul.py
Outdated
| event = self._runtime.background_tasks.completion_event | ||
| event.clear() | ||
| # Re-check after clearing to avoid a race where a task | ||
| # completed between has_active_tasks() and clear(). | ||
| if self._runtime.background_tasks.has_active_tasks(): |
There was a problem hiding this comment.
Potential missed background completion signal: this code clears completion_event before checking whether it’s already set (or whether there are pending LLM notifications). If a background task completes and the notification pump sets the event while the LLM step is running, event may already be set when we reach here; clearing it can drop the signal and then the loop may wait until the next completion (or a steer), delaying/never processing the already-pending notification. Consider first checking event.is_set() and/or self._runtime.notifications.has_pending_for_sink("llm") (similar to the shell watcher) and immediately continuing to the next step in that case, only entering the timed wait when there’s truly nothing pending yet.
Signed-off-by: Kai <me@kaiyi.cool>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f06bef46a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/kimi_cli/soul/kimisoul.py
Outdated
| await asyncio.wait_for(event.wait(), timeout=2.0) | ||
| event.clear() | ||
| break # A task completed — let next step handle it |
There was a problem hiding this comment.
Consume steers before handling a completion wake-up
In the background-wait loop, a successful event.wait() path clears the event and breaks immediately, but it does not check _consume_pending_steers() first. If a user steer arrives during the same wait window as a completion signal, the next iteration runs _step() for the completion-triggered follow-up before injecting the steer, which defeats the intended steer priority and can execute one extra LLM/tool step the user was trying to interrupt.
Useful? React with 👍 / 👎.
If a background task completes during _step() and the notification pump sets completion_event before the wait logic runs, the event was being cleared unconditionally — dropping the signal and forcing a 2s timeout before the notification could be processed. Check event.is_set() first and immediately continue to the next step when a completion is already pending, preserving per-completion progression.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 21097db0ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/kimi_cli/soul/kimisoul.py
Outdated
| step_outcome.stop_reason == "no_tool_calls" | ||
| and self._runtime.role == "root" | ||
| and self._runtime.background_tasks.has_active_tasks() | ||
| ): |
There was a problem hiding this comment.
Restrict background wait to turns that launched work
This condition makes every no_tool_calls turn block whenever any session background task is active, including long-running tasks started in earlier turns that are unrelated to the current user request. Because shell run_soul(...) only returns after _agent_loop() returns, the prompt never comes back and users cannot continue normal interaction while those jobs run, effectively turning background tasks into foreground blocking work; please gate this path to contexts where process exit would kill tasks (e.g., one-shot modes) or to tasks started in the current turn.
Useful? React with 👍 / 👎.
Signed-off-by: Kai <me@kaiyi.cool>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a6a4e5dba2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/kimi_cli/soul/kimisoul.py
Outdated
| step_outcome.stop_reason == "no_tool_calls" | ||
| and self._runtime.role == "root" | ||
| and self._runtime.background_tasks.has_active_tasks() | ||
| ): |
There was a problem hiding this comment.
Scope background wait to pending completion notifications
This condition waits on every no_tool_calls step whenever any non-terminal background task exists, even if that task was started in a prior turn and has no new completion to process. In practice, a long-lived background task (for example one started with run_in_background=true) keeps has_active_tasks() true, so later text-only turns can stay in the wait loop indefinitely and run_soul() does not return except via steer/cancel; that blocks normal interaction flow (queued inputs are only drained after run_soul returns in src/kimi_cli/ui/shell/__init__.py).
Useful? React with 👍 / 👎.
Signed-off-by: Kai <me@kaiyi.cool>
The prior fix added a wait loop inside KimiSoul._agent_loop() that
blocked every no_tool_calls turn whenever any background task was
active, even tasks unrelated to the current turn. Review feedback
flagged this as turning background work into foreground blocking.
Move the wait logic to Print.run() instead, where it only runs in
one-shot text mode (the only context where process exit would kill
still-running background agents). stream-json multi-turn mode is
unaffected so prompts never block each other.
The loop drives reconcile() itself (the notification pump inside
run_soul is no longer running after it returns), re-enters the soul
whenever has_pending_for_sink("llm") is True — even with active
sibling tasks so per-task progress is not blocked — and does a final
reconcile before break to close the race where a worker writes
terminal status between the top-of-loop reconcile and the
has_active_tasks check. Ctrl+C during the wait now raises
RunCancelled so the process exits with failure instead of success.
Tests: 8 unit tests (mocked) + 4 integration tests using real
BackgroundTaskManager + NotificationManager + file-backed stores,
including a race-window regression guard.
…ei-v1 # Conflicts: # CHANGELOG.md # docs/en/release-notes/changelog.md # docs/zh/release-notes/changelog.md
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60870da825
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/kimi_cli/ui/print/__init__.py
Outdated
| # | ||
| # stream-json mode is multi-turn: background tasks | ||
| # from one command must not block the next command. | ||
| if runtime and runtime.role == "root" and self.input_format == "text": |
There was a problem hiding this comment.
Skip print background wait when keep_alive_on_exit is enabled
The new post-run_soul wait loop runs for every root/text print turn, but it ignores runtime.config.background.keep_alive_on_exit. That flag is already honored by KimiCLI.shutdown_background_tasks (src/kimi_cli/app.py), which means exiting would not kill active tasks in this mode; with the current condition, one-shot --print can still block until all background tasks finish (or forever for long-lived jobs), which defeats the configured keep-alive behavior.
Useful? React with 👍 / 👎.
Review feedback: the new Print wait loop blocked one-shot runs whenever any background task was active, even when background.keep_alive_on_exit is configured. That flag opts into "fire and forget" semantics where background subprocesses are meant to outlive the CLI process, so blocking defeats the configured behavior. Skip the wait loop entirely when keep_alive_on_exit is true. Any pending completion notifications remain in the notification store and will be surfaced in the next interactive session.
Summary
_agent_loop()that waits for the next background task completion event instead of exiting, with per-completion semantics (each completion triggers one new step)stop_reason="no_tool_calls"—tool_rejectedand other stop reasons exit immediately as beforeTest plan