Feat/engine#2
Conversation
…-format Add sample compressed traces fixture for use by halo-engine e2e tests. Made-with: Cursor
Reverting the removal of the child_depth > maximum_depth check inside guarded_invoke. The structural guard in _child_tools_for_depth (only exposing call_subagent when depth < maximum_depth) makes this path unreachable in normal flow, but the runtime check is cheap insurance against a future refactor that bypasses the structural enforcement. Comment now records that intent. New unit test constructs the tool directly via _build_subagent_as_tool with child_depth=maximum_depth+1 to exercise the branch, addressing the original "unreachable" / "dead code" complaint by making the guard testable and tested.
| compaction_model=ModelConfig(name=E2E_MODEL), | ||
| # Force compaction to trigger even on small conversations | ||
| text_message_compaction_keep_last_messages=1, | ||
| tool_call_compaction_keep_last_messages=1, |
There was a problem hiding this comment.
Stale field name in plan's integration test code
Medium Severity
Task 11.4's integration test snippet uses tool_call_compaction_keep_last_messages=1, but EngineConfig defines the field as tool_call_compaction_keep_last_turns (with extra="forbid"). When an engineer or agentic worker follows the plan to create test_engine_compaction.py, the test will immediately fail with a Pydantic ValidationError on construction rather than testing compaction behavior.
Reviewed by Cursor Bugbot for commit f8bde0f. Configure here.
| self.items = list(items) | ||
| self.compaction_model = compaction_model | ||
| self.text_message_compaction_keep_last_messages = text_message_compaction_keep_last_messages | ||
| self.tool_call_compaction_keep_last_turns = tool_call_compaction_keep_last_turns |
There was a problem hiding this comment.
Dead compaction_model field on AgentContext
Low Severity
AgentContext.__init__ accepts and stores compaction_model as a required field, but no method on AgentContext ever reads self.compaction_model. Compaction actually happens through the compactor callable injected into compact_old_items, which is produced by build_openai_compactor_factory(engine_config) — AgentContext has no role in resolving the model. Every call site (including subagent_tool_factory) must pass the field for no effect, and the TODO comment in openai_agent_runner.py acknowledges the current two-path design.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f8bde0f. Configure here.
…nups Wire agent_context into ToolContext + drop dead subagent code
ModelConfig.name was a Literal that included Claude variants but the compactor and synthesis tool always built AsyncOpenAI() and called chat.completions.create — a Claude name would 4xx at runtime. Targeting the OpenAI-compatible chat-completions surface (de facto 2025 standard exposed by OpenAI, OpenRouter, Anthropic compat, vLLM, Together, Groq, Ollama, LM Studio, Azure, etc.) covers the broad-compatibility goal without adding a dependency: - Drop Literal on ModelConfig.name (now plain str so any provider's model id is accepted) - New ModelProviderConfig (base_url, api_key) on EngineConfig; each field independently falls back to the matching env var when None - Thread it into the AsyncOpenAI client used by the compactor and synthesis tool, and into a new configure_default_sdk_client in openai_agent_runner so the Agents SDK honors the same endpoint - Rename openai_compactor.py -> compactor.py and build_openai_compactor_factory -> build_compactor_factory now that the helper is provider-agnostic
Primary fix (cursor[bot] 3150208823, Medium):
- _drive caught BaseException and routed it through output_bus.fail()
without re-raising, so CancelledError / KeyboardInterrupt / SystemExit
were converted into normal task completions. Structured cancellation
broke: an outer task.cancel() left the inner task ended-normally with
a fail signal queued for a consumer that may have already torn down.
- Narrow the silent-fail branch to except Exception; for BaseException
drain the bus then re-raise so the task transitions to cancelled.
Cleanups bundled (same file or trivially small):
- Rename inner runner -> agent_runner in _drive (3149854827) so it
doesn't shadow the outer runner: RunnerProtocol parameter.
- Add the {maximum_parallel_subagents} placeholder to the subagent
system prompt (3150184259); render_subagent_system_prompt accepted
the kwarg but the template had no slot, so str.format silently
dropped it.
- Anchor /data/ in .gitignore (3150106227) so the rule only matches
the top-level data/ directory rather than any data/ at any depth.
…leanups Let _drive cancel cleanly + small bug-bot cleanups
set_default_openai_client defaults use_for_tracing=True, which routes the Agents SDK tracing exporter through the same custom client used for model calls. Non-OpenAI providers (vLLM, Ollama, OpenRouter, etc.) don't speak the tracing API, so tracing POSTs would either error or silently drop — defeating a key goal of the PR. Pass use_for_tracing=False so the tracing exporter stays on its default OpenAI path while model calls go through the configured endpoint.
| "--", | ||
| "/venv/bin/python", | ||
| str(script_path), | ||
| ] |
There was a problem hiding this comment.
Hardcoded sandbox mount paths duplicated across modules
Medium Severity
The in-sandbox mount paths /mnt/trace/traces.jsonl, /mnt/trace/traces.jsonl.engine-index.jsonl, /venv, /venv/bin/python, and /workspace/bootstrap.py are hardcoded as string literals in both platform_commands.py and sandbox_runner.py. They must stay perfectly synchronized — if one is changed without the other, the sandbox silently breaks because the bootstrap script tries to load files at paths the bind mount no longer provides. Extracting them into shared constants would eliminate this drift risk.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 58a6ec6. Configure here.
Route LLM calls through configurable OpenAI-compatible endpoint
cursor[bot] 3150754673, High: a single asyncio.Semaphore was shared across every depth, and guarded_invoke held its slot for the full child execution including the parent's wait on its grandchildren. With max_parallel_subagents parents holding every slot at depth N, any depth-(N+1) grandchild blocks forever waiting on a slot its parent is holding — and the parent stays parked inside runner.run waiting for that grandchild's tool result. Reachable on default config (max_depth=2, max_parallel=4) and at max_parallel=1 with even a single grandchild spawn. Replace the single semaphore with one per spawnable depth, built once in build_root_sdk_agent. Each depth has its own pool so parent-holding-slot can no longer block its own grandchildren — they contend on different semaphores. Worst-case concurrent agents at default config: P at each depth, total P × D = 8. Hard cap, honest semantics. Includes a regression test that pre-acquires the depth-1 slot externally then drives a depth-2 tool via _build_subagent_as_tool, verifying it completes within timeout. Verified the test fails (TimeoutError) when guarded_invoke is reverted to a global slot.
Fix nested-subagent deadlock with per-depth semaphores
|
|
||
| if [[ -x "$ENGINE_HOOK" ]]; then | ||
| "$ENGINE_HOOK" | ||
| fi |
There was a problem hiding this comment.
Pre-commit hook silently no-ops when engine hook missing
Low Severity
Both hooks silently skip when $ENGINE_HOOK is not executable or doesn't exist. A developer who hasn't run task setup-git-hooks, or whose checkout has the file but without the executable bit, will get zero feedback that pre-commit/pre-push validation is being bypassed entirely. The PR discussion explicitly flagged "dont like fallbacks like this lets use one way of setting paths" — this silent fallback is the kind of behavior that change was meant to remove.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 4be7579. Configure here.
| ], | ||
| temperature=engine_config.compaction_model.temperature or 0.0, | ||
| ) | ||
| return (response.choices[0].message.content or "").strip() |
There was a problem hiding this comment.
Compactor lazy client reads unset model_provider
Low Severity
The factory closure mutates a shared openai_client via nonlocal, lazily constructing one AsyncOpenAI per factory but reusing it across executions. Two issues: (1) under concurrent subagent compaction the lazy if openai_client is None check is racy, possibly creating multiple clients; (2) the inline TODO already calls out this design — building one client per run and injecting it removes the branch entirely. As written, this is overly complex relative to the simpler single-client construction the TODO describes.
Reviewed by Cursor Bugbot for commit 4be7579. Configure here.
cursor[bot] 3151031551 (Medium): release workflow ran ``uv version --bump … --no-sync`` then committed both pyproject.toml and uv.lock — but ``--no-sync`` also skips the lock refresh, so the release would push a uv.lock that lagged the bumped pyproject.toml. Add an explicit ``uv lock`` step after the bump. cursor[bot] 3151031562 (Low): when the consumer of stream_engine_async cancels iteration, the outer ``except`` calls task.cancel() but never awaits the inner _drive task. asyncio then warns "Task was destroyed but it is pending" and any exception _drive raises during cancellation cleanup is silently dropped. Add a try/except-BaseException that drains the task before re-raising the consumer's original exception.
| chunk = await stream.read(65536) | ||
| if not chunk: | ||
| break | ||
| return bytes(buf) |
There was a problem hiding this comment.
Sandbox stdout/stderr drained twice loses output
Medium Severity
In _read_capped, after the first loop fills buf exactly to cap, the second while True loop continues to read and discard the remaining stream. However, if the first loop exits because len(buf) == cap while the next chunk would have fit (the boundary case), or if the process emits less than cap total bytes, the second loop still runs an extra await stream.read(65536) after EOF — harmless but wasteful. More importantly, when output exceeds cap, bytes between cap and the next chunk boundary are silently discarded with no truncation marker, so users see partial output with no indication that it was capped.
Reviewed by Cursor Bugbot for commit 83cea0a. Configure here.
|
|
||
| if [[ -x "$ENGINE_HOOK" ]]; then | ||
| "$ENGINE_HOOK" "$REMOTE" "$URL" < "$REFS_FILE" | ||
| fi |
There was a problem hiding this comment.
Pre-push hook silently swallows refs when engine hook missing
Low Severity
The wrapper unconditionally consumes stdin via cat > "$REFS_FILE" before checking whether the engine hook exists. If the engine hook is absent or non-executable, the captured refs file is discarded on cleanup and the push proceeds with no validation. That's the intended fallback, but it also means any failure to install the engine hook (typo in path, lost executable bit) becomes a silent no-op rather than an obvious error — defeating the purpose of having a pre-push hook at all. Consider logging a warning when the engine hook is not found.
Reviewed by Cursor Bugbot for commit 83cea0a. Configure here.
cursor[bot] 3151054134 (Low): _map_tool_call and _map_tool_output
both used ``item_id = str(getattr(raw, "id", "") or call_id)``. When
the SDK's raw item lacks an ``id``, both fall back to the same
``call_id``, so the two AgentContextItems collide on item_id.
AgentContext._index is keyed by item_id, so the second append()
silently overwrites the first — get_context_item then returns the
wrong item even though both still live in self.items.
Namespace the fallback: tool-call-{call_id} for the assistant call
and tool-result-{call_id} for the tool result. Items with intact
raw.id are unaffected. Includes a regression test that constructs
a tool_call/tool_output pair without raw.id and asserts distinct
item_ids; verified the test fails when the fix is reverted.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 14 total unresolved issues (including 13 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1ad22ea. Configure here.
| input=input, | ||
| context=context, | ||
| max_turns=engine_config.root_agent.maximum_turns, | ||
| ) |
There was a problem hiding this comment.
_run_streamed hardcodes root agent's max turns
Low Severity
The _run_streamed closure ignores its agent kwarg and always passes engine_config.root_agent.maximum_turns to Runner.run_streamed. The runner is only bound to the root execution today, so this works, but if OpenAiAgentRunner is ever reused for a non-root agent (or subagent is configured with a different maximum_turns), the wrong cap will be silently applied. Reading max_turns from the agent that was passed in would remove this latent coupling.
Reviewed by Cursor Bugbot for commit 1ad22ea. Configure here.
… new finds Driven by a second sub-agent run that surfaced two new bugs and gave specific framework feedback. This commit lands the framework changes, promotes one probe to an example, and leaves one probe in place as a bug repro per the new lifecycle rules. probe_kit.py — three new helpers: - make_checker(): bundles the _check / _FAILURES / main() boilerplate that every probe duplicates inline. Returns (check, failures); call failures.report_and_exit_code() at the end of main(). - make_tool_context(state, *, agent_context, agent_execution): builds a ToolContext from an EngineRunState. Removes 5-line boilerplate from every unit-style tool probe. - check_raises(callable_or_coro, expected_type): asserts an expected exception is raised, returns the caught exception. Distinguishes legitimate try/except (asserting expected raises) from the forbidden use (hiding unexpected failures). README — significant rewrite around probe lifecycle, driven by user direction: - Probes that PASS are deleted by the agent. Probes that FAIL stay in place as the canonical bug repro; the user evaluates / deletes once fixed. Replaced 'probes are ephemeral' with explicit lifecycle rules in Hard rule 7 plus a top-level summary in the intro. - New 'Section 3 — Suggested example promotions' in the report format. The bar is novel TESTING TECHNIQUE, not code surface coverage. Defines what does and does not count as novel. - Process step 1 expanded with concrete module pointers (added agent_context_items, engine_output_bus, runner_protocol, tool_protocol, trace_store) per second-agent feedback that the prior list felt thin. - 'Areas worth probing' shortened to one-line teasers — the curated sub-bullets had been functioning as a checklist that pulled the agent toward listed items. Now genuinely a seed list. - Try/except clarification: hiding unexpected failures forbidden; asserting expected raises is fine — use check_raises. - Optional 'Repro:' line added to FAIL findings (only when probe file is left behind). - 'How to write a probe' skeleton updated to use make_checker / make_tool_context / check_raises. example_compaction.py — promoted from probe_compaction.py. Demonstrates two genuinely novel patterns the existing examples don't show: - Direct AgentContext probing without run_with_fake at all - _RecordingCompactor class (a real callable that records, not a mock) Surfaced bug #7: compacted plain-text assistant messages mis-rendered as 'Compacted tool calls' in agent_context.py:138-142. probe_max_turns_forwarding.py — kept as ephemeral probe (no novel technique; just runner.calls inspection that example_circuit_breaker already shows). Surfaces bug #6: AgentConfig.maximum_turns never forwarded to Runner.run_streamed at either root or subagent call sites. File stays per lifecycle rule #2 — it's the bug repro until the bug is fixed.


Note
High Risk
High risk due to adding a large new execution runtime (LLM calls, tool streaming/compaction, sandboxed code execution) and new release/CI pipelines that can affect publishing and developer workflows.
Overview
Adds a new
engine/uv package implementing the HALO agent runtime, including async streaming entrypoints (stream_engine_async/run_engine_async), shared output bus sequencing across root/subagents, context compaction that preserves valid tool-call turns, and an OpenAI Agents SDK runner with retriable-error classification and a circuit breaker.Introduces a sandboxed
run_codetool (bubblewrap on Linux /sandbox-execon macOS) with capped output and timeouts, plus supporting config/models and trace indexing/store wiring.Adds repo tooling and delivery: git hook shims in
.githooks/that delegate toengine/scripts/git-hooks, anengine/Taskfile.ymlfor setup/checks, new GitHub Actions workflows for engine lint/unit/integration/e2e and a multi-stage release workflow (version bump, build, PyPI publish), and a separatecli/package (halo-cli) exposing ahalo-engineTyper command for streaming runs.Reviewed by Cursor Bugbot for commit 1ad22ea. Bugbot is set up for automated code reviews on this repo. Configure here.