Skip to content

Feat/engine#2

Merged
francescov1 merged 122 commits into
mainfrom
feat/engine
Apr 28, 2026
Merged

Feat/engine#2
francescov1 merged 122 commits into
mainfrom
feat/engine

Conversation

@francescov1
Copy link
Copy Markdown
Contributor

@francescov1 francescov1 commented Apr 26, 2026

Note

High Risk
High risk due to adding a large new execution runtime (LLM calls, tool streaming/compaction, sandboxed code execution) and new release/CI pipelines that can affect publishing and developer workflows.

Overview
Adds a new engine/ uv package implementing the HALO agent runtime, including async streaming entrypoints (stream_engine_async/run_engine_async), shared output bus sequencing across root/subagents, context compaction that preserves valid tool-call turns, and an OpenAI Agents SDK runner with retriable-error classification and a circuit breaker.

Introduces a sandboxed run_code tool (bubblewrap on Linux / sandbox-exec on macOS) with capped output and timeouts, plus supporting config/models and trace indexing/store wiring.

Adds repo tooling and delivery: git hook shims in .githooks/ that delegate to engine/scripts/git-hooks, an engine/Taskfile.yml for setup/checks, new GitHub Actions workflows for engine lint/unit/integration/e2e and a multi-stage release workflow (version bump, build, PyPI publish), and a separate cli/ package (halo-cli) exposing a halo-engine Typer command for streaming runs.

Reviewed by Cursor Bugbot for commit 1ad22ea. Bugbot is set up for automated code reviews on this repo. Configure here.

…-format

Add sample compressed traces fixture for use by halo-engine e2e tests.

Made-with: Cursor
francescov1 and others added 2 commits April 27, 2026 13:52
Reverting the removal of the child_depth > maximum_depth check inside
guarded_invoke. The structural guard in _child_tools_for_depth (only
exposing call_subagent when depth < maximum_depth) makes this path
unreachable in normal flow, but the runtime check is cheap insurance
against a future refactor that bypasses the structural enforcement.

Comment now records that intent. New unit test constructs the tool
directly via _build_subagent_as_tool with child_depth=maximum_depth+1
to exercise the branch, addressing the original "unreachable" /
"dead code" complaint by making the guard testable and tested.
compaction_model=ModelConfig(name=E2E_MODEL),
# Force compaction to trigger even on small conversations
text_message_compaction_keep_last_messages=1,
tool_call_compaction_keep_last_messages=1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale field name in plan's integration test code

Medium Severity

Task 11.4's integration test snippet uses tool_call_compaction_keep_last_messages=1, but EngineConfig defines the field as tool_call_compaction_keep_last_turns (with extra="forbid"). When an engineer or agentic worker follows the plan to create test_engine_compaction.py, the test will immediately fail with a Pydantic ValidationError on construction rather than testing compaction behavior.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f8bde0f. Configure here.

Comment thread engine/engine/agents/prompt_templates.py
self.items = list(items)
self.compaction_model = compaction_model
self.text_message_compaction_keep_last_messages = text_message_compaction_keep_last_messages
self.tool_call_compaction_keep_last_turns = tool_call_compaction_keep_last_turns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead compaction_model field on AgentContext

Low Severity

AgentContext.__init__ accepts and stores compaction_model as a required field, but no method on AgentContext ever reads self.compaction_model. Compaction actually happens through the compactor callable injected into compact_old_items, which is produced by build_openai_compactor_factory(engine_config)AgentContext has no role in resolving the model. Every call site (including subagent_tool_factory) must pass the field for no effect, and the TODO comment in openai_agent_runner.py acknowledges the current two-path design.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f8bde0f. Configure here.

Comment thread engine/engine/main.py
…nups

Wire agent_context into ToolContext + drop dead subagent code
ModelConfig.name was a Literal that included Claude variants but the
compactor and synthesis tool always built AsyncOpenAI() and called
chat.completions.create — a Claude name would 4xx at runtime.

Targeting the OpenAI-compatible chat-completions surface (de facto 2025
standard exposed by OpenAI, OpenRouter, Anthropic compat, vLLM, Together,
Groq, Ollama, LM Studio, Azure, etc.) covers the broad-compatibility
goal without adding a dependency:

- Drop Literal on ModelConfig.name (now plain str so any provider's
  model id is accepted)
- New ModelProviderConfig (base_url, api_key) on EngineConfig; each
  field independently falls back to the matching env var when None
- Thread it into the AsyncOpenAI client used by the compactor and
  synthesis tool, and into a new configure_default_sdk_client in
  openai_agent_runner so the Agents SDK honors the same endpoint
- Rename openai_compactor.py -> compactor.py and
  build_openai_compactor_factory -> build_compactor_factory now that
  the helper is provider-agnostic
Primary fix (cursor[bot] 3150208823, Medium):
- _drive caught BaseException and routed it through output_bus.fail()
  without re-raising, so CancelledError / KeyboardInterrupt / SystemExit
  were converted into normal task completions. Structured cancellation
  broke: an outer task.cancel() left the inner task ended-normally with
  a fail signal queued for a consumer that may have already torn down.
- Narrow the silent-fail branch to except Exception; for BaseException
  drain the bus then re-raise so the task transitions to cancelled.

Cleanups bundled (same file or trivially small):
- Rename inner runner -> agent_runner in _drive (3149854827) so it
  doesn't shadow the outer runner: RunnerProtocol parameter.
- Add the {maximum_parallel_subagents} placeholder to the subagent
  system prompt (3150184259); render_subagent_system_prompt accepted
  the kwarg but the template had no slot, so str.format silently
  dropped it.
- Anchor /data/ in .gitignore (3150106227) so the rule only matches
  the top-level data/ directory rather than any data/ at any depth.
…leanups

Let _drive cancel cleanly + small bug-bot cleanups
set_default_openai_client defaults use_for_tracing=True, which routes
the Agents SDK tracing exporter through the same custom client used
for model calls. Non-OpenAI providers (vLLM, Ollama, OpenRouter, etc.)
don't speak the tracing API, so tracing POSTs would either error or
silently drop — defeating a key goal of the PR.

Pass use_for_tracing=False so the tracing exporter stays on its
default OpenAI path while model calls go through the configured
endpoint.
"--",
"/venv/bin/python",
str(script_path),
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded sandbox mount paths duplicated across modules

Medium Severity

The in-sandbox mount paths /mnt/trace/traces.jsonl, /mnt/trace/traces.jsonl.engine-index.jsonl, /venv, /venv/bin/python, and /workspace/bootstrap.py are hardcoded as string literals in both platform_commands.py and sandbox_runner.py. They must stay perfectly synchronized — if one is changed without the other, the sandbox silently breaks because the bootstrap script tries to load files at paths the bind mount no longer provides. Extracting them into shared constants would eliminate this drift risk.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 58a6ec6. Configure here.

Route LLM calls through configurable OpenAI-compatible endpoint
Comment thread engine/engine/tools/subagent_tool_factory.py
cursor[bot] 3150754673, High: a single asyncio.Semaphore was shared
across every depth, and guarded_invoke held its slot for the full
child execution including the parent's wait on its grandchildren.
With max_parallel_subagents parents holding every slot at depth N,
any depth-(N+1) grandchild blocks forever waiting on a slot its
parent is holding — and the parent stays parked inside runner.run
waiting for that grandchild's tool result. Reachable on default
config (max_depth=2, max_parallel=4) and at max_parallel=1 with even
a single grandchild spawn.

Replace the single semaphore with one per spawnable depth, built
once in build_root_sdk_agent. Each depth has its own pool so
parent-holding-slot can no longer block its own grandchildren —
they contend on different semaphores.

Worst-case concurrent agents at default config: P at each depth,
total P × D = 8. Hard cap, honest semantics.

Includes a regression test that pre-acquires the depth-1 slot
externally then drives a depth-2 tool via _build_subagent_as_tool,
verifying it completes within timeout. Verified the test fails
(TimeoutError) when guarded_invoke is reverted to a global slot.
Fix nested-subagent deadlock with per-depth semaphores
Comment thread .github/workflows/engine--release.yml
Comment thread .githooks/pre-commit

if [[ -x "$ENGINE_HOOK" ]]; then
"$ENGINE_HOOK"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-commit hook silently no-ops when engine hook missing

Low Severity

Both hooks silently skip when $ENGINE_HOOK is not executable or doesn't exist. A developer who hasn't run task setup-git-hooks, or whose checkout has the file but without the executable bit, will get zero feedback that pre-commit/pre-push validation is being bypassed entirely. The PR discussion explicitly flagged "dont like fallbacks like this lets use one way of setting paths" — this silent fallback is the kind of behavior that change was meant to remove.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4be7579. Configure here.

Comment thread engine/engine/main.py
],
temperature=engine_config.compaction_model.temperature or 0.0,
)
return (response.choices[0].message.content or "").strip()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compactor lazy client reads unset model_provider

Low Severity

The factory closure mutates a shared openai_client via nonlocal, lazily constructing one AsyncOpenAI per factory but reusing it across executions. Two issues: (1) under concurrent subagent compaction the lazy if openai_client is None check is racy, possibly creating multiple clients; (2) the inline TODO already calls out this design — building one client per run and injecting it removes the branch entirely. As written, this is overly complex relative to the simpler single-client construction the TODO describes.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4be7579. Configure here.

cursor[bot] 3151031551 (Medium): release workflow ran
``uv version --bump … --no-sync`` then committed both pyproject.toml
and uv.lock — but ``--no-sync`` also skips the lock refresh, so the
release would push a uv.lock that lagged the bumped pyproject.toml.
Add an explicit ``uv lock`` step after the bump.

cursor[bot] 3151031562 (Low): when the consumer of
stream_engine_async cancels iteration, the outer ``except`` calls
task.cancel() but never awaits the inner _drive task. asyncio then
warns "Task was destroyed but it is pending" and any exception
_drive raises during cancellation cleanup is silently dropped. Add
a try/except-BaseException that drains the task before re-raising
the consumer's original exception.
chunk = await stream.read(65536)
if not chunk:
break
return bytes(buf)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sandbox stdout/stderr drained twice loses output

Medium Severity

In _read_capped, after the first loop fills buf exactly to cap, the second while True loop continues to read and discard the remaining stream. However, if the first loop exits because len(buf) == cap while the next chunk would have fit (the boundary case), or if the process emits less than cap total bytes, the second loop still runs an extra await stream.read(65536) after EOF — harmless but wasteful. More importantly, when output exceeds cap, bytes between cap and the next chunk boundary are silently discarded with no truncation marker, so users see partial output with no indication that it was capped.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 83cea0a. Configure here.

Comment thread .githooks/pre-push

if [[ -x "$ENGINE_HOOK" ]]; then
"$ENGINE_HOOK" "$REMOTE" "$URL" < "$REFS_FILE"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-push hook silently swallows refs when engine hook missing

Low Severity

The wrapper unconditionally consumes stdin via cat > "$REFS_FILE" before checking whether the engine hook exists. If the engine hook is absent or non-executable, the captured refs file is discarded on cleanup and the push proceeds with no validation. That's the intended fallback, but it also means any failure to install the engine hook (typo in path, lost executable bit) becomes a silent no-op rather than an obvious error — defeating the purpose of having a pre-push hook at all. Consider logging a warning when the engine hook is not found.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 83cea0a. Configure here.

Comment thread engine/engine/agents/openai_event_mapper.py Outdated
cursor[bot] 3151054134 (Low): _map_tool_call and _map_tool_output
both used ``item_id = str(getattr(raw, "id", "") or call_id)``. When
the SDK's raw item lacks an ``id``, both fall back to the same
``call_id``, so the two AgentContextItems collide on item_id.
AgentContext._index is keyed by item_id, so the second append()
silently overwrites the first — get_context_item then returns the
wrong item even though both still live in self.items.

Namespace the fallback: tool-call-{call_id} for the assistant call
and tool-result-{call_id} for the tool result. Items with intact
raw.id are unaffected. Includes a regression test that constructs
a tool_call/tool_output pair without raw.id and asserts distinct
item_ids; verified the test fails when the fix is reverted.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 14 total unresolved issues (including 13 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1ad22ea. Configure here.

Comment thread engine/engine/main.py
input=input,
context=context,
max_turns=engine_config.root_agent.maximum_turns,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_streamed hardcodes root agent's max turns

Low Severity

The _run_streamed closure ignores its agent kwarg and always passes engine_config.root_agent.maximum_turns to Runner.run_streamed. The runner is only bound to the root execution today, so this works, but if OpenAiAgentRunner is ever reused for a non-root agent (or subagent is configured with a different maximum_turns), the wrong cap will be silently applied. Reading max_turns from the agent that was passed in would remove this latent coupling.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1ad22ea. Configure here.

@francescov1 francescov1 merged commit 31c844e into main Apr 28, 2026
3 checks passed
francescov1 pushed a commit that referenced this pull request Apr 29, 2026
… new finds

Driven by a second sub-agent run that surfaced two new bugs and gave
specific framework feedback. This commit lands the framework changes,
promotes one probe to an example, and leaves one probe in place as a
bug repro per the new lifecycle rules.

probe_kit.py — three new helpers:
- make_checker(): bundles the _check / _FAILURES / main() boilerplate
  that every probe duplicates inline. Returns (check, failures); call
  failures.report_and_exit_code() at the end of main().
- make_tool_context(state, *, agent_context, agent_execution): builds
  a ToolContext from an EngineRunState. Removes 5-line boilerplate from
  every unit-style tool probe.
- check_raises(callable_or_coro, expected_type): asserts an expected
  exception is raised, returns the caught exception. Distinguishes
  legitimate try/except (asserting expected raises) from the forbidden
  use (hiding unexpected failures).

README — significant rewrite around probe lifecycle, driven by user
direction:
- Probes that PASS are deleted by the agent. Probes that FAIL stay in
  place as the canonical bug repro; the user evaluates / deletes once
  fixed. Replaced 'probes are ephemeral' with explicit lifecycle rules
  in Hard rule 7 plus a top-level summary in the intro.
- New 'Section 3 — Suggested example promotions' in the report format.
  The bar is novel TESTING TECHNIQUE, not code surface coverage.
  Defines what does and does not count as novel.
- Process step 1 expanded with concrete module pointers (added
  agent_context_items, engine_output_bus, runner_protocol, tool_protocol,
  trace_store) per second-agent feedback that the prior list felt thin.
- 'Areas worth probing' shortened to one-line teasers — the curated
  sub-bullets had been functioning as a checklist that pulled the agent
  toward listed items. Now genuinely a seed list.
- Try/except clarification: hiding unexpected failures forbidden;
  asserting expected raises is fine — use check_raises.
- Optional 'Repro:' line added to FAIL findings (only when probe file
  is left behind).
- 'How to write a probe' skeleton updated to use make_checker /
  make_tool_context / check_raises.

example_compaction.py — promoted from probe_compaction.py. Demonstrates
two genuinely novel patterns the existing examples don't show:
- Direct AgentContext probing without run_with_fake at all
- _RecordingCompactor class (a real callable that records, not a mock)
Surfaced bug #7: compacted plain-text assistant messages mis-rendered
as 'Compacted tool calls' in agent_context.py:138-142.

probe_max_turns_forwarding.py — kept as ephemeral probe (no novel
technique; just runner.calls inspection that example_circuit_breaker
already shows). Surfaces bug #6: AgentConfig.maximum_turns never
forwarded to Runner.run_streamed at either root or subagent call sites.
File stays per lifecycle rule #2 — it's the bug repro until the bug
is fixed.
francescov1 added a commit that referenced this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants