Skip to content

feat: add _FailureCounter utility for session-scoped circuit breakers (Closes #571)#6

Closed
Da-Mikey wants to merge 657 commits into
mainfrom
evolution/issue-571-failure-counter-lexus
Closed

feat: add _FailureCounter utility for session-scoped circuit breakers (Closes #571)#6
Da-Mikey wants to merge 657 commits into
mainfrom
evolution/issue-571-failure-counter-lexus

Conversation

@Da-Mikey

Copy link
Copy Markdown
Owner

Cherry-pick of the _FailureCounter from the original PR onto current Lexus2016 main. Clean cherry-pick — 2 files, 276 insertions, 20/20 tests green.

alelpoan and others added 30 commits June 21, 2026 12:52
…o README

Follow-up on salvaged #50347: the event surface table was missing the
billing.step_up.verification switch case, and the File map omitted
lib/perfPane.tsx.
…atch_tool (#50352)

Answers a recurring plugin-author question: how to read the active
profile and drive Hermes from inside a hook callback when ctx._cli_ref
is None (gateway, hermes chat -q, and kanban-spawned worker sessions).

- Adds a 'Act from inside a hook' section to the plugin guide covering
  ctx.profile_name and ctx.dispatch_tool as the session-agnostic APIs,
  with a kanban_task_blocked example, and notes there is no in-process
  slash-command bridge for headless workers (shell out via the terminal
  tool instead).
- Adds the three kanban lifecycle hooks to the hook reference table with
  their process semantics.
- Pins the contract with a regression test: ctx.dispatch_tool invokes a
  tool handler with _cli_ref=None (worker/hook context).

Requested by @Smithangshu on Discord.
… launch

The Windows update path can leave tracked ui-tui/ files deleted in the
working tree (HEAD intact). The guard now self-heals: when ui-tui/ is
missing in a git checkout, run `git restore -- ui-tui` and continue,
falling back to the printed manual-recovery steps only when git can't
recover it (no checkout / restore failed).

Builds on konsisumer's missing-workspace guard.
… ui-tui

Root cause of #49145: the Windows ZIP-update path did rmtree(dst) then
copytree(src, dst). If the copy failed partway — common on that path,
which only runs because file I/O is already flaky on the machine — the
directory was left deleted with nothing copied back. ui-tui/ vanishing
is what broke 'hermes --tui' (WinError 267), but the bug hit every
top-level directory.

_atomic_replace_dir stages the new copy into a sibling temp dir and only
swaps it in on full success, restoring the original on failure. A failed
update now leaves the live tree untouched instead of half-deleted.
In Telegram streaming, the typing indicator persisted through the slow
final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble
lingered for seconds after the last streamed token. Add a one-shot
on_before_finalize hook to GatewayStreamConsumer, fired once when the
stream transitions into its finalization path, and wire it on both
Telegram streaming call sites to call pause_typing_for_chat() before
the final edit. Cover hook ordering and once-only behavior in tests.

Fixes #49712
…d restart loop (#50381)

When a Windows user relaunches Hermes while an in-app update is still
running (the desktop vanished with no progress and looks crashed), the
fresh instance spawns its own dashboard backend. That backend re-locks
the venv shim, the updater's straggler cleanup (force_kill_other_hermes
-> taskkill /F /T /IM hermes.exe) kills it, the launch dies with the 45s
"backend didn't come up" timeout, and the user relaunches into the same
trap -- an infinite respawn/kill loop (#50238).

Root cause: no mutual exclusion between an applying update and a fresh
desktop spawning its own local backend.

Fix: the updater publishes a HERMES_HOME/.hermes-update-in-progress
marker (pid + start time) for the whole run via an RAII drop-guard that
removes it on every exit path (success, early return, panic). A
freshly-launched desktop checks the marker before spawning its local
backend and PARKS until the update finishes -- then brings the backend
up itself (it is the surviving instance; the updater's own relaunch hits
the single-instance lock and quits). A stale marker (dead pid or past a
20-minute ceiling) is pruned so a crashed updater can never strand
future launches. No rogue backend spawns mid-update, so
force_kill_other_hermes has nothing legitimate to kill.

Marker parse/staleness logic is extracted to update-marker.cjs and
unit-tested; the Rust guard has unit tests; the Rust-write <-> JS-read
contract is E2E-verified.
Baileys' jidDecode crashes ("Cannot destructure property 'user' of
jidDecode(...) as it is undefined") when handed a bare phone number, so
sending a WhatsApp message to +50766715226 / 50766715226 returned HTTP
500 and never delivered (#8637).

Add to_whatsapp_jid() to gateway/whatsapp_identity.py — the outbound
inverse of normalize_whatsapp_identifier: it builds the JID a send must
use (bare phone -> <digits>@s.whatsapp.net) and passes through already
qualified JIDs (@g.us, @lid, status@broadcast, @newsletter) unchanged.
Wire it at every outbound bridge call site in the WhatsApp adapter
(send, edit, media, typing, get_chat_info, and the standalone cron /
send_message sender).

Co-authored-by: Hermes Agent <noreply@nousresearch.com>
Per @egilewski's audit on this PR (#15544), the original fix was
correct but the file has refactored since: the four endpoint-local
empty-peer checks have been consolidated into _ws_client_is_allowed
and _ws_client_reason, but the helpers were left fail-open ('no peer
host known means allow' / 'no reason to block').

On a loopback-bound dashboard with auth disabled, an ASGI server
behind a misconfigured proxy or a unix-socket transport can deliver
ws.client == None or ws.client.host == ''. The helpers were treating
that as 'allowed', so the loopback-only peer gate could be bypassed
by anything that suppressed the client tuple in transit. All four
WebSocket endpoints (/api/pty, /api/ws, /api/pub, /api/events) route
through _ws_request_is_allowed -> _ws_client_is_allowed, so the gap
applied uniformly.

Fix:

* _ws_client_is_allowed: return False when client_host is empty
  instead of True. Only reached on loopback bind with auth disabled
  (auth_required=True and explicit non-loopback binds short-circuit
  earlier), so the fail-closed behavior is scoped to the surface
  that needs it.

* _ws_client_reason: return a 'missing_or_empty_peer bound=...'
  block reason instead of None, so the dispatcher's existing
  reason-based rejection path picks it up and the close gets logged
  with a machine-parseable token for diagnosability.

Behavior unchanged for:

* gated mode (auth_required=True) — early-returns True before the
  empty-peer check runs. The OAuth ticket is the auth at that point.
* explicit non-loopback bind (--host 0.0.0.0/::, or a specific LAN
  address, always with --insecure) — early-returns True before the
  empty-peer check runs. DNS-rebinding is still blocked by the
  Host/Origin guard in _ws_host_origin_is_allowed.
* legitimate loopback peers (client_host == '127.0.0.1' / '::1') —
  not affected by the empty-peer branch.

Regression tests added in tests/hermes_cli/test_dashboard_auth_ws_auth.py:

* test_empty_client_host_rejected_in_loopback_mode
* test_missing_client_object_rejected_in_loopback_mode
* test_empty_client_host_reason_is_block

Plus two regression guards to ensure the fix does not over-reach:

* test_empty_client_host_still_allowed_in_insecure_public_mode
* test_empty_client_host_still_allowed_in_gated_mode

All three new fail-closed tests fail without this patch (the helpers
return True / None for an empty peer) and pass with it. The 45
pre-existing tests in test_dashboard_auth_ws_auth.py continue to pass.
… session (#50375)

When a /model switch resolves a valid model but the in-place agent swap
fails mid-conversation (expired key, unreachable base_url), the agent
rolls itself back to the old working model+client and re-raises. The
callers caught that re-raise, logged a warning, then committed the broken
switch anyway: wrote the failed model to the session DB, set
_session_model_overrides to the broken model/provider/key, and (gateway
direct path) evicted the working cached agent. The next message then
rebuilt a dead agent from the broken override -> permanently unusable
conversation (#50163).

Fix the whole caller class so a failed swap aborts the commit entirely:

- gateway/slash_commands.py (picker + direct /model paths): on swap
  failure, early-return an error message; skip DB persist, session
  override, cache eviction, and config write.
- cli.py (both /model handlers): snapshot CLI-level credential/runtime
  fields before mutating, restore them on swap failure, and abort the
  note + success print.
- tui_gateway/server.py: wrap the previously-unguarded swap; on failure
  raise a clean error and skip worker restart, runtime persist, switch
  marker, session model_override, and config persist.

The no-cached-agent path (apply-on-next-session) is unaffected.

Adds a gateway regression test that fails on the pre-fix behavior.
… (#50373)

On Windows, _pause_windows_gateways_for_update() force-kills every running
gateway before mutating the venv. Gateways mapped to a profile (via
profile.path/gateway.pid) were respawned afterward, but gateways with NO
profile mapping — e.g. a Windows Scheduled Task running
"pythonw.exe -m hermes_cli.main gateway run" — were force-killed and only
told to restart manually. After an auto-update/bootstrap the Telegram bot
stayed dead until manual intervention.

Now we snapshot each unmapped gateway's argv (psutil, guarded by
looks_like_gateway_command_line) before the kill and replay it through the
same detached watcher used for profile gateways, so unmapped gateways come
back automatically too.

Co-authored-by: Hermes Agent <agent@nousresearch.com>
…ers (#50385)

A bare custom provider configured via `model.api_base` (the intuitive name
OpenAI-SDK / LiteLLM users reach for) was silently ignored: `hermes config set`
accepts any dotted key, so `model.api_base` got written and confirmed, but the
runtime resolver reads only `model.base_url`. Requests fell back to OpenRouter
with an empty key -> 401, zero hits to the custom endpoint (issue #8919).

Now api_base is migrated to base_url at load time (fixes existing broken
configs) and at set time (with a notice), never overriding an explicit
base_url. Closes #8919.
A dangerous-command gateway approval blocks the agent's execution thread
inside _await_gateway_decision() on threading.Event.wait() until the user
responds or the 5-minute approval timeout fires. The poll loop never checked
is_interrupted(), so /stop (which flags the agent's execution thread via
AIAgent.interrupt()) was silently ignored — the session stayed wedged until
timeout, even though /stop reported the session unlocked.

Check is_interrupted() at the top of the poll loop. The wait runs on the
agent's execution thread, the exact thread interrupt() flags, so the check
sees the signal and resolves the pending approval as deny — the agent loop
receives a normal denial and unwinds cleanly. Covers /stop, /new, and the
gateway inactivity-timeout interrupt through the single shared wait loop used
by both the terminal and execute_code guards.
- Add thread-scoped regression test: interrupt on the waiting thread resolves
  the approval as deny well under the 300s timeout; a foreign-thread interrupt
  does NOT release the wait (interrupts are per-thread).
- Add panghuer023 to AUTHOR_MAP for the salvaged #37994 fix.
…nnecting

The email adapter read address/host purely from env vars and never stripped
them, so a missing or whitespace-padded EMAIL_IMAP_HOST reached
imaplib.IMAP4_SSL("") and surfaced as the misleading
"[Errno 8] nodename nor servname provided, or not known" — sending users down a
DNS rabbit hole when the real problem was an empty/dirty host string. A
config.yaml-only setup also left the host empty because __init__ ignored
PlatformConfig.extra, even though the "connected" check, the send helper, and
`hermes config show` already read address/imap_host/smtp_host from it.

Resolve address/imap_host/smtp_host from the env var first, then fall back to
config.extra, and strip surrounding whitespace — matching the send helper's
existing pattern. Validate the required settings at the start of connect() and
return False with an actionable message instead of attempting a connection with
an empty host.

Adds regression tests for whitespace stripping, config.extra fallback, and the
no-IMAP-attempt-on-missing-host path.
…ars (#40715)

Fold in the #40715 blank-env OOM fix on top of the host-resolution change:
- connect() now sets a non-retryable fatal error when required settings are
  missing, so the gateway stops reconnecting against an empty host instead of
  looping forever and leaking memory until the host OOM-kills.
- check_email_requirements() treats blank/whitespace-only EMAIL_* values as
  missing, so an abandoned setup with empty keys no longer enables the platform.

Credits the parallel fixes by zerone0x (#40745) and liuhao1024 (#40829).
…e delivery

When a streamed Telegram reply finalizes, the stream consumer could take
the fresh-final path (send a new sendRichMessage + best-effort delete the
preview) purely because the time-based _should_send_fresh_final()
threshold elapsed — even though Telegram's prefers_fresh_final_streaming
returns False. The fresh Rich Message then overlapped the legacy
MarkdownV2 preview already on screen, leaving both visible (the #47048
table + bullet double-render).

Honor the adapter's decision: when prefers_fresh_final_streaming exists
on the adapter (checked on the class + instance __dict__ so MagicMock
auto-attrs don't false-positive) and declines, the time threshold no
longer overrides it. Adapters without the hook keep the time-based
fresh-final for backward compat.

Fixes #47048
…bypass

ipaddress.ip_address() raises ValueError on IPv6 addresses with scope
IDs (e.g. 'fe80::1%eth0'). Both is_always_blocked_url() and is_safe_url()
silently skipped these via `except ValueError: continue`.

If ALL resolved addresses for a hostname carry scope IDs, every address
is skipped and the URL passes all safety checks — a potential SSRF
bypass vector against link-local or metadata endpoints.

Fix:
- Strip the scope ID (%eth0) before parsing in both functions
- is_safe_url(): fail closed (return False) with a warning log if still
  unparseable after stripping
- is_always_blocked_url(): use continue (not return False) to preserve
  multi-address scanning, with a warning log

Affected: tools/url_safety.py — is_always_blocked_url(), is_safe_url()
Follow-up to the salvaged #25961 fix: regression tests asserting that
scope-bearing IPv6 addresses (fe80::1%eth0, ::1%lo) are blocked by
is_safe_url after the scope is stripped, that a still-unparseable address
fails closed, and that a scoped IPv4-mapped IMDS address is caught by the
always-blocked floor.
…Closes Lexus2016#432) (Lexus2016#436)

Add mutating/idempotent tool-aware thresholding to the loop guard,
so mutating tools (terminal, write_file, execute_code, etc.) trigger
spiral detection at half the threshold of read-only tools.

=== Changes ===

agent/loop_guard.py:
- Add _MUTATING_TOOLS and _IDEMPOTENT_TOOLS frozensets with category
  threshold constants: mutating repeat=4/fail=2/escalate=8 vs
  idempotent repeat=8/fail=4/escalate=15
- Add _tool_category() and _tool_spiral_score() helper functions
- Update maybe_nudge() to auto-select thresholds based on tool type
- Add ESCALATED INTERRUPT level: when a spiral exceeds the escalate
  threshold, the nudge becomes a directive requiring the agent to
  summarize progress before continuing
- Include spiral-intensity score in high-count nudges so the model
  sees the evidence of fixation
- Unknown tools (MCP, plugins) default to the safer mutating thresholds
- Fix _EXIT_CODE_RE regex to correctly use \s (whitespace) and \d (digit)
  character classes

tests/agent/test_loop_guard.py:
- Split mutating/idempotent threshold coverage across both tool types
- Add TestEscalatedInterrupt class (8 tests): verify escalated interrupt
  fires at correct thresholds for mutating, idempotent, and unknown tools
- Test spiral-intensity annotations appear at high counts
- Update existing tests to reflect new mutating thresholds
- Added 11 new test cases (15 -> 26 total)

agent/conversation_loop.py:
- Log a warning when an ESCALATED INTERRUPT fires, so operators and
  log aggregators can detect deep spiral patterns

=== Testing ===
26/26 loop_guard tests pass, 9/9 guardrail runtime tests pass (35 total)

Co-authored-by: Hermes Evolution <evolution@hermes-agent.nousresearch.com>
Secret redaction only matched `Authorization: Bearer <token>`. Other auth
headers passed through verbatim into logs, tool output, and transcripts:

- `Authorization: Basic <base64>` — leaks base64(user:password)
- `Authorization: token <pat>` / any non-Bearer scheme
- `Proxy-Authorization: ...`
- `x-api-key: <key>` (Anthropic and many providers) and `api-key`,
  `x-goog-api-key`, `x-auth-token`, `x-access-token`, ... — opaque values with
  no known vendor prefix were caught by nothing

A logged request or an echoed `curl -H "x-api-key: ..."` command therefore
leaked live credentials.

Generalize the Authorization rule to mask the credential for any scheme (and
Proxy-Authorization) while preserving the header name and scheme word for
debuggability, and add an api-key header rule for the single-opaque-value
headers. Bearer behavior is unchanged; plain prose containing the word
"authorization" (no colon-delimited value) is left untouched.

Adds regression tests for Basic/token/Proxy auth and the x-api-key/api-key
headers, including inside a curl command.
fix(windows): prefer cmd npm shim on PATH fallback
Automated security fix generated by Orbis Security AI
…n _ensure_loaded

Addresses PR #9560 review comments: applies the CWE-22 fix to current main
(post-PR Lexus2016#458 rebase) and adds the requested regression tests.

- SessionEntry.from_dict now raises ValueError for session_key or session_id
  containing '..' or starting with '/' or '\' (directory traversal guard)
- SessionStore._ensure_loaded moves per-entry validation inside the loop so
  one malicious/corrupt entry is skipped with a warning instead of aborting
  the entire sessions.json load
- Adds TestSessionEntryFromDictTraversalValidation (5 cases) and
  TestEnsureLoadedSkipsInvalidEntries covering the skip-not-abort behavior

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion

Extends the CWE-22 path traversal guard to cover Windows absolute paths
of the form C:/... and D:\... — previously only leading / and \ were
checked, which missed drive-letter prefixes. Replaces the inline
startswith check with a compiled module-level regex (_TRAVERSAL_RE) that
covers all three attack patterns: .., leading /\, and leading X: drives.
Adds two regression tests for C:/windows/system32 and D:\\path\\to\\file.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…den non-leading separators

Follow-up to the salvaged #9560 fix:
- Replace the _TRAVERSAL_RE regex with an explicit _is_path_unsafe() helper
  (drops the now-unused `import re`); catches a path separator ANYWHERE,
  not just leading, so a non-leading Windows backslash can't slip through.
- Switch the per-entry skip in _ensure_loaded_locked from print() to
  logger.warning to match the module's logging conventions.
- Add AUTHOR_MAP entry for the contributor.
- Add regression tests for the non-leading-separator case.
Lexus2016 and others added 24 commits June 24, 2026 12:22
…rt budget on LOW_SELECTION_EFFICIENCY (Lexus2016#507)

Prod metrics show selection_efficiency = merged/selected = 12% (window:
selected=57, merged=7): evolution-analysis selects up to Max total effort 3.0
per cycle BLINDLY, ignoring the pipeline's real land-rate (~0.7 merges/cycle).
That is poor self-capability calibration — picking more than it can land — and
trips the watchdog flag LOW_SELECTION_EFFICIENCY.

Fix (prompt-level, in evolution-analysis/SKILL.md only): before final
selection, read the existing evolution-health.txt sidecar ([evolution-metrics]
line). When LOW_SELECTION_EFFICIENCY is flagged, throttle max_total_effort from
3.0 to 1.5 (select ~half as much) and spend the smaller budget on the
highest-land-confidence issues. Missing/OK signal → default 3.0.

Infra already existed (not recreated): evolution_metrics.py computes the metric
and emits the flag (<0.34); evolution_funnel.py writes it to the sidecar. This
mirrors the established read-sidecar-signal precedent (research step 0 reads
funnel-summary.txt; analysis step 6c reads realized-impact.txt).

Scoring formula, weights, decomposition gate, split rule, and anti-starvation
slot are UNCHANGED — only the budget SIZE is now adaptive.

Tested: test_evolution_skill_integrity (skill-lint, no dead wiring),
test_evolution_metrics, test_evolution_funnel all pass; E2E confirmed the
funnel writes LOW_SELECTION_EFFICIENCY into evolution-health.txt.
…mit/auth cooldowns (Lexus2016#510)

So 429/auth failover paths record Retry-After / default cooldowns for
the failed provider and skip it when advancing the fallback chain.

Closes Lexus2016#478

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…ache (Lexus2016#509)

Recurring cron jobs build their session_id as cron_<job_id>_<timestamp>,
so each fire gets a fresh session_id. The Codex/Responses transport uses
session_id as prompt_cache_key (and the cache-scope routing headers),
meaning every cron fire pays a cold-start LLM cache — the static prefix
(agent identity + tool guidance + job prompt) is recomputed in full on
each run. Interactive sessions are unaffected (one conversation keeps one
stable session_id).

Thread an optional cache_key through AIAgent -> init_agent -> the
transport build_kwargs, defaulting to session_id when absent so the
interactive path is byte-identical. Cron passes the constant
cron_<job_id>, so repeated fires share one warm cache key while
session_id still rotates per run for transcript isolation.

The cache split is deliberate: prompt_cache_key and the Codex cache-scope
headers (session_id / x-client-request-id) follow the stable cache_key,
but xAI's x-grok-conv-id stays on the per-run session_id so distinct
fires aren't merged into one conversation. cache_key is threaded into all
three transport build_kwargs call sites (Responses, profile, legacy), so
the warm key applies regardless of which transport the cron agent selects;
non-Codex transports ignore the unknown param via **params.

prompt_cache_key is a routing hint, never a correctness boundary — a
stale or wrong key only causes a cache miss, never a wrong result.

Reimplements the idea from PR Lexus2016#488 (by @Da-Mikey) cleanly on current
main; that PR was unmergeable — branched ~298 commits behind and would
have reverted unrelated work across gateway/desktop/docs.

- agent/agent_init.py: +cache_key param + agent.cache_key assignment
- run_agent.py: thread cache_key through AIAgent
- agent/chat_completion_helpers.py: pass cache_key to build_kwargs (x3)
- agent/transports/codex.py: cache_key drives prompt_cache_key + Codex
  cache headers + xAI extra_body; x-grok-conv-id stays on session_id
- cron/scheduler.py: pass cache_key=f"cron_{job_id}"
- tests: 4 new cases (override, fallback, codex headers, xAI body/conv split)
Implements the cron.thinking config option, defaulting thinking mode off for cron sessions.

Follow-up fixes to make CI green:
- cron/evolution/hydra.yaml: dropped the stage skills: block (Hydra is a pure delegator dispatching via delegate_task; listing script-running skills under a [file, delegation] toolset was dead wiring per evolution_skill_lint).
- scripts/release.py: mapped the Hermes Evolution agent author email in AUTHOR_MAP.
Adds a per-session circuit breaker around Honcho dialectic queries so a
failing backend stops burning API credits after 5 consecutive failures.

- Trip after 5 consecutive exceptions from the Honcho SDK/backend.
- 120s cooldown; half-open probe resets on success or re-trips on failure.
- Empty (but exception-free) results do NOT increment the failure counter,
  so sparse profiles don't disable memory.
- Replace the silent logger.warning with logger.exception so failures are
  observable in logs.

Closes Lexus2016#463

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…16#515, Closes Lexus2016#516)

Add structured size fields to memory tool limit errors and expose an explicit compact action so the agent can shorten entries before a write.
…/3.0, analysis copies verbatim (Lexus2016#519)

The watchdog re-fires LOW_SELECTION_EFFICIENCY (selection_efficiency=12%,
window selected=60 merged=7) the morning after Lexus2016#507's prompt-level throttle
landed. Root cause in the 2026-06-24 analysis cycle: it set
max_total_effort=2.0 — neither the 3.0 default nor the 1.5 throttle Lexus2016#507
prescribed. A prompt-level "you decide 1.5 vs 3.0 from the flag" instruction is
unenforced, so the LLM drifted to an arbitrary middle, under-throttling and
keeping the funnel over-selected.

Fix: move the decision out of the prompt into the deterministic metric script.

- evolution_metrics.compute_health() now emits effort_budget (1.5 when
  LOW_SELECTION_EFFICIENCY is flagged, else 3.0 — the only two legal values).
- format_health() renders `effort_budget=X` in the BODY of the sidecar line,
  before the final `| <tail>`, so evolution_watchdog's `.endswith("| healthy")`
  flag parsing is untouched (verified E2E on a prod-shaped flagged line).
- evolution-analysis/SKILL.md: steps 5a/6 now COPY effort_budget=X verbatim as
  max_total_effort instead of deriving it; middle values like 2.0 are explicitly
  illegal. Scoring formula, weights, decomposition/split rules, and the
  anti-starvation slot are UNCHANGED.

Tested: test_evolution_metrics (+3 cases: throttle-only-when-flagged,
default-on-insufficient-signal, budget-in-body-not-watchdog-tail),
test_evolution_watchdog, test_evolution_funnel, test_evolution_skill_integrity
all pass (73); ruff + ty clean.
…oses Lexus2016#514) (Lexus2016#527)

Adds explicit, structured logging around provider failover so cron
introspection can classify rate-limit and billing events without parsing
free-text logs. Emits two events:

- provider_rate_limit_failover: when the primary provider is put on
  cooldown, including provider/model, Retry-After header, computed cooldown,
  and fallback chain position.
- provider_fallback_activated: when the runtime actually switches to a
  fallback provider/model, confirming the new active backend.

Closes Lexus2016#514

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
… agent jobs without skills (Lexus2016#534)

_normalize_skills/_normalize_toolsets return None (not []) when a stage YAML
omits skills:/toolsets:. The reconcile path for an ALREADY-REGISTERED job did
`list(skills)` / `list(toolsets)` unconditionally, so the moment those jobs
existed every re-run raised `TypeError: 'NoneType' object is not iterable` and
aborted. register_evolution_cron IS the integration stage's self-update step
(refresh HERMES_HOME no_agent scripts + sync skills), so this silently froze
script/skill deployment on the server — e.g. PR Lexus2016#519's funnel-side effort_budget
never reached HERMES_HOME until the registrar was run by hand.

Fix: only reconcile skills/toolsets when the YAML explicitly specifies them
(`is not None`). None means "leave the registered value as-is", NOT "clear it",
so this also prevents a no-skills YAML from clobbering a job's registered skills
to []. Helper-script install (runs before reconcile) was already fine; only the
existing-job reconcile branch was affected.

Regression test: an existing agent job whose YAML omits skills reconciles its
schedule without crashing and without clobbering skills.

Tested: test_register_evolution_cron 23 pass (uv run); ruff + ty clean.
…overspent analysis budgets (Lexus2016#535)

PR Lexus2016#519 made the effort budget deterministic at the source (the metric script
prescribes 1.5/3.0; analysis SKILL.md tells the agent to copy it verbatim). But a
prompt instruction is still not enforced: the 2026-06-24 cycle wrote
max_total_effort=2.0 — neither legal value — and under-throttled. This adds the
deterministic teeth.

- scripts/evolution_analysis_audit.py: pure audit_analysis(report) returns
  BUDGET_ILLEGAL (max_total_effort not in {1.5, 3.0}) and BUDGET_OVERSPENT
  (total_effort_selected > max_total_effort). Missing/non-numeric fields are
  skipped (no false alarm on partial/legacy/idle reports). audit_latest() reads
  the most recent YYYY-MM-DD.json under analysis/ (ignores issues_*/prs_*).
- evolution_watchdog.py: new check_analysis_integrity wired into the morning run
  so a drift surfaces to the owner — the same deterministic-verdict + alert
  pattern as the realized-impact / regression gates. Silent when clean.

Read+flag only: the analysis stage merges nothing, so a bad selection
self-corrects next cycle; an alert is the right enforcement teeth for this stage.

Tested: test_evolution_analysis_audit (18 cases) + test_evolution_watchdog +
skill-integrity — 62 pass; ruff + ty clean. E2E: flags the real 2026-06-24
max_total_effort=2.0 (BUDGET_ILLEGAL).
…ts rejections (Lexus2016#83 class) (Lexus2016#536)

Gate #2 of the enforcement campaign. The analysis stage can CLOSE an issue
claiming the feature already exists, citing a repo path — but the prompt-level
"cite an exact path you verified" rule is unenforced. The real Lexus2016#83 was closed
citing scripts/evolution_watchdog.sh; the actual script is .py, so a wanted idea
was killed on fabricated evidence.

- evolution_analysis_audit.audit_rejections(report, repo_root): for each
  already-exists rejection, extract cited repo paths from the prose reason and
  flag FABRICATED_REJECTION when it cites concrete paths and NONE exist. A single
  missing path among existing ones is treated as a typo / secondary reference
  (not fabrication) → no false positive. Needs the repo; silent without it.
- audit_latest now also runs the rejection check when a repo_root is given.
- evolution_watchdog.check_analysis_integrity passes _resolve_repo_dir() so the
  morning run verifies the latest cycle's closes against the actual checkout.

Read+flag only (alerts the owner; the close already happened — the value is
catching the pattern so the idea can be re-opened).

Tested: test_evolution_analysis_audit (+8 rejection cases) + watchdog — 61 pass;
ruff + ty clean. E2E: 5 real prod already-exists rejections → 0 false positives
(all cited paths exist); a fabricated scripts/evolution_watchdog.sh → flagged.
…alive (Lexus2016#538)

The approval hooks (register/unregister_gateway_notify) are monkeypatched
globally, so concurrent daemon build threads from sibling session.create tests in
the same xdist shard append THEIR session keys to closed_workers /
unregistered_keys. This test even nudges a late sibling's build thread into its
own replaced-cleanup by clearing server._sessions. Comparing the lists to [] then
fails on that sibling noise (observed failing key 20260625_..._955d26 — a sibling
session, not this test's), blocking unrelated PRs (Lexus2016#534, Lexus2016#536) until a flaky
rerun. With evolution-integration's require_all_checks_green, it also randomly
blocks the autonomous pipeline's own merges.

Fix: scope the assertions to THIS session's own key (session["session_key"])
instead of asserting the global lists are empty. The own-key check is immune to
sibling noise while still catching this build thread over-cleaning its own
session — the regression the test guards.

Test-only change.
…e (gate #3) (Lexus2016#539)

The integration stage self-merges its OWN evolution/issue-* PRs and auto-deploys
— a bad self-merge lands on every install. Branch protection already requires
green CI; the remaining gaps are unenforced prompt-level checks. This adds the
deterministic operational-safety gate, designed with a cross-AI review (Gemini):

- scripts/evolution_merge_gate.py — pure check_merge_policy(files, max_lines):
  * DIFF_TOO_LARGE: > 200 changed lines (cap, EVOLUTION_MERGE_MAX_LINES) — a large
    autonomous change is the agent spiraling / a refactor that needs a human.
  * HIGH_RISK_PATH: never self-merge a PR touching CI workflows, dependency
    lockfiles/manifests (supply-chain), container/infra, secrets, or the
    pipeline's OWN approval / merge-gate / cron-registrar machinery.
  The --merge mode then merges ATOMICALLY via `gh api PUT .../merge -f sha=<head>`
  so a push landing between review and merge returns 409 and aborts instead of
  merging unreviewed code — closing the TOCTOU race the prompt-level
  branch-integrity check had.
- evolution-integration/SKILL.md: the merge step now goes through the gate
  instead of a manual commit-set check + raw `gh pr merge`.

Dead-code detection is deliberately NOT a blocking gate (AST across Python/JS is
false-positive-prone and would stall the pipeline on reflection/decorators/route
registries) — it belongs as a future detect+alert, per the same review.

Tested: test_evolution_merge_gate (14 policy cases) + skill-integrity — 25 pass;
ruff + ty clean. E2E: typical small PR OK; uv.lock → HIGH_RISK; 520-line → DIFF_TOO_LARGE.
…g flag (Closes Lexus2016#517) (Lexus2016#537)

- MemoryStore accepts allow_batch_override (default False).
- apply_batch(target='memory', memory_char_limit=N) honours N only when
  allow_batch_override is True; user target and system-prompt snapshot keep
  the configured limit, preserving the prefix cache.
- load_on_disk_store() reads memory.allow_batch_memory_char_limit_override.
- memory_tool schema exposes the optional memory_char_limit parameter.
- Tests cover enabled/disabled, user-target guard, and snapshot stability.

Closes Lexus2016#517

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…2016#542)

Adds aux_health_ping() in agent/auxiliary_client.py: a tiny, best-effort
chat-completion probe of the auxiliary provider layer. It resolves the
configured provider/model for a task and returns provider:model on
success, or None (with a warning) on failure.

The gateway calls it once whenever a new session is created, so auxiliary
problems (compression, memory, title-gen sidecars) surface immediately
instead of waiting for the first background task.

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…r, model, retries (Closes Lexus2016#521) (Lexus2016#541)

When a cron job fails because of a provider-layer timeout, the failure
record now includes:

- provider / model (resolved from the job's model or HERMES_MODEL)
- failure_category (via agent.error_classifier.classify_api_error)
- retry_count (best-effort parse from the error text)

The chat delivery summary also appends the failure category so operators
see e.g. 'provider timeout [timeout]' instead of a generic message.

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…me (Lexus2016#537) (Lexus2016#563)

The required meta-check "All required checks pass" was RED on main because
tests/run_agent/test_percentage_clamp.py::TestSourceLinesAreClamped::
test_memory_tool_clamped failed (count 1, expected >= 2), blocking ALL
merges (PR Lexus2016#561 was BLOCKED, the evolution pipeline's autonomous merges,
and PRs Lexus2016#542/Lexus2016#546/Lexus2016#552).

Root cause (Case B — stale guard, NOT a correctness regression):
The guard counted the literal substring `min(100, int((current / limit)`
and required >= 2. Commit 59502d3 (Lexus2016#537, per-apply_batch
memory_char_limit override) renamed the local `limit` -> `effective_limit`
inside _success_response, so its clamp line became
`min(100, int((current / effective_limit) * 100))`. The clamp is fully
preserved — only the variable name changed — but the brittle literal no
longer matched, dropping the count from 2 to 1.

All three percentage sites in tools/memory_tool.py remain correctly
clamped (verified): L847 _compact (new_total/limit), L1099
_success_response (current/effective_limit), L1128 _render_block
(current/limit). No display path can emit >100%; there is no unclamped
percentage. tools/memory_tool.py is intentionally left untouched.

Fix: replace the variable-specific literal count with a refactor-resilient
guard that asserts the REAL invariant — every `int(... * 100)` percentage
expression is immediately wrapped in `min(100, ...)`. This is strictly
stronger than the old line-count: it actively detects an unclamped
regression (verified with negative tests for both the double-paren and the
standard single-paren `int(current / limit * 100)` forms) and survives
future variable renames. A secondary `min(100, int((` >= 2 site-count
sanity check is retained. A cross-AI (Gemini) review confirmed the new
guard is at least as strong as the old one and prompted broadening the
regex to also cover the single-paren cast form.
…(client-wide) (Lexus2016#561)

Symptom: every onboarded client fired a daily watchdog alert claiming
"fork is ~13000 commits behind upstream/main".

Root cause: scripts/install.sh (lines 1214/1219) and install.ps1 clone
Hermes with `git clone --depth 1`, so every fresh client install is a
SHALLOW repo. upgrade.sh then adds the `upstream` remote and registers
the evolution-watchdog cron. On a shallow clone HEAD shares no ancestry
with upstream/main, so check_upstream_lag()'s
`git rev-list --count HEAD..upstream/main` counts ~ALL upstream history
(~13031) instead of the true distance, tripping the threshold every day.

The shallow guard already exists in hermes_cli/banner.py
(_check_via_local_git) and hermes_cli/main.py (update-check), but
check_upstream_lag was the one update-check site that was missed.

Fix: add _upstream_lag_unmeasurable(runner, repo) — returns True when the
repo is shallow (`git rev-parse --is-shallow-repository` == "true") OR
HEAD/upstream have no shared history (`git merge-base` non-zero with empty
stdout). When True, check_upstream_lag returns [] SILENTLY (no alert).
Shallow is the INTENDED client default and upstream-lag is the fork
maintainer's concern — the evolution server is a full clone and still gets
the real count. The helper is fail-open: any inconclusive probe falls
through to the existing exact rev-list/threshold path, so it can never make
the check worse than today.

Consolidates two competing local fixes: silent-skip RESPONSE (mirrors the
banner.py/main.py precedent) + broader DETECTION (also covers the
grafted/no-common-ancestor case).

Tests: 3 new TestUpstreamLag cases — shallow -> silent (rev-list not
consulted), unresolved merge-base -> silent, full clone behind>80 -> still
alerts (regression guard). 38/38 watchdog tests pass.
…at fatigue (Lexus2016#564)

The evolution watchdog re-emitted the SAME pipeline-health alert
(LOW_SELECTION_EFFICIENCY, LOW_SUCCESS, REALIZED_* …) on EVERY cron run while
a known, already-throttled condition persisted (e.g. selection_efficiency=11%,
self-corrected by PR Lexus2016#519's deterministic effort_budget). The owner got the
identical alert every single day — pure fatigue, zero new information.

Add a state-aware edge-trigger layer around the HEALTH alerts only. We now
alert on TRANSITIONS, not steady state:
  • a NEW flag/condition appears  -> emit
  • a condition WORSENS (new/harsher flag, or an embedded counter like
    MERGED_ZERO x3 -> x5 — both change the flag tail) -> emit
  • a condition CLEARS -> emit one recovery line
  • a condition unchanged for >= EDGE_COOLDOWN_DAYS (7) -> emit one
    "still unresolved" nudge, so it is never silent forever
Only the verbatim repeat of an already-reported, non-worsening condition within
the cooldown is suppressed.

Signature = the sorted set of flag TAILS (text after the final "|"), ignoring
the metrics body whose per-run counts drift even when the condition is
unchanged. State persists in <evolution_dir>/watchdog-alert-state.json using the
same EVOLUTION_PROFILE_DIR resolution the health checks already use.

NO-MASK SAFETY PROPERTY: suppression keys on the condition signature, so any new
fault, any worsening, any new distinct flag, and any post-recovery recurrence
change the signature and emit immediately. Operational alerts (stage reports,
jobs, gh, and the Lexus2016#561 upstream-lag guard) bypass the layer entirely and fire
every run.

FAIL-OPEN: every state read/write is best-effort — a missing/unreadable/corrupt
state file means "unknown previous state" and we emit exactly as today; a write
failure is swallowed. Edge-triggering can only ever reduce noise, never mask a
fault.

TDD: 13 new tests cover steady-suppress, new-flag, worsening, counter-growth,
recovery, recovery-then-recurrence, cooldown re-reminder, the three fail-open
paths, signature stability, and a main() wiring test asserting upstream-lag/gh
are never edge-suppressed. Watchdog module fully green (51 passed).
…M] issue (Lexus2016#566)

Two robustness improvements to the evolution watchdog that close the gaps
behind the frozen nightly self-update and the manually-opened upstream issue.

Feature 1 — silent-freeze detection (check_runtime_divergence):
The runtime checkout self-updates with `git pull --ff-only`. When the
evolution pipeline (or a contributor) leaves LOCAL commits on the tracking
branch that later squash-merge upstream under a DIFFERENT SHA, local HEAD
diverges from origin/main, ff-only can no longer fast-forward, and the
nightly update silently no-ops — freezing the box on an old revision with
no signal (root cause seen on osoba: the runtime checkout commits in-place
as "Hermes Evolution" and checks out main in the same tree). The new check
alerts on the high-confidence DIVERGED signal only:
  rev-list --count origin/main..HEAD > 0  AND
  merge-base --is-ancestor HEAD origin/main is FALSE.
The is-ancestor probe is authoritative for "can ff-only advance", so a
behind-but-fast-forwardable box (healthy, just hasn't pulled yet) is NOT
alerted — avoids a daily false-positive storm. DETECT + ALERT only; never
auto-reset (that would risk losing the local commits). Routed through the
existing Lexus2016#564 edge-trigger HEALTH layer so a steady divergence emits once
and suppresses the verbatim daily repeat (cooldown nudge backstop intact).

Feature 2 — idempotent GitHub [UPSTREAM] tracking issue (ensure_upstream_issue):
On a REAL upstream escalation (full clone, behind > threshold — the Lexus2016#561
shallow case never reaches here) check_upstream_lag now ensures the
[UPSTREAM] tracking issue exists instead of leaving the owner to open it by
hand (as with Lexus2016#562). Idempotency key = an OPEN issue whose title starts with
"[UPSTREAM]"; if present, no duplicate is created. All gh interaction goes
through the injectable runner seam (unit-testable, no network in tests).

Fail-open everywhere: repo unresolved, any git/gh spawn error, failed
search, or unparseable output → no alert / no-op, never a crash and never a
false alarm. Gated by WATCHDOG_FILE_UPSTREAM_ISSUE (default on).

TDD: 18 new tests (divergence detection + edge-trigger routing, idempotent
issue create/no-duplicate/fail-open, Lexus2016#561 shallow regression still silent).
…exus2016#568)

Reduce the Hydra system prompt from ~3678 chars to ~1876 chars (49% smaller)
while preserving all stage definitions, safety rules, and dispatch output
requirements. Replaces verbose prose and repeated paths with compact
variable abbreviations and one-line stage toolset/goal summaries.

- cron/evolution/hydra.yaml: condensed prompt, no functional behavior change.
- tests/cron/test_hydra_prompt.py: regression tests that pin prompt size
  (<2500 chars), stage coverage, safety rules, and the absence of terminal
  from Hydra's own toolsets.

Closes Lexus2016#549

Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…failure (Lexus2016#570)

When Telegram (or any platform) delivery fails, the undelivered content
is now preserved at a well-known deterministic path:

  {HERMES_HOME}/cron/output/delivery_fallback/{target}_{job_id}_{timestamp}.md

This allows the user or a secondary channel to retrieve cron job results
even when the primary messaging channel is unreachable.

Closes Lexus2016#525

Co-authored-by: Hermes Evolution <evolution@hermes-agent.nousresearch.com>
Co-authored-by: Hermes Evolution <evolution@hermes.ai>
…forced guard) (Lexus2016#569)

* feat(agent): learn from user corrections — lean Phase 1 (per-user)

Principle: a real user correcting a real agent on a real task is the
highest-signal feedback an agent gets. Today Hermes captures some of it
(the post-turn LLM background_review writes preferences to per-profile
memory/skills) but misses the loudest signals — interrupted and denied
turns are skipped by the `not interrupted` guard in turn_finalizer.

Lean Phase 1 scope (per-user Fast Loop only):

1. Deterministic correction detection (agent/correction_learning.py
   detect_correction): classifies a completed turn as INTERRUPT / DENY /
   STEER from runtime markers only — no fuzzy text regex, no LLM. Returns
   a small CorrectionRecord {kind, signature, context, session_id, ts}.

2. Stop skipping corrections: turn_finalizer now triggers the background
   review when the turn IS a structured correction, even when interrupted
   or denied. Non-correction interrupted turns and normal turns keep their
   exact prior behavior. The detected correction is passed to the reviewer
   as a tier-aware hint.

3. Generalization guard (the load-bearing safety piece): a captured
   correction is TRANSIENT by default. It promotes to DURABLE (a write to
   the per-profile memory store, which re-injects next session) ONLY on
   evidence — the same signature recurs across >= 2 DISTINCT sessions, or
   an explicit "remember this". A minimal recurrence tracker (signature ->
   distinct sessions) lives in a fail-open per-profile JSON store; a broken
   store degrades to transient-only and never disturbs a turn. Promotion is
   idempotent (no duplicate ledger/memory writes on later sightings). The
   correction-triggered LLM review prompt is tier-aware: a first-sighting
   transient correction is explicitly NOT persisted durably by the reviewer,
   so the deterministic guard is the single durable gate (closes a one-off
   leak found in independent review).

4. Provenance + unlearn: every durable item is tagged with origin signal
   kind, session, signature, timestamps, and promotion reason in a ledger.
   unlearn(provenance_id) removes the durable item from the memory store AND
   resets the signature's recurrence evidence (so unlearn is not silently
   undone by the next sighting). Reversible by construction.

Deferred to later phases (explicitly NOT built): the multi-dimensional
evidence vector, the fleet/global consensus path, calibration / positive-
negative controls, the adversarial counter-reviewer, TTL / model-version
tagging, config-over-prompt routing, and the parallel codex_runtime guard.

TDD: tests written first (tests/agent/test_correction_learning.py,
test_correction_learning_wiring.py, test_turn_finalizer_correction_review.py)
covering the acceptance test (transient on first sighting -> durable on a
second distinct session; explicit remember promotes immediately), the
negative control (one-off never injects, neither deterministically nor via
the LLM review), idempotent promotion, unlearn-resets-recurrence,
regression (non-correction turns unchanged), and provenance. 37 tests green;
background_review regression suite (52) green. No auto-merge — fleet-
affecting behavior change requires owner review.

* fix(agent): enforce durable-write gate + discriminate user denials (X1, X2)

Two serious defects in the lean Phase-1 "learn from user corrections" slice.

X1 — the deterministic guard was NOT the single durable gate it claimed to
be. The correction-triggered review forced review_memory=True on every
detected correction and spawned the background-review LLM fork holding the
memory/skill WRITE toolset (shared _memory_store). For a TRANSIENT one-off
correction, only an advisory preamble — not enforcement — stopped the LLM
from writing it durably. This widened an ungated durable-write path.

Now ENFORCED: turn_finalizer computes block_durable_writes (transient
correction that is the SOLE reason the review spawned — i.e. the legacy nudge
path would not have fired and the recurrence guard has not promoted it) and
threads it through _spawn_background_review -> spawn_background_review_thread
-> _run_review_in_thread. The fork's runtime tool whitelist
(_review_tool_whitelist) then strips the durable writers (memory,
skill_manage), so get_pre_tool_call_block_message denies any durable write at
dispatch. Durable persistence for the correction path now happens ONLY via the
deterministic CorrectionLearner promotion (recurrence>=2 or explicit
remember). Pre-existing nudge-driven review behavior is untouched.

X2 — _detect_deny keyed on status=="blocked", which automatic terminal blocks
(dangerous-command denial and workdir shell-injection validation) also emit
with no user involvement, minting false "user corrections". A genuine user
denial is now stamped with an explicit user_denied marker at the approval
sites (tools/approval.py), propagated into the terminal tool result only for
real denials (tools/terminal_tool.py), and _detect_deny fires solely on that
marker. Automatic blocks are excluded.

Also: wired CorrectionLearner.unlearn to a real CLI surface
(hermes corrections list / unlearn <id>) so reversibility is not paper-only.

Tests: repaired the misleading "negative control" (it only exercised the
deterministic learner, never the LLM leak path) and added real enforcement
tests — the fork whitelist excludes durable writers for a transient
correction; an automatic dangerous-command/workdir block is NOT detected as a
denial; the unlearn CLI reverses a durable item end-to-end.

* test(agent): prove X1 enforcement at the dispatch gate

Add a deterministic end-to-end test: under the transient-correction whitelist,
get_pre_tool_call_block_message (the gate the review fork's tool dispatch
consults) denies memory and skill_manage while allowing read-only skill_view.
Proves the LLM fork cannot persist a one-off correction durably — denied at
dispatch, not merely discouraged by a prompt.

* fix(agent): close 5 audit defects in learn-from-corrections Phase 1

Adversarial audit (6/10) flagged five issues; all now fixed surgically.

1. CI red: register "corrections" in _BUILTIN_SUBCOMMANDS so
   test_builtin_set_covers_every_registered_subcommand passes. The
   subcommand was wired in main() but missing from the frozenset, so the
   startup-gating parity check (a required CI check) failed.

2. Codex parity: extract correction detection + recurrence recording +
   the spawn/block decision into a shared seam, agent/correction_review.py
   (decide_correction_review). Both turn_finalizer.finalize_turn and the
   Codex finalizer (codex_runtime.run_codex_app_server_turn) now route
   through it, so they cannot drift. Previously the Codex path carried an
   unmodified nudge-only gate and silently never learned from a correction.

3. X1 co-occurrence hole now universal: block_durable_writes =
   (correction present AND not durable), dropping the prior
   `and not _healthy_review` term. A transient correction co-occurring
   with a nudge can no longer ride the nudge's durable write into the
   store; the nudge's own durable write is deferred to the next interval.

4. No wasted aux-model spend: a pure-transient correction with no nudge is
   recorded deterministically but no longer spawns the (write-blocked) LLM
   review fork. Spawn = (nudge fired) OR (correction promoted to durable).

5. Promote atomicity: in CorrectionLearner._promote write the ledger entry
   FIRST, then the durable memory line, so a failure between them leaves a
   cleanable ledger entry (injected:false) rather than an un-unlearnable
   orphaned MEMORY.md line. Still fail-open.

Tests: rewrote the turn-finalizer correction-review tests for the new
spawn/block contract; added tests/agent/test_codex_runtime_correction_review.py
proving the Codex path detects+records and obeys the same spawn/block rules.

* fix(agent): revive INTERRUPT corrections on default runtime (capture-before-clear)

DEFECT 1 — capture-before-clear. finalize_turn called clear_interrupt()
(nulls _interrupt_message) ~46 lines before the correction detector read it,
so the INTERRUPT branch of detect_correction was dead on the default runtime.
Capture _interrupt_message into a local BEFORE the clear and feed that local
to detection and result["interrupt_message"].

DEFECT 2 — un-mask the tests. The _StubAgent/_Agent.clear_interrupt() stubs
were no-ops that never nulled _interrupt_message, so INTERRUPT tests passed
against broken production code. Mirror production (null the message + request
flag). Add an explicit capture-before-clear test asserting, through the REAL
finalize_turn ordering, that clear_interrupt ran (attribute is None) yet the
INTERRUPT correction was still detected with its exact redirect text. The stub
fix turns the existing INTERRUPT tests RED pre-fix; DEFECT 1 turns them GREEN.

DEFECT 3 — explicit "remember this" durable promotion is unreachable in
production (no caller threads remember=True). Downgrade the docstrings/comments:
cross-session recurrence (>=2 distinct sessions) is the SOLE Phase-1 production
durable trigger; explicit-remember is deferred. Keep the remember param as the
tested seam, documented as not-yet-wired.

DEFECT 4 — codex INTERRUPT honesty. On the codex runtime user interrupts are
never propagated into the session (request_interrupt has no production callers;
codex interrupted is only a deadline-timeout), so codex INTERRUPT stays inert
even after DEFECT 1 — a pre-existing platform gap, deferred. Scope the code
comments honestly: default runtime = INTERRUPT/DENY/STEER all live; codex =
DENY/STEER live, INTERRUPT blocked.

No behavior change beyond reviving the INTERRUPT branch. DO NOT MERGE.
Extracts a reusable _FailureCounter from the fallback-failure-tracking
work (PR #3) so that multiple callers (Telegram adapter, auxiliary
clients, search providers) can share the same session-scoped failure
threshold + disable logic without duplicating the pattern.

Closes Lexus2016#571

Co-Authored-By: Hermes Evolution <evolution@hermes.ai>
@Da-Mikey

Copy link
Copy Markdown
Owner Author

Closing as redundant — upstream PR Lexus2016#574 subsumes the _FailureCounter utility (issue Lexus2016#571) that this PR introduced.

@Da-Mikey Da-Mikey closed this Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.