Skip to content

feat(llm): add LLM call outcome classifier / state machine#13

Open
GrigoryEvko wants to merge 16 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/llm-call-outcome-classifier
Open

feat(llm): add LLM call outcome classifier / state machine#13
GrigoryEvko wants to merge 16 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/llm-call-outcome-classifier

Conversation

@GrigoryEvko
Copy link
Copy Markdown

The bandit and router stack catches LLM call exceptions today with bare try / except ladders that each subsystem rolls separately. There is no shared vocabulary for distinguishing a RATE_LIMITED upstream from a PARSE_FAILED arm output, no shared dispatch for "fail the pull on the bandit" versus "wait for the mutation outcome", and no shared guard against surprising failure modes — a langchain_core.exceptions.OutputParserException whose __cause__ is openai.RateLimitError should classify as the cause not the wrapper, a Retry-After: inf header should not propagate as math.inf into a sleep budget, a metaclass with non-str __module__ or a raising __mro__ property should not break classifier totality.

This PR adds gigaevo/llm/call_outcome.py, a pure total function classify_call_result(exc, *, model_name=None) that maps any BaseException | None to a frozen LLMCallResult carrying a 12-variant LLMCallOutcome. The classifier walks the __cause__ / __context__ chain with cycle protection and collapses outcomes through a priority ordering so wrapped errors classify by root cause. Rule tables (_CONTEXT_OVERFLOW_MRO_NAMES, _OUTPUT_TRUNCATED_MRO_NAMES, _CONTENT_FILTER_MRO_NAMES, _TIMEOUT_MRO_NAMES, _NETWORK_MRO_NAMES, _PARSE_MRO_NAMES) match against the full MRO chain and are asserted pairwise-disjoint at module load. OUTCOME_ACTION is asserted closed over LLMCallOutcome. Class names were verified against installed openai 2.36.0, httpx 0.28.1, langchain-openai 1.2.1, langchain-core 1.4.0, and litellm 1.x (which contributes ContextWindowExceededError and ContentPolicyViolationError, both BadRequestError subclasses that would otherwise lose their semantic to BAD_REQUEST).

import openai, httpx
from langchain_core.exceptions import OutputParserException
from gigaevo.llm.call_outcome import classify_call_result, LLMCallOutcome

req = httpx.Request("POST", "https://api.openai.com/v1/x")
inner = openai.RateLimitError("rate", response=httpx.Response(429, request=req), body=None)
outer = OutputParserException("parser saw bad output")
outer.__cause__ = inner

result = classify_call_result(outer)
assert result.outcome is LLMCallOutcome.RATE_LIMITED
assert result.http_status == 429
assert result.cause_chain == ("OutputParserException", "RateLimitError")

Defensive guards are kept tight against confirmed defects, not as theater. Lone UTF-16 surrogates in LLMCallResult.message are replaced with U+FFFD so downstream UTF-8 encoders and JSON serializers do not crash. Retry-After is rejected when non-finite (inf, nan, numbers that overflow float); any finite non-negative value passes through and the consumer is the arbiter of a maximum acceptable sleep. type(exc).__module__ and __name__ are coerced through str so a metaclass setting a non-str module no longer breaks totality. __cause__ / __context__ / __mro__ access is funnelled through _safe_getattr so a property override raising non-AttributeError cannot propagate; a non-BaseException value returned from such a property terminates the cause walk, a raising __mro__ collapses to an empty fingerprint set. Eighty-nine unit tests against real provider exceptions plus regression guards for every fixed defect. No live caller consumes the classifier yet; the bandit rewire that operationalizes OUTCOME_ACTION is a separate follow-up.

A pure total function classify_call_result(exc, *, model_name=None)
that maps any BaseException | None to a frozen
LLMCallResult carrying a 12-variant LLMCallOutcome (SUCCESS
plus eleven failure variants verified against installed sources
for openai 2.36.0, httpx 0.28.1, langchain-openai 1.2.1,
langchain-core 1.4.0). Cause-chain walk with cycle protection plus
priority ranking so a wrapper exception cannot mask a more
informative root (a langchain_core.exceptions.OutputParserException
whose __cause__ is openai.RateLimitError lands on
RATE_LIMITED). Pairwise-disjoint rule tables for MRO-name
matching, asserted at module load. Action mapping that the bandit
will consume in a follow-up PR (SUCCESS defers to
on_mutation_outcome; every failure outcome maps to
INJECT_ZERO_REWARD).

Adds four audit-discovered guards inside the classifier so it can
be merged without external sanitizer dependency: lone UTF-16
surrogates in the result message are replaced with U+FFFD so
downstream UTF-8 encoders and JSON serializers do not crash;
Retry-After headers with non-finite or excessive values
(inf, nan, overflow strings, values past 24 hours) are
rejected as None rather than propagating poisonous floats into
the bandit's sleep budget; type(exc).__module__ and
__name__ are coerced through str defensively so a metaclass
that sets a non-str module no longer breaks classifier totality;
_safe_getattr suppresses non-AttributeError exceptions raised
by hostile property/descriptor implementations.

Eighty unit tests against real provider exceptions plus the four
new regression guards. No live caller consumes the classifier yet
(bandit rewire is a separate follow-up); this is module-only with
no behavior change.
Combined audit pass against the classifier added in the previous
commit, addressing both totality and coverage gaps:

Totality contract guards. __cause__ / __context__ /
__mro__ access is funnelled through _safe_getattr so a
metaclass or subclass that overrides any of them with a property
raising non-AttributeError cannot break classifier totality. A
non-BaseException value returned from a property-override path
terminates the cause walk rather than poisoning cycle detection;
a raising __mro__ collapses to an empty fingerprint set so the
MRO branches naturally skip and the exception falls through to the
status-code / isinstance / OTHER paths.

Coverage of litellm exception classes that subclass openai
BadRequestError but carry semantics that map to a more informative
outcome: ContextWindowExceededError now lands on CONTEXT_OVERFLOW
(was BAD_REQUEST) and ContentPolicyViolationError on
CONTENT_FILTERED (was BAD_REQUEST). Both verified against installed
litellm 1.x classes.

Documentation. LLMCallResult field docstrings expanded to spell out
the cause_chain ordering (index 0 is the surface exception, walking
toward roots), the retry_after_seconds tri-state semantics that PR-B
will branch on (None means no hint, 0.0 means retry immediately,
positive means sleep before retry), and the PR-B integration
contract.

Tests grow from 80 to 89: hostile cause / context / mro property
guards, BaseException subclass totality, litellm class fingerprints,
cause_chain ordering invariant, retry-after tri-state, per-outcome
action lookup, JSON round-trip mirroring the PR-B telemetry path.
The classifier's job is to parse Retry-After into a finite
non-negative float and surface it on LLMCallResult. The upper bound
was a policy choice without operational evidence — no provider in
the wild sends Retry-After past one hour even at abuse-protection
limits, and a legitimate two-hour backoff was being silently
mapped to None (no hint) instead of being passed through. The
consumer (a future bandit retry adapter) is the right arbiter of
a maximum acceptable sleep; the classifier just needs to keep
non-finite garbage out of the field.
BanditModelRouter overrides invoke/ainvoke to wrap the LLM dispatch
in try/except. On failure the new _inject_failure_reward helper
classifies the exception via classify_call_result and looks up the
BanditAction; INJECT_ZERO_REWARD outcomes (every non-SUCCESS variant)
normalize a zero reward and append it to the arm's window before the
original exception re-raises. Without this, _select's pull recording
already happened but the reward window never got a matching entry,
so the UCB1 confidence term shrank for the failing arm and the
bandit underexplored flaky models.

_StructuredOutputRouter grows a failure_hook callback (orthogonal to
the existing select_override) so bandit-wrapped structured-output
dispatches go through the same path. BanditModelRouter.with_structured_output
wires self._inject_failure_reward through.

Six new tests cover sync failure, async failure, repeated failures
keeping pulls and rewards in step, success-path-does-not-inject,
structured-output failure routing through the same hook, and a
direct test of the unknown-arm defensive skip in
_inject_failure_reward.
@GrigoryEvko
Copy link
Copy Markdown
Author

GrigoryEvko commented May 15, 2026

Part B (commits 4f90bf96398a82, 12 commits on top of the classifier). The classifier is now consumed end-to-end and the wiring has been audit-hardened across two parallel three-agent passes.

BanditModelRouter.invoke / ainvoke / stream / astream wrap the LLM dispatch in try / except. On failure the new _inject_failure_reward helper calls classify_call_result(exc, model_name=arm_name), looks up the BanditAction, and for every INJECT_ZERO_REWARD outcome normalizes a zero reward and appends it to the arm's window before the original exception re-raises. _select already recorded the pull, so without this hook the reward window drifted behind total_pulls and the UCB1 confidence term shrank for the failing arm. _StructuredOutputRouter grows a failure_hook callback (orthogonal to the existing select_override) so bandit-wrapped structured-output dispatches go through the same path. BanditModelRouter.with_structured_output wires self._inject_failure_reward through.

Audit findings that surfaced during integration probes and got fixed in this PR: hook-side errors no longer mask the LLM exception (_safe_inject_failure_reward + _maybe_fire_failure_hook log at warning level); token-tracker errors on the success path no longer break the deferred-reward contract (_safe_track); the bandit's _select now sets the _selected_model_var ContextVar so get_selected_model() returns the right arm; _StructuredOutputRouter._process runs inside the try so token-tracker / parser exceptions after a successful model.invoke fire the failure hook; _process now raises parsing_error when langchain's include_raw=True flow surfaces a schema-validation failure as {parsed=None, parsing_error=<exc>} instead of raising (FU12); StageError.{type, message, stage, traceback} scrub UTF-16 surrogates; MigrantEnvelope.to_stream_fields recursively scrubs program_data; TokenUsage.from_response isinstance-guards every nested layer and coerces token counts via _coerce_int. Pre-_select exceptions verified clean (no pull, no reward). 32-way concurrent mixed-dispatch stress confirms total_pulls == 32 and window_size == failures. No double-injection across the agent/operator boundary (evolution engine drops failed mutations without persisting, so on_mutation_outcome never fires for the same call that already got an immediate zero reward).

Agent-layer integration tests (tests/llm/test_agent_bandit_integration.py) exercise the wiring against real MutationAgent / InsightsAgent / ScoringAgent / LineageAgent instances dispatching through a BanditModelRouter. Full target sweep (tests/llm tests/evolution/test_bandit.py tests/database tests/utils tests/dag tests/test_program.py tests/stages tests/prompts tests/trackers tests/infra tests/evolution/bus): 2532 tests pass.

One ambiguity is queued rather than decided unilaterally. The classifier currently maps every non-SUCCESS outcome to INJECT_ZERO_REWARD. For arm-fault outcomes (PARSE_FAILED, CONTENT_FILTERED, OUTPUT_TRUNCATED) this is correct. For operator/infra-fault outcomes (AUTH_FAILED, RATE_LIMITED, SERVER_5XX, NETWORK_ERROR, TIMEOUT, CONTEXT_OVERFLOW) the random arm that drew the storm is penalized even though it is blameless. The docstring acknowledges this and claims mean-reward self-corrects; in practice failures correlate with specific arms (one base_url, one model with a smaller window) so the assumption is violated. A future PR could extend BanditAction with a SKIP_LEDGER variant and remap operator/infra outcomes to it; that is a policy choice the maintainer should weigh. Queued as a follow-up rather than included here.

…rocess-side hook

* ``_StructuredOutputRouter.{,a}invoke`` now executes ``_process`` *inside*
  the try/except so a token-tracker error or a malformed structured response
  fires the failure_hook. Previously a transport-side success followed by a
  ``_process`` raise left the bandit ledger inflated for that arm with no
  matching window entry — exactly the desync the wiring was supposed to
  prevent. Documented inline.

* Adds tests covering the remaining audit items:
  - ``stream`` / ``astream`` failure dispatch (sync + async)
  - failure-hook error containment on ``invoke`` / ``ainvoke``
  - cause-chain / traceback preservation through the bandit re-raise
  - classifier-internal error containment
  - ``_StructuredOutputRouter._process`` failure firing the hook
  - tracker-side telemetry failure on the success path (sync + async)
  - ``_remember_selected_model`` propagation for bandit selection and for
    the structured-output ``_bandit_select`` override
  - end-to-end structured-output ledger symmetry (failure + success)
  - concurrent ``ainvoke`` failure ledger consistency
``_StructuredOutputRouter._maybe_fire_failure_hook`` used to swallow
hook exceptions silently (``except Exception: pass``). The hook is
observability-only, so re-raising would mask the real LLM failure —
but a *silent* swallow loses telemetry whenever the hook itself has a
bug, hiding bandit-side regressions. Replace ``pass`` with
``logger.warning(...)`` so the original exception still propagates,
yet a broken hook is visible in operator logs.

Audit item FusionBrainLab#2 from the PR FusionBrainLab#13 bug hunt.
If ``_bandit.select()`` raises before ``record_pull`` runs, no pull is
recorded and the try/except inside ``invoke``/``ainvoke`` never engages
to inject a zero reward — the ledger invariant ("pulls and rewards
stay in step") therefore holds vacuously. Codify this so a future
refactor that moves ``record_pull`` outside ``_select`` (or hoists the
try/except above ``_select``) doesn't break the invariant silently.

Audit item FusionBrainLab#3 from the PR FusionBrainLab#13 bug hunt — verification, no code change.
When the bandit-wrapped LLM call inside an agent fails, the mutation
operator catches the exception and constructs a StageError via
StageError.from_exception(). Provider exception text occasionally carries
lone UTF-16 surrogate code points (typically tokenizer artefacts that
emit half of an astral pair without its partner). They survive str(exc)
unchanged into StageError.message / StageError.traceback, then crash
the downstream Redis-write path with TypeError: str is not valid UTF-8:
surrogates not allowed when gigaevo.utils.json.dumps (= orjson) tries
to encode the program blob.

Add idempotent field validators that replace lone surrogates with U+FFFD
at the StageError boundary so every downstream consumer (orjson, asyncpg
TEXT, migration-bus payload) is safe regardless of where the StageError
travels next.

The fix is local to the StageError schema and does not modify the
exception-classifier behavior or any LLM-side code.

Coverage: 7 tests under tests/test_program.py — positive case (high and
low surrogates scrubbed, downstream dumps succeeds), negative cases
(real astral emoji preserved, validator idempotent on revalidation),
end-to-end via StageError.from_exception and Program.to_dict.
Adds two integration-level test classes that pin invariants the
classifier wiring depends on, neither of which was covered by the
existing per-method tests:

* TestBanditNoDoubleInjectionOnFailure verifies the boundary between
  the failure-path _inject_failure_reward (fires inside ainvoke) and
  the deferred on_mutation_outcome (fires only for persisted
  programs). A failed mutation never reaches storage, so
  on_program_ingested never runs, so the bandit window receives
  exactly one entry per failed call. A future refactor that
  persists a placeholder program for failed mutations would
  silently introduce double-injection without these tests.

* TestBanditConcurrentMixedDispatch stresses ainvoke under 32-way
  concurrent dispatch with a deterministic success/failure mix and
  asserts the ledger invariant total_pulls == N and
  window_size == failure_count. Catches regressions in the task-id
  keyed _task_model_map that would leak one task's outcome into
  another's reward window, plus pure all-fail and all-succeed
  shapes for boundary coverage.
The migration bus serializes program_data via stdlib json.dumps, which
escapes lone UTF-16 surrogate code points as \\uD800 JSON literals. The
receiver json.loads them back into Python str form and eventually
attempts to persist the migrated Program through
gigaevo.utils.json.dumps (= orjson). orjson rejects surrogates with
TypeError: str is not valid UTF-8: surrogates not allowed, which would
crash the migration handler mid-restore.

When a Program migrates from process A (whose LLM call produced an
exception text with a tokenizer-artefact surrogate that propagated into
StageError.message via Program.to_dict, or carries an arm name with a
surrogate, or a code blob with one) to process B, the surrogate must be
sanitized at the publish boundary so every consumer downstream sees an
orjson-safe payload.

Add _scrub_surrogates that walks the JSON-shaped program_data tree and
replaces every lone surrogate with U+FFFD on str leaves. Also scrub
source_run_id and program_id directly. The transform is idempotent so
forward/retry paths that re-envelope a consumed message are safe.

Coverage: 6 tests under tests/evolution/bus/test_transport.py — top-level
program_data leaf, nested metadata + stage_results + tag list,
source_run_id/program_id, surrogate-free identity, real astral emoji
preserved, idempotency on republish.
Add an integration-level test module that exercises the agent layer
(`MutationAgent`, `InsightsAgent`, `LineageAgent`, `ScoringAgent`) on top
of a real `BanditModelRouter`. Existing coverage in
`tests/evolution/test_bandit.py` only checks the router in isolation
with synthetic mocks; the agent <-> router contract had no end-to-end
test.

The new module verifies that:

* On transport failure the bandit's failure hook fires exactly once
  even though `MutationAgent.acall_llm` swallows the exception into
  `state["error"]`, so the ledger stays in step
  (`total_pulls == window_size`).
* Repeated failures keep the ledger in step across many calls.
* On the success path the bandit defers the reward to
  `on_mutation_outcome` (window stays empty) and no double-injection
  happens in the agent layer.
* `InsightsAgent`/`LineageAgent`/`ScoringAgent` propagate failures out
  of `agent.arun` (they inherit `base.acall_llm`, which does not catch
  exceptions) while the bandit hook still fires before the exception
  escapes.

A regression that adds an internal retry/fallback wrapper to any agent
(e.g. `Runnable.with_retry` or `with_fallbacks`) would surface here as
a `window_size` mismatch.

The module also documents an out-of-slice gap (FU12): when
`_StructuredOutputRouter._process` receives the langchain
`include_raw=True` response shape with `parsed=None` and a non-None
`parsing_error`, it returns `None` without firing the failure hook,
leaving `total_pulls` incremented but `window_size` unchanged. The
fix belongs in `gigaevo/llm/models.py` and is queued for a follow-up.
The regression-lock test pins the current (broken) behaviour so the
follow-up can flip the assertion when the gap closes.
The extractor assumed every layer of the provider usage payload is a
dict with int counts:

  - response.response_metadata is a dict
  - response_metadata["token_usage"] (or "usage") is a dict
  - usage["completion_tokens_details"] is a dict
  - every *_tokens count is an int

A hostile or malformed provider that returns a string in place of any
nested dict crashed the extractor with AttributeError ("str has no get").
The bandit's _safe_track wrapper already swallows the failure, but
direct callers (MultiModelRouter.invoke in gigaevo/llm/models.py) have
no second-level guard, so an exception there propagates to the LLM call
site and loses the response.

Harden the extractor itself: isinstance-check every container layer
before .get(), and coerce every count through _coerce_int (which accepts
int/bool/float/str and defaults to 0 on anything else or on parse
failure). Token telemetry is observability-only and must always return
either a valid TokenUsage or None — never raise.

Coverage: 7 tests under tests/llm/test_llm_routing.py — string in place
of completion_tokens_details, string token_usage, string
response_metadata, string-encoded counts, None counts, float counts,
garbage string counts.
@GrigoryEvko GrigoryEvko changed the title feat(llm): add LLM call outcome classifier feat(llm): add LLM call outcome classifier / state machine May 15, 2026
BanditModelRouter.with_structured_output passes include_raw=True so
the underlying langchain wrapper returns {raw, parsed, parsing_error}.
On schema-validation failure langchain populates parsing_error and
sets parsed=None instead of raising. _StructuredOutputRouter._process
returned response.get('parsed') silently and the bandit's failure_hook
never fired — the pull was recorded by _select but the reward window
never got a matching entry, so the UCB1 confidence term shrank for
the failing arm without the bandit knowing why.

_process now detects parsed=None plus a non-None parsing_error and
re-raises the parsing_error so the existing try/except in
invoke/ainvoke routes through _maybe_fire_failure_hook normally.
The degenerate case where both parsed and parsing_error are None
passes through unchanged (caller handles the None).

The previous regression-lock test (which pinned the broken behavior
for future detection) is flipped to assert the fixed behavior. Two
additional tests cover the success path (no hook fire) and the
empty-result degenerate case.
…m_fields"

This reverts commit 7fbf680.

The defence is now redundant against PR FusionBrainLab#10
(fix/llm-output-sanitization), which introduces
``gigaevo.utils.text_sanitize.deep_sanitize_for_json`` and applies it
at the same migration-bus boundary with broader coverage (ANSI / BIDI
/ control characters, not only surrogates).  Keeping the local
``_scrub_str`` / ``_scrub_surrogates`` helpers here forces a merge
conflict with FusionBrainLab#10 regardless of merge order; removing them lets either
PR merge independently against ``main``.

If FusionBrainLab#10 is rejected, this commit should be reverted (or the original
content cherry-picked back) to restore the defence on the bus.
… fields"

This reverts commit a25678f.

The defence is now redundant against PR FusionBrainLab#10
(fix/llm-output-sanitization), which adds ``sanitize_for_log`` field
validators on ``StageError`` (covering ANSI / BIDI / control
characters / surrogates).  Keeping the local ``_scrub_surrogates``
validators here forces a merge conflict with FusionBrainLab#10 regardless of merge
order; removing them lets either PR merge independently against
``main``.

If FusionBrainLab#10 is rejected, this commit should be reverted (or the original
content cherry-picked back) to restore the defence on
``StageError``.
@GrigoryEvko
Copy link
Copy Markdown
Author

Conflict map vs PR #10 (fix/llm-output-sanitization)

After two pre-emptive reverts on this branch (commits 0f1a0d2, 9926967), the conflict footprint against PR #10 is now:

File Hunks Type
gigaevo/llm/bandit.py 1 (import block at L25) Disjoint imports — clean union

Resolution: take the union of both import sets:

from gigaevo.llm.call_outcome import BanditAction, classify_call_result
from gigaevo.llm.models import (
    MultiModelRouter,
    _remember_selected_model,
    _StructuredOutputRouter,
)
from gigaevo.utils.text_sanitize import sanitize_for_log

PR #10 contributes the last import; this PR contributes the first two. No structural conflict — merge order is interchangeable.

The two reverts on this branch undo local _scrub_str / _scrub_surrogates helpers in transport.py + _scrub_surrogates validators in core_types.py that PR #10 supersedes with its gigaevo.utils.text_sanitize centralised helpers. If PR #10 is rejected, both reverts should be reverted back (or the original surrogate-scrubbing content cherry-picked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant