feat(llm): add LLM call outcome classifier / state machine by GrigoryEvko · Pull Request #13 · FusionBrainLab/gigaevo-core

GrigoryEvko · 2026-05-15T07:29:09Z

The bandit and router stack catches LLM call exceptions today with bare try / except ladders that each subsystem rolls separately. There is no shared vocabulary for distinguishing a RATE_LIMITED upstream from a PARSE_FAILED arm output, no shared dispatch for "fail the pull on the bandit" versus "wait for the mutation outcome", and no shared guard against surprising failure modes — a langchain_core.exceptions.OutputParserException whose __cause__ is openai.RateLimitError should classify as the cause not the wrapper, a Retry-After: inf header should not propagate as math.inf into a sleep budget, a metaclass with non-str __module__ or a raising __mro__ property should not break classifier totality.

This PR adds gigaevo/llm/call_outcome.py, a pure total function classify_call_result(exc, *, model_name=None) that maps any BaseException | None to a frozen LLMCallResult carrying a 12-variant LLMCallOutcome. The classifier walks the __cause__ / __context__ chain with cycle protection and collapses outcomes through a priority ordering so wrapped errors classify by root cause. Rule tables (_CONTEXT_OVERFLOW_MRO_NAMES, _OUTPUT_TRUNCATED_MRO_NAMES, _CONTENT_FILTER_MRO_NAMES, _TIMEOUT_MRO_NAMES, _NETWORK_MRO_NAMES, _PARSE_MRO_NAMES) match against the full MRO chain and are asserted pairwise-disjoint at module load. OUTCOME_ACTION is asserted closed over LLMCallOutcome. Class names were verified against installed openai 2.36.0, httpx 0.28.1, langchain-openai 1.2.1, langchain-core 1.4.0, and litellm 1.x (which contributes ContextWindowExceededError and ContentPolicyViolationError, both BadRequestError subclasses that would otherwise lose their semantic to BAD_REQUEST).

import openai, httpx
from langchain_core.exceptions import OutputParserException
from gigaevo.llm.call_outcome import classify_call_result, LLMCallOutcome

req = httpx.Request("POST", "https://api.openai.com/v1/x")
inner = openai.RateLimitError("rate", response=httpx.Response(429, request=req), body=None)
outer = OutputParserException("parser saw bad output")
outer.__cause__ = inner

result = classify_call_result(outer)
assert result.outcome is LLMCallOutcome.RATE_LIMITED
assert result.http_status == 429
assert result.cause_chain == ("OutputParserException", "RateLimitError")

Defensive guards are kept tight against confirmed defects, not as theater. Lone UTF-16 surrogates in LLMCallResult.message are replaced with U+FFFD so downstream UTF-8 encoders and JSON serializers do not crash. Retry-After is rejected when non-finite (inf, nan, numbers that overflow float); any finite non-negative value passes through and the consumer is the arbiter of a maximum acceptable sleep. type(exc).__module__ and __name__ are coerced through str so a metaclass setting a non-str module no longer breaks totality. __cause__ / __context__ / __mro__ access is funnelled through _safe_getattr so a property override raising non-AttributeError cannot propagate; a non-BaseException value returned from such a property terminates the cause walk, a raising __mro__ collapses to an empty fingerprint set. Eighty-nine unit tests against real provider exceptions plus regression guards for every fixed defect. No live caller consumes the classifier yet; the bandit rewire that operationalizes OUTCOME_ACTION is a separate follow-up.

A pure total function classify_call_result(exc, *, model_name=None) that maps any BaseException | None to a frozen LLMCallResult carrying a 12-variant LLMCallOutcome (SUCCESS plus eleven failure variants verified against installed sources for openai 2.36.0, httpx 0.28.1, langchain-openai 1.2.1, langchain-core 1.4.0). Cause-chain walk with cycle protection plus priority ranking so a wrapper exception cannot mask a more informative root (a langchain_core.exceptions.OutputParserException whose __cause__ is openai.RateLimitError lands on RATE_LIMITED). Pairwise-disjoint rule tables for MRO-name matching, asserted at module load. Action mapping that the bandit will consume in a follow-up PR (SUCCESS defers to on_mutation_outcome; every failure outcome maps to INJECT_ZERO_REWARD). Adds four audit-discovered guards inside the classifier so it can be merged without external sanitizer dependency: lone UTF-16 surrogates in the result message are replaced with U+FFFD so downstream UTF-8 encoders and JSON serializers do not crash; Retry-After headers with non-finite or excessive values (inf, nan, overflow strings, values past 24 hours) are rejected as None rather than propagating poisonous floats into the bandit's sleep budget; type(exc).__module__ and __name__ are coerced through str defensively so a metaclass that sets a non-str module no longer breaks classifier totality; _safe_getattr suppresses non-AttributeError exceptions raised by hostile property/descriptor implementations. Eighty unit tests against real provider exceptions plus the four new regression guards. No live caller consumes the classifier yet (bandit rewire is a separate follow-up); this is module-only with no behavior change.

Combined audit pass against the classifier added in the previous commit, addressing both totality and coverage gaps: Totality contract guards. __cause__ / __context__ / __mro__ access is funnelled through _safe_getattr so a metaclass or subclass that overrides any of them with a property raising non-AttributeError cannot break classifier totality. A non-BaseException value returned from a property-override path terminates the cause walk rather than poisoning cycle detection; a raising __mro__ collapses to an empty fingerprint set so the MRO branches naturally skip and the exception falls through to the status-code / isinstance / OTHER paths. Coverage of litellm exception classes that subclass openai BadRequestError but carry semantics that map to a more informative outcome: ContextWindowExceededError now lands on CONTEXT_OVERFLOW (was BAD_REQUEST) and ContentPolicyViolationError on CONTENT_FILTERED (was BAD_REQUEST). Both verified against installed litellm 1.x classes. Documentation. LLMCallResult field docstrings expanded to spell out the cause_chain ordering (index 0 is the surface exception, walking toward roots), the retry_after_seconds tri-state semantics that PR-B will branch on (None means no hint, 0.0 means retry immediately, positive means sleep before retry), and the PR-B integration contract. Tests grow from 80 to 89: hostile cause / context / mro property guards, BaseException subclass totality, litellm class fingerprints, cause_chain ordering invariant, retry-after tri-state, per-outcome action lookup, JSON round-trip mirroring the PR-B telemetry path.

The classifier's job is to parse Retry-After into a finite non-negative float and surface it on LLMCallResult. The upper bound was a policy choice without operational evidence — no provider in the wild sends Retry-After past one hour even at abuse-protection limits, and a legitimate two-hour backoff was being silently mapped to None (no hint) instead of being passed through. The consumer (a future bandit retry adapter) is the right arbiter of a maximum acceptable sleep; the classifier just needs to keep non-finite garbage out of the field.

BanditModelRouter overrides invoke/ainvoke to wrap the LLM dispatch in try/except. On failure the new _inject_failure_reward helper classifies the exception via classify_call_result and looks up the BanditAction; INJECT_ZERO_REWARD outcomes (every non-SUCCESS variant) normalize a zero reward and append it to the arm's window before the original exception re-raises. Without this, _select's pull recording already happened but the reward window never got a matching entry, so the UCB1 confidence term shrank for the failing arm and the bandit underexplored flaky models. _StructuredOutputRouter grows a failure_hook callback (orthogonal to the existing select_override) so bandit-wrapped structured-output dispatches go through the same path. BanditModelRouter.with_structured_output wires self._inject_failure_reward through. Six new tests cover sync failure, async failure, repeated failures keeping pulls and rewards in step, success-path-does-not-inject, structured-output failure routing through the same hook, and a direct test of the unknown-arm defensive skip in _inject_failure_reward.

GrigoryEvko · 2026-05-15T07:35:45Z

Part B (commits 4f90bf9 → 6398a82, 12 commits on top of the classifier). The classifier is now consumed end-to-end and the wiring has been audit-hardened across two parallel three-agent passes.

BanditModelRouter.invoke / ainvoke / stream / astream wrap the LLM dispatch in try / except. On failure the new _inject_failure_reward helper calls classify_call_result(exc, model_name=arm_name), looks up the BanditAction, and for every INJECT_ZERO_REWARD outcome normalizes a zero reward and appends it to the arm's window before the original exception re-raises. _select already recorded the pull, so without this hook the reward window drifted behind total_pulls and the UCB1 confidence term shrank for the failing arm. _StructuredOutputRouter grows a failure_hook callback (orthogonal to the existing select_override) so bandit-wrapped structured-output dispatches go through the same path. BanditModelRouter.with_structured_output wires self._inject_failure_reward through.

Audit findings that surfaced during integration probes and got fixed in this PR: hook-side errors no longer mask the LLM exception (_safe_inject_failure_reward + _maybe_fire_failure_hook log at warning level); token-tracker errors on the success path no longer break the deferred-reward contract (_safe_track); the bandit's _select now sets the _selected_model_var ContextVar so get_selected_model() returns the right arm; _StructuredOutputRouter._process runs inside the try so token-tracker / parser exceptions after a successful model.invoke fire the failure hook; _process now raises parsing_error when langchain's include_raw=True flow surfaces a schema-validation failure as {parsed=None, parsing_error=<exc>} instead of raising (FU12); StageError.{type, message, stage, traceback} scrub UTF-16 surrogates; MigrantEnvelope.to_stream_fields recursively scrubs program_data; TokenUsage.from_response isinstance-guards every nested layer and coerces token counts via _coerce_int. Pre-_select exceptions verified clean (no pull, no reward). 32-way concurrent mixed-dispatch stress confirms total_pulls == 32 and window_size == failures. No double-injection across the agent/operator boundary (evolution engine drops failed mutations without persisting, so on_mutation_outcome never fires for the same call that already got an immediate zero reward).

Agent-layer integration tests (tests/llm/test_agent_bandit_integration.py) exercise the wiring against real MutationAgent / InsightsAgent / ScoringAgent / LineageAgent instances dispatching through a BanditModelRouter. Full target sweep (tests/llm tests/evolution/test_bandit.py tests/database tests/utils tests/dag tests/test_program.py tests/stages tests/prompts tests/trackers tests/infra tests/evolution/bus): 2532 tests pass.

One ambiguity is queued rather than decided unilaterally. The classifier currently maps every non-SUCCESS outcome to INJECT_ZERO_REWARD. For arm-fault outcomes (PARSE_FAILED, CONTENT_FILTERED, OUTPUT_TRUNCATED) this is correct. For operator/infra-fault outcomes (AUTH_FAILED, RATE_LIMITED, SERVER_5XX, NETWORK_ERROR, TIMEOUT, CONTEXT_OVERFLOW) the random arm that drew the storm is penalized even though it is blameless. The docstring acknowledges this and claims mean-reward self-corrects; in practice failures correlate with specific arms (one base_url, one model with a smaller window) so the assumption is violated. A future PR could extend BanditAction with a SKIP_LEDGER variant and remap operator/infra outcomes to it; that is a policy choice the maintainer should weigh. Queued as a follow-up rather than included here.

…rocess-side hook * ``_StructuredOutputRouter.{,a}invoke`` now executes ``_process`` *inside* the try/except so a token-tracker error or a malformed structured response fires the failure_hook. Previously a transport-side success followed by a ``_process`` raise left the bandit ledger inflated for that arm with no matching window entry — exactly the desync the wiring was supposed to prevent. Documented inline. * Adds tests covering the remaining audit items: - ``stream`` / ``astream`` failure dispatch (sync + async) - failure-hook error containment on ``invoke`` / ``ainvoke`` - cause-chain / traceback preservation through the bandit re-raise - classifier-internal error containment - ``_StructuredOutputRouter._process`` failure firing the hook - tracker-side telemetry failure on the success path (sync + async) - ``_remember_selected_model`` propagation for bandit selection and for the structured-output ``_bandit_select`` override - end-to-end structured-output ledger symmetry (failure + success) - concurrent ``ainvoke`` failure ledger consistency

``_StructuredOutputRouter._maybe_fire_failure_hook`` used to swallow hook exceptions silently (``except Exception: pass``). The hook is observability-only, so re-raising would mask the real LLM failure — but a *silent* swallow loses telemetry whenever the hook itself has a bug, hiding bandit-side regressions. Replace ``pass`` with ``logger.warning(...)`` so the original exception still propagates, yet a broken hook is visible in operator logs. Audit item FusionBrainLab#2 from the PR FusionBrainLab#13 bug hunt.

If ``_bandit.select()`` raises before ``record_pull`` runs, no pull is recorded and the try/except inside ``invoke``/``ainvoke`` never engages to inject a zero reward — the ledger invariant ("pulls and rewards stay in step") therefore holds vacuously. Codify this so a future refactor that moves ``record_pull`` outside ``_select`` (or hoists the try/except above ``_select``) doesn't break the invariant silently. Audit item FusionBrainLab#3 from the PR FusionBrainLab#13 bug hunt — verification, no code change.

When the bandit-wrapped LLM call inside an agent fails, the mutation operator catches the exception and constructs a StageError via StageError.from_exception(). Provider exception text occasionally carries lone UTF-16 surrogate code points (typically tokenizer artefacts that emit half of an astral pair without its partner). They survive str(exc) unchanged into StageError.message / StageError.traceback, then crash the downstream Redis-write path with TypeError: str is not valid UTF-8: surrogates not allowed when gigaevo.utils.json.dumps (= orjson) tries to encode the program blob. Add idempotent field validators that replace lone surrogates with U+FFFD at the StageError boundary so every downstream consumer (orjson, asyncpg TEXT, migration-bus payload) is safe regardless of where the StageError travels next. The fix is local to the StageError schema and does not modify the exception-classifier behavior or any LLM-side code. Coverage: 7 tests under tests/test_program.py — positive case (high and low surrogates scrubbed, downstream dumps succeeds), negative cases (real astral emoji preserved, validator idempotent on revalidation), end-to-end via StageError.from_exception and Program.to_dict.

Adds two integration-level test classes that pin invariants the classifier wiring depends on, neither of which was covered by the existing per-method tests: * TestBanditNoDoubleInjectionOnFailure verifies the boundary between the failure-path _inject_failure_reward (fires inside ainvoke) and the deferred on_mutation_outcome (fires only for persisted programs). A failed mutation never reaches storage, so on_program_ingested never runs, so the bandit window receives exactly one entry per failed call. A future refactor that persists a placeholder program for failed mutations would silently introduce double-injection without these tests. * TestBanditConcurrentMixedDispatch stresses ainvoke under 32-way concurrent dispatch with a deterministic success/failure mix and asserts the ledger invariant total_pulls == N and window_size == failure_count. Catches regressions in the task-id keyed _task_model_map that would leak one task's outcome into another's reward window, plus pure all-fail and all-succeed shapes for boundary coverage.

The migration bus serializes program_data via stdlib json.dumps, which escapes lone UTF-16 surrogate code points as \\uD800 JSON literals. The receiver json.loads them back into Python str form and eventually attempts to persist the migrated Program through gigaevo.utils.json.dumps (= orjson). orjson rejects surrogates with TypeError: str is not valid UTF-8: surrogates not allowed, which would crash the migration handler mid-restore. When a Program migrates from process A (whose LLM call produced an exception text with a tokenizer-artefact surrogate that propagated into StageError.message via Program.to_dict, or carries an arm name with a surrogate, or a code blob with one) to process B, the surrogate must be sanitized at the publish boundary so every consumer downstream sees an orjson-safe payload. Add _scrub_surrogates that walks the JSON-shaped program_data tree and replaces every lone surrogate with U+FFFD on str leaves. Also scrub source_run_id and program_id directly. The transform is idempotent so forward/retry paths that re-envelope a consumed message are safe. Coverage: 6 tests under tests/evolution/bus/test_transport.py — top-level program_data leaf, nested metadata + stage_results + tag list, source_run_id/program_id, surrogate-free identity, real astral emoji preserved, idempotency on republish.

Add an integration-level test module that exercises the agent layer (`MutationAgent`, `InsightsAgent`, `LineageAgent`, `ScoringAgent`) on top of a real `BanditModelRouter`. Existing coverage in `tests/evolution/test_bandit.py` only checks the router in isolation with synthetic mocks; the agent <-> router contract had no end-to-end test. The new module verifies that: * On transport failure the bandit's failure hook fires exactly once even though `MutationAgent.acall_llm` swallows the exception into `state["error"]`, so the ledger stays in step (`total_pulls == window_size`). * Repeated failures keep the ledger in step across many calls. * On the success path the bandit defers the reward to `on_mutation_outcome` (window stays empty) and no double-injection happens in the agent layer. * `InsightsAgent`/`LineageAgent`/`ScoringAgent` propagate failures out of `agent.arun` (they inherit `base.acall_llm`, which does not catch exceptions) while the bandit hook still fires before the exception escapes. A regression that adds an internal retry/fallback wrapper to any agent (e.g. `Runnable.with_retry` or `with_fallbacks`) would surface here as a `window_size` mismatch. The module also documents an out-of-slice gap (FU12): when `_StructuredOutputRouter._process` receives the langchain `include_raw=True` response shape with `parsed=None` and a non-None `parsing_error`, it returns `None` without firing the failure hook, leaving `total_pulls` incremented but `window_size` unchanged. The fix belongs in `gigaevo/llm/models.py` and is queued for a follow-up. The regression-lock test pins the current (broken) behaviour so the follow-up can flip the assertion when the gap closes.

The extractor assumed every layer of the provider usage payload is a dict with int counts: - response.response_metadata is a dict - response_metadata["token_usage"] (or "usage") is a dict - usage["completion_tokens_details"] is a dict - every *_tokens count is an int A hostile or malformed provider that returns a string in place of any nested dict crashed the extractor with AttributeError ("str has no get"). The bandit's _safe_track wrapper already swallows the failure, but direct callers (MultiModelRouter.invoke in gigaevo/llm/models.py) have no second-level guard, so an exception there propagates to the LLM call site and loses the response. Harden the extractor itself: isinstance-check every container layer before .get(), and coerce every count through _coerce_int (which accepts int/bool/float/str and defaults to 0 on anything else or on parse failure). Token telemetry is observability-only and must always return either a valid TokenUsage or None — never raise. Coverage: 7 tests under tests/llm/test_llm_routing.py — string in place of completion_tokens_details, string token_usage, string response_metadata, string-encoded counts, None counts, float counts, garbage string counts.

BanditModelRouter.with_structured_output passes include_raw=True so the underlying langchain wrapper returns {raw, parsed, parsing_error}. On schema-validation failure langchain populates parsing_error and sets parsed=None instead of raising. _StructuredOutputRouter._process returned response.get('parsed') silently and the bandit's failure_hook never fired — the pull was recorded by _select but the reward window never got a matching entry, so the UCB1 confidence term shrank for the failing arm without the bandit knowing why. _process now detects parsed=None plus a non-None parsing_error and re-raises the parsing_error so the existing try/except in invoke/ainvoke routes through _maybe_fire_failure_hook normally. The degenerate case where both parsed and parsing_error are None passes through unchanged (caller handles the None). The previous regression-lock test (which pinned the broken behavior for future detection) is flipped to assert the fixed behavior. Two additional tests cover the success path (no hook fire) and the empty-result degenerate case.

…m_fields" This reverts commit 7fbf680. The defence is now redundant against PR FusionBrainLab#10 (fix/llm-output-sanitization), which introduces ``gigaevo.utils.text_sanitize.deep_sanitize_for_json`` and applies it at the same migration-bus boundary with broader coverage (ANSI / BIDI / control characters, not only surrogates). Keeping the local ``_scrub_str`` / ``_scrub_surrogates`` helpers here forces a merge conflict with FusionBrainLab#10 regardless of merge order; removing them lets either PR merge independently against ``main``. If FusionBrainLab#10 is rejected, this commit should be reverted (or the original content cherry-picked back) to restore the defence on the bus.

… fields" This reverts commit a25678f. The defence is now redundant against PR FusionBrainLab#10 (fix/llm-output-sanitization), which adds ``sanitize_for_log`` field validators on ``StageError`` (covering ANSI / BIDI / control characters / surrogates). Keeping the local ``_scrub_surrogates`` validators here forces a merge conflict with FusionBrainLab#10 regardless of merge order; removing them lets either PR merge independently against ``main``. If FusionBrainLab#10 is rejected, this commit should be reverted (or the original content cherry-picked back) to restore the defence on ``StageError``.

GrigoryEvko · 2026-05-16T02:38:33Z

Conflict map vs PR #10 (fix/llm-output-sanitization)

After two pre-emptive reverts on this branch (commits 0f1a0d2, 9926967), the conflict footprint against PR #10 is now:

File	Hunks	Type
`gigaevo/llm/bandit.py`	1 (import block at L25)	Disjoint imports — clean union

Resolution: take the union of both import sets:

from gigaevo.llm.call_outcome import BanditAction, classify_call_result
from gigaevo.llm.models import (
    MultiModelRouter,
    _remember_selected_model,
    _StructuredOutputRouter,
)
from gigaevo.utils.text_sanitize import sanitize_for_log

PR #10 contributes the last import; this PR contributes the first two. No structural conflict — merge order is interchangeable.

The two reverts on this branch undo local _scrub_str / _scrub_surrogates helpers in transport.py + _scrub_surrogates validators in core_types.py that PR #10 supersedes with its gigaevo.utils.text_sanitize centralised helpers. If PR #10 is rejected, both reverts should be reverted back (or the original surrogate-scrubbing content cherry-picked).

GrigoryEvko added 4 commits May 15, 2026 10:09

GrigoryEvko added 9 commits May 15, 2026 10:46

style: ruff format after audit commits

46223f2

GrigoryEvko changed the title ~~feat(llm): add LLM call outcome classifier~~ feat(llm): add LLM call outcome classifier / state machine May 15, 2026

GrigoryEvko added 3 commits May 15, 2026 11:28

GrigoryEvko mentioned this pull request May 16, 2026

fix: sanitize LLM-derived text before logging, JSON serialization, and database writes #10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): add LLM call outcome classifier / state machine#13

feat(llm): add LLM call outcome classifier / state machine#13
GrigoryEvko wants to merge 16 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/llm-call-outcome-classifier

GrigoryEvko commented May 15, 2026

Uh oh!

GrigoryEvko commented May 15, 2026 •

edited

Loading

Uh oh!

GrigoryEvko commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GrigoryEvko commented May 15, 2026

Uh oh!

GrigoryEvko commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GrigoryEvko commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

GrigoryEvko commented May 15, 2026 •

edited

Loading