feat(llm): add LLM call outcome classifier / state machine#13
feat(llm): add LLM call outcome classifier / state machine#13GrigoryEvko wants to merge 14 commits into
Conversation
A pure total function classify_call_result(exc, *, model_name=None) that maps any BaseException | None to a frozen LLMCallResult carrying a 12-variant LLMCallOutcome (SUCCESS plus eleven failure variants verified against installed sources for openai 2.36.0, httpx 0.28.1, langchain-openai 1.2.1, langchain-core 1.4.0). Cause-chain walk with cycle protection plus priority ranking so a wrapper exception cannot mask a more informative root (a langchain_core.exceptions.OutputParserException whose __cause__ is openai.RateLimitError lands on RATE_LIMITED). Pairwise-disjoint rule tables for MRO-name matching, asserted at module load. Action mapping that the bandit will consume in a follow-up PR (SUCCESS defers to on_mutation_outcome; every failure outcome maps to INJECT_ZERO_REWARD). Adds four audit-discovered guards inside the classifier so it can be merged without external sanitizer dependency: lone UTF-16 surrogates in the result message are replaced with U+FFFD so downstream UTF-8 encoders and JSON serializers do not crash; Retry-After headers with non-finite or excessive values (inf, nan, overflow strings, values past 24 hours) are rejected as None rather than propagating poisonous floats into the bandit's sleep budget; type(exc).__module__ and __name__ are coerced through str defensively so a metaclass that sets a non-str module no longer breaks classifier totality; _safe_getattr suppresses non-AttributeError exceptions raised by hostile property/descriptor implementations. Eighty unit tests against real provider exceptions plus the four new regression guards. No live caller consumes the classifier yet (bandit rewire is a separate follow-up); this is module-only with no behavior change.
Combined audit pass against the classifier added in the previous commit, addressing both totality and coverage gaps: Totality contract guards. __cause__ / __context__ / __mro__ access is funnelled through _safe_getattr so a metaclass or subclass that overrides any of them with a property raising non-AttributeError cannot break classifier totality. A non-BaseException value returned from a property-override path terminates the cause walk rather than poisoning cycle detection; a raising __mro__ collapses to an empty fingerprint set so the MRO branches naturally skip and the exception falls through to the status-code / isinstance / OTHER paths. Coverage of litellm exception classes that subclass openai BadRequestError but carry semantics that map to a more informative outcome: ContextWindowExceededError now lands on CONTEXT_OVERFLOW (was BAD_REQUEST) and ContentPolicyViolationError on CONTENT_FILTERED (was BAD_REQUEST). Both verified against installed litellm 1.x classes. Documentation. LLMCallResult field docstrings expanded to spell out the cause_chain ordering (index 0 is the surface exception, walking toward roots), the retry_after_seconds tri-state semantics that PR-B will branch on (None means no hint, 0.0 means retry immediately, positive means sleep before retry), and the PR-B integration contract. Tests grow from 80 to 89: hostile cause / context / mro property guards, BaseException subclass totality, litellm class fingerprints, cause_chain ordering invariant, retry-after tri-state, per-outcome action lookup, JSON round-trip mirroring the PR-B telemetry path.
The classifier's job is to parse Retry-After into a finite non-negative float and surface it on LLMCallResult. The upper bound was a policy choice without operational evidence — no provider in the wild sends Retry-After past one hour even at abuse-protection limits, and a legitimate two-hour backoff was being silently mapped to None (no hint) instead of being passed through. The consumer (a future bandit retry adapter) is the right arbiter of a maximum acceptable sleep; the classifier just needs to keep non-finite garbage out of the field.
BanditModelRouter overrides invoke/ainvoke to wrap the LLM dispatch in try/except. On failure the new _inject_failure_reward helper classifies the exception via classify_call_result and looks up the BanditAction; INJECT_ZERO_REWARD outcomes (every non-SUCCESS variant) normalize a zero reward and append it to the arm's window before the original exception re-raises. Without this, _select's pull recording already happened but the reward window never got a matching entry, so the UCB1 confidence term shrank for the failing arm and the bandit underexplored flaky models. _StructuredOutputRouter grows a failure_hook callback (orthogonal to the existing select_override) so bandit-wrapped structured-output dispatches go through the same path. BanditModelRouter.with_structured_output wires self._inject_failure_reward through. Six new tests cover sync failure, async failure, repeated failures keeping pulls and rewards in step, success-path-does-not-inject, structured-output failure routing through the same hook, and a direct test of the unknown-arm defensive skip in _inject_failure_reward.
|
Part B (commits
Audit findings that surfaced during integration probes and got fixed in this PR: hook-side errors no longer mask the LLM exception ( Agent-layer integration tests ( One ambiguity is queued rather than decided unilaterally. The classifier currently maps every non- |
…rocess-side hook
* ``_StructuredOutputRouter.{,a}invoke`` now executes ``_process`` *inside*
the try/except so a token-tracker error or a malformed structured response
fires the failure_hook. Previously a transport-side success followed by a
``_process`` raise left the bandit ledger inflated for that arm with no
matching window entry — exactly the desync the wiring was supposed to
prevent. Documented inline.
* Adds tests covering the remaining audit items:
- ``stream`` / ``astream`` failure dispatch (sync + async)
- failure-hook error containment on ``invoke`` / ``ainvoke``
- cause-chain / traceback preservation through the bandit re-raise
- classifier-internal error containment
- ``_StructuredOutputRouter._process`` failure firing the hook
- tracker-side telemetry failure on the success path (sync + async)
- ``_remember_selected_model`` propagation for bandit selection and for
the structured-output ``_bandit_select`` override
- end-to-end structured-output ledger symmetry (failure + success)
- concurrent ``ainvoke`` failure ledger consistency
``_StructuredOutputRouter._maybe_fire_failure_hook`` used to swallow hook exceptions silently (``except Exception: pass``). The hook is observability-only, so re-raising would mask the real LLM failure — but a *silent* swallow loses telemetry whenever the hook itself has a bug, hiding bandit-side regressions. Replace ``pass`` with ``logger.warning(...)`` so the original exception still propagates, yet a broken hook is visible in operator logs. Audit item FusionBrainLab#2 from the PR FusionBrainLab#13 bug hunt.
If ``_bandit.select()`` raises before ``record_pull`` runs, no pull is
recorded and the try/except inside ``invoke``/``ainvoke`` never engages
to inject a zero reward — the ledger invariant ("pulls and rewards
stay in step") therefore holds vacuously. Codify this so a future
refactor that moves ``record_pull`` outside ``_select`` (or hoists the
try/except above ``_select``) doesn't break the invariant silently.
Audit item FusionBrainLab#3 from the PR FusionBrainLab#13 bug hunt — verification, no code change.
When the bandit-wrapped LLM call inside an agent fails, the mutation operator catches the exception and constructs a StageError via StageError.from_exception(). Provider exception text occasionally carries lone UTF-16 surrogate code points (typically tokenizer artefacts that emit half of an astral pair without its partner). They survive str(exc) unchanged into StageError.message / StageError.traceback, then crash the downstream Redis-write path with TypeError: str is not valid UTF-8: surrogates not allowed when gigaevo.utils.json.dumps (= orjson) tries to encode the program blob. Add idempotent field validators that replace lone surrogates with U+FFFD at the StageError boundary so every downstream consumer (orjson, asyncpg TEXT, migration-bus payload) is safe regardless of where the StageError travels next. The fix is local to the StageError schema and does not modify the exception-classifier behavior or any LLM-side code. Coverage: 7 tests under tests/test_program.py — positive case (high and low surrogates scrubbed, downstream dumps succeeds), negative cases (real astral emoji preserved, validator idempotent on revalidation), end-to-end via StageError.from_exception and Program.to_dict.
Adds two integration-level test classes that pin invariants the classifier wiring depends on, neither of which was covered by the existing per-method tests: * TestBanditNoDoubleInjectionOnFailure verifies the boundary between the failure-path _inject_failure_reward (fires inside ainvoke) and the deferred on_mutation_outcome (fires only for persisted programs). A failed mutation never reaches storage, so on_program_ingested never runs, so the bandit window receives exactly one entry per failed call. A future refactor that persists a placeholder program for failed mutations would silently introduce double-injection without these tests. * TestBanditConcurrentMixedDispatch stresses ainvoke under 32-way concurrent dispatch with a deterministic success/failure mix and asserts the ledger invariant total_pulls == N and window_size == failure_count. Catches regressions in the task-id keyed _task_model_map that would leak one task's outcome into another's reward window, plus pure all-fail and all-succeed shapes for boundary coverage.
The migration bus serializes program_data via stdlib json.dumps, which escapes lone UTF-16 surrogate code points as \\uD800 JSON literals. The receiver json.loads them back into Python str form and eventually attempts to persist the migrated Program through gigaevo.utils.json.dumps (= orjson). orjson rejects surrogates with TypeError: str is not valid UTF-8: surrogates not allowed, which would crash the migration handler mid-restore. When a Program migrates from process A (whose LLM call produced an exception text with a tokenizer-artefact surrogate that propagated into StageError.message via Program.to_dict, or carries an arm name with a surrogate, or a code blob with one) to process B, the surrogate must be sanitized at the publish boundary so every consumer downstream sees an orjson-safe payload. Add _scrub_surrogates that walks the JSON-shaped program_data tree and replaces every lone surrogate with U+FFFD on str leaves. Also scrub source_run_id and program_id directly. The transform is idempotent so forward/retry paths that re-envelope a consumed message are safe. Coverage: 6 tests under tests/evolution/bus/test_transport.py — top-level program_data leaf, nested metadata + stage_results + tag list, source_run_id/program_id, surrogate-free identity, real astral emoji preserved, idempotency on republish.
Add an integration-level test module that exercises the agent layer (`MutationAgent`, `InsightsAgent`, `LineageAgent`, `ScoringAgent`) on top of a real `BanditModelRouter`. Existing coverage in `tests/evolution/test_bandit.py` only checks the router in isolation with synthetic mocks; the agent <-> router contract had no end-to-end test. The new module verifies that: * On transport failure the bandit's failure hook fires exactly once even though `MutationAgent.acall_llm` swallows the exception into `state["error"]`, so the ledger stays in step (`total_pulls == window_size`). * Repeated failures keep the ledger in step across many calls. * On the success path the bandit defers the reward to `on_mutation_outcome` (window stays empty) and no double-injection happens in the agent layer. * `InsightsAgent`/`LineageAgent`/`ScoringAgent` propagate failures out of `agent.arun` (they inherit `base.acall_llm`, which does not catch exceptions) while the bandit hook still fires before the exception escapes. A regression that adds an internal retry/fallback wrapper to any agent (e.g. `Runnable.with_retry` or `with_fallbacks`) would surface here as a `window_size` mismatch. The module also documents an out-of-slice gap (FU12): when `_StructuredOutputRouter._process` receives the langchain `include_raw=True` response shape with `parsed=None` and a non-None `parsing_error`, it returns `None` without firing the failure hook, leaving `total_pulls` incremented but `window_size` unchanged. The fix belongs in `gigaevo/llm/models.py` and is queued for a follow-up. The regression-lock test pins the current (broken) behaviour so the follow-up can flip the assertion when the gap closes.
The extractor assumed every layer of the provider usage payload is a
dict with int counts:
- response.response_metadata is a dict
- response_metadata["token_usage"] (or "usage") is a dict
- usage["completion_tokens_details"] is a dict
- every *_tokens count is an int
A hostile or malformed provider that returns a string in place of any
nested dict crashed the extractor with AttributeError ("str has no get").
The bandit's _safe_track wrapper already swallows the failure, but
direct callers (MultiModelRouter.invoke in gigaevo/llm/models.py) have
no second-level guard, so an exception there propagates to the LLM call
site and loses the response.
Harden the extractor itself: isinstance-check every container layer
before .get(), and coerce every count through _coerce_int (which accepts
int/bool/float/str and defaults to 0 on anything else or on parse
failure). Token telemetry is observability-only and must always return
either a valid TokenUsage or None — never raise.
Coverage: 7 tests under tests/llm/test_llm_routing.py — string in place
of completion_tokens_details, string token_usage, string
response_metadata, string-encoded counts, None counts, float counts,
garbage string counts.
BanditModelRouter.with_structured_output passes include_raw=True so
the underlying langchain wrapper returns {raw, parsed, parsing_error}.
On schema-validation failure langchain populates parsing_error and
sets parsed=None instead of raising. _StructuredOutputRouter._process
returned response.get('parsed') silently and the bandit's failure_hook
never fired — the pull was recorded by _select but the reward window
never got a matching entry, so the UCB1 confidence term shrank for
the failing arm without the bandit knowing why.
_process now detects parsed=None plus a non-None parsing_error and
re-raises the parsing_error so the existing try/except in
invoke/ainvoke routes through _maybe_fire_failure_hook normally.
The degenerate case where both parsed and parsing_error are None
passes through unchanged (caller handles the None).
The previous regression-lock test (which pinned the broken behavior
for future detection) is flipped to assert the fixed behavior. Two
additional tests cover the success path (no hook fire) and the
empty-result degenerate case.
The bandit and router stack catches LLM call exceptions today with bare
try / exceptladders that each subsystem rolls separately. There is no shared vocabulary for distinguishing aRATE_LIMITEDupstream from aPARSE_FAILEDarm output, no shared dispatch for "fail the pull on the bandit" versus "wait for the mutation outcome", and no shared guard against surprising failure modes — alangchain_core.exceptions.OutputParserExceptionwhose__cause__isopenai.RateLimitErrorshould classify as the cause not the wrapper, aRetry-After: infheader should not propagate asmath.infinto a sleep budget, a metaclass with non-str__module__or a raising__mro__property should not break classifier totality.This PR adds
gigaevo/llm/call_outcome.py, a pure total functionclassify_call_result(exc, *, model_name=None)that maps anyBaseException | Noneto a frozenLLMCallResultcarrying a 12-variantLLMCallOutcome. The classifier walks the__cause__/__context__chain with cycle protection and collapses outcomes through a priority ordering so wrapped errors classify by root cause. Rule tables (_CONTEXT_OVERFLOW_MRO_NAMES,_OUTPUT_TRUNCATED_MRO_NAMES,_CONTENT_FILTER_MRO_NAMES,_TIMEOUT_MRO_NAMES,_NETWORK_MRO_NAMES,_PARSE_MRO_NAMES) match against the full MRO chain and are asserted pairwise-disjoint at module load.OUTCOME_ACTIONis asserted closed overLLMCallOutcome. Class names were verified against installedopenai 2.36.0,httpx 0.28.1,langchain-openai 1.2.1,langchain-core 1.4.0, andlitellm 1.x(which contributesContextWindowExceededErrorandContentPolicyViolationError, bothBadRequestErrorsubclasses that would otherwise lose their semantic toBAD_REQUEST).Defensive guards are kept tight against confirmed defects, not as theater. Lone UTF-16 surrogates in
LLMCallResult.messageare replaced withU+FFFDso downstream UTF-8 encoders and JSON serializers do not crash.Retry-Afteris rejected when non-finite (inf,nan, numbers that overflowfloat); any finite non-negative value passes through and the consumer is the arbiter of a maximum acceptable sleep.type(exc).__module__and__name__are coerced throughstrso a metaclass setting a non-str module no longer breaks totality.__cause__/__context__/__mro__access is funnelled through_safe_getattrso a property override raising non-AttributeErrorcannot propagate; a non-BaseExceptionvalue returned from such a property terminates the cause walk, a raising__mro__collapses to an empty fingerprint set. Eighty-nine unit tests against real provider exceptions plus regression guards for every fixed defect. No live caller consumes the classifier yet; the bandit rewire that operationalizesOUTCOME_ACTIONis a separate follow-up.