Skip to content

feat(proxy): add disable_daily_spend_aggregation config flag to prevent Redis OOM#28802

Draft
yassin-berriai wants to merge 18 commits into
litellm_oss_branchfrom
litellm_fix/disable-daily-spend-aggregation
Draft

feat(proxy): add disable_daily_spend_aggregation config flag to prevent Redis OOM#28802
yassin-berriai wants to merge 18 commits into
litellm_oss_branchfrom
litellm_fix/disable-daily-spend-aggregation

Conversation

@yassin-berriai
Copy link
Copy Markdown
Contributor

Summary

Adds a new disable_daily_spend_aggregation: true flag to general_settings that completely suppresses writes to the daily spend aggregation tables (DailyUser/Team/Tag/Org/EndUser/Agent) and prevents the associated Redis buffer keys from growing.

Root cause (LIT-3332): disable_spend_logs: true only suppresses per-request LiteLLM_SpendLogs rows. The daily aggregation path runs unconditionally — even for customers who never use the Usage dashboard. On high-throughput deployments the buffer keys (litellm_daily_tag_spend_update_buffer, litellm_daily_team_spend_update_buffer, etc.) grow until Redis OOMs.

What changes:

  • New _enqueue_daily_spend_updates() helper on DBSpendUpdateWriter consolidates the six daily-queue enqueue calls and returns immediately when the flag is set — nothing reaches the Redis buffers.
  • The dedicated update_daily_tag_spend_job APScheduler job is skipped at startup when the flag is set.
  • Key/user/team balance updates (LiteLLM_KeyTable, UserTable, TeamTable, rate-limiting) are unaffected — they go through a completely separate path.
  • ConfigGeneralSettings Pydantic model updated so the flag is documented and validated.

Usage:

general_settings:
  disable_daily_spend_aggregation: true   # prevents Redis OOM from daily buffer keys
  # disable_spend_logs: true              # optional: also suppress per-request SpendLogs

Behavioral test matrix

Scenario Daily queues enqueued? Balance updates (key/user/team) still run?
disable_daily_spend_aggregation absent (default) ✅ yes ✅ yes
disable_daily_spend_aggregation: false (explicit) ✅ yes ✅ yes
disable_daily_spend_aggregation: true ❌ no ✅ yes
ProxyUpdateSpend.disable_daily_spend_aggregation() — flag absent returns False
ProxyUpdateSpend.disable_daily_spend_aggregation() — flag True returns True
ProxyUpdateSpend.disable_daily_spend_aggregation() — flag False returns False
Scheduler: flag absent → update_daily_tag_spend_job registered ✅ registered
Scheduler: flag True → job skipped ❌ not registered

All 7 unit tests pass (tests/proxy_unit_tests/test_disable_daily_spend_aggregation.py).
All 14 pre-existing spend tests pass unchanged.
Ruff lint: no errors.

Files changed

File Change
litellm/proxy/db/db_spend_update_writer.py Extract _enqueue_daily_spend_updates() helper; guard with flag
litellm/proxy/utils.py Add ProxyUpdateSpend.disable_daily_spend_aggregation() static method
litellm/proxy/proxy_server.py Skip update_daily_tag_spend_job registration when flag is set
litellm/proxy/_types.py Add field to ConfigGeneralSettings
tests/proxy_unit_tests/test_disable_daily_spend_aggregation.py New test file (7 tests)

Resolves LIT-3332

https://claude.ai/code/session_01VSh9vug6wHMkBWZ76MjdTt


Generated by Claude Code

mateo-berri and others added 15 commits May 23, 2026 12:15
… model catalog) (#28223)

* fix(opentelemetry): JSON-serialize dict metadata fields for OTEL span attributes (#27451) (#27455)

Squash-merged by litellm-agent from Anai-Guo's PR.

* feat(dashscope): add embeddings and reranks(qwen3-rerank) support via OpenAI-compatible endpoint (#27508)

Squash-merged by litellm-agent from yimao's PR.

* fix(vertex_ai/gemini): raise BadRequestError when image_url or url fi… (#24550)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix(vertex_ai): raise error on mid-stream 429/error chunks instead of silently swallowing (#23711)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix: raise BadRequestError for file content blocks missing 'file' sub… (#24503)

Squash-merged by litellm-agent from krisxia0506's PR.

* Fix Gemini MIME detection for extensionless GCS URIs (#27278)

Squash-merged by litellm-agent from krisxia0506's PR.

* fix(vertex_ai/partner_models): drop unused vertexai SDK gate from count_tokens (closes #28084) (#28107)

Squash-merged by litellm-agent from voidborne-d's PR.

* feat(chart): add support for autoscaling behavior in HPA (#27990)

Squash-merged by litellm-agent from FabrizioCafolla's PR.

* feat(proxy): add blocked flag to models for pause/resume from the UI (#27927)

Squash-merged by litellm-agent from Cyberfilo's PR.

* fix: pass socket timeouts to Redis cluster clients (#27920)

Squash-merged by litellm-agent from tomdee's PR.

* Fix/cache token (#28009)

Squash-merged by litellm-agent from escon1004's PR.

* fix(deepseek): forward reasoning_content in multi-turn thinking mode conversations (#28080)

Squash-merged by litellm-agent from Divyansh8321's PR.

* fix(guardrails): return HTTP 400 instead of 500 for blocked requests (#27617)

* fix: reset org and tag budgets (#27326)

* reset org budgets

* reset tag budgets

---------

Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>

* fix(ui): omit allowed_routes from key edit save when unchanged (#27553)

* fix(ui): omit allowed_routes from key edit save when unchanged

When a team admin opens Edit Settings on a key with key_type=AI APIs and
saves without changing anything, the UI re-sends the existing allowed_routes
value, which the backend's _check_allowed_routes_caller_permission gate
rejects for non-proxy-admins (LIT-2681).

Strip allowed_routes from the patch in handleSubmit when it deep-equals the
original keyData.allowed_routes. The backend treats absence as "leave alone,"
so no-op saves now succeed for non-admins. Admins explicitly editing the
field still send the new value.

* fix(ui): order-insensitive allowed_routes diff + cover null-original case

Address Greptile review:

- Switch the "is allowed_routes unchanged" check to a Set-based comparison so
  a server-side reorder of the array doesn't register as a user edit and
  re-trigger LIT-2681.
- Add two regression tests: (1) keyData.allowed_routes is null and the form
  is untouched — patch should strip the field; (2) server returned routes in
  a different order than the user originally entered — patch should still
  recognize the value as unchanged.

* chore(ui): strip ticket refs and tighten comments in key edit fix

- Remove internal-tracker references from in-code comments
- Tighten the WHY comment in handleSubmit to two lines
- Drop redundant test-block comments — test names already describe the case

* fix(ui): annotate Set<string> generic in allowed_routes diff to fix tsc

* fix(guardrails): return HTTP 400 instead of 500 for guardrail-blocked requests

GuardrailRaisedException and BlockedPiiEntityError both lacked a
status_code attribute.  When these exceptions reached the proxy
exception handler (getattr(e, 'status_code', 500)), the fallback
defaulted to HTTP 500 — making intentional guardrail blocks
indistinguishable from server errors and causing unnecessary client
retries.

Changes:
- Add status_code=400 (keyword-only) to GuardrailRaisedException
- Add status_code=400 (keyword-only) to BlockedPiiEntityError
- Update _is_guardrail_intervention() to recognize both exceptions
  so downstream loggers record 'guardrail_intervened' instead of
  'guardrail_failed_to_respond'
- Add 6 unit tests for default/custom status codes and getattr pattern
- Strengthen existing blocked-action test with status_code assertion

Fixes #24348

---------

Co-authored-by: Michael-RZ-Berri <michael@berri.ai>
Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>

* fix(router/proxy): address Greptile P1+P2 review comments on PR #28161

- router: raise ServiceUnavailableError (503) instead of RouterRateLimitErrorBasic (429)
  when a specifically-addressed deployment is administratively blocked; 429 misleads
  retry-enabled clients into spinning forever against a paused model
- proxy_server: compute get_fully_blocked_model_names() once before both branches in
  model_list() instead of duplicating the call in each branch
- deepseek: upgrade silent debug log to warning when injecting placeholder
  reasoning_content so callers are clearly notified of degraded multi-turn quality
- tests: update two blocked-deployment assertions to expect ServiceUnavailableError

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix: address bug detection findings (cache token order, mutable defaults)

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix: address bugs in async pass-through, anthropic cache token detection, rerank tests

- async_get_available_deployment_for_pass_through: enforce blocked check on specific deployments
- cost_calculator: detect anthropic-style usage by attribute presence (not truthiness) to avoid mixing OpenAI cached_tokens into anthropic normalization when read=0
- dashscope rerank tests: pass request to httpx.Response constructions for consistency

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix code qa

* fix(vertex_ai/gemini): strip MIME parameters from GCS contentType

GCS object metadata's contentType field can include parameters such as
'text/html; charset=utf-8'. Strip them in _apply_gemini_mime_type_aliases
so downstream get_file_extension_from_mime_type sees a bare MIME type.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(vertex_ai/gemini): clarify mime-type error message string concatenation

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* feat(oci): add embeddings, fix streaming/reasoning, expand model catalog

- Add OCIEmbedConfig with full Cohere embed support (7 models, batch up to 96)
- Fix sync streaming: split SSE events on \n\n before JSON parsing
- Fix reasoning models (Gemini 2.5, xAI Grok): make completionTokens and message
  optional in OCIResponseChoice to handle max_tokens exhausted on reasoning
- Fix compartment_id resolution in chat transform to use resolve_oci_credentials
- Fix tool call id: make OCIToolCall.id optional, generate UUID fallback for
  providers (Google via OCI) that omit it
- Add OCI_KEY env var support for inline PEM keys
- Fix datetime.utcnow() deprecation in request signing
- Expand model catalog: 29 OCI models including Llama 4, Gemini 2.5, xAI Grok,
  Cohere Command A, and all Cohere embed variants
- Add 37 live integration tests: sync/async completions for Meta/Google/xAI/Cohere,
  sync/async embeddings, tool use across all vendors, streaming, env var auth
- Add 23 embed unit tests covering all transform and validation paths

* fix(oci): remove dead OCI elif branch in utils.py, align async split_chunks with sync version

* test(oci): add unit tests for split_chunks fix and no-duplicate-OCI-branch guard

* fix(oci): address remaining bugs from issue #25082 — streaming signed body, Cohere stop sequences, hardcoded defaults

- Bug 1: sync and async streaming paths now use signed_json_body when provided
  instead of re-serializing data with json.dumps() — the OCI RSA-SHA256 signature
  covers the exact request body bytes, so re-serializing produces an invalid sig
- Bug 3: Cohere stop sequences now map to 'stopSequences' (was incorrectly 'stop')
- Bug 4: removed hardcoded Cohere defaults (maxTokens=600, temperature=1, topK=0,
  topP=0.75, frequencyPenalty=0) that silently overrode user intent on every call
- Added 6 unit tests covering all three fixes

* fix(oci): comprehensive code quality pass — bugs, tests, schema accuracy

- Fix Cohere tool call IDs (was always call_0; now UUID per call)
- Fix TOOL_CALL finish reason mapping in both sync and streaming paths
- Fix Cohere stop parameter mapping (stop → stopSequences)
- Remove hardcoded Cohere defaults (maxTokens/topK/topP/frequencyPenalty)
- Fix content[0] safety guard against empty content arrays
- Fix streaming signed body used consistently (not re-serialized)
- Raise OCIError (not bare Exception/ValueError) throughout
- Centralize OCI_API_VERSION constant; import uuid at module level
- Fix embed get_complete_url to strip trailing slashes from api_base
- Fix OCIEmbedResponse schema: add inputTextTokenCounts (actual OCI field)
- Fix embed usage computed from inputTextTokenCounts (sum of per-input counts)
- Fix Cohere toolCallId included in tool result messages
- Add OCIToolCall.id as Optional (absent in Google/xAI streaming chunks)
- Update tests to reflect correct behavior (no hardcoded defaults, UUID ids,
  deferred credential validation, OCIError vs ValueError, real response schema)

* test(oci): move integration tests to tests/llm_translation/

Addresses greptile P1: tests/test_litellm/ is for mock-only unit tests
(make test-unit target). Real-network OCI tests now live in the correct
location alongside other provider integration tests.

* fix(oci): align types and transformation with official OCI SDK

- Remove OCIVendors.GEMINI — apiFormat="GEMINI" is invalid; all non-Cohere
  models use apiFormat="GENERIC"
- Add toolChoice, logitBias, logProbs to OCIChatRequestPayload so params
  present in the mapping are no longer silently dropped by Pydantic
- Exclude n→numGenerations from Cohere param map (not a Cohere API field)
- Fix CohereToolResult: change callId/result to call/outputs matching
  the OCI SDK's CohereToolResult structure
- Fix CohereToolMessage: replace non-existent toolCallId with toolResults
  list; update adapt_messages_to_cohere_standard to build proper tool-result
  history entries by resolving tool call name+params from preceding assistant
  messages
- Map generic-model stream finish reasons to OpenAI convention
  (COMPLETE→stop, MAX_TOKENS→length, TOOL_CALLS→tool_calls), consistent
  with the existing Cohere streaming path
- Add optional id field to OCIEmbedResponse so valid API responses
  carrying an id are not rejected by the Pydantic model

* fix(oci): use 'output' key in Cohere tool result outputs (matches reference impl)

* fix(oci): port schema/type utilities from langchain-oracle reference impl

- Add resolve_oci_schema_refs: inline $ref/$defs — OCI rejects JSON Schema refs
- Add resolve_oci_schema_anyof: flatten Optional[T] anyOf (Pydantic v2 emits these)
- Add sanitize_oci_schema: strip title, normalise null types, ensure array items
- Add OCI_JSON_TO_PYTHON_TYPES: Cohere expects Python type names (str/int/float),
  not JSON Schema names (string/integer/number)
- Add enrich_cohere_param_description: embed enum/format/range/pattern constraints
  into description since CohereParameterDefinition has no dedicated fields
- Apply all of the above in adapt_tool_definitions_to_cohere_standard and
  adapt_tool_definition_to_oci_standard
- Fix toolChoice conversion: map OpenAI string ('auto','none','required') to OCI
  dict form ({"type":"AUTO"} etc.) — the API rejects plain strings
- Update unit test expectations to match correct Python type names and enriched
  descriptions

* refactor(oci): split transformation.py into cohere.py and generic.py

transformation.py was 1 243 lines doing too many jobs. Split along the
same boundaries as the langchain-oracle reference (providers/cohere.py,
providers/generic.py):

  chat/cohere.py   — Cohere message/tool building, response + stream parsing
  chat/generic.py  — Generic message/tool building, response + stream parsing
  transformation.py — thin OCIChatConfig orchestrator + OCIStreamWrapper

Public symbols (OCIChatConfig, OCIStreamWrapper, adapt_messages_to_*,
OCIRequestWrapper, version, …) remain importable from transformation.py
for backward compatibility. OCIStreamWrapper gains delegating shims for
_handle_cohere_stream_chunk and _handle_generic_stream_chunk so existing
test call sites keep working unchanged.

transformation.py: 1 243 → 620 lines

* refactor(oci): principal-level code quality pass

- Remove _extract_text_content duplication — single definition in cohere.py,
  imported where needed; instance method on OCIChatConfig eliminated
- Move cryptography imports to module level with _CRYPTOGRAPHY_AVAILABLE flag
  and _require_cryptography() guard; no more re-import on every signing call
- Move litellm version import to module level via litellm._version; remove
  inline import inside validate_oci_environment
- sign_with_manual_credentials now returns Tuple[dict, bytes] matching
  sign_with_oci_signer — asymmetry eliminated, Optional[bytes] guards removed
  throughout stream wrappers (signed_json_body: bytes = b"")
- Rename _openai_to_oci_cohere_param_map → openai_to_oci_cohere_param_map
  for consistency with openai_to_oci_generic_param_map
- Remove double-key bug in map_openai_params where responseFormat was stored
  under both OCI and OpenAI key names simultaneously
- Remove delegating shims (adapt_messages_to_cohere_standard,
  adapt_tool_definitions_to_cohere_standard, _handle_generic_stream_chunk)
  from OCIChatConfig/OCIStreamWrapper; tests now import directly from
  cohere.py and generic.py where symbols live
- Trim __all__ to 7 genuine public symbols; remove the 13-symbol list that
  existed only to support test imports
- Collapse per-model integration test classes into pytest.mark.parametrize;
  CHAT_MODELS list is the single source of truth for model-specific config
- Black + Ruff clean across all OCI files

* fix(oci): address PR review findings

- types/llms/oci.py: add "TOOL_CALL" to CohereChatResponse.finishReason
  Literal so Pydantic does not raise ValidationError on non-streaming
  Cohere tool-use calls (Greptile P1)
- test_oci_cohere_tool_calls.py: add test covering TOOL_CALL finish reason
- model_prices_and_context_window.json: remove 6 duplicate oci/cohere.embed-*
  keys that were silently overridden by the more complete entries already
  present in the file (Greptile P1)
- common_utils.py: move OCI_API_VERSION here from chat/transformation.py
  so embed/transformation.py does not need to import chat/transformation;
  change Protocol stub body from ... to pass (CodeQL "statement no effect");
  add comment to sha256_base64 clarifying it implements OCI HTTP signing
  spec, not password hashing (CodeQL false positive)
- chat/transformation.py: import CustomStreamWrapper from
  litellm_core_utils.streaming_handler instead of litellm.utils to reduce
  import cycle depth (CodeQL cyclic import)
- chat/cohere.py, chat/generic.py: import Usage and
  ChatCompletionMessageToolCall from litellm.types.utils instead of
  litellm.utils for the same reason
- embed/transformation.py: import OCI_API_VERSION from common_utils
  instead of chat/transformation (removes the embed→chat import edge)

* test(oci): add unit tests to improve patch coverage

- test_oci_common_utils.py (new): covers sha256_base64, build_signature_string,
  OCIRequestWrapper.path_url, resolve_oci_credentials, get_oci_base_url,
  validate_oci_environment, sign_with_oci_signer error paths, sign_oci_request
  routing, load_private_key_from_file error paths, resolve_oci_schema_refs
  (including circular ref and external $ref), resolve_oci_schema_anyof,
  sanitize_oci_schema (all branches), enrich_cohere_param_description
- test_oci_generic_chat.py (new): covers content-message error paths (non-dict
  item, unsupported type, non-string text, invalid image_url), tool-call
  validation error paths, adapt_messages_to_generic_oci_standard error paths,
  handle_generic_response (None message, text content, tool calls),
  handle_generic_stream_chunk (finish reasons, streaming tool calls),
  OCIStreamWrapper non-string chunk error
- test_oci_chat_transformation.py: add error paths for validate_environment
  (empty messages), transform_request (missing compartment_id, Cohere without
  user messages), transform_response (error key), map_openai_params
  (unsupported param with and without drop_params), tool_choice string mapping
- test_oci_cohere_tool_calls.py: add edge cases for stream chunk finish
  reasons (TOOL_CALL, MAX_TOKENS, unknown), _extract_text_content with
  non-dict list items and non-string input,
  adapt_messages_to_cohere_standard with malformed JSON tool arguments

* fix(oci): rename supports_streaming to supports_native_streaming in model prices

The JSON schema for model_prices_and_context_window.json uses
`supports_native_streaming` (not `supports_streaming`) and has
`additionalProperties: false`. Rename the field across all OCI
entries to pass the schema validation test.

* test(oci): add 67 tests targeting uncovered happy paths for coverage

Boost patch coverage on the four lowest-coverage OCI files:
- common_utils.py: sign_with_manual_credentials (oci_key / oci_key_file
  paths), sign_oci_request routing, _require_cryptography
- generic.py: adapt_messages_to_generic_oci_standard (all roles),
  adapt_tool_definition_to_oci_standard, adapt_tools_to_openai_standard,
  handle_generic_stream_chunk text/finish-reason paths
- cohere.py: _extract_text_content, adapt_messages_to_cohere_standard
  (all roles including tool results), handle_cohere_response /
  handle_cohere_stream_chunk all finish-reason branches
- transformation.py: get_vendor_from_model, OCIChatConfig._get_optional_params
  (toolChoice string→dict, responseFormat, tools for both vendors),
  transform_request for GENERIC model, get_sync/async_custom_stream_wrapper
  with mocked HTTP, OCIStreamWrapper.chunk_creator happy paths

* fix(oci): suppress CodeQL false positive on sha256_base64 (OCI HTTP signing, not password hashing)

* fix(oci): remove 6 duplicate model price entries and reconcile conflicting values

Six OCI chat model keys appeared twice in model_prices_and_context_window.json
with conflicting pricing/context data (JSON parsers silently discard the first).
Remove the first-occurrence entries and update the surviving entries:
- meta.llama-4-maverick / llama-4-scout: keep updated entries (free preview
  pricing, larger context windows, vision support)
- meta.llama-3.1-70b: keep original pricing, restore supports_native_streaming
- google.gemini-2.5-{flash,pro,flash-lite}: keep OCI pricing page values,
  restore supports_native_streaming

* fix(oci): route GPT-5 family to maxCompletionTokens

GPT-5 / GPT-5-mini / GPT-5-nano / GPT-5.5 on OCI reject "maxTokens"
with HTTP 400:

  Invalid 'maxTokens': Unsupported parameter: 'maxTokens' is not
  supported with this model. Use 'maxCompletionTokens' instead.

(Same convention as OpenAI's reasoning-API contract.)

Add a model-aware rename in OCIChatConfig._get_optional_params so the
request payload uses maxCompletionTokens when the model id starts with
openai.gpt-5. Regular Llama / Cohere / Gemini / GPT-4.x continue to use
maxTokens unchanged.

Also widen OCIChatRequestPayload to carry the new optional field so it
survives Pydantic serialization.

Verified live against OCI us-chicago-1:
- openai.gpt-5, gpt-5-mini, gpt-5-nano, gpt-5.5 all return 200
- Full feature sweep on gpt-5.5 (basic, system, multi-turn, streaming,
  tools, usage) all green
- meta.llama-3.3-70b-instruct still uses maxTokens (no regression)

4 new unit tests cover the helper, the routing in both pre- and
post-translation states, and Pydantic serialization.

* ci(oci): fix CI failures — black formatting + recursive_detector ignore

- Run black on litellm/llms/oci/common_utils.py + 3 OCI test files
  that drifted out of black-compliance during the rebase.
- Add the three bounded recursive functions in oci/common_utils.py
  (`_resolve`, `resolve_oci_schema_anyof`, `sanitize_oci_schema`) to
  the recursive_detector IGNORE_FUNCTIONS list. All three are bounded:
  `_resolve` uses a `resolving_stack` cycle guard; the other two are
  bounded by JSON-schema tree depth (no cycles in well-formed input),
  matching the pattern of the existing OCI/Vertex schema walkers
  already on the list.

* fix(oci): silence MyPy errors in cohere.py — typed-dict access

Two errors flagged by `lint` CI:

  llms/oci/chat/cohere.py:73:  "object" has no attribute "__iter__"
  llms/oci/chat/cohere.py:119: No overload variant of "get" of "dict"
                               matches argument types "object", "CohereToolCall"

Both stem from `msg.get("tool_calls")` / `msg.get("tool_call_id")`
returning `object` per the AllMessageValues TypedDict union. Bind to
`Any` locally for the iteration and coerce the lookup key with `str()`,
removing the now-unused `# type: ignore` on those lines.

No behaviour change — pure type-narrowing for the type checker.

* fix(oci): silence CodeQL py/weak-sensitive-data-hashing on sha256_base64

CodeQL's taint analysis traces request bodies back to environment-loaded
secrets and flags `hashlib.sha256(body).digest()` as
`py/weak-sensitive-data-hashing` — even though SHA-256 is the algorithm
mandated by the OCI HTTP request signing spec for the
`x-content-sha256` header (not a password/secret hash).

The previous suppression used legacy `# lgtm[...]` syntax which the
modern CodeQL action ignores. Switch to Python's standard
`hashlib.sha256(..., usedforsecurity=False)` (Python 3.9+) which CodeQL
honours as a non-security declaration. Behaviour unchanged.

* feat(oci): add reasoning_effort passthrough — only true missing primitive

OCI's GenericChatRequest exposes a reasoningEffort field
(NONE/MINIMAL/LOW/MEDIUM/HIGH) that's the single biggest cost knob for
reasoning-capable models on the service:

  - GPT-5 family
  - Gemini 2.5
  - Grok reasoning variants (3-mini, 4-fast, 4.20)
  - Cohere Command-A-Reasoning

Setting reasoning_effort=LOW typically cuts reasoning-token spend 5-10×
vs the default. Without exposing this, litellm users had no way to tune
cost-vs-quality on these models.

The other GenericChatRequest fields (verbosity, parallel_tool_calls,
logit_bias, n, metadata, web_search_options, prediction) are not
exposed because they are not missing primitives — they either duplicate
prompt-engineering, framework-level controls, or are too niche to
justify the maintenance surface. We only ship what users genuinely
can't accomplish another way.

Excluded from the Cohere v1 param map: CohereChatRequest has no
reasoningEffort field, and Cohere reasoning models
(cohere.command-a-reasoning) use COHEREV2 which is a separate request
type not covered by this PR.

Verified live: GPT-5.5 + reasoning_effort="HIGH" sends
{"reasoningEffort": "HIGH"} on the wire and OCI accepts the request.

* feat(oci): reasoning_effort + reasoning_tokens for OCI GenAI

Three small additions for OCI reasoning models, requested by users
testing the PR in production fork builds:

1. **reasoning_effort param mapping (GENERIC vendors).** OCI expects
   uppercase levels ("LOW"/"MEDIUM"/"HIGH"/"NONE") on `reasoningEffort`,
   but OpenAI-compatible clients send lowercase. Mapped + uppercased in
   `_get_optional_params`. Marked unsupported on Cohere V1/V2 since OCI
   Cohere has no reasoning models (avoids Pydantic validation failure
   on CohereChatRequest).

2. **"disable" → "NONE" mapping.** OpenAI uses "disable" to turn off
   reasoning; OCI uses "NONE". Without this, callers get a 400.

3. **reasoning_tokens propagated to Usage.** OCI returns
   `completionTokensDetails.reasoningTokens` but it wasn't being passed
   to LiteLLM's Usage object. Now flows through to
   `Usage.completion_tokens_details.reasoning_tokens` so callers can
   track reasoning token consumption for cost/observability.

Tests: 7 new unit tests in TestOCIReasoningEffort covering upper/lower
case, "disable"→"NONE", Cohere drop/raise paths, and reasoning_tokens
extraction (with and without completionTokensDetails). 5 new live
integration tests against xai.grok-3-mini in us-chicago-1 verifying the
full request/response loop end-to-end. Existing
test_transform_response_simple_text assertion that
completion_tokens_details was None has been updated to assert
reasoning_tokens flows through.

Verified live on xai.grok-3-mini: reasoning_effort=low → OCI accepts
"LOW", returns reasoningTokens=316 in usage. reasoning_effort=disable
→ OCI accepts "NONE". Full suite: 370/370 unit + 51/51 integration.

* fix(codeql): re-scope py/weak-sensitive-data-hashing exclusion to OCI signing file

CodeQL's taint analysis re-fires the `py/weak-sensitive-data-hashing`
alert at `litellm/llms/oci/common_utils.py:103` whenever upstream code
paths into the OCI signing module change (touching `transformation.py`
opens new flow paths that CodeQL re-evaluates from scratch). The
`hashlib.sha256(..., usedforsecurity=False)` declaration silences the
direct-call form of the query but not the taint-flow form.

SHA-256 here is mandated by the OCI HTTP signing specification for the
x-content-sha256 content-integrity header — not for password storage:
https://docs.oracle.com/en-us/iaas/Content/API/Concepts/signingrequests.htm

CodeQL has no per-query path filter and GitHub Code Scanning ignores
inline lgtm/codeql comments, so path-ignoring this single ~560-line
signing utility file is the narrowest available suppression. All other
files retain full coverage of py/weak-sensitive-data-hashing — including
litellm/proxy/utils.py where the rule legitimately applies.

This restores the NEUTRAL CodeQL state the PR had on prior commits
(see `2111c98af7` for the same approach on the previous branch
evolution that the cherry-pick was rebased onto a different baseline).

* fix(oci): drop duplicate text on Cohere streaming terminal chunk

OCI Cohere's terminal SSE event re-sends the full assembled response in
`text` alongside a populated `chatHistory`. Emitting that text as another
delta concatenates the entire response onto the already-streamed output
(e.g. "How can I help?How can I help?").

Use `chatHistory is not None` as the discriminator for the consolidated
terminal event — `finishReason` is a weaker signal that could in principle
appear on a non-consolidated chunk. The two coincide today; this preserves
correctness if OCI ever ships finishReason on an incremental chunk.

Adds a live-OCI integration regression test that compares streamed vs
non-streamed length and asserts the response prefix appears only once.
Verified to fail under the previous code with the exact reported
reproduction: 'Hello! How can I help you today?Hello! How can I help you today?'.

Reported by @gotsysdba on PR #25177.

* fix(oci): buffer SSE stream across HTTP read boundaries

The old split_chunks helper split each individual HTTP read on "\n\n",
which assumed SSE event boundaries always aligned with read boundaries.
In practice the OCI streaming endpoint delivers events that may:

  - straddle two reads (chunk_creator gets a truncated JSON and crashes)
  - arrive separated by a single "\n" instead of "\n\n"
  - share a read with multiple complete events

Replace the inline split with module-level helpers _iter_sse_events
(sync) / _aiter_sse_events (async) that maintain a buffer across reads,
split on any newline, and yield only complete "data:" lines.

Add 25 regression tests covering event-split-across-reads, tiny-chunk
reads, single-newline separators, keepalive/comment lines, trailing
partial events flushed at EOF, "\r\n" line endings, and an end-to-end
smoke test that feeds an awkwardly-chopped payload through the splitter
into OCIStreamWrapper.chunk_creator.

Reported by John Lathouwers.

* test(oci): repoint TestOCIKeyNormalization to sign_with_manual_credentials

The signing helper moved from OCIChatConfig._sign_with_manual_credentials
to a module-level sign_with_manual_credentials in common_utils.py. Four
tests in TestOCIKeyNormalization still called the old method:

  - 2 failed outright with AttributeError
  - 2 passed by accident because they used pytest.raises(Exception),
    which happily caught the AttributeError instead of exercising the
    intended OCIError path

Repoint all four to the new module-level function so they exercise the
actual oci_key type-validation branch.

* fix(oci): validate oci_region before URL interpolation to prevent SSRF

Anchor oci_region to ^[a-z][a-z0-9-]{0,30}[a-z0-9]$ inside get_oci_base_url
so user-supplied regions that would redirect the signed request to an
attacker-controlled host (e.g. 'evil.com/#') fail with HTTP 400 before
the URL or signature is built. Empty string still falls back to the
us-ashburn-1 default, so existing callers are unaffected.

* test(audio): skip when gpt-4o-audio-preview is unavailable upstream

OpenAI retired `gpt-4o-audio-preview` (404 model_not_found in CI as of
2026-05-19), and the existing try/except in these tests only re-raised
on 'openai-internal' errors. Other exceptions were silently swallowed,
so the next line ran with an unbound `response`/`completion` and
failed with an unrelated UnboundLocalError that masked the real cause.

Extend the skip condition to also cover model_not_found / 'does not exist'
so the suite reports the upstream outage cleanly, matching the pattern
used in ce87c41 for the realtime and nvidia_nim rerank tests.
Re-raise unknown exceptions instead of falling through.

* fix(oci/router): catalog-driven maxCompletionTokens; generic blocked-deployment message

- Drive OCI maxCompletionTokens via supports_reasoning from the model
  catalog instead of a hardcoded openai.gpt-5 prefix. Add OCI GPT-5 family
  entries (gpt-5, gpt-5-mini, gpt-5-nano) with supports_reasoning: true.
  Gate the override to non-Cohere vendor so Cohere reasoning models keep
  maxTokens (Cohere endpoint does not accept maxCompletionTokens).
- Replace proxy-specific 'Contact your proxy admin' phrasing in the four
  Router blocked-deployment ServiceUnavailableError messages with neutral
  SDK-appropriate text.

* fix(oci/cohere): guard handle_cohere_response against missing usage

* fix(oci): address bug review findings in chat transformation

- Cohere param map: keep tool_choice/n as False (not omitted) so unsupported
  params are dropped or rejected rather than silently passed through.
- get_complete_url: when an explicit api_base/litellm.api_base is provided,
  use it as-is instead of unconditionally appending /20231130/actions/chat
  (mirrors the embed config behavior).
- Cohere stream: require both chatHistory and finishReason to be present to
  identify a terminal consolidation chunk, avoiding silent text suppression
  if chatHistory ever appears on a non-terminal chunk.
- Generic usage: use 'is not None' for reasoningTokens so a legitimate value
  of 0 is preserved instead of being treated as absent.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/cohere): emit tool calls in streaming and null content when text empty

handle_cohere_response now sets message.content to None when the Cohere
response text is empty, matching the OpenAI convention for tool-call-only
responses.

handle_cohere_stream_chunk now extracts toolCalls — both directly from
the chunk and from the terminal chunk's chatHistory CHATBOT message —
and emits them in the delta. Previously, CohereStreamChunk lacked a
toolCalls field, so any tool calls in the stream were silently dropped.

* fix(oci): preserve tool results, embed URL path, and generic finish reason

- Use SerializeAsAny on CohereChatRequest.chatHistory so subclass-specific
  fields like CohereToolMessage.toolResults are not dropped during Pydantic
  v2 serialization.
- Make OCIEmbedConfig.get_complete_url append the /20231130/actions/embedText
  action path consistently with chat, so setting litellm.api_base to the
  region inference base URL no longer posts to the bare hostname.
- Map OCI finishReason (COMPLETE / MAX_TOKENS / TOOL_CALLS) to OpenAI
  finish_reason values in handle_generic_response, mirroring the streaming
  handler and the Cohere non-streaming handler.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): silence mypy assignment error on dynamic finish_reason

* fix(oci/embed): always set usage on embedding response

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): append /20231130/actions/chat to explicit api_base

Restore the embed-style behavior so OCIChatConfig.get_complete_url always
appends the OCI GenAI chat path. Routing through get_oci_base_url ensures the
optional explicit api_base has its trailing slash stripped before the suffix is
joined, matching the embed config and the test_respects_explicit_api_base
expectation.

* fix(oci/cohere): mark logprobs/logit_bias unsupported and normalize unknown stream finish reasons

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/cohere): preserve trailing tool result in chatHistory

When the last message in the OpenAI-format input is a tool result (the
standard agentic continuation pattern), the prior messages[:-1] slice
silently dropped that tool result from chatHistory and the model never
saw it. Excluding the last user message by index instead keeps tool
results that trail the last user turn intact.

* fix(main): remove dead OCI embedding elif block

The earlier elif at line 5119 already routes OCI embeddings through the
base HTTP handler with the headers None-guard, so the later identical
block was unreachable dead code.

* test(oci): move integration tests out of llm_translation mock-only folder

Greptile flags tests/llm_translation/ as mock-only via a project-specific
rule; relocate the live-network OCI integration suite to tests/integration/
and adjust the in-file sys.path / run instructions accordingly.

* fix(oci/cohere): suppress tool calls on stream terminal consolidation chunk

The terminal SSE event re-sends the full assembled response in both
`text` and `chatHistory`. The existing logic already suppresses
`text` to avoid double-emit, but tool calls extracted from the
terminal chunk (via `typed_chunk.toolCalls` or the `chatHistory`
CHATBOT fallback) would still be re-emitted with fresh uuid4 IDs.
If OCI Cohere ever streams tool calls progressively in intermediate
chunks (now possible since CohereStreamChunk has a toolCalls field),
this would cause downstream agentic frameworks to execute each tool
call twice.

Suppress tool calls on the terminal consolidation chunk for the same
reason `text` is suppressed.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci,httpx): normalize finish_reason, preserve response_format, fix sync embed JSON content-type

- cohere.py / generic.py: normalize unknown OCI finishReason values (ERROR,
  ERROR_TOXIC, CONTENT_FILTERED, USER_CANCEL, ...) to 'stop' in non-streaming
  and streaming generic handlers, matching the streaming Cohere handler so
  downstream consumers switching on finish_reason aren't broken by raw OCI
  values.
- transformation.py: restore the dual-key alias so optional_params still
  carries the original 'response_format' key alongside the OCI-mapped
  'responseFormat'. Downstream litellm framework code (json_mode detection,
  logging) inspects 'response_format' after map_openai_params runs.
- llm_http_handler.py: make the sync embedding path mirror the async path —
  when sign_request returns no signed_body, send via json=data (which sets
  Content-Type: application/json) instead of data=json.dumps(data) which
  doesn't. Removes a sync/async behavioural asymmetry for non-OCI providers
  that adopt the sign_request pattern.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): clean up OCIChatConfig init, normalize generic stream finish reasons, correct embed sign_request return type

- Replace fragile setattr(self.__class__, ...) pattern in OCIChatConfig.__init__ with a @property for has_custom_stream_wrapper, matching the pattern used by other providers.
- Normalize unknown OCI finish reasons (e.g. ERROR, ERROR_TOXIC, USER_CANCEL) to 'stop' in handle_generic_stream_chunk, matching the existing Cohere stream handler behaviour.
- Tighten OCIEmbedConfig.sign_request return type from Tuple[dict, Optional[bytes]] to Tuple[dict, bytes] — sign_oci_request never returns None for the body, and this matches OCIChatConfig.sign_request.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): strip trailing action path in get_oci_base_url to avoid URL doubling

A fully-formed OCI endpoint URL (e.g. https://inference.generativeai.us-chicago-1.oci.oraclecloud.com/20231130/actions/chat) passed via api_base previously had the action path appended a second time by get_complete_url in both chat and embed configs, yielding a 404. get_oci_base_url now strips a trailing /20231130/actions/<name> so callers can always append the action path safely.

* fix(httpx): preserve sync embed data= kwarg to avoid breaking mock-based tests

The earlier sync_httpx_client.post() call passed data=json.dumps(data),
which downstream embedding tests assert on (e.g. tests for hosted_vllm,
jina_ai, watsonx). Switching to json=data changed the kwarg name and broke
those tests. The OCI signed_body path keeps using data=signed_body and is
unaffected.

* fix(oci): stable tool-call ids across stream chunks; lenient Cohere finishReason

- Replace random uuid4 per chunk with a deterministic content-derived
  digest for synthetic tool-call ids in both Cohere and Generic OCI
  handlers. Previously, when OCI omitted 'id' (always for Cohere, often
  for Generic streaming deltas), every chunk for the same logical tool
  call received a new uuid, causing downstream stream-mergers (which key
  off id) to treat each fragment as a distinct call.

- Relax CohereChatResponse.finishReason from a strict Literal[...] to
  Optional[str], matching CohereStreamChunk.finishReason. The
  handle_cohere_response 'elif oci_finish_reason is not None' fallback
  was previously unreachable because Pydantic raised ValidationError on
  any unknown value before the fallback executed. Now non-streaming
  responses degrade unknown reasons to 'stop' just like the streaming
  path.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/embed): validate OCI credentials in validate_environment

Mirror OCIChatConfig.validate_environment so embedding requests fail
fast with a clear error when oci_user/oci_fingerprint/oci_tenancy/
oci_compartment_id or an oci_key/oci_key_file is missing, instead of
deferring the failure until sign_request.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(oci/embed): expect OCIError from validate_environment when credentials are missing

OCIEmbedConfig.validate_environment now raises eagerly (mirroring OCIChatConfig)
when oci_user/oci_fingerprint/oci_tenancy/oci_compartment_id or oci_key/oci_key_file
is missing. Update the test to match.

* fix(oci): polish stream chunk handling and signed body default

- cohere stream terminal consolidation now emits content=None instead of ""
- drop redundant index truthiness check (None is already replaced with 0)
- accept both "TOOL_CALL" and "TOOL_CALLS" finish reasons in cohere
- signed_json_body defaults to None and uses explicit None check, so an
  explicitly empty bytes body wouldn't be silently re-serialized

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): catch pydantic ValidationError when parsing OCI responses

Pydantic v2 raises ValidationError (not TypeError) when field validation
fails, so malformed OCI completion responses or stream chunks would
propagate unhandled out of handle_generic_response,
handle_generic_stream_chunk, and handle_cohere_stream_chunk. Widen the
except clauses to also catch ValidationError so callers get a clean
OCIError.

* fix(oci/catalog): real prices for Llama 4, drop zero-cost OCI OpenAI entries

Zero-cost catalog entries (input_cost_per_token=0, output_cost_per_token=0)
make proxy spend tracking silently report $0 for these paid OCI models, so
any caller can drive them without decrementing a budget.

For Llama 4 Maverick and Scout, OCI charges the same character-based rate
as Llama 3.3 70B ($0.0018 per 10,000 characters), so use the same per-token
price as the existing oci/meta.llama-3.3-70b-instruct entry (7.2e-07 in/out).

For oci/openai.gpt-5, gpt-5-mini, gpt-5-nano, gpt-oss-120b, and gpt-oss-20b,
no public per-token pricing is available; drop the entries so operators must
register them with explicit custom pricing. The existing GPT-5 reasoning test
fixture already injects synthetic entries when the catalog omits them, so the
chat transformation's supports_reasoning lookup keeps working in tests.

* fix(oci/chat): wrap CohereChatResult construction in try/except

Match the handle_generic_response pattern: surface OCIError with the
upstream status code instead of letting a raw pydantic.ValidationError
propagate when the Cohere response payload is malformed.

* fix(oci): harden Cohere stream/finish-reason and dedupe maxTokens param mapping

- Cohere stream: track per-stream tool-call emission and only suppress the
  terminal consolidation chunk's tool calls once they've been seen earlier.
  Prevents silent drop if tool calls are delivered exclusively on the
  terminal chunk.
- Cohere stream: emit content=None (not "") on non-terminal text-free
  chunks (e.g. tool-call-only / keep-alive) so downstream consumers that
  distinguish missing vs explicitly-empty deltas behave correctly.
- Generic handlers: accept singular TOOL_CALL finish reason in addition to
  TOOL_CALLS, matching the Cohere handlers.
- _get_optional_params: when both max_tokens and max_completion_tokens are
  provided, explicitly prefer max_completion_tokens instead of relying on
  dict iteration order.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): emit content=None instead of empty string for text-free generic stream chunks

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* test(oci): expect content=None for text-free generic stream chunks

handle_generic_stream_chunk now emits content=None instead of empty
string when a chunk carries no text parts. Update the corresponding
no-message test to match.

* codeql: narrow OCI sha256 suppression to query-filter, not whole file

paths-ignore was suppressing every CodeQL query on
litellm/llms/oci/common_utils.py, hiding all future findings in a
security-critical file (private key loading, credential resolution,
URL construction, RSA signing). Move the suppression for
py/weak-sensitive-data-hashing into query-filters so common_utils.py
remains fully analyzed by every other query.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): use locale-independent RFC 7231 date for manual signing

email.utils.formatdate(usegmt=True) emits canonical English weekday/
month abbreviations regardless of system locale, so signature
verification doesn't break on non-en_US deployments.

* fix(oci): strip 'oci/' prefix in get_vendor_from_model

Previously, get_vendor_from_model split on '.' without stripping the
optional 'oci/' provider prefix, so 'oci/cohere.command-a-03-2025' was
routed through the GENERIC pipeline instead of COHERE.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* codeql: scope OCI sha256 suppression to common_utils.py via filter-sarif

Replace the global query-filters exclude for py/weak-sensitive-data-hashing
with a SARIF post-filter that only drops the alert when it originates from
litellm/llms/oci/common_utils.py, keeping the rule active on every other
SHA-256 callsite in the repository.

* Fix OCI chat bugs: tool_calls None key, dead max_tokens dedup, single-event stream text suppression

- handle_cohere_response: omit tool_calls key from message dict when None,
  matching the generic handler's behaviour and avoiding tripping consumers
  that key off 'tool_calls' in message.
- _get_optional_params: remove dead prefer_max_completion branch. By the
  time this helper runs, map_openai_params has already collapsed
  max_tokens/max_completion_tokens onto the OCI alias, so the OpenAI-key
  membership check is unreachable.
- handle_cohere_stream_chunk: add prior_text_emitted parameter mirroring
  prior_tool_calls_emitted. The terminal consolidation chunk's text is
  only suppressed when prior deltas already emitted text — otherwise
  (degenerate single-event stream) the text passes through so the
  response content isn't silently lost. OCIStreamWrapper now tracks
  emitted text alongside emitted tool calls.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): preserve all text parts in generic response and emit SYSTEM role for Cohere

- handle_generic_response: iterate all content parts and concatenate text
  (matches the streaming handler) so non-leading text parts are not lost
  and a leading non-text part does not suppress trailing text.
- adapt_messages_to_cohere_standard: emit CohereSystemMessage for system
  messages so direct callers do not silently drop them. The Cohere
  request builder filters system messages before calling this helper to
  avoid duplicating preambleOverride content into chatHistory.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): normalise dict-format tool_choice to OCI flat uppercase shape

The OCI Generative AI API only accepts toolChoice values of the form
{"type": "AUTO"|"NONE"|"REQUIRED"} or {"type": "FUNCTION",
"name": "<fn>"}. The previous conversion only handled string
tool_choice values, so OpenAI's standard dict shape
{"type": "function", "function": {"name": "<fn>"}} passed
through unchanged and was rejected by OCI with a 400.

Normalise the dict shape by uppercasing the discriminator and hoisting
the function name to the top level. Also accept dict variants of the
non-function selectors (e.g. {"type": "auto"}).

* test(oci): exercise system-message filtering at transform_request boundary

adapt_messages_to_cohere_standard now emits SYSTEM-role entries by design
so direct callers don't silently drop system content. The Cohere request
builder filters system messages before calling the helper and routes them
into preambleOverride, so the user-visible 'no SYSTEM in chatHistory'
guarantee holds at the transform_request boundary, where the test should
live.

* fix(oci/chat): extract tool_choice/response_format helpers to satisfy PLR0915

_get_optional_params exceeded ruff's 50-statement cap. The toolChoice and
responseFormat normalisation blocks are self-contained mutations, so move
them to module-level helpers.

* fix(oci): normalize None finishReason in generic non-streaming handler; drop dead Cohere system-role branch

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): silence mypy assignment error on cleared finish_reason

* fix(docker): install libatomic in builder for prisma nodeenv binary

The prebuilt node binary that prisma-python's nodeenv downloads links
against libatomic.so.1, which Wolfi does not pull in via gcc/nodejs.
Without this, fresh Docker builds (no GHA cache hit) fail at
`prisma generate` with:
  node: error while loading shared libraries: libatomic.so.1

* fix(oci): raise on invalid tool_choice instead of silently passing OpenAI shape

_normalize_tool_choice previously left an OpenAI-format dict in selected_params['toolChoice'] when the type was unrecognized or when 'FUNCTION' was given with a missing/empty name. OCI would then reject the request with a non-obvious error. Raise ValueError with a clear message in these cases.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): raise OCIError instead of ValueError in _normalize_tool_choice

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): declare non-security intent on sha256 for synthetic tool-call id

* fix(oci): simplify _get_optional_params and reject invalid tool_choice types

- Collapse the two-loop _get_optional_params into a single pass with
  clear precedence (OpenAI key wins over OCI alias; first OpenAI key
  reaching a given OCI target wins). Removes the redundant maxTokens
  special-case in the second loop and makes the map_openai_params /
  transform_request handoff easier to reason about.
- Raise OCIError when _normalize_tool_choice sees an unexpected type
  (list, bool, int, ...) instead of silently letting it through to the
  OCI API where it would produce an opaque server-side error.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* Remove no-op data['stream'] deletion in OCI stream wrappers

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): always send Cohere isStream field explicitly

Match OCIChatRequestPayload by defaulting CohereChatRequest.isStream to
False instead of None so model_dump(exclude_none=True) does not silently
omit the field on non-streaming requests.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): revert Cohere isStream to Optional[bool]=None to preserve omission semantics

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/generic): raise OCIError on empty choices instead of IndexError

Pydantic accepts an empty choices list when validating OCICompletionResponse, so accessing chatResponse.choices[0] could raise an unhandled IndexError. Surface it as OCIError so the response error path is consistent with the existing (TypeError, ValidationError) guard.

* fix(oci/cohere): map top_k -> topK so Cohere topK param is settable

The Cohere param map (derived from the GENERIC map) had no entry for
topK. Since the simplified _get_optional_params only iterates over
param_map entries, callers had no way to pass topK to CohereChatRequest
(neither via an OpenAI-style key nor via the OCI alias).

Add 'top_k': 'topK' to the Cohere map only — OCIChatRequestPayload
(GENERIC) has no topK field. _get_optional_params accepts both the
OpenAI key (top_k) and the OCI alias (topK) in optional_params, so this
covers both calling conventions.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): tighten cohere stream dedup flags and forward stream args in embed signing

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci/chat): reorder dict guard and wrap stream chunk json.loads

- Move isinstance(response_json, dict) check before .get("error") so
  the guard runs before the attribute access it is supposed to protect.
- Wrap json.loads in OCIStreamWrapper.chunk_creator with try/except so
  malformed SSE payloads surface as OCIError instead of a raw
  JSONDecodeError propagating out of the stream loop.

* fix(oci/cohere stream): only flag text emitted on non-empty content

An intermediate Cohere SSE chunk carrying text="" was flipping
_cohere_text_emitted via the "is not None" check, which then caused
the terminal consolidation chunk to drop its real text as a duplicate.
Use a truthy check so only actual content marks the stream as having
emitted text.

* test(oci): end-to-end proxy integration test against real OCI GenAI

Spins up the litellm proxy via the console-script entrypoint with a
minimal OCI-only config and drives real OpenAI-shaped HTTP requests
through it against OCI GenAI. Covers non-streaming chat, streaming
chat, embeddings, and /v1/models for Cohere, Llama, Gemini, and Grok.

Skips automatically when ~/.oci/config is absent or when the active
profile uses session-token auth (the OCI provider currently only
consumes OCI_* env vars; session tokens would need an in-process
signer). API-key profiles work out of the box.

* test(oci): move proxy integration test to tests/integration/

tests/llm_translation/ is mock-only; the OCI proxy integration test
spawns a real proxy subprocess and makes live HTTP calls, so move
it (and the companion config) to tests/integration/ alongside the
existing test_oci_integration.py.

* fix(oci): dedupe finish-reason mapping and batch Cohere tool results

- Extract _normalize_oci_finish_reason helper so the four chat handlers
  (Cohere/GENERIC, sync/stream) share one OCI->OpenAI mapping instead of
  four near-identical if/elif chains.
- Merge consecutive OpenAI tool-role messages into a single
  CohereToolMessage with multiple toolResults entries, matching the OCI
  Cohere API's expectation for parallel tool calls in one assistant turn.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(oci): drop dead Cohere toolChoice field and emit GENERIC tool-call dicts inline

- Remove the unreachable toolChoice field from CohereChatRequest. The
  Cohere param map explicitly marks tool_choice as unsupported, so the
  field can never be populated through the normal optional_params flow
  and only confused the public model surface.
- Build GENERIC stream tool-call dicts inline (id/type/function shape)
  instead of round-tripping through ChatCompletionMessageToolCall and
  model_dump(). Matches handle_cohere_stream_chunk so downstream
  stream-mergers see the same minimal payload regardless of which
  vendor produced the chunk.

Co-authored-by: Yassin Kortam <yassin@berri.ai>

* fix(docker): drop redundant libatomic from non_root builder

litellm_internal_staging already fixes the prisma `nodeenv` build
failure at the root cause by restoring `npm` to the builder (#28519):
with npm on PATH, prisma-python uses the system Node and never downloads
the nodeenv binary that links against libatomic.so.1. After merging
internal_staging the libatomic line is dead weight, so remove it.

https://claude.ai/code/session_01SwKzxRxgUhLFyyEf4UV812

* fix(oci/catalog): add openai.gpt-5{,-mini,-nano} entries with supports_reasoning

Without these catalog entries, supports_reasoning(model='openai.gpt-5*',
custom_llm_provider='oci') returned False, so _model_uses_max_completion_tokens
fell back to the default and OCI rejected the request with HTTP 400
('Use maxCompletionTokens instead.'). Add the three entries so the catalog-driven
maxCompletionTokens routing works against a stock LiteLLM install.

Also reword the test fixture docstring — the bundled backup now actually ships
these entries, so the fixture is only a fallback for environments that loaded
their cost map from a stale remote source.

---------

Co-authored-by: Tai An <antai12232931@outlook.com>
Co-authored-by: Vincent <yimao1231@gmail.com>
Co-authored-by: Kris Xia <xiajiayi0506@gmail.com>
Co-authored-by: d 🔹 <liusway405@gmail.com>
Co-authored-by: Fabrizio Cafolla <developer@fabriziocafolla.com>
Co-authored-by: Filippo Menghi <113345637+Cyberfilo@users.noreply.github.com>
Co-authored-by: Tom Denham <tom@tomdee.co.uk>
Co-authored-by: escon1004 <70471150+escon1004@users.noreply.github.com>
Co-authored-by: Divyansh Singhal <97736786+Divyansh8321@users.noreply.github.com>
Co-authored-by: robin-fiddler <robin@fiddler.ai>
Co-authored-by: Michael-RZ-Berri <michael@berri.ai>
Co-authored-by: Michael Riad Zaky <michaelr@Mac.localdomain>
Co-authored-by: ryan-crabbe-berri <ryan@berri.ai>
Co-authored-by: Krrish Dholakia <krrish+github@berri.ai>
Co-authored-by: Sameer Kankute <sameer@berri.ai>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Yassin Kortam <yassin@berri.ai>
Co-authored-by: Federico Kamelhar <federico.kamelhar@oracle.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Michael Riad Zaky <michaelr@Michaels-MacBook-Air.local>
Co-authored-by: Yuneng Jiang <yuneng@berri.ai>
…aming hot paths (#28289)

* perf: reduce per-request and per-chunk overhead across Anthropic streaming hot paths

- Introduce pure-text fast-path in `_build_complete_streaming_response` that collapses O(N) `content_block_delta` events into a single equivalent SSE event before conversion, eliminating per-output-token Pydantic `ModelResponseStream` construction; non-text streams (tool_use, thinking, citations) fall back to the unchanged legacy path
- Skip agentic streaming wrapper entirely when no callback overrides `async_should_run_agentic_loop`; the wrapper buffered every chunk and rebuilt the SSE response only to call hooks that all return `(False, {})` — a pure no-op for the default config
- Serialize request body once (`json.dumps`) for both the pre-call log input and the wire, instead of twice; avoids a full O(payload) scan per request, significant for long-context Claude Code histories
- Add fast path in `async_streaming_data_generator` that bypasses the per-chunk `async_post_call_streaming_hook` coroutine await, response-string materialization, and cost-injection call when no callback/guardrail/cost-injection is active (the default config)
- Resolve `_DD_STREAMING_TRACE_ENABLED` once at import time; eliminate per-chunk `NullSpan` context manager allocation when Datadog tracing is disabled (the default)
- Memoize `get_type_hints(AnthropicMessagesRequestOptionalParams)` with `@lru_cache(maxsize=1)` — resolves once per process instead of once per `/v1/messages` request (~80µs each)
- Hoist `cost_injection_active` out of the per-chunk loop in `chunk_processor`; eliminates repeated `getattr` + endpoint-type checks on every streamed byte chunk
- Extract `_build_passthrough_logging_result` from `_route_streaming_logging_to_handler` as a standalone static method to facilitate future off-loop dispatch
- Convert `async_sse_data_generator` from an `async for: yield` trampoline to a direct return of the underlying generator, removing one async-generator layer per streamed chunk
- Skip redundant `strip_empty_text_blocks_from_anthropic_messages` scan in `anthropic_messages_handler` when the async wrapper already sanitized (signalled via `_litellm_messages_presanitized` sentinel, popped before reaching provider params)
- Gate debug log `f-string` evaluation behind `isEnabledFor(DEBUG)` in both the streaming generator and the transformation layer to avoid serializing entire message payloads on every request at non-debug log levels
- Add benchmark script (`scripts/benchmark_anthropic_messages_perf.py`) with a local mock Anthropic SSE provider for reproducible TTFT and TPM measurement across commits/branches
- Add parity tests asserting fast-path and legacy-path produce byte-identical logged/billed payloads, plus unit tests for agentic hook detection, pre-serialized body reuse, and memoized key resolution

* perf: address greptile review for anthropic streaming hot path

- Bail to legacy in `_collapse_pure_text_chunks` when content_block_delta
  events from different block indexes are observed without an intervening
  flush. Anthropic sends blocks strictly sequentially, but defensive bail
  prevents silent text-merging if the protocol ever interleaves.
- Replace leaf-class `__dict__` check for `async_post_call_streaming_hook`
  in `_callback_capabilities` with a function-identity comparison that
  walks the MRO. A vendor base class can carry the override and the
  registered class can add nothing else; before this PR the hook was
  unconditionally invoked, so an inherited-override miss would silently
  drop the hook on the streaming path.
- Add unit tests for both behaviors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mypy): narrow model_name to str in cost-injection branch

The hoisted cost_injection_active flag in chunk_processor encodes the
`bool(model_name)` requirement but mypy can't track that invariant
through the local, so the per-chunk `_process_chunk_with_cost_injection(
chunk, model_name)` calls flagged Optional[str] vs str. Pin a typed
non-None local inside the cost-injection branch so mypy narrows
correctly without changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Yassin Kortam <yassinkortam@g.ucla.edu>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: oss-agent-shin <279349115+oss-agent-shin@users.noreply.github.com>
Co-authored-by: ishaan-berri <ishaan-berri@users.noreply.github.com>
… management endpoints (#28681)

* test(proxy): phase-4 payload behavior pinning for tier-2/3 key + team management endpoints

Extends the Phase 1–3 behavior-pin suite at tests/proxy_behavior/management/
with a second axis: payload-shape pinning. Phase 1–3 held payload minimal
and pinned (actor, target) → status across 37 routes; Phase 4 holds the
caller fixed at an authorized actor, varies the payload shape, and asserts
the observable DB effect (on accept) or the named guard / row-unchanged
(on reject). Faithfulness contract from Phase 1–3 is unchanged.

Six families + one gap-closer (59 new scenarios, 620 → 679 total):

  * F1 — key budget / rate-limit (test_key_budget_limits.py, 18)
  * F2 — key↔team reassignment   (test_key_team_change.py, 6)
  * F3 — team budget / rate-limit (test_team_budget_limits.py, 15)
  * F4 — member-info validation   (test_team_member_info_validation.py, 5)
  * F5 — permission batching      (test_team_permissions_bulk_update.py, 6)
  * F6 — org-scoped team access   (+2 detail-string pins in existing files)
  * F7 — coverage gap-closer      (test_f7_coverage_closeout.py, 7)

Harness extensions in conftest.py (additive only):
  * create_scratch_org() seeder with its own scratch-prefixed budget row
  * budget / limit fields on create_scratch_team()
  * scratch teardown also sweeps litellm_organizationtable

Coverage telemetry (behavior-suite-only):
  * key_management_endpoints.py  60 % → 65 % (+82 lines)
  * team_endpoints.py            62 % → 72 % (+137 lines, crosses 70 % stretch)

Key lands under 70 % per plan §7 escape hatch — the gap is dominated by
routes outside F1–F6 scope (key list/info v2 internals) and structurally
dead org-budget guards (call sites at lines 889 + 2310 + 985 + 1751 load
the org without include_budget_table=True, so org.litellm_budget_table is
None at guard time and the aggregate guard no-ops). Pinned as observed
no-op behavior so a future fix that flips the flag turns these into reds.

Zero source-code changes; pyproject.toml diff is empty;
test_route_coverage.py stays green untouched; G3 grep guards still green;
local wall-time 14 s for the full suite (no coverage), 22 s with coverage.

G4 regression-replay protocol executed against three representative
fix-PR parents (410ce76, 0bd49ec, 8bbc61e): all Phase 4 tests
PASS at pre-fix SHAs — confirming the F1–F7 layer is a helper-body pin,
not a regression-replay layer for those specific historical bypass
shapes. Targeted RED-bait scenarios for each fix are left for a
follow-up PR.

* test(proxy): push key_management_endpoints.py past the 70% stretch (F7-extension)

Adds 24 more payload-pin scenarios in test_f7_key_coverage_push.py
following the same accepted-effect / rejected-guard pattern. Each
scenario cites the file:line range it pins; same anti-snapshot rules
apply.

Target ranges (all reachable via HTTP-boundary payload variation):
  * 5942-6063  /key/health with metadata.logging → test_key_logging body
  * 4565-4692  /key/reset_spend happy + 404 + non-admin gate + value validation
  * 4421-4533  /key/regenerate ghost-404 + happy + new_key + grace_period
  * 4168-4202  _insert_deprecated_key body via grace_period
  * 6118-6133  _enforce_unique_key_alias duplicate-alias rejection
  * 6148-6169  validate_model_max_budget malformed-payload rejection
  * 4708-4789  validate_key_list_check user/team/org/key_hash branches
  * 2622-2733  /key/bulk_update mixed success/failure + admin gate + size limits
  * 2797-2950  /team/key/bulk_update all-keys path + explicit-keys dedupe + 404
  * 5108-5207  /key/aliases admin + scoped + search-filter branches
  * 3253-3303  /key/info ghost + explicit-key + no-key-uses-auth-header
  * 3427-3436  generate_key_helper_fn budget_limits initialization
  * 1794-1815  prepare_key_update_data duration + budget_duration paths
  * 5280-5388  _build_filter_conditions across include_created_by_keys/team/sort/alias

Coverage telemetry — full PR4 dataset:

  key_management_endpoints.py: 60 % → 71 %  (+11 pts, +194 lines)
  team_endpoints.py:           62 % → 72 %  (+10 pts, +137 lines)

Both files now over the plan §7 PR4.M4 70 % stretch as a side effect of
pinning real payload behavior. 721 tests pass in 19 s local (full suite,
no coverage); 27 s with coverage. Zero source-code changes; pyproject.toml
diff still empty; test_route_coverage.py + G3 grep guards still green.

Honest finding (kept from the prior commit's body): four structurally-dead
org-budget guards remain pinned as observed no-op behavior — they fire
only when get_org_object is called with include_budget_table=True, which
none of the four management-endpoint call sites currently do. Pinned so
a future change that flips the flag turns these into reds.

Two helper guards are honest-ceiling: _validate_reset_spend_value's
isinstance check at line 4568 is unreachable from HTTP because Pydantic
422s non-float before the helper runs; same shape for /team/key/bulk_update's
missing team_id / no-selector pre-handler guards.

* test(proxy): address PR review — try/finally cleanup + loosen 500 envelope pins + Optional annotations

Greptile review feedback on PR #28681:

1. Wrap manual budget-row cleanup in try/finally so an assertion failure
   doesn't leave non-scratch-prefixed budget rows orphaned across CI re-runs
   (test_team_new_with_team_member_budget_creates_budget_row and
   test_team_update_team_member_budget_upserts).
2. Loosen the two 500-status pins to in (400, 422, 500) — the named-guard
   substring is the real pin; the outer ValueError-wrap envelope is an
   implementation detail that a future improvement should be free to fix
   to a proper 400/422 without flipping these tests red.
3. Add missing Optional annotations on _seed_token's max_budget / metadata
   / team_id keyword args (they default to None).

Greptile's typo flag on 'read-world' in the conftest comment is declined —
'read-world' is the project's established term for the immutable seeded
world fixture (see other usages in conftest.py and actors.py).

721 tests still pass in 17 s.
…) (#28378)

* feat(prometheus): emit per-token-type detail metrics (LIT-3220) (#28372)

Adds five sparse counter metrics that break out the token detail
fields providers already report in `usage.prompt_tokens_details` and
`usage.completion_tokens_details`:

  - litellm_input_cached_tokens_metric            (provider prompt-cache reads)
  - litellm_input_cache_creation_tokens_metric    (Anthropic prompt-cache writes)
  - litellm_input_audio_tokens_metric             (audio input tokens)
  - litellm_output_reasoning_tokens_metric        (reasoning tokens)
  - litellm_output_audio_tokens_metric            (audio output tokens)

These are additive — existing input/output/total counters are
unchanged, so no dashboards break. Each new counter is only
incremented when the underlying detail is populated and > 0, keeping
scrape output sparse for providers that don't report a given field.

Data is read from the canonical Usage dict that
`get_standard_logging_object_payload` already attaches at
`standard_logging_payload["metadata"]["usage_object"]`, so no new
plumbing through the logging pipeline is required.

Tests: 10 new unit tests covering registration, label-set parity,
all-types increment, zero/None/negative skip behaviour, and the
no-metadata/no-usage_object no-op paths.

Closes LIT-3220

Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Krrish Dholakia <krrishdholakia@berri.ai>
Co-authored-by: Claude <noreply@anthropic.com>

* chore: remove proof folder image

---------

Co-authored-by: oss-agent-shin <ext-agent-shin@berri.ai>
Co-authored-by: shin-berri <shin-laptop@berri.ai>
Co-authored-by: yuneng-jiang <yuneng@berri.ai>
Co-authored-by: Krrish Dholakia <krrishdholakia@berri.ai>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Ishaan Jaffer <ishaanjaffer0324@gmail.com>
…8405)

* fix(otel): stamp http.response.status_code on all error responses

httpx.HTTPStatusError exposes status under .response.status_code, not as a
top-level attr, so unified-endpoint 5xx failures left the SERVER span without
a status. The admin hooks only wrote a child span and never stamped or ended
the parent at all, so admin 4xx/5xx (and success) responses were invisible
to dashboards. Adds a fallback to .response.status_code in get_error_information,
and ends the parent SERVER span in async_management_endpoint_{success,failure}_hook
with the same _record_exception_on_span helper the unified path uses.

Resolves LIT-3193

* test(otel): exercise httpx.HTTPStatusError through admin path

Pins the contract that get_error_information's response.status_code fallback
is reachable from any entry point — without this, a future refactor that
bypasses _record_exception_on_span in the admin hooks could regress for
httpx-wrapped exceptions while the unified suite still passes.

* chore(otel): trim verbose comments in LIT-3193 changes

Tighten docstrings and remove redundant section dividers/inline narration.
Behavior is unchanged.

* fix(otel): set span.status on management hook parent SERVER span

Mirror the unified failure path: stamp StatusCode.ERROR on the parent
SERVER span before recording the exception, and StatusCode.OK before
ending it on success. Without this, OTEL backends filtering on span
status (the idiomatic primitive) miss admin-endpoint failures even
though the http.response.status_code attribute is correct.

Extend assert_server_span_attrs to assert span.status.status_code
matches the expected outcome so the gap can't regress.

* fix(otel): close SERVER span on body-validation and unhandled errors

Stash the SERVER span on request.state in auth so FastAPI exception
handlers can finish it for failures that occur after auth but before
the route handler (e.g. /model/new TypeError, /key/generate
RequestValidationError). Without this, those requests left dangling
spans missing http.response.status_code.

Resolves LIT-3193

* fix(otel): generic 500 body, log exception details server-side

Don't leak str(exc) and type(exc).__name__ to clients on uncaught
exceptions. The full traceback is logged via verbose_proxy_logger and
the SERVER span still gets http.response.status_code=500.

Resolves LIT-3193

* fix(otel): stamp http.response.status_code on every SERVER span path

Closes three remaining gaps where the proxy SERVER span ended without
the http.response.status_code attribute:

1. ProxyException raised from _read_request_body (e.g. invalid JSON
   body) bubbled out of user_api_key_auth before the SERVER span was
   created, so the FastAPI handler had nothing to close and the trace
   never reached the backend. Hoist the span creation to a new
   idempotent _ensure_parent_otel_span_on_request_state helper called
   at the top of user_api_key_auth; wire openai_exception_handler to
   close the dangling span. Covers /v1/chat/completions, /v1/messages,
   /v1/responses (shared handler).

2. /v1/responses success — _handle_success ends the proxy span before
   async_post_call_success_hook fires on this path, so the hook's
   set_response_status_code_attribute(200) silently no-op'd against an
   ended span. Stamp 200 + set OK status at the close site in
   _handle_success / _end_proxy_span_from_kwargs via a shared
   _close_proxy_span_ok helper, so the attribute lands regardless of
   which success hook runs first.

3. Failure path for exceptions without code/status_code (e.g. a bare
   TypeError surfacing through _handle_llm_api_exception) — empty
   error_information.error_code → _record_exception_on_span skips the
   stamp → the hook ends the span. Default to 500 in
   async_post_call_failure_hook so the attribute is always set.

Resolves LIT-3193
* fix(helm): drop main- prefix from default image tag

The default image tag in the deployment + migrations-job templates was
`main-{{ .Chart.AppVersion }}`. The current release pipeline publishes
content tags without the `main-` prefix (e.g. `v1.85.1` / `1.85.1`,
`v1.86.0-rc.1` / `1.86.0-rc.1`), so the rendered ref points at a tag
that does not exist on GHCR or DockerHub and installs fail with
ImagePullBackOff.

- templates/deployment.yaml, templates/migrations-job.yaml: render
  `.Chart.AppVersion` directly instead of `main-<AppVersion>`.
- Chart.yaml: bump stale `appVersion: v1.80.12` (not on either
  registry) to `v1.85.1` so local-checkout installs also resolve.
- values.yaml: update the commented tag-override hint to match.

* fix(helm): use :latest in tag override example, not pinned version

Per review: ghcr.io/berriai/litellm-database:latest is a floating
alias for the most recent stable (same digest as :main-stable),
maintained by the release pipeline's UPDATE_LATEST advance step.
Better example than a pinned version that goes stale.
The schema in test_aaamodel_prices_and_context_window_json_is_valid uses
additionalProperties: false. The azure/speech/azure-stt entry added in
#27482 introduced an audio_transcription_config field that the schema
did not whitelist, so the test fails on every branch built on top of
staging.

Add the field as a string property.
…8683)

* fix(team): refresh team cache on team_model_add/delete (LIT-3244)

team_model_add and team_model_delete wrote to the DB but did not
invalidate the in-memory LiteLLM_TeamTableCachedObj used by
common_checks. After the v1.83.14 common_checks centralization made
team.models authoritative on /v1/files and /v1/vector_stores/*,
adding a Team-BYOK model silently failed to grant the new public
model name to team members until the cache TTL expired (and a
removed model kept working until then on the symmetric path).

Extract the cache-refresh snippet from update_team into a small
helper and apply it consistently at all three team-write sites.

* test: also assert updated models in team-cache-refresh pin

Strengthens the LIT-3244 regression test to also assert
`call_kwargs["team_table"].models` matches the updated row,
not just `team_id`. Both `existing_team` and `updated_team`
share `team_id` in the test setup, so the previous assertion
would have passed even if the implementation accidentally cached
the pre-mutation row.

Greptile review feedback.

* fix(team): hydrate object_permission on cache-refreshing team updates

The Prisma update calls in update_team, team_model_add, and
team_model_delete returned a team row with object_permission_id set
but object_permission=None (the relation was not requested via
include=). _refresh_cached_team then wrote that to the in-memory
LiteLLM_TeamTableCachedObj, and the cache-hit path in get_team_object
returns the cached object without re-hydrating. Downstream consumers
(validate_key_search_tools_against_team, the MCP/agent authz paths)
treat a missing object_permission as no team-level restriction, so
a team-write op silently dropped object-permission enforcement until
the cache TTL expired or a DB-fetch path re-hydrated it.

Add include={"object_permission": True} to all three updates so the
refresh writes a complete cached team. Extend the LIT-3244 regression
test to pin both the cached object_permission and the include shape
on the Prisma call.

Surfaced in PR review of LIT-3244.
… Anthropic (#28723)

`getProviderModels()` matched a model into a provider's dropdown when the
model's `litellm_provider` string *contained* the provider key as a
substring. The intent was to admit suffix variants (e.g. `anthropic_text`,
`bedrock_converse`), but the substring check is too loose: it also pulls in
unrelated providers whose name happens to contain the key, most visibly
`vertex_ai-anthropic_models` matching `anthropic` and `vertex_ai-openai_models`
matching `openai`.

Replace `.includes()` with separator-anchored prefix matching
(`startsWith(provider + "_")` / `startsWith(provider + "-")`). All legitimate
variants in `model_prices_and_context_window.json` still match
(`anthropic_text`, `azure_text`, `azure_ai`, `bedrock_converse`,
`bedrock_mantle`, `cohere_chat`, `fireworks_ai-embedding-models`,
`vertex_ai-*`, `vertex_ai_beta`), and the cross-provider leak is closed.

Tests: update one assertion that pinned the buggy substring behavior
(`custom_openai_endpoint` matching `openai` — not a real provider value);
add 6 new tests covering the leak regressions and the variant-preservation
contract for vertex_ai/bedrock/fireworks.
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: ryan-crabbe-berri <ryan-crabbe-berri@users.noreply.github.com>
…rs and signed request body (#27526)

* Fix Bedrock KB pass-through SigV4 headers and signed body

Coerce botocore HeadersDict to a dict for pass-through routes. When
forward_headers is true, drop request headers that collide case-insensitively
with signed headers so client Bearer auth does not shadow AWS SigV4.
Send prepped.body as raw content so the outbound payload matches the
signature after logging hooks mutate the parsed dict.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Simplify pass-through raw body handling

Read the SigV4-signed bytes directly from request.state inside
pass_through_request instead of threading a custom_raw_body argument
through three functions. Helper methods are restored to their original
signatures, and the new branch lives in one place at each httpx call site.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Harden pass-through raw body read from request.state

Guard missing request.state (test fixtures) and ignore non-bytes/str
values so MagicMock does not trigger the SigV4 raw-body path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* Test pass_through_request state_raw_body uses httpx content=

Cover non-streaming (async_client.request) and streaming (build_request)
paths so SigV4 bytes on request.state are not replaced by json= of a
hook-mutated dict.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds a new `disable_daily_spend_aggregation: true` config flag to
`general_settings` that completely skips writing to the
DailyUser/Team/Tag/Org/EndUser/Agent spend aggregation tables.

When this flag is set:
- `_enqueue_daily_spend_updates` returns early without touching any of the
  six daily spend queues, so the Redis buffer keys
  (`litellm_daily_*_spend_update_buffer`) never grow.
- The dedicated `update_daily_tag_spend_job` APScheduler job is not
  registered at startup.
- Key/user/team balance updates (LiteLLM_KeyTable, UserTable, TeamTable)
  continue to work normally.

This resolves the customer OOM scenario (LIT-3332) where
`disable_spend_logs: true` suppresses per-request spend log rows but the
daily aggregation path still runs unconditionally, filling Redis buffers
until the instance OOMs.

Resolves LIT-3332

https://claude.ai/code/session_01VSh9vug6wHMkBWZ76MjdTt
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 25, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 8 committers have signed the CLA.

✅ mateo-berri
✅ Michael-RZ-Berri
✅ yuneng-berri
✅ ryan-crabbe-berri
✅ milan-berri
❌ yassin-berriai
❌ ishaan-berri
❌ claude
You have signed the CLA already but the status is still pending? Let us recheck it.

@yassin-berriai yassin-berriai changed the base branch from main to litellm_oss_branch May 25, 2026 19:14
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 25, 2026

Congrats! CodSpeed is installed 🎉

🆕 16 new benchmarks were detected.

You will start to see performance impacts in the reports once the benchmarks are run from your default branch.

Detected benchmarks


Open in CodSpeed

claude added 2 commits May 25, 2026 19:19
Required by the assert-shard-coverage guard — every test_*.py under
tests/proxy_unit_tests/ must appear in a named matrix shard.

https://claude.ai/code/session_01VSh9vug6wHMkBWZ76MjdTt
The uv.lock was inadvertently updated by a local `uv pip install` run
that installed test dependencies. This caused the CI `uv lock --check`
step to fail because the lockfile no longer matched what pyproject.toml
+ uv 0.10.9 (the pinned CI version) would generate.

Restoring uv.lock to the state it was in at the feature commit baseline.

https://claude.ai/code/session_01VSh9vug6wHMkBWZ76MjdTt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants