Skip to content

Commit de41845

Browse files
feat(openai_agents): pull cached tokens through into metrics (#364)
## Summary - Walk `*_tokens_details` sub-objects in `_usage_to_metrics` so the OpenAI Agents SDK integration picks up cached / reasoning / audio token counts (e.g. `input_tokens_details.cached_tokens` → `prompt_cached_tokens`). Mirrors the JS fix in [braintrust-sdk-javascript#1186](braintrustdata/braintrust-sdk-javascript@a05dc4d). - Route `_response_log_data` through `_usage_to_metrics` instead of hardcoding the three `total/input/output` fields, so the Responses API path benefits from the same extraction. - `_task_log_data` and `_turn_log_data` already delegated to `_usage_to_metrics`, so they inherit the fix. ## Why A customer reported that cached tokens are not showing up in the Python `BraintrustTracingProcessor`. The narrow 3-field extraction in `_response_log_data` (Responses API) and `_usage_to_metrics` (chat-completions / Generation spans) drops `input_tokens_details.cached_tokens` even though the OpenAI wrapper (`braintrust/oai.py`'s `_parse_metrics_from_usage`) already handles it correctly. The JS SDK was patched in December but the Python equivalent was never written. ## Test plan - [x] `test_response_span_extracts_cached_tokens_from_usage` — Response span sees `prompt_cached_tokens` - [x] `test_response_span_handles_zero_cached_tokens` — zero is preserved, not dropped - [x] `test_response_span_handles_missing_cached_tokens` — no `prompt_cached_tokens` key when details absent - [x] `test_generation_span_extracts_cached_tokens_from_usage` — Generation span path - [x] Existing non-VCR processor tests still pass --------- Co-authored-by: Abhijeet Prasad <abhijeet@braintrustdata.com>
1 parent aff7b6a commit de41845

2 files changed

Lines changed: 29 additions & 3 deletions

File tree

py/src/braintrust/integrations/openai_agents/test_openai_agents.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,12 @@ async def test_openai_agents_integration_setup_creates_spans(memory_logger):
125125

126126
llm_spans = [span for span in spans if span.get("span_attributes", {}).get("type") == "llm"]
127127
assert llm_spans
128+
llm_metrics = [span.get("metrics", {}) for span in llm_spans]
129+
assert any(metrics.get("prompt_tokens") is not None for metrics in llm_metrics)
130+
assert any(metrics.get("completion_tokens") is not None for metrics in llm_metrics)
131+
assert any(metrics.get("tokens") is not None for metrics in llm_metrics)
132+
assert any(metrics.get("prompt_cached_tokens") == 0 for metrics in llm_metrics)
133+
assert any(metrics.get("completion_reasoning_tokens") == 0 for metrics in llm_metrics)
128134

129135

130136
@pytest.mark.asyncio

py/src/braintrust/integrations/openai_agents/tracing.py

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,14 @@ def _maybe_timestamp_elapsed(end: str | None, start: str | None) -> float | None
6969
return (datetime.datetime.fromisoformat(end) - datetime.datetime.fromisoformat(start)).total_seconds()
7070

7171

72+
# Maps the prefix of an OpenAI usage `*_tokens_details` field to the Braintrust
73+
# metric prefix (e.g. `input_tokens_details.cached_tokens` → `prompt_cached_tokens`).
74+
_TOKEN_PREFIX_MAP = {
75+
"input": "prompt",
76+
"output": "completion",
77+
}
78+
79+
7280
def _usage_to_metrics(usage: dict[str, Any]) -> dict[str, Any]:
7381
"""Convert an OpenAI-style usage dict to Braintrust metrics."""
7482
metrics: dict[str, Any] = {}
@@ -86,6 +94,19 @@ def _usage_to_metrics(usage: dict[str, Any]) -> dict[str, Any]:
8694
metrics["tokens"] = usage["total_tokens"]
8795
elif "input_tokens" in usage and "output_tokens" in usage:
8896
metrics["tokens"] = usage["input_tokens"] + usage["output_tokens"]
97+
98+
# Walk *_tokens_details sub-objects so we capture cached / reasoning / audio
99+
# token counts (e.g. input_tokens_details.cached_tokens → prompt_cached_tokens).
100+
for key, value in usage.items():
101+
if not key.endswith("_tokens_details") or not isinstance(value, dict):
102+
continue
103+
raw_prefix = key[: -len("_tokens_details")]
104+
prefix = _TOKEN_PREFIX_MAP.get(raw_prefix, raw_prefix)
105+
for sub_key, sub_value in value.items():
106+
if isinstance(sub_value, bool) or not isinstance(sub_value, (int, float)):
107+
continue
108+
metrics[f"{prefix}_{sub_key}"] = sub_value
109+
89110
return metrics
90111

91112

@@ -166,9 +187,8 @@ def _response_log_data(self, span: tracing.Span[tracing.ResponseSpanData]) -> di
166187
if ttft is not None:
167188
data["metrics"]["time_to_first_token"] = ttft
168189
if span.span_data.response is not None and span.span_data.response.usage is not None:
169-
data["metrics"]["tokens"] = span.span_data.response.usage.total_tokens
170-
data["metrics"]["prompt_tokens"] = span.span_data.response.usage.input_tokens
171-
data["metrics"]["completion_tokens"] = span.span_data.response.usage.output_tokens
190+
usage_dict = span.span_data.response.usage.model_dump()
191+
data["metrics"].update(_usage_to_metrics(usage_dict))
172192

173193
return data
174194

0 commit comments

Comments
 (0)