fix(instrumentation): read provider token usage metadata by Mr-Dark-debug · Pull Request #4318 · traceloop/openllmetry

Mr-Dark-debug · 2026-06-22T11:00:18Z

What changed

This PR now covers both token-usage paths reported in the issue thread:

LangChain + ChatDatabricks: token usage may be exposed on AIMessage.response_metadata, either directly as prompt_tokens, completion_tokens, and total_tokens, or nested under usage.
LlamaIndex + VertexAI/Gemini: token usage may be exposed on response.raw.usage_metadata / response.raw.usageMetadata using Google fields such as prompt_token_count, candidates_token_count, and total_token_count.

The instrumentation previously missed those provider-specific metadata shapes, so traces could be exported successfully while the GenAI token usage attributes were absent.

This change adds narrow extraction fallbacks for those response payloads and continues to emit the existing GenAI semantic convention attributes:

gen_ai.usage.input_tokens
gen_ai.usage.output_tokens
gen_ai.usage.total_tokens

Root cause

The existing token extraction logic only handled common usage / usage_metadata shapes for some providers:

LangChain chat response usage extraction read message.usage_metadata, but not Databricks-style message.response_metadata usage.
LlamaIndex raw response extraction read raw.usage and raw.meta, but not Google VertexAI/Gemini raw.usage_metadata.

Tests

Added LangChain regression tests for Databricks-style response_metadata token usage.
Added LlamaIndex regression tests for VertexAI/Gemini usage_metadata and usageMetadata token usage.
Ran uv run ruff check . in packages/opentelemetry-instrumentation-langchain.
Ran uv run pytest tests/test_token_usage.py tests/test_finish_reasons.py tests/test_generation_role_extraction.py --record-mode=none in packages/opentelemetry-instrumentation-langchain.
Ran uv run pytest tests/ --record-mode=none -k "not anthropic" in packages/opentelemetry-instrumentation-langchain.
Ran uv run ruff check . in packages/opentelemetry-instrumentation-llamaindex.
Ran uv run --group test pytest tests/ --record-mode=none in packages/opentelemetry-instrumentation-llamaindex.

Notes

Full LangChain uv run pytest tests/ --record-mode=none currently has 3 Anthropic cassette failures in my local environment because requests are sent to api.z.ai while the recorded cassettes target api.anthropic.com. The non-Anthropic LangChain suite passes.

I do not have Databricks/Dynatrace/VertexAI credentials in this environment, so the fix is covered with regression tests using the documented response payload shapes.

Checklist

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
- Not included: I do not have Databricks/Dynatrace/VertexAI credentials in this environment. Regression tests cover the documented response payload shapes.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.
- Not applicable; this fixes extraction behavior without changing public configuration.

Summary by CodeRabbit

Release Notes

Improvements
- Improved token usage tracking for LangChain by extracting input/output/total tokens from additional response metadata shapes and token-key variants, with safer defaults and total fallback (input + output).
- Extended LlamaIndex token usage extraction to support Google/VertexAI-style usage_metadata / usageMetadata, including prompt/candidates/total fields and camelCase variants.
Tests
- Added unit coverage for Databricks-style LangChain token metadata and multiple Google/VertexAI usage metadata payload formats in LlamaIndex.

coderabbitai · 2026-06-22T11:00:28Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59479150-6c56-4b1a-a109-7dc502fcd8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 40e3c11 and 83b66e5.

📒 Files selected for processing (4)

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py
packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py

🚧 Files skipped from review as they are similar to previous changes (4)

packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py

📝 Walkthrough

Walkthrough

Two private helper functions are added to LangChain's span_utils.py to extract token counts from either usage_metadata or response_metadata (supporting multiple key-name variants), and set_chat_response_usage is updated to use them with fallback total-token derivation. Separately, LlamaIndex's extract_token_usage is extended to recognize Google/VertexAI usage_metadata formats using a new extraction helper and refactored _get_int utility. Both packages add parametrized tests validating their respective token extraction paths.

Changes

LangChain Databricks Token Usage Extraction

Layer / File(s)	Summary
Token extraction helpers and `set_chat_response_usage` refactor `packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`	Adds `_get_token_count(usage, *keys)` to extract integer token counts by trying multiple key names with a default of `0`. Adds `_extract_message_usage(message)` to return `usage_metadata` if present, otherwise probes `response_metadata["usage"]` or top-level token keys. Refactors `set_chat_response_usage` to use these helpers, derive total tokens as input+output when an explicit total is missing, and preserve cache-token accumulation.
Parametrized test for Databricks `response_metadata` schema `packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`	New test module adds a `_mock_span()` helper and a parametrized `response_metadata` fixture supplying `prompt_tokens`, `completion_tokens`, and `total_tokens`. The test builds an `AIMessage` with that metadata, calls `set_chat_response_usage` with a Databricks model name, and asserts `GEN_AI_USAGE_INPUT_TOKENS`, `GEN_AI_USAGE_OUTPUT_TOKENS`, and `GEN_AI_USAGE_TOTAL_TOKENS` span attributes are correct.

LlamaIndex Google/VertexAI Token Support

Layer / File(s)	Summary
Google/VertexAI usage metadata extraction `packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py`	Updates `extract_token_usage` docstring to document VertexAI support. Adds `_extract_google_usage_metadata` to extract `input_tokens`, `output_tokens`, and `total_tokens` from `usage_metadata` using multiple key variants (`prompt_token_count` / `promptTokenCount`, `candidates_token_count` / `candidatesTokenCount`, `total_token_count` / `totalTokenCount`), with `total_tokens` computed as `total_token_count` or derived as input+output when missing. Refactors `_get_int(obj, *keys)` to accept multiple candidate keys and return the first matching integer value from object attributes or dict entries.
Parametrized tests for VertexAI token formats `packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`	Adds `test_google_vertexai_usage_metadata` covering dict-style, camelCase `usageMetadata`, and `SimpleNamespace` VertexAI input representations, asserting extraction of `prompt_token_count` / `promptTokenCount` as input tokens, `candidates_token_count` / `candidatesTokenCount` as output tokens, and `total_token_count` / `totalTokenCount` as total tokens into `TokenUsage` objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

traceloop/openllmetry#4261: Directly modifies set_chat_response_usage in LangChain's span_utils.py to adjust cache-token handling and accumulation, overlapping with the same code path refactored here.

Suggested reviewers

nina-kollman

Poem

🐇 Hop through the metadata forest so deep,
Where Databricks hides its token heap,
Google's snake_case, camelCase too—
I find each key variant, just for you!
No count forgotten, every trace shines true! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change - fixing instrumentation to read provider-specific token usage metadata from LangChain and LlamaIndex responses.
Linked Issues check	✅ Passed	The PR addresses issue `#2661` by implementing token usage extraction from LangChain ChatDatabricks response_metadata and extends the fix to LlamaIndex VertexAI usage_metadata, with comprehensive tests covering both scenarios.
Out of Scope Changes check	✅ Passed	All changes are directly within scope - helper functions for token extraction, updated token aggregation logic, and corresponding test coverage for both LangChain and LlamaIndex implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

CLAassistant · 2026-06-22T11:00:52Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

coderabbitai

🧹 Nitpick comments (2)

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)
446-452: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding alternative key for total_tokens extraction.

_get_token_count for input/output tokens checks multiple key variants (e.g., input_tokens, prompt_tokens, input_token_count), but total_tokens only checks the single key "total_tokens". Some providers may use "total_token_count" or similar.

The fallback at lines 456-458 mitigates this by computing input + output when total_tokens is 0, so this is low risk.
♻️ Optional: Add key variant for consistency
-                    generation_total_tokens = _get_token_count(usage, "total_tokens")
+                    generation_total_tokens = _get_token_count(
+                        usage, "total_tokens", "total_token_count"
+                    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`
around lines 446 - 452, The _get_token_count call for generation_total_tokens
only checks the single key "total_tokens" while the calls for
generation_input_tokens and generation_output_tokens check multiple key variants
for consistency across different providers. Update the generation_total_tokens
_get_token_count call to include alternative key variants (such as
"total_token_count") in addition to "total_tokens" to match the pattern used for
input and output token extraction and improve provider compatibility.
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py (1)
23-65: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding test case for total_tokens fallback computation.

The test covers both response_metadata shapes but all cases include explicit total_tokens. The code at lines 456-458 in span_utils.py computes total_tokens as input + output when the explicit value is missing/zero. Adding a test case without total_tokens would verify this fallback.
🧪 Suggested additional test case
`@pytest.mark.parametrize`(
    "response_metadata",
    [
        # ... existing cases ...
        {
            "usage": {
                "prompt_tokens": 10,
                "completion_tokens": 16,
                # total_tokens omitted - should be computed as 26
            }
        },
    ],
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`
around lines 23 - 65, Add a new test case to the parametrize decorator of the
function test_chat_response_usage_reads_databricks_response_metadata that covers
the fallback computation of total_tokens. Include a response_metadata dictionary
(in either the nested usage structure or flat structure) that omits the
total_tokens field or sets it to zero, so that the set_chat_response_usage
function must compute total_tokens as the sum of prompt_tokens and
completion_tokens. The test assertions should remain the same, verifying that
the computed total_tokens equals 26 (10 + 16).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`:
- Around line 446-452: The _get_token_count call for generation_total_tokens
only checks the single key "total_tokens" while the calls for
generation_input_tokens and generation_output_tokens check multiple key variants
for consistency across different providers. Update the generation_total_tokens
_get_token_count call to include alternative key variants (such as
"total_token_count") in addition to "total_tokens" to match the pattern used for
input and output token extraction and improve provider compatibility.

In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`:
- Around line 23-65: Add a new test case to the parametrize decorator of the
function test_chat_response_usage_reads_databricks_response_metadata that covers
the fallback computation of total_tokens. Include a response_metadata dictionary
(in either the nested usage structure or flat structure) that omits the
total_tokens field or sets it to zero, so that the set_chat_response_usage
function must compute total_tokens as the sum of prompt_tokens and
completion_tokens. The test assertions should remain the same, verifying that
the computed total_tokens equals 26 (10 + 16).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff96f0ca-7884-40ab-9be3-ca9f451dd00a

📥 Commits

Reviewing files that changed from the base of the PR and between fb292d0 and bda0490.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py

Mr-Dark-debug · 2026-06-22T11:11:17Z

@nirga quick update: I pushed another commit to this PR and broadened the fix to cover both reports in #2661.

What is fixed now:

LangChain + ChatDatabricks: token usage is read from AIMessage.response_metadata when Databricks returns prompt_tokens, completion_tokens, and total_tokens there instead of usage_metadata.
LlamaIndex + VertexAI/Gemini: token usage is read from response.raw.usage_metadata / usageMetadata with Google fields like prompt_token_count, candidates_token_count, and total_token_count.

I added regression tests for both provider response shapes. Validation passed for LangChain focused tests and the full LlamaIndex package suite. Only caveat: I can’t add Dynatrace screenshots from this environment because I don’t have Databricks/Dynatrace/VertexAI credentials here, but the tests cover the documented payload shapes that were being missed.

Would appreciate your review when you get a chance.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py (1)
127-156: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add edge-case coverage for total_token_count=0 and missing total derivation.

Current params only cover fully populated non-zero payloads. Please add cases where total_token_count is explicitly 0, and where it is absent (expect derived input+output) to lock in fallback semantics.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`
around lines 127 - 156, The test_google_vertexai_usage_metadata method in the
parametrize decorator needs additional test cases to cover edge cases. Add two
more parameter sets to the "raw" parametrize list: one where total_token_count
is explicitly set to 0 (while keeping prompt_token_count=10 and
candidates_token_count=20), and another where the total_token_count field is
completely absent from all three payload formats (both snake_case and camelCase
dictionary versions and SimpleNamespace version), expecting the result to derive
total_tokens as 30 from input_tokens + output_tokens. These new cases will
ensure the extract_token_usage function handles zero totals and missing totals
correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py`:
- Around line 167-171: The TokenUsage initialization at line 170 uses the `or`
operator to handle the total_tokens parameter, which incorrectly treats an
explicit value of 0 as falsy and replaces it with the calculated sum. Replace
the `total_tokens or _safe_sum(input_tokens, output_tokens)` logic with an
explicit None check instead, so that explicit zero values for total_tokens are
preserved and only fall back to calculating the sum when total_tokens is
actually None or not provided.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`:
- Around line 127-156: The test_google_vertexai_usage_metadata method in the
parametrize decorator needs additional test cases to cover edge cases. Add two
more parameter sets to the "raw" parametrize list: one where total_token_count
is explicitly set to 0 (while keeping prompt_token_count=10 and
candidates_token_count=20), and another where the total_token_count field is
completely absent from all three payload formats (both snake_case and camelCase
dictionary versions and SimpleNamespace version), expecting the result to derive
total_tokens as 30 from input_tokens + output_tokens. These new cases will
ensure the extract_token_usage function handles zero totals and missing totals
correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1dc8bb53-0de6-47e6-a004-61557042d727

📥 Commits

Reviewing files that changed from the base of the PR and between bda0490 and 40e3c11.

📒 Files selected for processing (2)

packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py
packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py

Mr-Dark-debug · 2026-06-22T11:30:23Z

Implemented the CodeRabbit review suggestions in 83b66e546.

Changes made:

Added total_token_count as a LangChain total-token key variant.
Added LangChain coverage for missing explicit totals so input + output fallback is verified.
Preserved explicit 0 totals for LlamaIndex VertexAI/Gemini usage metadata instead of treating zero as missing.
Added LlamaIndex coverage for populated totals, explicit zero totals, and missing totals across usage_metadata, object-style metadata, and usageMetadata camelCase payloads.

Validation run:

uv run ruff check . in packages/opentelemetry-instrumentation-langchain
uv run --group test pytest tests/test_token_usage.py tests/test_finish_reasons.py tests/test_generation_role_extraction.py --record-mode=none in packages/opentelemetry-instrumentation-langchain (29 passed)
uv run ruff check . in packages/opentelemetry-instrumentation-llamaindex
uv run --group test pytest tests/ --record-mode=none in packages/opentelemetry-instrumentation-llamaindex (280 passed)

fix(langchain): read chat token usage from response metadata

bda0490

Mr-Dark-debug marked this pull request as ready for review June 22, 2026 11:01

Mr-Dark-debug mentioned this pull request Jun 22, 2026

🐛 Bug Report: LLM token counts not updated in traces sent to dynatrace #2661

Open

1 task

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

fix(llamaindex): read VertexAI token usage metadata

40e3c11

Mr-Dark-debug changed the title ~~fix(langchain): read chat token usage from response metadata~~ fix(instrumentation): read provider token usage metadata Jun 22, 2026

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread ...metry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py

fix(instrumentation): address token usage review feedback

83b66e5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(instrumentation): read provider token usage metadata#4318

fix(instrumentation): read provider token usage metadata#4318
Mr-Dark-debug wants to merge 3 commits into
traceloop:mainfrom
Mr-Dark-debug:codex/langchain-response-metadata-token-usage

Mr-Dark-debug commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

CLAassistant commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Mr-Dark-debug commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Mr-Dark-debug commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mr-Dark-debug commented Jun 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Root cause

Tests

Notes

Checklist

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

CLAassistant commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Mr-Dark-debug commented Jun 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mr-Dark-debug commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mr-Dark-debug commented Jun 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading