Skip to content

fix(instrumentation): read provider token usage metadata#4318

Open
Mr-Dark-debug wants to merge 3 commits into
traceloop:mainfrom
Mr-Dark-debug:codex/langchain-response-metadata-token-usage
Open

fix(instrumentation): read provider token usage metadata#4318
Mr-Dark-debug wants to merge 3 commits into
traceloop:mainfrom
Mr-Dark-debug:codex/langchain-response-metadata-token-usage

Conversation

@Mr-Dark-debug

@Mr-Dark-debug Mr-Dark-debug commented Jun 22, 2026

Copy link
Copy Markdown

What changed

Fixes #2661.

This PR now covers both token-usage paths reported in the issue thread:

  • LangChain + ChatDatabricks: token usage may be exposed on AIMessage.response_metadata, either directly as prompt_tokens, completion_tokens, and total_tokens, or nested under usage.
  • LlamaIndex + VertexAI/Gemini: token usage may be exposed on response.raw.usage_metadata / response.raw.usageMetadata using Google fields such as prompt_token_count, candidates_token_count, and total_token_count.

The instrumentation previously missed those provider-specific metadata shapes, so traces could be exported successfully while the GenAI token usage attributes were absent.

This change adds narrow extraction fallbacks for those response payloads and continues to emit the existing GenAI semantic convention attributes:

  • gen_ai.usage.input_tokens
  • gen_ai.usage.output_tokens
  • gen_ai.usage.total_tokens

Root cause

The existing token extraction logic only handled common usage / usage_metadata shapes for some providers:

  • LangChain chat response usage extraction read message.usage_metadata, but not Databricks-style message.response_metadata usage.
  • LlamaIndex raw response extraction read raw.usage and raw.meta, but not Google VertexAI/Gemini raw.usage_metadata.

Tests

  • Added LangChain regression tests for Databricks-style response_metadata token usage.
  • Added LlamaIndex regression tests for VertexAI/Gemini usage_metadata and usageMetadata token usage.
  • Ran uv run ruff check . in packages/opentelemetry-instrumentation-langchain.
  • Ran uv run pytest tests/test_token_usage.py tests/test_finish_reasons.py tests/test_generation_role_extraction.py --record-mode=none in packages/opentelemetry-instrumentation-langchain.
  • Ran uv run pytest tests/ --record-mode=none -k "not anthropic" in packages/opentelemetry-instrumentation-langchain.
  • Ran uv run ruff check . in packages/opentelemetry-instrumentation-llamaindex.
  • Ran uv run --group test pytest tests/ --record-mode=none in packages/opentelemetry-instrumentation-llamaindex.

Notes

Full LangChain uv run pytest tests/ --record-mode=none currently has 3 Anthropic cassette failures in my local environment because requests are sent to api.z.ai while the recorded cassettes target api.anthropic.com. The non-Anthropic LangChain suite passes.

I do not have Databricks/Dynatrace/VertexAI credentials in this environment, so the fix is covered with regression tests using the documented response payload shapes.

Checklist

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
    • Not included: I do not have Databricks/Dynatrace/VertexAI credentials in this environment. Regression tests cover the documented response payload shapes.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.
    • Not applicable; this fixes extraction behavior without changing public configuration.

Summary by CodeRabbit

Release Notes

  • Improvements
    • Improved token usage tracking for LangChain by extracting input/output/total tokens from additional response metadata shapes and token-key variants, with safer defaults and total fallback (input + output).
    • Extended LlamaIndex token usage extraction to support Google/VertexAI-style usage_metadata / usageMetadata, including prompt/candidates/total fields and camelCase variants.
  • Tests
    • Added unit coverage for Databricks-style LangChain token metadata and multiple Google/VertexAI usage metadata payload formats in LlamaIndex.

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 59479150-6c56-4b1a-a109-7dc502fcd8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 40e3c11 and 83b66e5.

📒 Files selected for processing (4)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
  • packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py
  • packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
  • packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py

📝 Walkthrough

Walkthrough

Two private helper functions are added to LangChain's span_utils.py to extract token counts from either usage_metadata or response_metadata (supporting multiple key-name variants), and set_chat_response_usage is updated to use them with fallback total-token derivation. Separately, LlamaIndex's extract_token_usage is extended to recognize Google/VertexAI usage_metadata formats using a new extraction helper and refactored _get_int utility. Both packages add parametrized tests validating their respective token extraction paths.

Changes

LangChain Databricks Token Usage Extraction

Layer / File(s) Summary
Token extraction helpers and set_chat_response_usage refactor
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
Adds _get_token_count(usage, *keys) to extract integer token counts by trying multiple key names with a default of 0. Adds _extract_message_usage(message) to return usage_metadata if present, otherwise probes response_metadata["usage"] or top-level token keys. Refactors set_chat_response_usage to use these helpers, derive total tokens as input+output when an explicit total is missing, and preserve cache-token accumulation.
Parametrized test for Databricks response_metadata schema
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
New test module adds a _mock_span() helper and a parametrized response_metadata fixture supplying prompt_tokens, completion_tokens, and total_tokens. The test builds an AIMessage with that metadata, calls set_chat_response_usage with a Databricks model name, and asserts GEN_AI_USAGE_INPUT_TOKENS, GEN_AI_USAGE_OUTPUT_TOKENS, and GEN_AI_USAGE_TOTAL_TOKENS span attributes are correct.

LlamaIndex Google/VertexAI Token Support

Layer / File(s) Summary
Google/VertexAI usage metadata extraction
packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py
Updates extract_token_usage docstring to document VertexAI support. Adds _extract_google_usage_metadata to extract input_tokens, output_tokens, and total_tokens from usage_metadata using multiple key variants (prompt_token_count / promptTokenCount, candidates_token_count / candidatesTokenCount, total_token_count / totalTokenCount), with total_tokens computed as total_token_count or derived as input+output when missing. Refactors _get_int(obj, *keys) to accept multiple candidate keys and return the first matching integer value from object attributes or dict entries.
Parametrized tests for VertexAI token formats
packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py
Adds test_google_vertexai_usage_metadata covering dict-style, camelCase usageMetadata, and SimpleNamespace VertexAI input representations, asserting extraction of prompt_token_count / promptTokenCount as input tokens, candidates_token_count / candidatesTokenCount as output tokens, and total_token_count / totalTokenCount as total tokens into TokenUsage objects.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • traceloop/openllmetry#4261: Directly modifies set_chat_response_usage in LangChain's span_utils.py to adjust cache-token handling and accumulation, overlapping with the same code path refactored here.

Suggested reviewers

  • nina-kollman

Poem

🐇 Hop through the metadata forest so deep,
Where Databricks hides its token heap,
Google's snake_case, camelCase too—
I find each key variant, just for you!
No count forgotten, every trace shines true! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change - fixing instrumentation to read provider-specific token usage metadata from LangChain and LlamaIndex responses.
Linked Issues check ✅ Passed The PR addresses issue #2661 by implementing token usage extraction from LangChain ChatDatabricks response_metadata and extends the fix to LlamaIndex VertexAI usage_metadata, with comprehensive tests covering both scenarios.
Out of Scope Changes check ✅ Passed All changes are directly within scope - helper functions for token extraction, updated token aggregation logic, and corresponding test coverage for both LangChain and LlamaIndex implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)

446-452: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider adding alternative key for total_tokens extraction.

_get_token_count for input/output tokens checks multiple key variants (e.g., input_tokens, prompt_tokens, input_token_count), but total_tokens only checks the single key "total_tokens". Some providers may use "total_token_count" or similar.

The fallback at lines 456-458 mitigates this by computing input + output when total_tokens is 0, so this is low risk.

♻️ Optional: Add key variant for consistency
-                    generation_total_tokens = _get_token_count(usage, "total_tokens")
+                    generation_total_tokens = _get_token_count(
+                        usage, "total_tokens", "total_token_count"
+                    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`
around lines 446 - 452, The _get_token_count call for generation_total_tokens
only checks the single key "total_tokens" while the calls for
generation_input_tokens and generation_output_tokens check multiple key variants
for consistency across different providers. Update the generation_total_tokens
_get_token_count call to include alternative key variants (such as
"total_token_count") in addition to "total_tokens" to match the pattern used for
input and output token extraction and improve provider compatibility.
packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py (1)

23-65: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Consider adding test case for total_tokens fallback computation.

The test covers both response_metadata shapes but all cases include explicit total_tokens. The code at lines 456-458 in span_utils.py computes total_tokens as input + output when the explicit value is missing/zero. Adding a test case without total_tokens would verify this fallback.

🧪 Suggested additional test case
`@pytest.mark.parametrize`(
    "response_metadata",
    [
        # ... existing cases ...
        {
            "usage": {
                "prompt_tokens": 10,
                "completion_tokens": 16,
                # total_tokens omitted - should be computed as 26
            }
        },
    ],
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`
around lines 23 - 65, Add a new test case to the parametrize decorator of the
function test_chat_response_usage_reads_databricks_response_metadata that covers
the fallback computation of total_tokens. Include a response_metadata dictionary
(in either the nested usage structure or flat structure) that omits the
total_tokens field or sets it to zero, so that the set_chat_response_usage
function must compute total_tokens as the sum of prompt_tokens and
completion_tokens. The test assertions should remain the same, verifying that
the computed total_tokens equals 26 (10 + 16).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`:
- Around line 446-452: The _get_token_count call for generation_total_tokens
only checks the single key "total_tokens" while the calls for
generation_input_tokens and generation_output_tokens check multiple key variants
for consistency across different providers. Update the generation_total_tokens
_get_token_count call to include alternative key variants (such as
"total_token_count") in addition to "total_tokens" to match the pattern used for
input and output token extraction and improve provider compatibility.

In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`:
- Around line 23-65: Add a new test case to the parametrize decorator of the
function test_chat_response_usage_reads_databricks_response_metadata that covers
the fallback computation of total_tokens. Include a response_metadata dictionary
(in either the nested usage structure or flat structure) that omits the
total_tokens field or sets it to zero, so that the set_chat_response_usage
function must compute total_tokens as the sum of prompt_tokens and
completion_tokens. The test assertions should remain the same, verifying that
the computed total_tokens equals 26 (10 + 16).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ff96f0ca-7884-40ab-9be3-ca9f451dd00a

📥 Commits

Reviewing files that changed from the base of the PR and between fb292d0 and bda0490.

📒 Files selected for processing (2)
  • packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py
  • packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py

@Mr-Dark-debug Mr-Dark-debug changed the title fix(langchain): read chat token usage from response metadata fix(instrumentation): read provider token usage metadata Jun 22, 2026
@Mr-Dark-debug

Copy link
Copy Markdown
Author

@nirga quick update: I pushed another commit to this PR and broadened the fix to cover both reports in #2661.

What is fixed now:

  • LangChain + ChatDatabricks: token usage is read from AIMessage.response_metadata when Databricks returns prompt_tokens, completion_tokens, and total_tokens there instead of usage_metadata.
  • LlamaIndex + VertexAI/Gemini: token usage is read from response.raw.usage_metadata / usageMetadata with Google fields like prompt_token_count, candidates_token_count, and total_token_count.

I added regression tests for both provider response shapes. Validation passed for LangChain focused tests and the full LlamaIndex package suite. Only caveat: I can’t add Dynatrace screenshots from this environment because I don’t have Databricks/Dynatrace/VertexAI credentials here, but the tests cover the documented payload shapes that were being missed.

Would appreciate your review when you get a chance.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py (1)

127-156: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add edge-case coverage for total_token_count=0 and missing total derivation.

Current params only cover fully populated non-zero payloads. Please add cases where total_token_count is explicitly 0, and where it is absent (expect derived input+output) to lock in fallback semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`
around lines 127 - 156, The test_google_vertexai_usage_metadata method in the
parametrize decorator needs additional test cases to cover edge cases. Add two
more parameter sets to the "raw" parametrize list: one where total_token_count
is explicitly set to 0 (while keeping prompt_token_count=10 and
candidates_token_count=20), and another where the total_token_count field is
completely absent from all three payload formats (both snake_case and camelCase
dictionary versions and SimpleNamespace version), expecting the result to derive
total_tokens as 30 from input_tokens + output_tokens. These new cases will
ensure the extract_token_usage function handles zero totals and missing totals
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py`:
- Around line 167-171: The TokenUsage initialization at line 170 uses the `or`
operator to handle the total_tokens parameter, which incorrectly treats an
explicit value of 0 as falsy and replaces it with the calculated sum. Replace
the `total_tokens or _safe_sum(input_tokens, output_tokens)` logic with an
explicit None check instead, so that explicit zero values for total_tokens are
preserved and only fall back to calculating the sum when total_tokens is
actually None or not provided.

---

Nitpick comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`:
- Around line 127-156: The test_google_vertexai_usage_metadata method in the
parametrize decorator needs additional test cases to cover edge cases. Add two
more parameter sets to the "raw" parametrize list: one where total_token_count
is explicitly set to 0 (while keeping prompt_token_count=10 and
candidates_token_count=20), and another where the total_token_count field is
completely absent from all three payload formats (both snake_case and camelCase
dictionary versions and SimpleNamespace version), expecting the result to derive
total_tokens as 30 from input_tokens + output_tokens. These new cases will
ensure the extract_token_usage function handles zero totals and missing totals
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1dc8bb53-0de6-47e6-a004-61557042d727

📥 Commits

Reviewing files that changed from the base of the PR and between bda0490 and 40e3c11.

📒 Files selected for processing (2)
  • packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py
  • packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py

@Mr-Dark-debug

Copy link
Copy Markdown
Author

Implemented the CodeRabbit review suggestions in 83b66e546.

Changes made:

  • Added total_token_count as a LangChain total-token key variant.
  • Added LangChain coverage for missing explicit totals so input + output fallback is verified.
  • Preserved explicit 0 totals for LlamaIndex VertexAI/Gemini usage metadata instead of treating zero as missing.
  • Added LlamaIndex coverage for populated totals, explicit zero totals, and missing totals across usage_metadata, object-style metadata, and usageMetadata camelCase payloads.

Validation run:

  • uv run ruff check . in packages/opentelemetry-instrumentation-langchain
  • uv run --group test pytest tests/test_token_usage.py tests/test_finish_reasons.py tests/test_generation_role_extraction.py --record-mode=none in packages/opentelemetry-instrumentation-langchain (29 passed)
  • uv run ruff check . in packages/opentelemetry-instrumentation-llamaindex
  • uv run --group test pytest tests/ --record-mode=none in packages/opentelemetry-instrumentation-llamaindex (280 passed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 Bug Report: LLM token counts not updated in traces sent to dynatrace

2 participants