fix(instrumentation): read provider token usage metadata#4318
fix(instrumentation): read provider token usage metadata#4318Mr-Dark-debug wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
📝 WalkthroughWalkthroughTwo private helper functions are added to LangChain's ChangesLangChain Databricks Token Usage Extraction
LlamaIndex Google/VertexAI Token Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py (1)
446-452: 🧹 Nitpick | 🔵 Trivial | 💤 Low valueConsider adding alternative key for
total_tokensextraction.
_get_token_countfor input/output tokens checks multiple key variants (e.g.,input_tokens,prompt_tokens,input_token_count), buttotal_tokensonly checks the single key"total_tokens". Some providers may use"total_token_count"or similar.The fallback at lines 456-458 mitigates this by computing
input + outputwhentotal_tokensis 0, so this is low risk.♻️ Optional: Add key variant for consistency
- generation_total_tokens = _get_token_count(usage, "total_tokens") + generation_total_tokens = _get_token_count( + usage, "total_tokens", "total_token_count" + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py` around lines 446 - 452, The _get_token_count call for generation_total_tokens only checks the single key "total_tokens" while the calls for generation_input_tokens and generation_output_tokens check multiple key variants for consistency across different providers. Update the generation_total_tokens _get_token_count call to include alternative key variants (such as "total_token_count") in addition to "total_tokens" to match the pattern used for input and output token extraction and improve provider compatibility.packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py (1)
23-65: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winConsider adding test case for
total_tokensfallback computation.The test covers both
response_metadatashapes but all cases include explicittotal_tokens. The code at lines 456-458 inspan_utils.pycomputestotal_tokensasinput + outputwhen the explicit value is missing/zero. Adding a test case withouttotal_tokenswould verify this fallback.🧪 Suggested additional test case
`@pytest.mark.parametrize`( "response_metadata", [ # ... existing cases ... { "usage": { "prompt_tokens": 10, "completion_tokens": 16, # total_tokens omitted - should be computed as 26 } }, ], )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py` around lines 23 - 65, Add a new test case to the parametrize decorator of the function test_chat_response_usage_reads_databricks_response_metadata that covers the fallback computation of total_tokens. Include a response_metadata dictionary (in either the nested usage structure or flat structure) that omits the total_tokens field or sets it to zero, so that the set_chat_response_usage function must compute total_tokens as the sum of prompt_tokens and completion_tokens. The test assertions should remain the same, verifying that the computed total_tokens equals 26 (10 + 16).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.py`:
- Around line 446-452: The _get_token_count call for generation_total_tokens
only checks the single key "total_tokens" while the calls for
generation_input_tokens and generation_output_tokens check multiple key variants
for consistency across different providers. Update the generation_total_tokens
_get_token_count call to include alternative key variants (such as
"total_token_count") in addition to "total_tokens" to match the pattern used for
input and output token extraction and improve provider compatibility.
In `@packages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py`:
- Around line 23-65: Add a new test case to the parametrize decorator of the
function test_chat_response_usage_reads_databricks_response_metadata that covers
the fallback computation of total_tokens. Include a response_metadata dictionary
(in either the nested usage structure or flat structure) that omits the
total_tokens field or sets it to zero, so that the set_chat_response_usage
function must compute total_tokens as the sum of prompt_tokens and
completion_tokens. The test assertions should remain the same, verifying that
the computed total_tokens equals 26 (10 + 16).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ff96f0ca-7884-40ab-9be3-ca9f451dd00a
📒 Files selected for processing (2)
packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/span_utils.pypackages/opentelemetry-instrumentation-langchain/tests/test_token_usage.py
|
@nirga quick update: I pushed another commit to this PR and broadened the fix to cover both reports in #2661. What is fixed now:
I added regression tests for both provider response shapes. Validation passed for LangChain focused tests and the full LlamaIndex package suite. Only caveat: I can’t add Dynatrace screenshots from this environment because I don’t have Databricks/Dynatrace/VertexAI credentials here, but the tests cover the documented payload shapes that were being missed. Would appreciate your review when you get a chance. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py (1)
127-156: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd edge-case coverage for
total_token_count=0and missing total derivation.Current params only cover fully populated non-zero payloads. Please add cases where
total_token_countis explicitly0, and where it is absent (expect derivedinput+output) to lock in fallback semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py` around lines 127 - 156, The test_google_vertexai_usage_metadata method in the parametrize decorator needs additional test cases to cover edge cases. Add two more parameter sets to the "raw" parametrize list: one where total_token_count is explicitly set to 0 (while keeping prompt_token_count=10 and candidates_token_count=20), and another where the total_token_count field is completely absent from all three payload formats (both snake_case and camelCase dictionary versions and SimpleNamespace version), expecting the result to derive total_tokens as 30 from input_tokens + output_tokens. These new cases will ensure the extract_token_usage function handles zero totals and missing totals correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.py`:
- Around line 167-171: The TokenUsage initialization at line 170 uses the `or`
operator to handle the total_tokens parameter, which incorrectly treats an
explicit value of 0 as falsy and replaces it with the calculated sum. Replace
the `total_tokens or _safe_sum(input_tokens, output_tokens)` logic with an
explicit None check instead, so that explicit zero values for total_tokens are
preserved and only fall back to calculating the sum when total_tokens is
actually None or not provided.
---
Nitpick comments:
In
`@packages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py`:
- Around line 127-156: The test_google_vertexai_usage_metadata method in the
parametrize decorator needs additional test cases to cover edge cases. Add two
more parameter sets to the "raw" parametrize list: one where total_token_count
is explicitly set to 0 (while keeping prompt_token_count=10 and
candidates_token_count=20), and another where the total_token_count field is
completely absent from all three payload formats (both snake_case and camelCase
dictionary versions and SimpleNamespace version), expecting the result to derive
total_tokens as 30 from input_tokens + output_tokens. These new cases will
ensure the extract_token_usage function handles zero totals and missing totals
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1dc8bb53-0de6-47e6-a004-61557042d727
📒 Files selected for processing (2)
packages/opentelemetry-instrumentation-llamaindex/opentelemetry/instrumentation/llamaindex/_response_utils.pypackages/opentelemetry-instrumentation-llamaindex/tests/test_response_utils.py
|
Implemented the CodeRabbit review suggestions in Changes made:
Validation run:
|
What changed
Fixes #2661.
This PR now covers both token-usage paths reported in the issue thread:
AIMessage.response_metadata, either directly asprompt_tokens,completion_tokens, andtotal_tokens, or nested underusage.response.raw.usage_metadata/response.raw.usageMetadatausing Google fields such asprompt_token_count,candidates_token_count, andtotal_token_count.The instrumentation previously missed those provider-specific metadata shapes, so traces could be exported successfully while the GenAI token usage attributes were absent.
This change adds narrow extraction fallbacks for those response payloads and continues to emit the existing GenAI semantic convention attributes:
gen_ai.usage.input_tokensgen_ai.usage.output_tokensgen_ai.usage.total_tokensRoot cause
The existing token extraction logic only handled common
usage/usage_metadatashapes for some providers:message.usage_metadata, but not Databricks-stylemessage.response_metadatausage.raw.usageandraw.meta, but not Google VertexAI/Geminiraw.usage_metadata.Tests
response_metadatatoken usage.usage_metadataandusageMetadatatoken usage.uv run ruff check .inpackages/opentelemetry-instrumentation-langchain.uv run pytest tests/test_token_usage.py tests/test_finish_reasons.py tests/test_generation_role_extraction.py --record-mode=noneinpackages/opentelemetry-instrumentation-langchain.uv run pytest tests/ --record-mode=none -k "not anthropic"inpackages/opentelemetry-instrumentation-langchain.uv run ruff check .inpackages/opentelemetry-instrumentation-llamaindex.uv run --group test pytest tests/ --record-mode=noneinpackages/opentelemetry-instrumentation-llamaindex.Notes
Full LangChain
uv run pytest tests/ --record-mode=nonecurrently has 3 Anthropic cassette failures in my local environment because requests are sent toapi.z.aiwhile the recorded cassettes targetapi.anthropic.com. The non-Anthropic LangChain suite passes.I do not have Databricks/Dynatrace/VertexAI credentials in this environment, so the fix is covered with regression tests using the documented response payload shapes.
Checklist
feat(instrumentation): ...orfix(instrumentation): ....Summary by CodeRabbit
Release Notes
usage_metadata/usageMetadata, including prompt/candidates/total fields and camelCase variants.