feat: Send GenAI spans as V2 envelope items #6079
7 issues
code-review: Found 7 issues (2 high, 4 medium, 1 low)
High
Span assertions never execute in test_multiple_providers due to missing span capture - `tests/integrations/litellm/test_litellm.py:1020-1023`
The sync test_multiple_providers function calls capture_items("transaction") at line 945, but the new code at line 1020 filters for item.type == "span". Since spans are never captured, the spans list will always be empty, causing the for-loop to never execute and the SPANDATA.GEN_AI_SYSTEM assertion to be silently skipped. The async version was correctly updated to capture_items("transaction", "span") at line 1040, but the sync version was missed.
Also found at:
tests/integrations/litellm/test_litellm.py:945
test_message_history mixes incompatible span formats causing test to silently pass without validation - `tests/integrations/pydantic_ai/test_pydantic_ai.py:830-840`
In test_message_history, spans are extracted from second_transaction["spans"] (line 830, old format with op field), but then filtered using s["attributes"].get("sentry.op", "") (lines 831-833, new V2 format). Since spans from transaction["spans"] use span["op"] not span["attributes"]["sentry.op"], the filter will never match any spans, making chat_spans empty. The subsequent assertions are wrapped in if chat_spans: and if "gen_ai.request.messages" in chat_span["attributes"]:, so the test silently passes without actually verifying anything.
Also found at:
tests/integrations/openai_agents/test_openai_agents.py:3560-3562
Medium
New span conversion functions lack test coverage - `sentry_sdk/client.py:102-298`
The new functions _serialized_v1_attribute_to_serialized_v2_attribute, _serialized_v1_span_to_serialized_v2_span, and _split_gen_ai_spans are not covered by dedicated unit tests. While the integration tests in the PR may exercise these code paths indirectly, direct unit tests would help verify edge cases like None values, empty lists, heterogeneous arrays, and invalid timestamp formats.
Also found at:
sentry_sdk/client.py:1130
Test uses incorrect key 'attributes' instead of 'data' for inline_data - `tests/integrations/google_genai/test_google_genai.py:2153`
The test data structure uses {"inline_data": {"attributes": b"binary_data", ...}} but Google GenAI's inline_data format uses data not attributes. The production code in sentry_sdk/ai/utils.py:286 retrieves inline_data.get("data", ""). Other tests in this same file (lines 1765, 1806) correctly use {"inline_data": {"data": ..., "mime_type": ...}}. This test will fail to properly exercise the code path for extracting blob content.
Also found at:
tests/integrations/google_genai/test_google_genai.py:330
Hardcoded SDK version will cause test failure when version changes - `tests/integrations/huggingface_hub/test_huggingface_hub.py:523`
The test test_text_generation hardcodes "sentry.sdk.version": "2.58.0" in expected_data, while the other tests in this file (test_text_generation_streaming, test_chat_completion) correctly use mock.ANY. This will cause the test to fail when the SDK version is bumped, introducing test fragility.
Unused list comprehension results in no actual test validation - `tests/integrations/langchain/test_langchain.py:1840-1844`
The list comprehension on lines 1840-1844 creates a list of error events but doesn't assign it to any variable, making it dead code with no effect. The original code also had this issue (assigning to [e for e in events...] without using e), but this refactoring maintains the bug. This means test_langchain_embeddings_error_handling doesn't actually validate that errors are captured properly.
Low
Test assertions may be silently skipped if no chat_spans are found - `tests/integrations/pydantic_ai/test_pydantic_ai.py:993-995`
In test_include_prompts_requires_pii, the test loops over chat_spans to verify assertions, but doesn't verify that chat_spans has at least one element. If no chat spans are generated due to a bug, the test would pass without actually testing anything. Consider adding assert len(chat_spans) >= 1 before the loop to ensure the test is actually validating the expected behavior.
Duration: 20m 35s · Tokens: 13.5M in / 167.6k out · Cost: $20.22 (+extraction: $0.01, +merge: $0.00, +fix_gate: $0.01)
Annotations
Check failure on line 1023 in tests/integrations/litellm/test_litellm.py
sentry-warden / warden: code-review
Span assertions never execute in test_multiple_providers due to missing span capture
The sync `test_multiple_providers` function calls `capture_items("transaction")` at line 945, but the new code at line 1020 filters for `item.type == "span"`. Since spans are never captured, the `spans` list will always be empty, causing the for-loop to never execute and the `SPANDATA.GEN_AI_SYSTEM` assertion to be silently skipped. The async version was correctly updated to `capture_items("transaction", "span")` at line 1040, but the sync version was missed.
Check failure on line 945 in tests/integrations/litellm/test_litellm.py
sentry-warden / warden: code-review
[UEC-9WY] Span assertions never execute in test_multiple_providers due to missing span capture (additional location)
The sync `test_multiple_providers` function calls `capture_items("transaction")` at line 945, but the new code at line 1020 filters for `item.type == "span"`. Since spans are never captured, the `spans` list will always be empty, causing the for-loop to never execute and the `SPANDATA.GEN_AI_SYSTEM` assertion to be silently skipped. The async version was correctly updated to `capture_items("transaction", "span")` at line 1040, but the sync version was missed.
Check failure on line 840 in tests/integrations/pydantic_ai/test_pydantic_ai.py
sentry-warden / warden: code-review
test_message_history mixes incompatible span formats causing test to silently pass without validation
In test_message_history, spans are extracted from `second_transaction["spans"]` (line 830, old format with `op` field), but then filtered using `s["attributes"].get("sentry.op", "")` (lines 831-833, new V2 format). Since spans from transaction["spans"] use `span["op"]` not `span["attributes"]["sentry.op"]`, the filter will never match any spans, making `chat_spans` empty. The subsequent assertions are wrapped in `if chat_spans:` and `if "gen_ai.request.messages" in chat_span["attributes"]:`, so the test silently passes without actually verifying anything.
Check failure on line 3562 in tests/integrations/openai_agents/test_openai_agents.py
sentry-warden / warden: code-review
[NK9-2UC] test_message_history mixes incompatible span formats causing test to silently pass without validation (additional location)
In test_message_history, spans are extracted from `second_transaction["spans"]` (line 830, old format with `op` field), but then filtered using `s["attributes"].get("sentry.op", "")` (lines 831-833, new V2 format). Since spans from transaction["spans"] use `span["op"]` not `span["attributes"]["sentry.op"]`, the filter will never match any spans, making `chat_spans` empty. The subsequent assertions are wrapped in `if chat_spans:` and `if "gen_ai.request.messages" in chat_span["attributes"]:`, so the test silently passes without actually verifying anything.
Check warning on line 298 in sentry_sdk/client.py
sentry-warden / warden: code-review
New span conversion functions lack test coverage
The new functions `_serialized_v1_attribute_to_serialized_v2_attribute`, `_serialized_v1_span_to_serialized_v2_span`, and `_split_gen_ai_spans` are not covered by dedicated unit tests. While the integration tests in the PR may exercise these code paths indirectly, direct unit tests would help verify edge cases like None values, empty lists, heterogeneous arrays, and invalid timestamp formats.
Check warning on line 1130 in sentry_sdk/client.py
sentry-warden / warden: code-review
[KML-HGS] New span conversion functions lack test coverage (additional location)
The new functions `_serialized_v1_attribute_to_serialized_v2_attribute`, `_serialized_v1_span_to_serialized_v2_span`, and `_split_gen_ai_spans` are not covered by dedicated unit tests. While the integration tests in the PR may exercise these code paths indirectly, direct unit tests would help verify edge cases like None values, empty lists, heterogeneous arrays, and invalid timestamp formats.
Check warning on line 2153 in tests/integrations/google_genai/test_google_genai.py
sentry-warden / warden: code-review
Test uses incorrect key 'attributes' instead of 'data' for inline_data
The test data structure uses `{"inline_data": {"attributes": b"binary_data", ...}}` but Google GenAI's inline_data format uses `data` not `attributes`. The production code in `sentry_sdk/ai/utils.py:286` retrieves `inline_data.get("data", "")`. Other tests in this same file (lines 1765, 1806) correctly use `{"inline_data": {"data": ..., "mime_type": ...}}`. This test will fail to properly exercise the code path for extracting blob content.
Check warning on line 330 in tests/integrations/google_genai/test_google_genai.py
sentry-warden / warden: code-review
[G6N-XFX] Test uses incorrect key 'attributes' instead of 'data' for inline_data (additional location)
The test data structure uses `{"inline_data": {"attributes": b"binary_data", ...}}` but Google GenAI's inline_data format uses `data` not `attributes`. The production code in `sentry_sdk/ai/utils.py:286` retrieves `inline_data.get("data", "")`. Other tests in this same file (lines 1765, 1806) correctly use `{"inline_data": {"data": ..., "mime_type": ...}}`. This test will fail to properly exercise the code path for extracting blob content.
Check warning on line 523 in tests/integrations/huggingface_hub/test_huggingface_hub.py
sentry-warden / warden: code-review
Hardcoded SDK version will cause test failure when version changes
The test `test_text_generation` hardcodes `"sentry.sdk.version": "2.58.0"` in expected_data, while the other tests in this file (`test_text_generation_streaming`, `test_chat_completion`) correctly use `mock.ANY`. This will cause the test to fail when the SDK version is bumped, introducing test fragility.
Check warning on line 1844 in tests/integrations/langchain/test_langchain.py
sentry-warden / warden: code-review
Unused list comprehension results in no actual test validation
The list comprehension on lines 1840-1844 creates a list of error events but doesn't assign it to any variable, making it dead code with no effect. The original code also had this issue (assigning to `[e for e in events...]` without using `e`), but this refactoring maintains the bug. This means `test_langchain_embeddings_error_handling` doesn't actually validate that errors are captured properly.