feat(data-collection): create DataCollection option in client#6702
feat(data-collection): create DataCollection option in client#6702ericapisani wants to merge 16 commits into
10 issues
Medium
`_map_from_send_default_pii` gates `graphql.document` on `send_default_pii`, contradicting its own docstring - `sentry_sdk/client.py:26`
The docstring of _map_from_send_default_pii states that graphql.variables and database.query_params follow send_default_pii while graphql.document should stay True (the document itself is not treated as PII). The implementation instead sets "graphql": {"document": send_default_pii, "variables": send_default_pii}, so with the default send_default_pii=False, graphql.document resolves to False and GraphQL documents are silently suppressed. Consider {"document": True, "variables": send_default_pii} to match the documented behavior.
Also found at:
sentry_sdk/data_collection.py:110tests/test_data_collection.py:131-140
`should_send_default_pii()` diverges from `data_collection["user_info"]` when only `data_collection` is set - `sentry_sdk/client.py:357`
The EventScrubber at line 357 now correctly uses data_collection["user_info"], but _Client.should_send_default_pii() (line 747) still reads directly from options["send_default_pii"]. Any user who sets data_collection={"user_info": True} without also setting send_default_pii=True will get a scrubber that allows PII through, while every integration calling should_send_default_pii() (fastapi, sanic, asgi, strawberry, rq, huey, etc.) will still return False and suppress PII collection — making the new option effectively broken for those integrations.
Also found at:
sentry_sdk/client.py:752-753
Low
`BaseClient.data_collection` returns a shared mutable module-level dict - `sentry_sdk/client.py:440-441`
BaseClient.data_collection (client.py:441) returns the module-level _DISABLED_DATA_COLLECTION_CONFIG dict directly, without copying. This dict contains nested mutable objects (a list for http_bodies, dicts for cookies/http_headers/query_params/etc.). It is exposed publicly via get_data_collection() / Scope.get_client().data_collection. If any future consumer mutates a nested field (e.g. client.data_collection["http_bodies"].append(...)), it would silently corrupt the shared constant for all callers of the non-recording/pre-init client path. No current caller mutates the returned structure, so this is a defensive hardening concern rather than an active bug. Consider returning a shallow/deep copy or an immutable structure.
Deprecation warning stacklevel too low — points to SDK internals, not user code - `sentry_sdk/consts.py:1437`
The DeprecationWarning warning that send_default_pii is deprecated/ignored when data_collection is set uses stacklevel=2 in resolve_data_collection() (sentry_sdk/data_collection.py). Because the warning is emitted directly inside resolve_data_collection, stacklevel=2 attributes the warning to its immediate caller _get_options() in client.py, rather than the user's sentry_sdk.init() call, making the warning harder to act on.
http_bodies accepts a bare string and silently converts it to individual characters - `sentry_sdk/_types.py:148`
In _resolve_explicit (sentry_sdk/data_collection.py), a user-supplied http_bodies value is passed directly to list(...) without checking it is a list. A string such as "incoming_request" is iterated character-by-character, producing ['i','n','c',...] instead of raising an error. Unlike key-value modes (validated via _kvcb_from_value against a whitelist), body type strings are never validated against ALL_HTTP_BODY_TYPES, so these bogus single-character entries pass through silently, effectively misconfiguring body collection without any diagnostic to the user.
Also found at:
sentry_sdk/data_collection.py:167-168sentry_sdk/data_collection.py:185-194
DeprecationWarning for `send_default_pii` + `data_collection` conflict points to internal SDK code - `sentry_sdk/client.py:29`
In resolve_data_collection, the warnings.warn(..., stacklevel=2) for the send_default_pii/data_collection conflict uses stacklevel=2, which attributes the warning to the immediate caller of resolve_data_collection. That caller is _get_options() in client.py (the rv["data_collection"] = resolve_data_collection(rv) line), i.e. internal SDK code, not the user's sentry_sdk.init() call site. As a result the DeprecationWarning is unhelpfully attributed to SDK internals, making it harder for users to locate the offending configuration in their own code.
Also found at:
sentry_sdk/data_collection.py:249-254
BaseClient.data_collection returns a mutable shared singleton dict - `sentry_sdk/client.py:440-442`
The property returns _DISABLED_DATA_COLLECTION_CONFIG directly by reference; any caller that mutates it (e.g., client.data_collection["http_bodies"].pop()) would corrupt the module-level singleton for all subsequent callers. Return a shallow copy or freeze the structure to prevent accidental mutation.
Also found at:
sentry_sdk/client.py:391-397
Spotlight re-derivation uses `is not False`, storing `stack_frame_variables=True` for a `None`-valued `include_local_variables` (inconsistent with normal resolution) - `sentry_sdk/client.py:640-643`
In DSN-less spotlight mode, _Client._setup_once re-derives data_collection with include_local_variables=self.options["include_local_variables"] is not False (and the same for include_source_context). When the option is explicitly set to None (permitted by the Optional[bool] typing in consts.py:1312-1313), None is not False evaluates to True, so the re-derived dict stores stack_frame_variables=True/frame_context_lines=5. The normal path resolve_data_collection instead passes the stored None straight through (options.get("include_local_variables", True) returns None), yielding stack_frame_variables=None (falsy). This is opposite to how None is treated everywhere else (utils.py:615 uses a truthy if include_local_variables: check, i.e. None disables capture). Impact is limited: the resolved stack_frame_variables/frame_context_lines values are not yet consumed for actual stacktrace capture (client.py:850 and utils.py read include_local_variables directly), so this is currently an internal-consistency defect in the resolved data_collection dict rather than a behavior change.
Direct key access on `options["data_collection"]` can raise KeyError when unpickling cross-version client state - `sentry_sdk/client.py:761`
The data_collection property returns self.options["data_collection"] with no fallback. The "data_collection" key is only injected by resolve_data_collection() inside _get_options(), which runs exclusively during __init__. When a _Client is unpickled, __setstate__ restores self.options = state["options"] and calls _init_impl() — which never calls _get_options(). Normal same-version pickle round-trips are safe because __getstate__ serializes self.options after it already contains the key. However, unpickling state produced by an older SDK version (serialized before this key existed) leaves self.options without "data_collection", causing a KeyError here (and also at line 637 in _init_impl under DSN-less spotlight mode). Prefer self.options.get("data_collection", _DISABLED_DATA_COLLECTION_CONFIG) to match the defensive pattern used by the adjacent should_send_default_pii() (self.options.get("send_default_pii") or False) and NonRecordingClient.data_collection.
Docstring for _map_from_send_default_pii incorrectly states graphql.document stays True - `sentry_sdk/data_collection.py:86-99`
The docstring (line 93) states that graphql.document stays True regardless of send_default_pii, but the implementation (line 110) sets graphql.document to the value of send_default_pii, so it becomes False when PII collection is off. The docstring misdescribes the actual PII-gating behavior.
Also found at:
tests/test_data_collection.py:196-204
4 skills analyzed
| Skill | Findings | Duration | Cost |
|---|---|---|---|
| security-review | 0 | 51.0s | $0.12 |
| code-review | 4 | 9m 12s | $4.04 |
| find-bugs | 6 | 20m 18s | $7.86 |
| skill-scanner | 0 | 12m 23s | $0.04 |
⏱ 42m 44s · 10.6M in / 321.7k out · $12.06