feat(eap): Add v2 co-occurring attributes storage with count, last_seen, and per-type attribute keys#7801
feat(eap): Add v2 co-occurring attributes storage with count, last_seen, and per-type attribute keys#7801phacops wants to merge 10 commits into
Conversation
Add a new SummingMergeTree-based storage for co-occurring attributes that includes a count column for proper deduplication via key_hash. The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature flag. Also simplify result row parsing in the attribute names endpoint. Co-Authored-By: Claude <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
|
This PR has a migration; here is the generated SQL for -- start migrations
-- forward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) AS
SELECT
organization_id AS organization_id,
project_id AS project_id,
item_type as item_type,
toMonday(timestamp) AS date,
retention_days as retention_days,
arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
mapKeys(attributes_int) AS attributes_int,
mapKeys(attributes_bool) AS attributes_bool,
mapKeys(attributes_array_string) AS attributes_array_string,
mapKeys(attributes_array_int) AS attributes_array_int,
mapKeys(attributes_array_float) AS attributes_array_float,
mapKeys(attributes_array_bool) AS attributes_array_bool,
1 AS count,
timestamp AS last_seen
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs
-- backward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 75193f8. Configure here.
Master picked up 0054_fix_bools_in_autocomplete; bump this one to 0055 to resolve the duplicate migration number. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Agent transcript: https://claudescope.sentry.dev/share/3bKJJo4cpTu-irMjftAcw6rYLjZEJsxUtHC2hucYt6s
Bring the branch up to date with master and narrow it to just the new co-occurring attributes storage: - Renumber the migration 0055 -> 0059 (0055-0058 are now taken on master). - Drop the endpoint changes (the `use_co_occurring_attrs_v2` flag and the storage switch). The v2 SummingMergeTree table with the `count` column is landed as groundwork only; the attribute-names endpoint continues to read the existing storage. Wiring the endpoint to read v2 (and sort by sum(count)) will be a follow-up. Refs EAP-432
…on number Resolve the conflict from merging master into the co-occurring attrs v2 work by renumbering the migration from 0059 to 0061 (0059 and 0060 are now taken on master), which keeps migration numbers strictly increasing. Add a `last_seen` column to the v2 co-occurring attributes storage so we can track the most recent time a set of attributes was seen. It is a SimpleAggregateFunction(max, DateTime), which the SummingMergeTree engine collapses with `max` during merges, and the materialized view populates it from the item `timestamp`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
The v2 co-occurring attributes table only captured string, float, and bool
attribute keys. Add the remaining attribute types so every attribute can be
surfaced with its type:
- `attributes_int`: keys of the `attributes_int` map (AttributeKey TYPE_INT).
- `attributes_array`: keys of all array-valued attribute maps
(`attributes_array_{string,int,float,bool}`), which all map to a single
AttributeKey TYPE_ARRAY.
Both new key arrays are folded into `attribute_keys_hash` (the bloom-filter
index) and `key_hash` (the dedup/sort key) via a shared `_all_attribute_keys`
expression, so dedup and lookups cover every attribute key regardless of type.
The materialized view populates the new columns from the corresponding
`eap_items_1_local` maps, and the storage config exposes them for reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
…8562) Removes `last_received` from the `TraceItemAttributeContext` and `TraceItemAttributeValueContext` tables. We're going to retrieve `last_received` from ClickHouse instead (getsentry/snuba#7801), so there's no need to store it. These tables are completely empty, so the column can be dropped without any data concerns. Fixes [BROWSE-587](https://linear.app/getsentry/issue/BROWSE-587/remove-last-received-from-attribute-context-tables)
onewland
left a comment
There was a problem hiding this comment.
Seems fine. You might want to verify the last_seen behavior that you mention here: https://github.com/getsentry/snuba/pull/7801/changes#diff-5f6187dc49efbb83ada5486a8becebd8b791c3adbbe38a7e78f027deac398a17R65-R68
…ntry-protos Resolve the master merge by renumbering the co-occurring migration to 0062 (master added 0061_add_ai_conversation_id, so 0061 was taken again). Replace the single `attributes_array` key column with one column per array element type — `attributes_array_string`, `attributes_array_int`, `attributes_array_float`, `attributes_array_bool` — so each can be surfaced with its specific AttributeKey type (TYPE_ARRAY_STRING / TYPE_ARRAY_INT / TYPE_ARRAY_DOUBLE / TYPE_ARRAY_BOOL; float arrays map to TYPE_ARRAY_DOUBLE). The materialized view populates each from the corresponding `attributes_array_*` map on eap_items, and all four are folded into the shared `_all_attribute_keys` expression backing `key_hash` and the bloom-filter `attribute_keys_hash`. Bump sentry-protos to >=0.35.0, which introduces the TYPE_ARRAY_* enum values the split array types map to. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
Yes, this is correct. It will select the timestamp and then the aggregation column will just |
…8562) Removes `last_received` from the `TraceItemAttributeContext` and `TraceItemAttributeValueContext` tables. We're going to retrieve `last_received` from ClickHouse instead (getsentry/snuba#7801), so there's no need to store it. These tables are completely empty, so the column can be dropped without any data concerns. Fixes [BROWSE-587](https://linear.app/getsentry/issue/BROWSE-587/remove-last-received-from-attribute-context-tables)

Add a new
SummingMergeTree-based storage (eap_item_co_occurring_attrs_v2) forco-occurring attributes. Compared to the existing
ReplacingMergeTreeapproach(
eap_item_co_occurring_attrs), the v2 table:includes a
countcolumn that is summed on merge, giving an occurrence count perset of co-occurring attributes;
uses a materialized
key_hash(a hash of the sorted, distinct attribute keys) inthe sort key so rows with the same attribute set are deduplicated/collapsed during
merges;
adds a
last_seencolumn (SimpleAggregateFunction(max, DateTime)) tracking the mostrecent timestamp at which a set of attributes was seen. Because it is a
SimpleAggregateFunction(max), theSummingMergeTreeappliesmaxon merge, so thestored value converges to the maximum item
timestampacross the collapsed rows(consumers aggregate it with
max()the same waycountis summed);represents every attribute type, mirroring the typed maps on
eap_items, with onekey array per type so each attribute can be surfaced with its
AttributeKeytype:AttributeKeytypeeap_itemsattributes_stringTYPE_STRINGattributes_string_*attributes_floatTYPE_FLOAT/TYPE_DOUBLEattributes_float_*attributes_intTYPE_INTattributes_intattributes_boolTYPE_BOOLEANattributes_boolattributes_array_stringTYPE_ARRAY_STRINGattributes_array_stringattributes_array_intTYPE_ARRAY_INTattributes_array_intattributes_array_floatTYPE_ARRAY_DOUBLEattributes_array_floatattributes_array_boolTYPE_ARRAY_BOOLattributes_array_boolBoth
key_hashand the bloom-filterattribute_keys_hashare derived from a singlearrayConcat(...)of all the key arrays, so dedup and key lookups cover everyattribute key regardless of type.
Migration
0062_add_count_to_co_occurring_attrs.pycreates the local/dist tables, thebf_attribute_keys_hashbloom-filter index, and the materialized view fromeap_items_1_local. The MV populates the per-type key arrays viamapKeys(...),countwith1(summed on merge), andlast_seenwith the itemtimestamp.Dependencies
Bumps
sentry-protosto>=0.35.0, which adds theTYPE_ARRAY_STRING/TYPE_ARRAY_INT/TYPE_ARRAY_DOUBLE/TYPE_ARRAY_BOOLenum values the split arraycolumns map to.
Validation (Python 3.13)
EventsAnalyticsPlatformLoaderloads all EAP migrations with no duplicate/gap errors(latest is
0062).per-type key arrays,
count UInt64, andlast_seen SimpleAggregateFunction(max, DateTime).snuba/validate_configs.pyreports all configs valid;ruff check/formatpass.Note
The attribute-names RPC (
endpoint_trace_item_attribute_names) still reads the v1storage, so it does not yet surface
count,last_seen, or the int/array key arrays.Switching it to v2 (behind a rollout flag) and tagging the new array types is a
follow-up.
Agent transcript: https://claudescope.sentry.dev/share/jjGnsb7JWH13GyrGe-wbHapP5rwLIJPOJyGwWJKv-70