Skip to content

feat(eap): Add v2 co-occurring attributes storage with count, last_seen, and per-type attribute keys#7801

Open
phacops wants to merge 10 commits into
masterfrom
phacops/eap-co-occurring-attrs-v2
Open

feat(eap): Add v2 co-occurring attributes storage with count, last_seen, and per-type attribute keys#7801
phacops wants to merge 10 commits into
masterfrom
phacops/eap-co-occurring-attrs-v2

Conversation

@phacops

@phacops phacops commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Add a new SummingMergeTree-based storage (eap_item_co_occurring_attrs_v2) for
co-occurring attributes. Compared to the existing ReplacingMergeTree approach
(eap_item_co_occurring_attrs), the v2 table:

  • includes a count column that is summed on merge, giving an occurrence count per
    set of co-occurring attributes;

  • uses a materialized key_hash (a hash of the sorted, distinct attribute keys) in
    the sort key so rows with the same attribute set are deduplicated/collapsed during
    merges;

  • adds a last_seen column (SimpleAggregateFunction(max, DateTime)) tracking the most
    recent timestamp at which a set of attributes was seen. Because it is a
    SimpleAggregateFunction(max), the SummingMergeTree applies max on merge, so the
    stored value converges to the maximum item timestamp across the collapsed rows
    (consumers aggregate it with max() the same way count is summed);

  • represents every attribute type, mirroring the typed maps on eap_items, with one
    key array per type so each attribute can be surfaced with its AttributeKey type:

    column AttributeKey type source map on eap_items
    attributes_string TYPE_STRING attributes_string_*
    attributes_float TYPE_FLOAT / TYPE_DOUBLE attributes_float_*
    attributes_int TYPE_INT attributes_int
    attributes_bool TYPE_BOOLEAN attributes_bool
    attributes_array_string TYPE_ARRAY_STRING attributes_array_string
    attributes_array_int TYPE_ARRAY_INT attributes_array_int
    attributes_array_float TYPE_ARRAY_DOUBLE attributes_array_float
    attributes_array_bool TYPE_ARRAY_BOOL attributes_array_bool

    Both key_hash and the bloom-filter attribute_keys_hash are derived from a single
    arrayConcat(...) of all the key arrays, so dedup and key lookups cover every
    attribute key regardless of type.

Migration

0062_add_count_to_co_occurring_attrs.py creates the local/dist tables, the
bf_attribute_keys_hash bloom-filter index, and the materialized view from
eap_items_1_local. The MV populates the per-type key arrays via mapKeys(...),
count with 1 (summed on merge), and last_seen with the item timestamp.

Dependencies

Bumps sentry-protos to >=0.35.0, which adds the TYPE_ARRAY_STRING /
TYPE_ARRAY_INT / TYPE_ARRAY_DOUBLE / TYPE_ARRAY_BOOL enum values the split array
columns map to.

Validation (Python 3.13)

  • EventsAnalyticsPlatformLoader loads all EAP migrations with no duplicate/gap errors
    (latest is 0062).
  • The migration renders valid ClickHouse DDL — the table and MV include all eight
    per-type key arrays, count UInt64, and last_seen SimpleAggregateFunction(max, DateTime).
  • snuba/validate_configs.py reports all configs valid; ruff check/format pass.

Note

The attribute-names RPC (endpoint_trace_item_attribute_names) still reads the v1
storage, so it does not yet surface count, last_seen, or the int/array key arrays.
Switching it to v2 (behind a rollout flag) and tagging the new array types is a
follow-up.

Agent transcript: https://claudescope.sentry.dev/share/jjGnsb7JWH13GyrGe-wbHapP5rwLIJPOJyGwWJKv-70

Add a new SummingMergeTree-based storage for co-occurring attributes
that includes a count column for proper deduplication via key_hash.
The v2 storage is gated behind a `use_co_occurring_attrs_v2` feature
flag. Also simplify result row parsing in the attribute names endpoint.

Co-Authored-By: Claude <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/yM8dAMnfR-nHQ6Z7BKDQd12ih3FsVPMAzgudpbFlskw
@github-actions

github-actions Bot commented Mar 5, 2026

Copy link
Copy Markdown

This PR has a migration; here is the generated SQL for ./snuba/migrations/groups.py ()

-- start migrations

-- forward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs
Local op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE ReplicatedSummingMergeTree('/clickhouse/tables/events_analytics_platform/{shard}/default/eap_item_co_occurring_attrs_2_local', '{replica}') PRIMARY KEY (organization_id, project_id, date, item_type, key_hash) ORDER BY (organization_id, project_id, date, item_type, key_hash, retention_days) PARTITION BY (retention_days, toMonday(date)) TTL date + toIntervalDay(retention_days);
Distributed op: CREATE TABLE IF NOT EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) ENGINE Distributed(`cluster_one_sh`, default, eap_item_co_occurring_attrs_2_local);
Local op: ALTER TABLE eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' ADD INDEX IF NOT EXISTS bf_attribute_keys_hash attribute_keys_hash TYPE bloom_filter GRANULARITY 1;
Local op: CREATE MATERIALIZED VIEW IF NOT EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' TO eap_item_co_occurring_attrs_2_local (organization_id UInt64, project_id UInt64, item_type UInt8, date Date CODEC (DoubleDelta, ZSTD(1)), retention_days UInt16, attribute_keys_hash Array(UInt64) MATERIALIZED arrayMap(k -> cityHash64(k), arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool))), attributes_string Array(String), attributes_float Array(String), attributes_int Array(String), attributes_bool Array(String), attributes_array_string Array(String), attributes_array_int Array(String), attributes_array_float Array(String), attributes_array_bool Array(String), key_hash UInt64 MATERIALIZED cityHash64(arraySort(arrayDistinct(arrayConcat(attributes_string, attributes_float, attributes_int, attributes_bool, attributes_array_string, attributes_array_int, attributes_array_float, attributes_array_bool)))), count UInt64, last_seen SimpleAggregateFunction(max, DateTime)) AS 
SELECT
    organization_id AS organization_id,
    project_id AS project_id,
    item_type as item_type,
    toMonday(timestamp) AS date,
    retention_days as retention_days,
    arrayConcat(mapKeys(attributes_string_0), mapKeys(attributes_string_1), mapKeys(attributes_string_2), mapKeys(attributes_string_3), mapKeys(attributes_string_4), mapKeys(attributes_string_5), mapKeys(attributes_string_6), mapKeys(attributes_string_7), mapKeys(attributes_string_8), mapKeys(attributes_string_9), mapKeys(attributes_string_10), mapKeys(attributes_string_11), mapKeys(attributes_string_12), mapKeys(attributes_string_13), mapKeys(attributes_string_14), mapKeys(attributes_string_15), mapKeys(attributes_string_16), mapKeys(attributes_string_17), mapKeys(attributes_string_18), mapKeys(attributes_string_19), mapKeys(attributes_string_20), mapKeys(attributes_string_21), mapKeys(attributes_string_22), mapKeys(attributes_string_23), mapKeys(attributes_string_24), mapKeys(attributes_string_25), mapKeys(attributes_string_26), mapKeys(attributes_string_27), mapKeys(attributes_string_28), mapKeys(attributes_string_29), mapKeys(attributes_string_30), mapKeys(attributes_string_31), mapKeys(attributes_string_32), mapKeys(attributes_string_33), mapKeys(attributes_string_34), mapKeys(attributes_string_35), mapKeys(attributes_string_36), mapKeys(attributes_string_37), mapKeys(attributes_string_38), mapKeys(attributes_string_39)) AS attributes_string,
    arrayConcat(mapKeys(attributes_float_0), mapKeys(attributes_float_1), mapKeys(attributes_float_2), mapKeys(attributes_float_3), mapKeys(attributes_float_4), mapKeys(attributes_float_5), mapKeys(attributes_float_6), mapKeys(attributes_float_7), mapKeys(attributes_float_8), mapKeys(attributes_float_9), mapKeys(attributes_float_10), mapKeys(attributes_float_11), mapKeys(attributes_float_12), mapKeys(attributes_float_13), mapKeys(attributes_float_14), mapKeys(attributes_float_15), mapKeys(attributes_float_16), mapKeys(attributes_float_17), mapKeys(attributes_float_18), mapKeys(attributes_float_19), mapKeys(attributes_float_20), mapKeys(attributes_float_21), mapKeys(attributes_float_22), mapKeys(attributes_float_23), mapKeys(attributes_float_24), mapKeys(attributes_float_25), mapKeys(attributes_float_26), mapKeys(attributes_float_27), mapKeys(attributes_float_28), mapKeys(attributes_float_29), mapKeys(attributes_float_30), mapKeys(attributes_float_31), mapKeys(attributes_float_32), mapKeys(attributes_float_33), mapKeys(attributes_float_34), mapKeys(attributes_float_35), mapKeys(attributes_float_36), mapKeys(attributes_float_37), mapKeys(attributes_float_38), mapKeys(attributes_float_39)) AS attributes_float,
    mapKeys(attributes_int) AS attributes_int,
    mapKeys(attributes_bool) AS attributes_bool,
    mapKeys(attributes_array_string) AS attributes_array_string,
    mapKeys(attributes_array_int) AS attributes_array_int,
    mapKeys(attributes_array_float) AS attributes_array_float,
    mapKeys(attributes_array_bool) AS attributes_array_bool,
    1 AS count,
    timestamp AS last_seen
FROM eap_items_1_local
;
-- end forward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs




-- backward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_3_mv ON CLUSTER 'cluster_one_sh' SYNC;
Distributed op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_dist ON CLUSTER 'cluster_one_sh' SYNC;
Local op: DROP TABLE IF EXISTS eap_item_co_occurring_attrs_2_local ON CLUSTER 'cluster_one_sh' SYNC;
-- end backward migration events_analytics_platform : 0062_add_count_to_co_occurring_attrs

@phacops phacops marked this pull request as ready for review May 25, 2026 22:39
@phacops phacops requested review from a team as code owners May 25, 2026 22:39

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 75193f8. Configure here.

phacops and others added 3 commits May 29, 2026 19:12
Master picked up 0054_fix_bools_in_autocomplete; bump this one to 0055
to resolve the duplicate migration number.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Agent transcript: https://claudescope.sentry.dev/share/3bKJJo4cpTu-irMjftAcw6rYLjZEJsxUtHC2hucYt6s
Bring the branch up to date with master and narrow it to just the new
co-occurring attributes storage:

- Renumber the migration 0055 -> 0059 (0055-0058 are now taken on master).
- Drop the endpoint changes (the `use_co_occurring_attrs_v2` flag and the
  storage switch). The v2 SummingMergeTree table with the `count` column is
  landed as groundwork only; the attribute-names endpoint continues to read
  the existing storage. Wiring the endpoint to read v2 (and sort by
  sum(count)) will be a follow-up.

Refs EAP-432
claude added 2 commits June 23, 2026 18:48
…on number

Resolve the conflict from merging master into the co-occurring attrs v2
work by renumbering the migration from 0059 to 0061 (0059 and 0060 are
now taken on master), which keeps migration numbers strictly increasing.

Add a `last_seen` column to the v2 co-occurring attributes storage so we
can track the most recent time a set of attributes was seen. It is a
SimpleAggregateFunction(max, DateTime), which the SummingMergeTree engine
collapses with `max` during merges, and the materialized view populates it
from the item `timestamp`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
@linear-code

linear-code Bot commented Jun 23, 2026

Copy link
Copy Markdown

EAP-573

@phacops phacops changed the title feat(eap): Add v2 co-occurring attributes storage with count column feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns Jun 23, 2026
The v2 co-occurring attributes table only captured string, float, and bool
attribute keys. Add the remaining attribute types so every attribute can be
surfaced with its type:

- `attributes_int`: keys of the `attributes_int` map (AttributeKey TYPE_INT).
- `attributes_array`: keys of all array-valued attribute maps
  (`attributes_array_{string,int,float,bool}`), which all map to a single
  AttributeKey TYPE_ARRAY.

Both new key arrays are folded into `attribute_keys_hash` (the bloom-filter
index) and `key_hash` (the dedup/sort key) via a shared `_all_attribute_keys`
expression, so dedup and lookups cover every attribute key regardless of type.
The materialized view populates the new columns from the corresponding
`eap_items_1_local` maps, and the storage config exposes them for reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
DominikB2014 added a commit to getsentry/sentry that referenced this pull request Jun 26, 2026
…8562)

Removes `last_received` from the `TraceItemAttributeContext` and
`TraceItemAttributeValueContext` tables. We're going to retrieve
`last_received` from ClickHouse instead (getsentry/snuba#7801), so
there's no need to store it.

These tables are completely empty, so the column can be dropped without
any data concerns.

Fixes
[BROWSE-587](https://linear.app/getsentry/issue/BROWSE-587/remove-last-received-from-attribute-context-tables)

@onewland onewland left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

claude added 2 commits June 26, 2026 23:20
…ntry-protos

Resolve the master merge by renumbering the co-occurring migration to 0062
(master added 0061_add_ai_conversation_id, so 0061 was taken again).

Replace the single `attributes_array` key column with one column per array
element type — `attributes_array_string`, `attributes_array_int`,
`attributes_array_float`, `attributes_array_bool` — so each can be surfaced with
its specific AttributeKey type (TYPE_ARRAY_STRING / TYPE_ARRAY_INT /
TYPE_ARRAY_DOUBLE / TYPE_ARRAY_BOOL; float arrays map to TYPE_ARRAY_DOUBLE). The
materialized view populates each from the corresponding `attributes_array_*` map
on eap_items, and all four are folded into the shared `_all_attribute_keys`
expression backing `key_hash` and the bloom-filter `attribute_keys_hash`.

Bump sentry-protos to >=0.35.0, which introduces the TYPE_ARRAY_* enum values
the split array types map to.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01SQHFWAZS2wQBJ2GTCGCoax
@phacops phacops changed the title feat(eap): Add v2 co-occurring attributes storage with count and last_seen columns feat(eap): Add v2 co-occurring attributes storage with count, last_seen, and per-type attribute keys Jun 26, 2026
@phacops

phacops commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Seems fine. You might want to verify the last_seen behavior that you mention here: https://github.com/getsentry/snuba/pull/7801/changes#diff-5f6187dc49efbb83ada5486a8becebd8b791c3adbbe38a7e78f027deac398a17R65-R68

Yes, this is correct. It will select the timestamp and then the aggregation column will just max it: https://github.com/getsentry/snuba/pull/7801/changes#diff-f1cfe315bbb96650580649e6fa4a1f8b7e672b6f860cdf4a7417a12c574a9b5dR73

shayna-ch pushed a commit to getsentry/sentry that referenced this pull request Jun 30, 2026
…8562)

Removes `last_received` from the `TraceItemAttributeContext` and
`TraceItemAttributeValueContext` tables. We're going to retrieve
`last_received` from ClickHouse instead (getsentry/snuba#7801), so
there's no need to store it.

These tables are completely empty, so the column can be dropped without
any data concerns.

Fixes
[BROWSE-587](https://linear.app/getsentry/issue/BROWSE-587/remove-last-received-from-attribute-context-tables)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants