Skip to content

fix(weave): shard call_parts by id so call_start/call_end co-locate#6997

Draft
gtarpenning wants to merge 1 commit into
gtarpenning/wb-34905-fix-mv-double-createfrom
gtarpenning/wb-34906-shard-call-parts-by-id
Draft

fix(weave): shard call_parts by id so call_start/call_end co-locate#6997
gtarpenning wants to merge 1 commit into
gtarpenning/wb-34905-fix-mv-double-createfrom
gtarpenning/wb-34906-shard-call-parts-by-id

Conversation

@gtarpenning
Copy link
Copy Markdown
Member

@gtarpenning gtarpenning commented May 28, 2026

WB-34906

Summary

  • call_parts was using default rand() sharding. call_end rows don't carry trace_id (only call_start does), so call_start and call_end of the same call landed on different shards.
  • Without co-location the partial states never merged into calls_complete, and parent_id IS NULL filters matched the call_end row of every child call instead of trace roots.
  • Add "call_parts": "id" to ID_SHARDED_TABLES so all parts of a single call hash to the same shard via sipHash64(id).

Testing

caught by the new 2s2r nightly tests; existing migrator unit tests cover the sharding-key resolution.

Copy link
Copy Markdown
Member Author

gtarpenning commented May 28, 2026

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

call_parts was using the default rand() sharding key, so call_start and
call_end for the same call could land on different shards. Once split,
the partial-state rows can never merge in calls_merged_local (OPTIMIZE
runs per-shard), and queries that filter on an aggregated column see
inconsistent state.

Concretely: call_end doesn't carry parent_id, so its row defaults to
NULL. Filters like trace_roots_only (`parent_id IS NULL`) then match
the call_end row of every child call as if it were a root, inflating
counts.

Shard by `id` instead of `wf_clickhouse_calls_shard_key()` (which
defaults to trace_id): trace_id is Nullable on call_end so sipHash64
returns Nullable, which ClickHouse rejects as a sharding expression
(TYPE_MISMATCH). `id` is non-null on every call_part row and uniquely
identifies a call, so all parts of one call land together.

calls_merged Distributed table is intentionally left rand(): the only
writes come from the MV which fires on the local source/target pair and
never goes through the Distributed wrapper.
@gtarpenning gtarpenning force-pushed the gtarpenning/wb-34905-fix-mv-double-create branch from dbe0478 to 7943d1a Compare May 28, 2026 21:07
@gtarpenning gtarpenning force-pushed the gtarpenning/wb-34906-shard-call-parts-by-id branch from c25f1ba to 42c2c39 Compare May 28, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant