Skip to content

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967

Merged
swiatekm merged 9 commits into
open-telemetry:mainfrom
gracewehner:fix/ta-sorted-labels
Jun 18, 2026
Merged

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967
swiatekm merged 9 commits into
open-telemetry:mainfrom
gracewehner:fix/ta-sorted-labels

Conversation

@gracewehner

@gracewehner gracewehner commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Description

When processTargetGroups builds labels for each target, it uses a struct-copy of the group ScratchBuilder and appends target labels. The resulting label set has two independently sorted sublists (group labels, then target labels) but is not globally sorted.

labels.Labels.Get() (in the stringlabels build, the default) uses a linear scan with sorted-order early termination: if it encounters a label name whose first byte is greater than the search key, it stops immediately and returns empty. When a group label like vendor (first byte v=118) precedes __address__ (first byte _=95), Get("__address__") sees 118 > 95 and breaks early — returning "".

This causes relabel_configs that read source_labels: [__address__] to produce identical output for all targets in the group, creating hash collisions in the allocator's map[ItemHash]*Item. The last target silently overwrites all others.

Fix

Replace the struct-copy pattern with a merge-sort that interleaves the sorted group labels and sorted target labels into globally sorted output. Uses ScratchBuilder.Overwrite() to reuse internal buffers, achieving ~0% overhead vs the buggy baseline in benchmarks.

Benchmarks

BenchmarkProcessTargetsWithRelabelConfig (exercises the affected code path), 10 runs, benchstat:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Intel(R) Core(TM) Ultra 7 165H
                                                             │ /tmp/bench-before-relabel.txt │     /tmp/bench-after-relabel.txt     │
                                                             │            sec/op             │    sec/op     vs base                │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                      13.42m ± 25%   13.32m ± 24%        ~ (p=0.971 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                14.34m ± 31%   12.66m ± 22%        ~ (p=0.280 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                          12.75m ± 51%   13.10m ± 38%        ~ (p=0.579 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                     192.5m ± 40%   104.8m ± 16%  -45.54% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                               200.1m ± 78%   110.3m ± 14%  -44.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                         136.5m ± 21%   119.9m ± 13%        ~ (p=0.190 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                             1191.7m ± 26%   959.7m ± 25%  -19.47% (p=0.009 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                         1.180 ± 14%    1.106 ±  7%        ~ (p=0.143 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                     1.292 ± 20%    1.085 ± 18%        ~ (p=0.105 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                     9.989 ± 11%    5.164 ±  6%  -48.31% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                               9.982 ± 11%    5.176 ±  4%  -48.14% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                         9.305 ± 10%    6.051 ± 17%  -34.97% (p=0.000 n=10)
geomean                                                                         408.8m         301.8m        -26.17%

                                                             │ /tmp/bench-before-relabel.txt │    /tmp/bench-after-relabel.txt     │
                                                             │             B/op              │     B/op      vs base               │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                      3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                          3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                     31.13Mi ± 0%   31.21Mi ± 0%  +0.24% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                               31.13Mi ± 0%   31.22Mi ± 0%  +0.29% (p=0.001 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                         31.10Mi ± 0%   31.21Mi ± 0%  +0.36% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                              314.5Mi ± 1%   313.5Mi ± 1%       ~ (p=0.315 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                        314.5Mi ± 1%   315.8Mi ± 1%  +0.39% (p=0.009 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                    314.5Mi ± 1%   313.5Mi ± 1%       ~ (p=0.912 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                    2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                              2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                        2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
geomean                                                                         93.55Mi        93.79Mi       +0.25%

                                                             │ /tmp/bench-before-relabel.txt │    /tmp/bench-after-relabel.txt    │
                                                             │           allocs/op           │  allocs/op   vs base               │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                       2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                 2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                           2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                      26.24k ± 0%   26.34k ± 0%  +0.35% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                                26.24k ± 0%   26.34k ± 0%  +0.36% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                          26.24k ± 0%   26.34k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                               261.8k ± 0%   262.5k ± 0%  +0.27% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                         261.8k ± 0%   262.8k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                     261.8k ± 0%   262.6k ± 0%  +0.29% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                     2.093M ± 0%   2.101M ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                               2.093M ± 0%   2.101M ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                         2.093M ± 0%   2.101M ± 0%  +0.38% (p=0.000 n=10)
geomean                                                                          78.50k        78.78k       +0.36%

Summary: Memory overhead is +0.25% geomean (groupSlice + groupLabels variables per group). Time is -26% geomean (likely due to ScratchBuilder.Overwrite() reusing buffers vs the old struct-copy + Labels() allocation). Allocation count increases by +0.36% geomean.

Testing

  • Added regression test TestSortedLabelsBlackboxRelabeling that reproduces the exact customer scenario (blackbox exporter with vendor group label + 2 targets)
  • Added workaround validation tests confirming which customer workarounds are effective
  • All existing tests pass (make precommit clean — fmt, vet, lint, 1955 tests, ensure-update-is-noop)

Customer Impact

This bug silently drops scrape targets when:

  1. static_configs has group-level labels AND multiple targets
  2. relabel_configs reads __address__ (which most configs do)
  3. The group label name sorts alphabetically after __address__ (any label starting with a-z)

Workaround: Remove group labels from static_configs and add them via metric_relabel_configs instead.

Link to tracking issue

Fixes open-telemetry/opentelemetry-operator#XXXX

gracewehner and others added 4 commits April 10, 2026 17:26
… fix silent target loss

processTargetGroups used a ScratchBuilder struct-copy optimization that
produced labels with two independently sorted sublists (group labels +
target labels) but not globally sorted. labels.Labels.Get() uses binary
search assuming globally sorted data. When group labels sort after target
labels (e.g. 'vendor' > '__address__'), Get('__address__') silently returns
empty string, causing relabel rules reading source_labels: [__address__] to
produce empty results.

Both targets then get identical post-relabel labels, creating hash collisions
in the allocator's target map which silently drops one target.

The fix materializes group labels once, then creates a fresh ScratchBuilder
per target and calls Sort() after all labels are added, ensuring globally
sorted Labels that work correctly with binary search.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Rename groupBuilder to targetBuilder via pointer alias in the per-target
loop to clarify that the builder is being reused for target label merging.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…hash collisions

ScratchBuilder.Labels() serializes labels in insertion order without
sorting. When group labels (e.g. vendor=nginx) sort after target labels
(e.g. __address__), Labels.Get() hits a sorted-order early-termination
check and returns empty for __address__. This causes relabel rules using
source_labels: [__address__] to produce identical output for all targets
in the group, creating hash collisions that silently drop targets in the
allocator's buildTargetMap.

Fix: replace the struct-copy pattern with a merge-sort that interleaves
sorted group labels and sorted target labels, keeping output sorted.
Uses ScratchBuilder.Overwrite for zero-allocation after the first
iteration (~0.3% benchmark overhead vs buggy baseline).

Add regression test and workaround validation tests confirming:
- The fix produces sorted labels and unique hashes
- The buggy code collapses targets via hash collision
- Workaround: removing group labels avoids the collision
- One-target-per-group does NOT avoid the collision

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gracewehner gracewehner marked this pull request as ready for review June 11, 2026 17:40
@gracewehner gracewehner requested a review from a team as a code owner June 11, 2026 17:40
@gracewehner

Copy link
Copy Markdown
Contributor Author

Hi @swiatekm this changes some optimizations from: #4587, would you be able to take a look? Thanks!

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me, but I'd like to cut down on the overly verbose comments and move things around a bit.

Comment thread cmd/otel-allocator/internal/prehook/sorted_labels_regression_test.go Outdated
Comment thread cmd/otel-allocator/internal/prehook/sorted_labels_regression_test.go Outdated
Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
Comment thread cmd/otel-allocator/internal/prehook/sorted_labels_regression_test.go Outdated
Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 15, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

@gracewehner gracewehner force-pushed the fix/ta-sorted-labels branch from 99d3224 to 221766d Compare June 15, 2026 19:32
@gracewehner

Copy link
Copy Markdown
Contributor Author

The change looks good to me, but I'd like to cut down on the overly verbose comments and move things around a bit.

Thanks @swiatekm ! Cleaned up the comments and debug tests, using model.LabelName now instead of string and moved the merge sort to a helper function to add unit tests.

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, minus one nitpick about the merge sort function arguments. It confused me initially, so I think it's worthwhile to address before we merge.

Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated
@swiatekm swiatekm self-requested a review June 17, 2026 20:46

@swiatekm swiatekm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@swiatekm swiatekm merged commit 4fd3379 into open-telemetry:main Jun 18, 2026
96 of 99 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants