fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions by gracewehner · Pull Request #4967 · open-telemetry/opentelemetry-operator

gracewehner · 2026-04-15T23:12:13Z

Description

When processTargetGroups builds labels for each target, it uses a struct-copy of the group ScratchBuilder and appends target labels. The resulting label set has two independently sorted sublists (group labels, then target labels) but is not globally sorted.

labels.Labels.Get() (in the stringlabels build, the default) uses a linear scan with sorted-order early termination: if it encounters a label name whose first byte is greater than the search key, it stops immediately and returns empty. When a group label like vendor (first byte v=118) precedes __address__ (first byte _=95), Get("__address__") sees 118 > 95 and breaks early — returning "".

This causes relabel_configs that read source_labels: [__address__] to produce identical output for all targets in the group, creating hash collisions in the allocator's map[ItemHash]*Item. The last target silently overwrites all others.

Fix

Replace the struct-copy pattern with a merge-sort that interleaves the sorted group labels and sorted target labels into globally sorted output. Uses ScratchBuilder.Overwrite() to reuse internal buffers, achieving ~0% overhead vs the buggy baseline in benchmarks.

Benchmarks

BenchmarkProcessTargetsWithRelabelConfig (exercises the affected code path), 10 runs, benchstat:

goos: linux
goarch: amd64
pkg: github.com/open-telemetry/opentelemetry-operator/cmd/otel-allocator
cpu: Intel(R) Core(TM) Ultra 7 165H
                                                             │ /tmp/bench-before-relabel.txt │     /tmp/bench-after-relabel.txt     │
                                                             │            sec/op             │    sec/op     vs base                │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                      13.42m ± 25%   13.32m ± 24%        ~ (p=0.971 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                14.34m ± 31%   12.66m ± 22%        ~ (p=0.280 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                          12.75m ± 51%   13.10m ± 38%        ~ (p=0.579 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                     192.5m ± 40%   104.8m ± 16%  -45.54% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                               200.1m ± 78%   110.3m ± 14%  -44.90% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                         136.5m ± 21%   119.9m ± 13%        ~ (p=0.190 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                             1191.7m ± 26%   959.7m ± 25%  -19.47% (p=0.009 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                         1.180 ± 14%    1.106 ±  7%        ~ (p=0.143 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                     1.292 ± 20%    1.085 ± 18%        ~ (p=0.105 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                     9.989 ± 11%    5.164 ±  6%  -48.31% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                               9.982 ± 11%    5.176 ±  4%  -48.14% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                         9.305 ± 10%    6.051 ± 17%  -34.97% (p=0.000 n=10)
geomean                                                                         408.8m         301.8m        -26.17%

                                                             │ /tmp/bench-before-relabel.txt │    /tmp/bench-after-relabel.txt     │
                                                             │             B/op              │     B/op      vs base               │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                      3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                          3.110Mi ± 0%   3.122Mi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                     31.13Mi ± 0%   31.21Mi ± 0%  +0.24% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                               31.13Mi ± 0%   31.22Mi ± 0%  +0.29% (p=0.001 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                         31.10Mi ± 0%   31.21Mi ± 0%  +0.36% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                              314.5Mi ± 1%   313.5Mi ± 1%       ~ (p=0.315 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                        314.5Mi ± 1%   315.8Mi ± 1%  +0.39% (p=0.009 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                    314.5Mi ± 1%   313.5Mi ± 1%       ~ (p=0.912 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                    2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                              2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                        2.457Gi ± 0%   2.467Gi ± 0%  +0.39% (p=0.000 n=10)
geomean                                                                         93.55Mi        93.79Mi       +0.25%

                                                             │ /tmp/bench-before-relabel.txt │    /tmp/bench-after-relabel.txt    │
                                                             │           allocs/op           │  allocs/op   vs base               │
ProcessTargetsWithRelabelConfig/consistent-hashing/1000-22                       2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/1000-22                                 2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/1000-22                           2.640k ± 0%   2.650k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/10000-22                      26.24k ± 0%   26.34k ± 0%  +0.35% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/10000-22                                26.24k ± 0%   26.34k ± 0%  +0.36% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/10000-22                          26.24k ± 0%   26.34k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/100000-22                               261.8k ± 0%   262.5k ± 0%  +0.27% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/100000-22                         261.8k ± 0%   262.8k ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/100000-22                     261.8k ± 0%   262.6k ± 0%  +0.29% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/consistent-hashing/800000-22                     2.093M ± 0%   2.101M ± 0%  +0.38% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/per-node/800000-22                               2.093M ± 0%   2.101M ± 0%  +0.39% (p=0.000 n=10)
ProcessTargetsWithRelabelConfig/least-weighted/800000-22                         2.093M ± 0%   2.101M ± 0%  +0.38% (p=0.000 n=10)
geomean                                                                          78.50k        78.78k       +0.36%

Summary: Memory overhead is +0.25% geomean (groupSlice + groupLabels variables per group). Time is -26% geomean (likely due to ScratchBuilder.Overwrite() reusing buffers vs the old struct-copy + Labels() allocation). Allocation count increases by +0.36% geomean.

Testing

Added regression test TestSortedLabelsBlackboxRelabeling that reproduces the exact customer scenario (blackbox exporter with vendor group label + 2 targets)
Added workaround validation tests confirming which customer workarounds are effective
All existing tests pass (make precommit clean — fmt, vet, lint, 1955 tests, ensure-update-is-noop)

Customer Impact

This bug silently drops scrape targets when:

static_configs has group-level labels AND multiple targets
relabel_configs reads __address__ (which most configs do)
The group label name sorts alphabetically after __address__ (any label starting with a-z)

Workaround: Remove group labels from static_configs and add them via metric_relabel_configs instead.

Link to tracking issue

Fixes open-telemetry/opentelemetry-operator#XXXX

… fix silent target loss processTargetGroups used a ScratchBuilder struct-copy optimization that produced labels with two independently sorted sublists (group labels + target labels) but not globally sorted. labels.Labels.Get() uses binary search assuming globally sorted data. When group labels sort after target labels (e.g. 'vendor' > '__address__'), Get('__address__') silently returns empty string, causing relabel rules reading source_labels: [__address__] to produce empty results. Both targets then get identical post-relabel labels, creating hash collisions in the allocator's target map which silently drops one target. The fix materializes group labels once, then creates a fresh ScratchBuilder per target and calls Sort() after all labels are added, ensuring globally sorted Labels that work correctly with binary search. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Rename groupBuilder to targetBuilder via pointer alias in the per-target loop to clarify that the builder is being reused for target label merging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…hash collisions ScratchBuilder.Labels() serializes labels in insertion order without sorting. When group labels (e.g. vendor=nginx) sort after target labels (e.g. __address__), Labels.Get() hits a sorted-order early-termination check and returns empty for __address__. This causes relabel rules using source_labels: [__address__] to produce identical output for all targets in the group, creating hash collisions that silently drop targets in the allocator's buildTargetMap. Fix: replace the struct-copy pattern with a merge-sort that interleaves sorted group labels and sorted target labels, keeping output sorted. Uses ScratchBuilder.Overwrite for zero-allocation after the first iteration (~0.3% benchmark overhead vs buggy baseline). Add regression test and workaround validation tests confirming: - The fix produces sorted labels and unique hashes - The buggy code collapses targets via hash collision - Workaround: removing group labels avoids the collision - One-target-per-group does NOT avoid the collision Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gracewehner · 2026-06-11T17:41:47Z

Hi @swiatekm this changes some optimizations from: #4587, would you be able to take a look? Thanks!

swiatekm

The change looks good to me, but I'd like to cut down on the overly verbose comments and move things around a bit.

linux-foundation-easycla · 2026-06-15T19:30:30Z

The committers listed above are authorized under a signed CLA.

✅ login: Copilot / name: Copilot (46c6411, 8848d5d, dd1ef4c, e5e1e66)
✅ login: gracewehner / name: Grace Wehner (221766d, 2c384df, 46c6411, 8848d5d, dd1ef4c, e5e1e66)

gracewehner · 2026-06-15T23:49:54Z

The change looks good to me, but I'd like to cut down on the overly verbose comments and move things around a bit.

Thanks @swiatekm ! Cleaned up the comments and debug tests, using model.LabelName now instead of string and moved the merge sort to a helper function to add unit tests.

swiatekm

Looks good to me now, minus one nitpick about the merge sort function arguments. It confused me initially, so I think it's worthwhile to address before we merge.

swiatekm

👍

gracewehner and others added 4 commits April 10, 2026 17:26

refactor: use pointer alias targetBuilder for clarity

8848d5d

Rename groupBuilder to targetBuilder via pointer alias in the per-target loop to clarify that the builder is being reused for target label merging. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add changelog entry for target allocator label sorting fix

dd1ef4c

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

gracewehner marked this pull request as ready for review June 11, 2026 17:40

gracewehner requested a review from a team as a code owner June 11, 2026 17:40

Merge branch 'main' into fix/ta-sorted-labels

2c384df

gracewehner requested review from frzifus and pavolloffay June 11, 2026 17:40

swiatekm reviewed Jun 13, 2026

View reviewed changes

feedback

221766d

gracewehner force-pushed the fix/ta-sorted-labels branch from 99d3224 to 221766d Compare June 15, 2026 19:32

gracewehner added 2 commits June 15, 2026 16:45

cleanup

bfb9e7a

Merge branch 'main' into fix/ta-sorted-labels

ea29cef

swiatekm reviewed Jun 17, 2026

View reviewed changes

Comment thread cmd/otel-allocator/internal/target/discovery.go Outdated

move sort into merge helper function

100081e

swiatekm self-requested a review June 17, 2026 20:46

swiatekm approved these changes Jun 18, 2026

View reviewed changes

swiatekm merged commit 4fd3379 into open-telemetry:main Jun 18, 2026
96 of 99 checks passed

azure-monitor-assistant Bot mentioned this pull request Jul 3, 2026

build(deps): Upgrade otelcollector to v0.154.0 Azure/prometheus-collector#1597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967

fix(target-allocator): sort labels in processTargetGroups to prevent hash collisions#4967
swiatekm merged 9 commits into
open-telemetry:mainfrom
gracewehner:fix/ta-sorted-labels

gracewehner commented Apr 15, 2026 •

edited

Loading

Uh oh!

gracewehner commented Jun 11, 2026

Uh oh!

swiatekm left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

gracewehner commented Jun 15, 2026

Uh oh!

swiatekm left a comment

Uh oh!

Uh oh!

swiatekm left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gracewehner commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Fix

Benchmarks

Testing

Customer Impact

Link to tracking issue

Uh oh!

gracewehner commented Jun 11, 2026

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linux-foundation-easycla Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gracewehner commented Jun 15, 2026

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

swiatekm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gracewehner commented Apr 15, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 15, 2026 •

edited

Loading