Commit 05f3138
authored
Fix kafka flakiness (#8629)
## Summary of changes
Fix the `DataStreamsMonitoringKafkaTests.HandlesBatchProcessing` (and
related) flake by making the test sample wait for Kafka metadata to
propagate after topic creation, and by hardening the partition-picker in
the producer helper against a transient `0` partition count.
## Reason for change
The `HandlesBatchProcessing` test failed in CI with a Verify mismatch
where every `kafka_produce` backlog entry collapsed onto `partition:0`:
```
- partition:1, topic:data-streams-batch-processing-1-..., type:kafka_produce
- partition:1, topic:data-streams-batch-processing-2-..., type:kafka_produce
- partition:2, topic:data-streams-batch-processing-1-..., type:kafka_produce
- partition:2, topic:data-streams-batch-processing-2-..., type:kafka_produce
```
The four missing entries are all producer-side backlogs for partitions 1
and 2. Consumer-side `kafka_commit` backlogs are present for all
partitions, and the per-topic in/out DSM edges are present too — only
the producer collapsed onto partition 0.
**Direct confirmation in the sample's stdout** — every produce went to
partition `[[0]]`:
```
Produced message to: data-streams-batch-processing-1-HandlesBatchProcessing [[0]] @0
Produced message to: data-streams-batch-processing-1-HandlesBatchProcessing [[0]] @1
Produced message to: data-streams-batch-processing-1-HandlesBatchProcessing [[0]] @2
...
Produced message to: data-streams-batch-processing-2-HandlesBatchProcessing [[0]] @0
Produced message to: data-streams-batch-processing-2-HandlesBatchProcessing [[0]] @1
Produced message to: data-streams-batch-processing-2-HandlesBatchProcessing [[0]] @2
```
…instead of the expected `[[0]]`, `[[1]]`, `[[2]]` distribution. Topic
creation took 4.2s (`Finished creating topics: 0:00:04.2190968`) and
producing started ~1s later — well within the metadata-propagation
window where `AdminClient.GetMetadata` can return the topic with zero
partitions.
**Root cause** is a race in the test sample:
1. `TopicHelpers.TryCreateTopic` returns as soon as Kafka acks the
`CreateTopics` request, but broker metadata propagation can lag.
2. `Producer.GetPartition` calls `AdminClient.GetMetadata(topic, 5s)` to
learn the partition count. If metadata hasn't propagated yet, the topic
comes back with zero partitions and `GetTopicPartitionCount` returns
`0`.
3. That `0` gets cached in a static `ConcurrentDictionary<string,int>`
for the lifetime of the sample process via `GetOrAdd`.
4. With `numPartitions == 0`, the `partition >= numPartitions` guard is
always true, so every subsequent produce is pinned to partition 0 — for
the entire run.
This is timing-dependent (flaky, not always failing), and once it loses
the race the whole run is doomed.
This is the same family of flake addressed in #7211 ("Alternative
approach to fixing flaky DSM Kafka tests"), but the partition-count race
in the sample wasn't covered there.
## Implementation details
Two changes, both in the test sample only — no tracer code touched.
**`Samples.Kafka/TopicHelpers.cs` — fix the root cause.** After
`CreateTopicsAsync` succeeds (or returns `TopicAlreadyExists`), poll
`GetMetadata` until the topic is visible with the expected partition
count, with a 30s bounded timeout. If metadata never propagates, throw a
descriptive exception instead of silently returning. The original
throw-on-exhausted-retries behavior is preserved.
**`Samples.Kafka/Producer.cs` — defense in depth.** `GetPartition` no
longer caches a `0` partition count. If `GetTopicPartitionCount` returns
0 (shouldn't happen anymore after the TopicHelpers fix, but in case
anyone calls this without going through `TryCreateTopic`), we fall back
to partition 0 for that single call and re-query on the next produce
instead of pinning the whole run.
## Test coverage
Existing `DataStreamsMonitoringKafkaTests` cover the scenario; the
failure mode this PR fixes is exactly the one observed in the failing
build. No new tests added — this is a flake fix in a sample app, not a
behavior change in the tracer.
## Other details
- Test sample code only; no tracer / production code changes.
- `Samples.Kafka/TopicHelpers.cs` is used by every Kafka-producing
sample test, so it broadly hardens that family of tests against the
metadata-propagation race.1 parent 6bb7fc9 commit 05f3138
2 files changed
Lines changed: 63 additions & 4 deletions
Lines changed: 16 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
93 | 108 | | |
94 | 109 | | |
95 | 110 | | |
| |||
Lines changed: 47 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
37 | 38 | | |
38 | 39 | | |
39 | 40 | | |
40 | | - | |
| 41 | + | |
| 42 | + | |
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
44 | 46 | | |
45 | 47 | | |
46 | 48 | | |
47 | | - | |
| 49 | + | |
| 50 | + | |
48 | 51 | | |
49 | 52 | | |
50 | 53 | | |
51 | 54 | | |
52 | 55 | | |
53 | | - | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
54 | 98 | | |
55 | 99 | | |
56 | 100 | | |
| |||
0 commit comments