Fix kafka flakiness#8629
Conversation
BenchmarksBenchmark execution time: 2026-05-14 14:12:43 Comparing candidate commit 8fc8077 in PR branch Some scenarios are present only in baseline or only in candidate runs. If you didn't create or remove some scenarios in your branch, this maybe a sign of crashed benchmarks 💥💥💥 Scenarios present only in baseline:
Found 4 performance improvements and 4 performance regressions! Performance is the same for 47 metrics, 17 unstable metrics, 85 known flaky benchmarks, 41 flaky benchmarks without significant changes.
|
| // see a topic with zero partitions. | ||
| if (!await WaitForTopicMetadata(topicName, numPartitions, config, TimeSpan.FromSeconds(30))) | ||
| { | ||
| throw new Exception($"Topic {topicName} metadata did not propagate with {numPartitions} partitions"); |
There was a problem hiding this comment.
If this is thrown, it's infra flake, right, so should we actually catch this in the sample and exit with our special "flake" exit code if so? what do you think? 🤔
There was a problem hiding this comment.
Nice catch! Thanks! Added!
Execution-Time Benchmarks Report ⏱️Execution-time results for samples comparing This PR (8629) and master. ✅ No regressions detected - check the details below Full Metrics ComparisonFakeDbCommand
HttpMessageHandler
Comparison explanationExecution-time benchmarks measure the whole time it takes to execute a program, and are intended to measure the one-off costs. Cases where the execution time results for the PR are worse than latest master results are highlighted in **red**. The following thresholds were used for comparing the execution times:
Note that these results are based on a single point-in-time result for each branch. For full results, see the dashboard. Graphs show the p99 interval based on the mean and StdDev of the test run, as well as the mean value of the run (shown as a diamond below the graph). Duration chartsFakeDbCommand (.NET Framework 4.8)gantt
title Execution time (ms) FakeDbCommand (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (75ms) : 71, 79
master - mean (73ms) : 70, 75
section Bailout
This PR (8629) - mean (78ms) : 74, 81
master - mean (77ms) : 75, 79
section CallTarget+Inlining+NGEN
This PR (8629) - mean (1,100ms) : 1045, 1154
master - mean (1,095ms) : 1055, 1136
FakeDbCommand (.NET Core 3.1)gantt
title Execution time (ms) FakeDbCommand (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (116ms) : 109, 122
master - mean (113ms) : 110, 116
section Bailout
This PR (8629) - mean (115ms) : 112, 117
master - mean (117ms) : 111, 122
section CallTarget+Inlining+NGEN
This PR (8629) - mean (784ms) : 758, 809
master - mean (787ms) : 764, 811
FakeDbCommand (.NET 6)gantt
title Execution time (ms) FakeDbCommand (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (101ms) : 98, 105
master - mean (101ms) : 98, 105
section Bailout
This PR (8629) - mean (102ms) : 99, 105
master - mean (102ms) : 100, 105
section CallTarget+Inlining+NGEN
This PR (8629) - mean (951ms) : 901, 1000
master - mean (952ms) : 908, 995
FakeDbCommand (.NET 8)gantt
title Execution time (ms) FakeDbCommand (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (103ms) : 97, 110
master - mean (102ms) : 96, 109
section Bailout
This PR (8629) - mean (102ms) : 98, 106
master - mean (101ms) : 98, 103
section CallTarget+Inlining+NGEN
This PR (8629) - mean (824ms) : 789, 858
master - mean (821ms) : 781, 861
HttpMessageHandler (.NET Framework 4.8)gantt
title Execution time (ms) HttpMessageHandler (.NET Framework 4.8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (200ms) : 193, 206
master - mean (199ms) : 192, 206
section Bailout
This PR (8629) - mean (203ms) : 198, 208
master - mean (204ms) : 199, 209
section CallTarget+Inlining+NGEN
This PR (8629) - mean (1,201ms) : 1162, 1240
master - mean (1,204ms) : 1158, 1250
HttpMessageHandler (.NET Core 3.1)gantt
title Execution time (ms) HttpMessageHandler (.NET Core 3.1)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (287ms) : 279, 294
master - mean (288ms) : 279, 297
section Bailout
This PR (8629) - mean (288ms) : 280, 296
master - mean (289ms) : 281, 298
section CallTarget+Inlining+NGEN
This PR (8629) - mean (969ms) : 949, 989
master - mean (966ms) : 938, 994
HttpMessageHandler (.NET 6)gantt
title Execution time (ms) HttpMessageHandler (.NET 6)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (284ms) : 277, 291
master - mean (282ms) : 276, 288
section Bailout
This PR (8629) - mean (283ms) : 278, 289
master - mean (281ms) : 277, 286
section CallTarget+Inlining+NGEN
This PR (8629) - mean (1,166ms) : 1120, 1211
master - mean (1,157ms) : 1116, 1198
HttpMessageHandler (.NET 8)gantt
title Execution time (ms) HttpMessageHandler (.NET 8)
dateFormat x
axisFormat %Q
todayMarker off
section Baseline
This PR (8629) - mean (281ms) : 271, 290
master - mean (277ms) : 267, 286
section Bailout
This PR (8629) - mean (281ms) : 275, 287
master - mean (278ms) : 268, 288
section CallTarget+Inlining+NGEN
This PR (8629) - mean (1,039ms) : 996, 1083
master - mean (1,039ms) : 981, 1097
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Summary of changes
Fix the
DataStreamsMonitoringKafkaTests.HandlesBatchProcessing(and related) flake by making the test sample wait for Kafka metadata to propagate after topic creation, and by hardening the partition-picker in the producer helper against a transient0partition count.Reason for change
The
HandlesBatchProcessingtest failed in CI with a Verify mismatch where everykafka_producebacklog entry collapsed ontopartition:0:The four missing entries are all producer-side backlogs for partitions 1 and 2. Consumer-side
kafka_commitbacklogs are present for all partitions, and the per-topic in/out DSM edges are present too — only the producer collapsed onto partition 0.Direct confirmation in the sample's stdout — every produce went to partition
[[0]]:…instead of the expected
[[0]],[[1]],[[2]]distribution. Topic creation took 4.2s (Finished creating topics: 0:00:04.2190968) and producing started ~1s later — well within the metadata-propagation window whereAdminClient.GetMetadatacan return the topic with zero partitions.Root cause is a race in the test sample:
TopicHelpers.TryCreateTopicreturns as soon as Kafka acks theCreateTopicsrequest, but broker metadata propagation can lag.Producer.GetPartitioncallsAdminClient.GetMetadata(topic, 5s)to learn the partition count. If metadata hasn't propagated yet, the topic comes back with zero partitions andGetTopicPartitionCountreturns0.0gets cached in a staticConcurrentDictionary<string,int>for the lifetime of the sample process viaGetOrAdd.numPartitions == 0, thepartition >= numPartitionsguard is always true, so every subsequent produce is pinned to partition 0 — for the entire run.This is timing-dependent (flaky, not always failing), and once it loses the race the whole run is doomed.
This is the same family of flake addressed in #7211 ("Alternative approach to fixing flaky DSM Kafka tests"), but the partition-count race in the sample wasn't covered there.
Implementation details
Two changes, both in the test sample only — no tracer code touched.
Samples.Kafka/TopicHelpers.cs— fix the root cause. AfterCreateTopicsAsyncsucceeds (or returnsTopicAlreadyExists), pollGetMetadatauntil the topic is visible with the expected partition count, with a 30s bounded timeout. If metadata never propagates, throw a descriptive exception instead of silently returning. The original throw-on-exhausted-retries behavior is preserved.Samples.Kafka/Producer.cs— defense in depth.GetPartitionno longer caches a0partition count. IfGetTopicPartitionCountreturns 0 (shouldn't happen anymore after the TopicHelpers fix, but in case anyone calls this without going throughTryCreateTopic), we fall back to partition 0 for that single call and re-query on the next produce instead of pinning the whole run.Test coverage
Existing
DataStreamsMonitoringKafkaTestscover the scenario; the failure mode this PR fixes is exactly the one observed in the failing build. No new tests added — this is a flake fix in a sample app, not a behavior change in the tracer.Other details
Samples.Kafka/TopicHelpers.csis used by every Kafka-producing sample test, so it broadly hardens that family of tests against the metadata-propagation race.