Commit 8d80a09
committed
NCBC-4160: AppTelemetry reports significantly lower metrics than it should
Motivation
----------
- Histogram bins are not thread-safe. Multiple threads can read `currentValue` simultaneously, both compute count + 1, and both write back the same value, so one increment is silently lost.
- Race condition in TryExportMetricsAndReset: An increment thread obtains a reference to metricSetA via GetOrAdd (inside lock), then the export thread takes a snapshot and clears the dictionary.
The increment thread thenwrites to metricSetA, but it's already been snapshotted and will be discarded after export, so the increment is permanently lost.
Or, a new entry metricSetB is added to the dictionary AFTER the snapshot ToArray() but BEFORE Clear(). It ends up in neither the snapshot nor the post-clear dictionary.
- WebSocket send failure permanently loses exported metrics
- TryExportMetricsAndReset clears the dictionary before the WebSocket send. If SendAsync fails (connection drop, timeout), those metrics are lost.
- Backoff never resets after a successful connection (so it grows to 1h backoff each time)
Changes
-------
- `AppTelemetryHistogramBins.cs`: Added a lock around the read-modify-write in IncrementCountAndSum() to prevent concurrent threads from overwriting each other's updates to bin counts and sums.
- `AppTelemetryCollector.cs`: Replaced the snapshot then clear pattern with an atomic dictionary swap under _metricsLock.
The old dictionary is swapped out and exported while new increments go to the fresh dictionary, removing the window where metrics could be lost between snapshot and clear.
Also fixed Disable() to use the same swap pattern under the metrics lock.
- `WebSocketClientHandler.cs`: Added a _pendingMetrics cache. If SendAsync fails after export, the serialized metrics are saved and retried on the next telemetry request instead of being lost.
- `WebSocketClientHandler.cs`: Added _attempt = 0 after a successful WebSocket connection opens, so the backoff resets to 100ms instead of staying at the max (up to 1h) after a connection terminates.
Change-Id: I1a96ddbd0c908b1ec90a20868aed264ca499b607
Reviewed-on: https://review.couchbase.org/c/couchbase-net-client/+/241592
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: David Kelly <davidmichaelkelly@gmail.com>1 parent 180f500 commit 8d80a09
3 files changed
Lines changed: 61 additions & 27 deletions
File tree
- src/Couchbase/Core/Diagnostics/Metrics/AppTelemetry
Lines changed: 14 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| 25 | + | |
25 | 26 | | |
26 | | - | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
75 | 76 | | |
76 | 77 | | |
77 | 78 | | |
78 | | - | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
79 | 83 | | |
80 | 84 | | |
81 | 85 | | |
| |||
125 | 129 | | |
126 | 130 | | |
127 | 131 | | |
128 | | - | |
129 | | - | |
| 132 | + | |
130 | 133 | | |
131 | 134 | | |
132 | 135 | | |
| |||
142 | 145 | | |
143 | 146 | | |
144 | 147 | | |
145 | | - | |
146 | | - | |
| 148 | + | |
147 | 149 | | |
148 | 150 | | |
149 | 151 | | |
150 | 152 | | |
151 | 153 | | |
152 | 154 | | |
153 | 155 | | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | 156 | | |
| 157 | + | |
160 | 158 | | |
161 | 159 | | |
162 | | - | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
163 | 163 | | |
164 | 164 | | |
165 | | - | |
| 165 | + | |
| 166 | + | |
166 | 167 | | |
167 | 168 | | |
168 | 169 | | |
| |||
Lines changed: 17 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
| 4 | + | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
11 | 17 | | |
12 | 18 | | |
13 | 19 | | |
| |||
32 | 38 | | |
33 | 39 | | |
34 | 40 | | |
35 | | - | |
36 | | - | |
37 | | - | |
| 41 | + | |
38 | 42 | | |
39 | | - | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
40 | 48 | | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
45 | 54 | | |
46 | 55 | | |
47 | 56 | | |
Lines changed: 30 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
41 | 42 | | |
42 | 43 | | |
43 | 44 | | |
| |||
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
| 73 | + | |
72 | 74 | | |
73 | 75 | | |
74 | 76 | | |
| |||
156 | 158 | | |
157 | 159 | | |
158 | 160 | | |
159 | | - | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
160 | 173 | | |
161 | | - | |
162 | | - | |
163 | | - | |
164 | | - | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
165 | 181 | | |
166 | | - | |
| 182 | + | |
167 | 183 | | |
| 184 | + | |
| 185 | + | |
168 | 186 | | |
169 | 187 | | |
170 | 188 | | |
171 | 189 | | |
172 | 190 | | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
173 | 197 | | |
174 | 198 | | |
175 | 199 | | |
| |||
0 commit comments