Skip to content

Commit b286c02

Browse files
NoahStapptadjik1
andauthored
DRIVERS-3239 - Token buckets disabled by default (#1902)
Co-authored-by: Sergey Zelenov <mail@zelenov.su>
1 parent 7ef1b81 commit b286c02

5 files changed

Lines changed: 173 additions & 37 deletions

File tree

source/client-backpressure/client-backpressure.md

Lines changed: 51 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -110,16 +110,10 @@ overload error, including those not eligible for retry under the
110110
updateMany, create collection, getMore, and generic runCommand. The new command execution method obeys the following
111111
rules:
112112

113-
1. `attempt` is the execution attempt number (starting with 0). Note that `attempt` includes retries for errors that
114-
are not overload errors (this might include attempts under other retry policies, see
113+
1. `attempt` is the execution attempt number (starting with 0). Note that `attempt` includes retries for errors that are
114+
not overload errors (this might include attempts under other retry policies, see
115115
[Interactions with Other Retry Policies](./client-backpressure.md#interaction-with-other-retry-policies)).
116-
2. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
117-
- The value is 0.1 and non-configurable.
118-
3. If the command succeeds on a retry attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE`+1 tokens.
119-
4. If a retry attempt fails with an error that is not an overload error, drivers MUST deposit 1 token.
120-
- An error that does not contain the `SystemOverloadedError` error label indicates that the server is healthy enough
121-
to handle requests. For the purposes of retry budget tracking, this counts as a success.
122-
5. A retry attempt will only be permitted if:
116+
2. A retry attempt will only be permitted if:
123117
1. The error is a retryable overload error.
124118
2. We have not reached `MAX_RETRIES`.
125119
- The value of `MAX_RETRIES` is 5 and non-configurable.
@@ -128,43 +122,58 @@ rules:
128122
3. (CSOT-only): There is still time for a retry attempt according to the
129123
[Client Side Operations Timeout](../client-side-operations-timeout/client-side-operations-timeout.md)
130124
specification.
131-
4. A token can be consumed from the token bucket.
132-
5. The command is a write and [retryWrites](../retryable-writes/retryable-writes.md#retrywrites) is enabled or the
125+
4. The command is a write and [retryWrites](../retryable-writes/retryable-writes.md#retrywrites) is enabled or the
133126
command is a read and [retryReads](../retryable-reads/retryable-reads.md#retryreads) is enabled.
134127
- To retry `runCommand`, both [retryWrites](../retryable-writes/retryable-writes.md#retrywrites) and
135-
[retryReads](../retryable-reads/retryable-reads.md#retryreads) must be enabled. See
128+
[retryReads](../retryable-reads/retryable-reads.md#retryreads) MUST be enabled. See
136129
[Why must both `retryWrites` and `retryReads` be enabled to retry runCommand?](client-backpressure.md#why-must-both-retrywrites-and-retryreads-be-enabled-to-retry-runcommand)
137-
6. A retry attempt consumes 1 token from the token bucket.
138-
7. If the request is eligible for retry (as outlined in step 5), the client MUST apply exponential backoff according to
139-
the following formula: `backoff = jitter * min(MAX_BACKOFF, BASE_BACKOFF * 2^(attempt - 1))`
130+
3. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
131+
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below), the client MUST apply
132+
exponential backoff according to the following formula:
133+
`backoff = jitter * min(MAX_BACKOFF, BASE_BACKOFF * 2^(attempt - 1))`
140134
- `jitter` is a random jitter value between 0 and 1.
141135
- `BASE_BACKOFF` is constant 100ms.
142136
- `MAX_BACKOFF` is 10000ms.
143137
- This results in delays of 100ms, 200ms, 400ms, 800ms, and 1600ms before accounting for jitter.
144-
8. If the request is eligible for retry (as outlined in step 5), the client MUST add the previously used server's
145-
address to the list of deprioritized server addresses for
138+
4. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
139+
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below), the client MUST add the
140+
previously used server's address to the list of deprioritized server addresses for
146141
[server selection](../server-selection/server-selection.md).
147-
9. If the request is eligible for retry (as outlined in step 5) and is a retryable write:
142+
5. If the request is eligible for retry (as outlined in step 2 above and step 4 in the
143+
[adaptive retry requirements](client-backpressure.md#adaptive-retry-requirements) below) and is a retryable write:
148144
1. If the command is a part of a transaction, the instructions for command modification on retry for commands in
149145
transactions MUST be followed, as outlined in the
150146
[transactions](../transactions/transactions.md#interaction-with-retryable-writes) specification.
151147
2. If the command is a not a part of a transaction, the instructions for command modification on retry for retryable
152148
writes MUST be followed, as outlined in the [retryable writes](../retryable-writes/retryable-writes.md)
153149
specification.
154-
10. If the request is not eligible for any retries, then the client MUST propagate errors following the behaviors
150+
6. If the request is not eligible for any retries, then the client MUST propagate errors following the behaviors
155151
described in the [retryable reads](../retryable-reads/retryable-reads.md),
156-
[retryable writes](../retryable-writes/retryable-writes.md) and the
157-
[transactions](../transactions/transactions.md) specifications.
152+
[retryable writes](../retryable-writes/retryable-writes.md) and the [transactions](../transactions/transactions.md)
153+
specifications.
158154
- For the purposes of error propagation, `runCommand` is considered a write.
159155

156+
##### Adaptive retry requirements
157+
158+
If adaptive retries are enabled, the following rules MUST also be obeyed:
159+
160+
1. If the command succeeds on the first attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE` tokens.
161+
- The value is 0.1 and non-configurable.
162+
2. If the command succeeds on a retry attempt, drivers MUST deposit `RETRY_TOKEN_RETURN_RATE`+1 tokens.
163+
3. If a retry attempt fails with an error that is not an overload error, drivers MUST deposit 1 token.
164+
- An error that does not contain the `SystemOverloadedError` error label indicates that the server is healthy enough
165+
to handle requests. For the purposes of retry budget tracking, this counts as a success.
166+
4. A retry attempt will only be permitted if a token can be consumed from the token bucket.
167+
5. A retry attempt consumes 1 token from the token bucket.
168+
160169
#### Interaction with Other Retry Policies
161170

162171
The retry policy in this specification is separate from the other retry policies defined in the
163172
[retryable reads](../retryable-reads/retryable-reads.md) and [retryable writes](../retryable-writes/retryable-writes.md)
164173
specifications. Drivers MUST ensure:
165174

166175
- Only overload errors consume tokens from the token bucket before retrying.
167-
- When a failed attempt is retried, backoff must be applied if and only if the error is an overload error.
176+
- When a failed attempt is retried, backoff MUST be applied if and only if the error is an overload error.
168177
- If an overload error is encountered:
169178
- Regardless of whether CSOT is enabled or not, the maximum number of retries for any retry policy becomes
170179
`MAX_RETRIES`.
@@ -196,11 +205,12 @@ def execute_command_retryable(command, ...):
196205
server = select_server(deprioritized_servers)
197206
connection = server.getConnection()
198207
res = execute_command(connection, command)
199-
# Deposit tokens into the bucket on success.
200-
tokens = RETRY_TOKEN_RETURN_RATE
201-
if attempt > 0:
202-
tokens += 1
203-
token_bucket.deposit(tokens)
208+
if adaptive_retry:
209+
# Deposit tokens into the bucket on success.
210+
tokens = RETRY_TOKEN_RETURN_RATE
211+
if attempt > 0:
212+
tokens += 1
213+
token_bucket.deposit(tokens)
204214
return res
205215
except PyMongoError as exc:
206216
is_retryable = (is_retryable_write(command, exc)
@@ -209,7 +219,7 @@ def execute_command_retryable(command, ...):
209219
is_overload = exc.contains_error_label("SystemOverloadedError")
210220

211221
# if a retry fails with an error which is not an overload error, deposit 1 token
212-
if attempt > 0 and not is_overload:
222+
if adaptive_retry and attempt > 0 and not is_overload:
213223
token_bucket.deposit(1)
214224

215225
# Raise if the error is non-retryable.
@@ -234,24 +244,27 @@ def execute_command_retryable(command, ...):
234244
if time.monotonic() + backoff > _csot.get_deadline():
235245
raise
236246

237-
if not token_bucket.consume(1):
247+
if adaptive_retry and not token_bucket.consume(1):
238248
raise
239249

240250
time.sleep(backoff)
241251
```
242252

243253
### Token Bucket
244254

245-
The overload retry policy introduces a per-client [token bucket](https://en.wikipedia.org/wiki/Token_bucket) to limit
246-
overload error retry attempts. Although the server rejects excess commands as quickly as possible, doing so costs CPU
247-
and creates extra contention on the connection pool which can eventually negatively affect goodput. To reduce this risk,
248-
the token bucket will limit retry attempts during a prolonged overload.
255+
The overload retry policy introduces an opt-in per-client [token bucket](https://en.wikipedia.org/wiki/Token_bucket) to
256+
limit overload error retry attempts. Although the server rejects excess commands as quickly as possible, doing so costs
257+
CPU and creates extra contention on the connection pool which can eventually negatively affect goodput. To reduce this
258+
risk, the token bucket will limit retry attempts during a prolonged overload.
259+
260+
The token bucket MUST be disabled by default and can be enabled through the
261+
[adaptiveRetries=True](../uri-options/uri-options.md) connection and client options.
249262

250263
The token bucket starts at its maximum capacity of 1000 for consistency with the server.
251264

252-
Each MongoClient instance MUST have its own token bucket. The token bucket MUST be created when the MongoClient is
253-
initialized and exist for the lifetime of the MongoClient. Drivers MUST ensure the token bucket implementation is
254-
thread-safe as it may be accessed concurrently by multiple operations.
265+
Each MongoClient instance MUST have its own token bucket. When adaptive retries are enabled, the token bucket MUST be
266+
created when the MongoClient is initialized and exist for the lifetime of the MongoClient. Drivers MUST ensure the token
267+
bucket implementation is thread-safe as it may be accessed concurrently by multiple operations.
255268

256269
#### Pseudocode
257270

@@ -449,4 +462,6 @@ retrying a write command when only `retryReads` is enabled.
449462

450463
## Changelog
451464

465+
- 2026-02-20: Disable token buckets by default.
466+
452467
- 2026-01-09: Initial version.

source/client-backpressure/tests/README.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,67 @@ Drivers should test that retries do not occur immediately when a SystemOverloade
6262

6363
Drivers should test that retry token buckets are created at their maximum capacity and that that capacity is enforced.
6464

65-
1. Let `client` be a `MongoClient`.
65+
1. Let `client` be a `MongoClient` with `adaptiveRetries=True`.
6666
2. Assert that the client's retry token bucket is at full capacity and that the capacity is
6767
`DEFAULT_RETRY_TOKEN_CAPACITY`.
6868
3. Using `client`, execute a successful `ping` command.
6969
4. Assert that the successful command did not increase the number of tokens in the bucket above
7070
`DEFAULT_RETRY_TOKEN_CAPACITY`.
71+
72+
#### Test 3: Overload Errors are Retried a Maximum of MAX_RETRIES times
73+
74+
Drivers should test that without adaptive retries enabled, overload errors are retried a maximum of five times.
75+
76+
1. Let `client` be a `MongoClient` with command event monitoring enabled.
77+
78+
2. Let `coll` be a collection.
79+
80+
3. Configure the following failpoint:
81+
82+
```javascript
83+
{
84+
configureFailPoint: 'failCommand',
85+
mode: 'alwaysOn',
86+
data: {
87+
failCommands: ['find'],
88+
errorCode: 462, // IngressRequestRateLimitExceeded
89+
errorLabels: ['SystemOverloadedError', 'RetryableError']
90+
}
91+
}
92+
```
93+
94+
4. Perform a find operation with `coll` that fails.
95+
96+
5. Assert that the raised error contains both the `RetryableError` and `SystemOverloadedError` error labels.
97+
98+
6. Assert that the total number of started commands is MAX_RETRIES + 1 (6).
99+
100+
#### Test 4: Adaptive Retries are Limited by Token Bucket Tokens
101+
102+
Drivers should test that when enabled, adaptive retries are limited by the number of tokens in the bucket.
103+
104+
1. Let `client` be a `MongoClient` with `adaptiveRetries=True` and command event monitoring enabled.
105+
106+
2. Set `client`'s retry token bucket to have 2 tokens.
107+
108+
3. Let `coll` be a collection.
109+
110+
4. Configure the following failpoint:
111+
112+
```javascript
113+
{
114+
configureFailPoint: 'failCommand',
115+
mode: {times: 3},
116+
data: {
117+
failCommands: ['find'],
118+
errorCode: 462, // IngressRequestRateLimitExceeded
119+
errorLabels: ['SystemOverloadedError', 'RetryableError']
120+
}
121+
}
122+
```
123+
124+
5. Perform a find operation with `coll` that fails.
125+
126+
6. Assert that the raised error contains both the `RetryableError` and `SystemOverloadedError` error labels.
127+
128+
7. Assert that the total number of started commands is 3: one for the initial attempt and two for the retries.

source/uri-options/tests/client-backpressure-options.json

Lines changed: 35 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
tests:
2+
-
3+
description: "adaptiveRetries=true is parsed correctly"
4+
uri: "mongodb://example.com/?adaptiveRetries=true"
5+
valid: true
6+
warning: false
7+
hosts: ~
8+
auth: ~
9+
options:
10+
adaptiveRetries: true
11+
-
12+
description: "adaptiveRetries=false is parsed correctly"
13+
uri: "mongodb://example.com/?adaptiveRetries=false"
14+
valid: true
15+
warning: false
16+
hosts: ~
17+
auth: ~
18+
options:
19+
adaptiveRetries: false
20+
-
21+
description: "adaptiveRetries with invalid value causes a warning"
22+
uri: "mongodb://example.com/?adaptiveRetries=invalid"
23+
valid: true
24+
warning: true
25+
hosts: ~
26+
auth: ~
27+
options: ~

0 commit comments

Comments
 (0)