Commit 638fa6d
committed
session: fix pool renewal race causing double statement execution
When two or more nodes are bootstrapped concurrently the Python driver
can execute the same CQL statement twice, causing spurious "already
exists" errors in the caller. This has been observed as flaky test
failures across the ScyllaDB test suite for the past two years, and
worked around by using idempotent DDL forms (IF NOT EXISTS / IF EXISTS)
in dozens of tests.
Root cause
----------
The race unfolds as follows:
1. Two on_add notifications arrive at roughly the same time, one for
each new node. Each one calls session.add_or_renew_pool(), which
submits run_add_or_renew_pool() to the thread pool and returns.
Both submissions are in-flight concurrently.
2. The first add_or_renew_pool() finishes and calls _finalize_add(),
which notifies load-balancing policies and then calls
session.update_created_pools() for every live session.
3. update_created_pools() iterates all known hosts. For the second
host, whose run_add_or_renew_pool() has not yet completed, it sees
self._pools.get(host) == None (or a shut-down pool) and therefore
submits *another* run_add_or_renew_pool() for that host.
4. Now two tasks are connecting to the same host. The first one
finishes and installs pool-A in self._pools, then runs a statement
(e.g. CREATE ROLE) that is in-flight on pool-A.
5. The second task finishes, reads the stale `previous = self._pools.get(host)`
value (captured *before* the lock was taken — another bug), installs
pool-B and then shuts down pool-A. The in-flight CREATE ROLE request
is orphaned; the driver retries it on pool-B. The server executes it
a second time and returns "Role ... already exists".
Fix
---
Three coordinated changes to cassandra/cluster.py:
* Session.__init__: add self._pending_pool_futures = {}, a dict mapping
host -> Future for any in-flight pool creation, guarded by _lock.
* add_or_renew_pool: before submitting run_add_or_renew_pool(), check
_pending_pool_futures under _lock. If an in-flight future already
exists for the host, return it immediately — this is the primary fix
that prevents the duplicate submission from update_created_pools.
Additionally, move the `previous = self._pools.get(host)` read inside
the lock so the live-pool check is atomic with the installation of the
new pool: if a concurrent creation has already installed a live pool
by the time we finish connecting, discard our new pool instead of
replacing the live one (defense-in-depth). In all exit paths, remove
the host from _pending_pool_futures once the future is done.
* remove_pool: clear _pending_pool_futures[host] under _lock so that
if a host is removed and immediately re-added, add_or_renew_pool
submits a fresh creation rather than reusing a stale done future.
Tests
-----
Three new unit tests are added in PoolRenewalRaceTest
(tests/unit/test_cluster.py). They exercise the new code paths without
requiring a real cluster connection by constructing a minimal Session
via object.__new__ and mocking the executor and profile manager:
* test_add_or_renew_pool_reuses_inflight_future: places a pending
Future in _pending_pool_futures and verifies that add_or_renew_pool
returns it without submitting a new task to the executor.
* test_add_or_renew_pool_discards_duplicate_when_live_pool_exists:
directly exercises the critical section that must discard a newly
connected pool when a live pool is already present.
* test_remove_pool_clears_pending_future: verifies that remove_pool
clears _pending_pool_futures so the next add_or_renew_pool call
submits a fresh task.
Fixes: #317
Signed-off-by: Nadav Har'El <nyh@scylladb.com>1 parent cd9f525 commit 638fa6d
2 files changed
Lines changed: 170 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2615 | 2615 | | |
2616 | 2616 | | |
2617 | 2617 | | |
| 2618 | + | |
| 2619 | + | |
| 2620 | + | |
| 2621 | + | |
| 2622 | + | |
| 2623 | + | |
2618 | 2624 | | |
2619 | 2625 | | |
2620 | 2626 | | |
| |||
3246 | 3252 | | |
3247 | 3253 | | |
3248 | 3254 | | |
| 3255 | + | |
| 3256 | + | |
3249 | 3257 | | |
3250 | 3258 | | |
3251 | 3259 | | |
| |||
3254 | 3262 | | |
3255 | 3263 | | |
3256 | 3264 | | |
| 3265 | + | |
| 3266 | + | |
3257 | 3267 | | |
3258 | 3268 | | |
3259 | | - | |
3260 | 3269 | | |
3261 | 3270 | | |
3262 | 3271 | | |
| |||
3274 | 3283 | | |
3275 | 3284 | | |
3276 | 3285 | | |
| 3286 | + | |
3277 | 3287 | | |
3278 | 3288 | | |
| 3289 | + | |
| 3290 | + | |
| 3291 | + | |
| 3292 | + | |
| 3293 | + | |
| 3294 | + | |
| 3295 | + | |
| 3296 | + | |
| 3297 | + | |
| 3298 | + | |
| 3299 | + | |
| 3300 | + | |
| 3301 | + | |
| 3302 | + | |
| 3303 | + | |
3279 | 3304 | | |
| 3305 | + | |
3280 | 3306 | | |
3281 | 3307 | | |
3282 | 3308 | | |
3283 | 3309 | | |
3284 | 3310 | | |
3285 | 3311 | | |
3286 | 3312 | | |
3287 | | - | |
| 3313 | + | |
| 3314 | + | |
| 3315 | + | |
| 3316 | + | |
| 3317 | + | |
| 3318 | + | |
| 3319 | + | |
| 3320 | + | |
| 3321 | + | |
| 3322 | + | |
| 3323 | + | |
| 3324 | + | |
| 3325 | + | |
| 3326 | + | |
| 3327 | + | |
| 3328 | + | |
3288 | 3329 | | |
3289 | 3330 | | |
3290 | | - | |
| 3331 | + | |
| 3332 | + | |
| 3333 | + | |
| 3334 | + | |
| 3335 | + | |
| 3336 | + | |
3291 | 3337 | | |
3292 | 3338 | | |
3293 | 3339 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| |||
339 | 341 | | |
340 | 342 | | |
341 | 343 | | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
342 | 463 | | |
343 | 464 | | |
344 | 465 | | |
| |||
0 commit comments