Commit 28ba65e
Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete (#1796)
* Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete
Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake
in CI Release builds. The general BGSAVE path synchronously paused all per-DB
checkpoint locks before returning 'Background saving started', but the per-DB checkpoint
helper released each per-DB lock as soon as that single DB's checkpoint completed - not
when the entire general save finished. If DB 0's checkpoint completed before the test's
'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with
'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never
loses; in CI Release it ran in 1s and reliably failed.
See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662.
Fix:
- RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE)
resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks
are held until ALL per-DB checkpoints complete, not just each individual one. A
per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint.
- The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync
that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming.
- Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely
double-await even if the synchronous task-creation loop throws partway through. The
double-await ensures we never resume a per-DB lock while its checkpoint is still
running.
- Remove the handedOffCount partial-resume bookkeeping that's no longer needed.
- The previously-shared RunPausedCheckpointAsync helper is removed - its only other
caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/
update/finally/resume sequence so its single-DB pause-resume lifecycle is visible
in one place.
Net effect:
- General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any
per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'.
- Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync
with pausedCount=1): unchanged - that single per-DB lock is still released exactly
when that single checkpoint completes.
- AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined).
- Other legal scenarios (per-DB then per-DB on different DB, per-DB then general,
general blocks general) preserved.
Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest +
MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31)
passes locally.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Pipeline BGSAVE + BGSAVE 0 in MultiDatabaseSaveInProgressTest to eliminate roundtrip race
Even with the previous server-side fix that holds all per-DB checkpoint locks until
the entire general save completes, the test still flaked in CI Release builds because
the actual checkpoint of ~1MB of in-memory data completes in microseconds-to-milliseconds.
That window is comparable to (and sometimes shorter than) the client→server roundtrip
between issuing the general BGSAVE and the follow-up BGSAVE 0:
Failed Garnet.test.MultiDatabaseTests.MultiDatabaseSaveInProgressTest [1 s]
Error Message:
ERR checkpoint already in progress
Assert.That(caughtException, expression)
Expected: <StackExchange.Redis.RedisServerException>
But was: null
(see https://github.com/microsoft/garnet/actions/runs/25773997865)
Replace the two sequential SE.Redis Execute calls with a single LightClient pipelined
send of 'BGSAVE\r\nBGSAVE 0\r\n'. Both commands arrive at the server in the same
network packet, so the server processes BGSAVE 0 immediately after the general BGSAVE's
synchronous setup completes - while DB 0's per-DB checkpoint lock is still held. This
makes the assertion deterministic regardless of how fast the actual checkpoint runs.
CountResponseType.Bytes is used because RESP token-counting in LightClient only treats
'-' as an error marker at position 0, so a pipelined response containing two RESP
tokens '+...\r\n-...\r\n' would never trigger CompletePendingRequests under the
default Tokens mode.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add [CancelAfter(180_000)] to MultipleReplicasWithVectorSetsAsync
This test has been timing out in CI. Set an explicit 180s cancellation timeout
so the shared ClusterTestContext.cts is configured accordingly and polling
loops (BackOff(cts.Token)) can exit cleanly instead of hanging until the test
runner kills the process.
Matches the convention applied in PR #1767 for MultipleReplicasWithVectorSetsAndDeletesAsync.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: badrishc <badrishc@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 4cc403f commit 28ba65e
3 files changed
Lines changed: 49 additions & 33 deletions
File tree
- libs/server/Databases
- test
- Garnet.test.cluster/VectorSets
- Garnet.test
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
295 | 295 | | |
296 | 296 | | |
297 | 297 | | |
298 | | - | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
299 | 307 | | |
300 | 308 | | |
301 | 309 | | |
| |||
967 | 975 | | |
968 | 976 | | |
969 | 977 | | |
970 | | - | |
| 978 | + | |
| 979 | + | |
971 | 980 | | |
972 | | - | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
973 | 985 | | |
974 | 986 | | |
975 | 987 | | |
976 | 988 | | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
977 | 995 | | |
978 | 996 | | |
979 | 997 | | |
980 | 998 | | |
981 | 999 | | |
982 | 1000 | | |
983 | | - | |
984 | | - | |
985 | 1001 | | |
986 | 1002 | | |
987 | 1003 | | |
988 | 1004 | | |
989 | | - | |
990 | | - | |
991 | | - | |
992 | | - | |
| 1005 | + | |
993 | 1006 | | |
994 | 1007 | | |
995 | 1008 | | |
996 | 1009 | | |
997 | 1010 | | |
998 | 1011 | | |
999 | 1012 | | |
1000 | | - | |
1001 | | - | |
1002 | | - | |
1003 | | - | |
1004 | | - | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
1005 | 1018 | | |
1006 | 1019 | | |
1007 | 1020 | | |
1008 | 1021 | | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
1009 | 1025 | | |
1010 | 1026 | | |
1011 | 1027 | | |
1012 | 1028 | | |
1013 | 1029 | | |
1014 | 1030 | | |
1015 | 1031 | | |
1016 | | - | |
1017 | 1032 | | |
1018 | | - | |
1019 | | - | |
1020 | | - | |
1021 | | - | |
1022 | | - | |
1023 | | - | |
1024 | | - | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
1025 | 1036 | | |
1026 | 1037 | | |
1027 | 1038 | | |
1028 | 1039 | | |
1029 | | - | |
1030 | | - | |
1031 | | - | |
1032 | | - | |
1033 | 1040 | | |
1034 | 1041 | | |
1035 | 1042 | | |
| |||
Lines changed: 1 addition & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
464 | 464 | | |
465 | 465 | | |
466 | 466 | | |
| 467 | + | |
467 | 468 | | |
468 | 469 | | |
469 | 470 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1377 | 1377 | | |
1378 | 1378 | | |
1379 | 1379 | | |
1380 | | - | |
1381 | | - | |
1382 | | - | |
1383 | | - | |
1384 | | - | |
1385 | | - | |
1386 | | - | |
| 1380 | + | |
| 1381 | + | |
| 1382 | + | |
| 1383 | + | |
| 1384 | + | |
| 1385 | + | |
| 1386 | + | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
1387 | 1395 | | |
1388 | 1396 | | |
1389 | 1397 | | |
| |||
0 commit comments