Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete (#1796)

badrishc · Copilot · web-flow · commit 28ba65e44938 · 2026-05-13T13:54:43.000-07:00
* Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake in CI Release builds. The general BGSAVE path synchronously paused all per-DB checkpoint locks before returning 'Background saving started', but the per-DB checkpoint helper released each per-DB lock as soon as that single DB's checkpoint completed - not when the entire general save finished. If DB 0's checkpoint completed before the test's 'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with 'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed. See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662. Fix: - RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE) resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks are held until ALL per-DB checkpoints complete, not just each individual one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint. - The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming. - Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely double-await even if the synchronous task-creation loop throws partway through. The double-await ensures we never resume a per-DB lock while its checkpoint is still running. - Remove the handedOffCount partial-resume bookkeeping that's no longer needed. - The previously-shared RunPausedCheckpointAsync helper is removed - its only other caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/ update/finally/resume sequence so its single-DB pause-resume lifecycle is visible in one place. Net effect: - General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'. - Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync with pausedCount=1): unchanged - that single per-DB lock is still released exactly when that single checkpoint completes. - AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined). - Other legal scenarios (per-DB then per-DB on different DB, per-DB then general, general blocks general) preserved. Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest + MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31) passes locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Pipeline BGSAVE + BGSAVE 0 in MultiDatabaseSaveInProgressTest to eliminate roundtrip race Even with the previous server-side fix that holds all per-DB checkpoint locks until the entire general save completes, the test still flaked in CI Release builds because the actual checkpoint of ~1MB of in-memory data completes in microseconds-to-milliseconds. That window is comparable to (and sometimes shorter than) the client→server roundtrip between issuing the general BGSAVE and the follow-up BGSAVE 0: Failed Garnet.test.MultiDatabaseTests.MultiDatabaseSaveInProgressTest [1 s] Error Message: ERR checkpoint already in progress Assert.That(caughtException, expression) Expected: <StackExchange.Redis.RedisServerException> But was: null (see https://github.com/microsoft/garnet/actions/runs/25773997865) Replace the two sequential SE.Redis Execute calls with a single LightClient pipelined send of 'BGSAVE\r\nBGSAVE 0\r\n'. Both commands arrive at the server in the same network packet, so the server processes BGSAVE 0 immediately after the general BGSAVE's synchronous setup completes - while DB 0's per-DB checkpoint lock is still held. This makes the assertion deterministic regardless of how fast the actual checkpoint runs. CountResponseType.Bytes is used because RESP token-counting in LightClient only treats '-' as an error marker at position 0, so a pipelined response containing two RESP tokens '+...\r\n-...\r\n' would never trigger CompletePendingRequests under the default Tokens mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add [CancelAfter(180_000)] to MultipleReplicasWithVectorSetsAsync This test has been timing out in CI. Set an explicit 180s cancellation timeout so the shared ClusterTestContext.cts is configured accordingly and polling loops (BackOff(cts.Token)) can exit cleanly instead of hanging until the test runner kills the process. Matches the convention applied in PR #1767 for MultipleReplicasWithVectorSetsAndDeletesAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: badrishc <badrishc@users.noreply.github.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
diff --git a/libs/server/Databases/MultiDatabaseManager.cs b/libs/server/Databases/MultiDatabaseManager.cs
@@ -295,7 +295,15 @@ public override async Task TaskCheckpointBasedOnAofSizeLimitAsync(long aofSizeLi
 
                 if (pausedDbId < 0) return;
 
-                await RunPausedCheckpointAsync(databasesMapSnapshot[pausedDbId], pausedDbId, token, logger).ConfigureAwait(false);
+                try
+                {
+                    var storeTailAddress = await TakeCheckpointAsync(databasesMapSnapshot[pausedDbId], logger: logger, token: token).ConfigureAwait(false);
+                    UpdateLastSaveData(pausedDbId, storeTailAddress);
+                }
+                finally
+                {
+                    ResumeCheckpoints(pausedDbId);
+                }
             }
             finally
             {
@@ -967,69 +975,68 @@ private void CopyDatabases(IDatabaseManager src, bool enableAof)
         }
 
         /// <summary>
-        /// Run pre-paused per-DB checkpoints in parallel, then release the outer locks held by the caller.
+        /// Run pre-paused per-DB checkpoints in parallel, then resume the per-DB checkpoint locks
+        /// and release the outer locks held by the caller.
         /// Caller must hold <see cref="databasesContentLock"/> as a reader and must have synchronously
-        /// pause-locked every DB in <paramref name="pausedDbIds"/>[0..<paramref name="pausedCount"/>).
+        /// pause-locked the first <paramref name="pausedCount"/> entries of <paramref name="pausedDbIds"/>.
+        /// Per-DB checkpoint locks are held until ALL per-DB checkpoints complete (not just each
+        /// individual one) so a per-DB BGSAVE issued mid-flight during a general BGSAVE reliably
+        /// observes the in-progress checkpoint and fails with "checkpoint already in progress".
         /// </summary>
         private async Task<bool> RunPausedCheckpointsAndReleaseLocksAsync(int[] pausedDbIds, int pausedCount,
             bool multiDbLockHeld, CancellationToken token, ILogger logger)
         {
+            // Pre-fill with Task.CompletedTask so the catch path can safely await Task.WhenAll
+            // even if the synchronous task-creation loop below throws partway through.
+            var checkpointTasks = new Task[pausedCount];
+            for (var i = 0; i < pausedCount; i++)
+                checkpointTasks[i] = Task.CompletedTask;
+
             try
             {
                 // Force async so that the entry point can return synchronously to the caller.
                 await Task.Yield();
 
                 var databaseMapSnapshot = databases.Map;
-                var checkpointTasks = new Task[pausedCount];
-                var handedOffCount = 0;
 
                 try
                 {
                     for (var i = 0; i < pausedCount; i++)
-                    {
-                        checkpointTasks[i] = RunPausedCheckpointAsync(databaseMapSnapshot[pausedDbIds[i]], pausedDbIds[i], token, logger);
-                        handedOffCount = i + 1;
-                    }
+                        checkpointTasks[i] = TakeOneCheckpointAsync(databaseMapSnapshot[pausedDbIds[i]], pausedDbIds[i]);
 
                     await Task.WhenAll(checkpointTasks).ConfigureAwait(false);
                 }
                 catch (Exception ex)
                 {
                     logger?.LogError(ex, "Checkpointing threw exception");
 
-                    // Per-DB helpers that were started always resume their own dbId in finally.
-                    // Resume locks for any pre-paused DBs that didn't get handed off (very rare —
-                    // would require the synchronous tasks[] assignment loop above to throw).
-                    for (var i = handedOffCount; i < pausedCount; i++)
-                        ResumeCheckpoints(pausedDbIds[i]);
+                    // Make sure any tasks already started are observed before we resume the per-DB
+                    // locks in the outer finally (otherwise we could resume a lock while its
+                    // checkpoint is still running).
+                    try { await Task.WhenAll(checkpointTasks).ConfigureAwait(false); }
+                    catch { /* already logged above */ }
                 }
             }
             finally
             {
+                for (var i = 0; i < pausedCount; i++)
+                    ResumeCheckpoints(pausedDbIds[i]);
+
                 if (multiDbLockHeld)
                     multiDbCheckpointingLock.WriteUnlock();
 
                 databasesContentLock.ReadUnlock();
             }
 
             return true;
-        }
 
-        /// <summary>
-        /// Run a single pre-paused per-DB checkpoint and resume the per-DB checkpoint lock.
-        /// Caller must have already pause-locked <paramref name="dbId"/> via <see cref="TryPauseCheckpoints(int)"/>.
-        /// </summary>
-        private async Task RunPausedCheckpointAsync(GarnetDatabase db, int dbId, CancellationToken token, ILogger logger)
-        {
-            try
+            // Local function: take one per-DB checkpoint and update LASTSAVE. Does NOT resume the
+            // per-DB lock — the outer finally above resumes all paused DBs after WhenAll completes.
+            async Task TakeOneCheckpointAsync(GarnetDatabase db, int dbId)
             {
                 var storeTailAddress = await TakeCheckpointAsync(db, logger: logger, token: token).ConfigureAwait(false);
                 UpdateLastSaveData(dbId, storeTailAddress);
             }
-            finally
-            {
-                ResumeCheckpoints(dbId);
-            }
         }
 
         private void UpdateLastSaveData(int dbId, long? storeTailAddress)
diff --git a/test/Garnet.test.cluster/VectorSets/ClusterVectorSetTests.cs b/test/Garnet.test.cluster/VectorSets/ClusterVectorSetTests.cs
@@ -464,6 +464,7 @@ static void Incr(byte[] k)
         }
 
         [Test]
+        [CancelAfter(180_000)]
         public async Task MultipleReplicasWithVectorSetsAsync()
         {
             const int PrimaryIndex = 0;
diff --git a/test/Garnet.test/MultiDatabaseTests.cs b/test/Garnet.test/MultiDatabaseTests.cs
@@ -1377,13 +1377,21 @@ public void MultiDatabaseSaveInProgressTest()
                     db2.ListLeftPush($"k{i}o", new string('x', 256));
                 }
 
-                // Issue general background save
-                res = db1.Execute("BGSAVE");
-                ClassicAssert.AreEqual("Background saving started", res.ToString());
-
-                // Issue background save to DB 0 while general save is in progress - illegal
-                Assert.Throws<RedisServerException>(() => db1.Execute("BGSAVE", "0"),
-                    Encoding.ASCII.GetString(CmdStrings.RESP_ERR_CHECKPOINT_ALREADY_IN_PROGRESS));
+                // Issue a general BGSAVE and a per-DB BGSAVE on DB 0 as a single pipelined batch
+                // via LightClient. Pipelining eliminates the client→server roundtrip between the
+                // two commands so the per-DB BGSAVE arrives at the server while the general BGSAVE's
+                // synchronous setup is still holding DB 0's per-DB checkpoint lock — guaranteeing
+                // the "checkpoint already in progress" error regardless of how fast the actual
+                // checkpoint completes. Without pipelining, a fast in-memory checkpoint can finish
+                // before the per-DB BGSAVE arrives over the wire and the test flakes (CI Release).
+                using (var lcRequest = TestUtils.CreateRequest(countResponseType: CountResponseType.Bytes))
+                {
+                    var expectedResponse =
+                        "+Background saving started\r\n" +
+                        $"-{Encoding.ASCII.GetString(CmdStrings.RESP_ERR_CHECKPOINT_ALREADY_IN_PROGRESS)}\r\n";
+                    var response = lcRequest.Execute("BGSAVE", "BGSAVE 0", expectedResponse.Length);
+                    ClassicAssert.AreEqual(expectedResponse, response);
+                }
 
                 int lastsave_old = lastsave;
                 // Wait for save to complete

Original file line number	Diff line number	Diff line change
`@@ -464,6 +464,7 @@ static void Incr(byte[] k)`
`464`	`464`	`}`
`465`	`465`
`466`	`466`	`[Test]`
	`467`	`+ [CancelAfter(180_000)]`
`467`	`468`	`public async Task MultipleReplicasWithVectorSetsAsync()`
`468`	`469`	`{`
`469`	`470`	`const int PrimaryIndex = 0;`