Commit 3df8983
committed
fix(raft): remove orphan .fsm on apply failure (PR #747 r4)
Round-3 review on commit 944ab86 from chatgpt-codex (P2):
> The streaming branch finalizes and renames the spool into
> fsmSnapDir before the message is handed to SendSnapshot's
> t.handle path, but FinalizeAsFSMFile clears the spool path so
> deferred spool.Close() cannot clean it up afterward. If
> t.handle returns an error (for example during transient
> engine/raft failures), SendSnapshot returns failure to the
> sender while leaving the newly written .fsm file behind with
> no corresponding applied snapshot, which can accumulate
> orphaned large files across retries with different snapshot
> indexes.
Confirmed:
- After receiveSnapshotStream succeeds, msg.Snapshot.Data is a
17-byte EKVT token and the .fsm file lives at
fsmSnapPath(fsmSnapDir, index).
- SendSnapshot then calls t.handle(ctx, msg). The engine's
applySnapshot is synchronous to t.handle, so a non-nil return
guarantees applied_index was NOT advanced — the .fsm file is
unreferenced.
- Same-index retries are safe (os.Rename atomically replaces),
but the leader can take a fresh snapshot at a higher index
before the apply finally succeeds, and each failed attempt at
a different index leaves an orphan .fsm.
- cleanupStaleFSMSnaps only runs at startup (prepareDataDirs), so
during a long-lived process, orphans accumulate to disk-size
pressure.
Fix: SendSnapshot calls a new removeOrphanedFSMSnapshot helper
on the apply-failure branch. The helper:
1. Decodes the EKVT token from msg.Snapshot.Data; bails if the
message used the legacy inline path (no .fsm file to clean).
2. Reads fsmSnapDir under t.mu; bails if it's unset (legacy
receivers also use the inline path so there's nothing on
disk).
3. os.Remove(fsmSnapPath(...)) — best-effort, IsNotExist is
tolerated, all other errors are slog.Warn'd. The original
apply error is the actionable signal returned to the sender;
a failed Remove is a secondary concern that startup cleanup
still picks up.
Test added (TestSendSnapshot_ApplyFailureRemovesFinalizedFSMFile):
- Wires SetFSMSnapDir + SetHandler-that-fails.
- Drives a real testStateMachine-framed payload through
SendSnapshot.
- Asserts SendSnapshot surfaces the apply error.
- Asserts fsmSnapPath(fsmSnapDir, index) does NOT exist after
the call — orphan cleanup fired.
Caller audit (semantic change requires it):
- SendSnapshot (this function): only gRPC service handler, no
in-tree non-test callers. Modified.
- removeOrphanedFSMSnapshot: only called from SendSnapshot.
- t.handle (other callers): line 438 in Send (regular RPC for
non-snapshot messages, no .fsm involvement) is untouched.
- client.SendSnapshot (lines 267, 455, 481): leader-side client
invocations are different function, unaffected.
The cleanup is gated by isSnapshotToken(msg.Snapshot.Data) so
the legacy in-memory fallback path (when fsmSnapDir is unset)
sees no behavior change — there's no .fsm file to remove on
that path because Bytes() materializes inline.
Test:
go test -race -count=1 -short ./internal/raftengine/etcd
-- 11.3s, all green.
golangci-lint run --enable-only cyclop ./internal/raftengine/etcd/...
-- 0 issues.1 parent 944ab86 commit 3df8983
2 files changed
Lines changed: 113 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
5 | 5 | | |
6 | 6 | | |
7 | 7 | | |
| 8 | + | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
| |||
378 | 379 | | |
379 | 380 | | |
380 | 381 | | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
381 | 390 | | |
382 | 391 | | |
383 | 392 | | |
384 | 393 | | |
385 | 394 | | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
386 | 430 | | |
387 | 431 | | |
388 | 432 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
282 | 282 | | |
283 | 283 | | |
284 | 284 | | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
285 | 354 | | |
286 | 355 | | |
287 | 356 | | |
| |||
0 commit comments