Skip to content

Commit 658b722

Browse files
NagyViktNagyViktclaude
authored
fix(storage): raise SQLite busy_timeout + retry headroom for fleet load (#577)
The Storage constructor previously set busy_timeout=5000ms and withBusyRetry defaulted to 5 attempts capped at 250ms — tuned for the worker + MCP + CLI trio. Under codex-fleet load (~30 codex panes + active Claude sessions + worker = 30+ concurrent writers on ~/.colony/data.db) the 5s SQLite window plus ~355ms Node retry tail got exhausted, surfacing SQLITE_BUSY: database is locked to callers. Bump busy_timeout to 15000ms and widen withBusyRetry defaults to 8 attempts with up to 1000ms per-attempt backoff (~3.85s total worst case). Happy-path callers are unaffected because no busy error means no retry sleep. Update the corresponding regression test assertion. Co-authored-by: NagyVikt <nagy.viktordp@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 53836ff commit 658b722

6 files changed

Lines changed: 58 additions & 18 deletions

File tree

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
'colonyq': patch
3+
---
4+
5+
Raise SQLite contention headroom in `@colony/storage` so the worker daemon,
6+
MCP server, CLI hooks, and codex-fleet panes can share `~/.colony/data.db`
7+
without surfacing `SQLITE_BUSY: database is locked` to callers. The
8+
`Storage` constructor now sets `busy_timeout=15000` (was 5000), and
9+
`withBusyRetry` defaults bump to 8 attempts with up-to-1s backoff (was 5
10+
attempts / 250ms cap). Happy-path callers are unaffected because no busy
11+
error still means no retry sleep; sustained contention from ~30+ concurrent
12+
writers — the codex-fleet shape that triggered this — now has ~3.85s of
13+
combined SQLite + Node retry headroom before raising.
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
schema: spec-driven
2+
created: 2026-05-16
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# agent-claude-colony-storage-busy-timeout-retry-under-2026-05-16-12-11 (minimal / T1)
2+
3+
Branch: `agent/claude/colony-storage-busy-timeout-retry-under-2026-05-16-12-11`
4+
5+
Raise SQLite contention headroom in `@colony/storage` so the worker daemon, MCP server, CLI hooks, and codex-fleet panes can share `~/.colony/data.db` without surfacing `SQLITE_BUSY: database is locked` to callers.
6+
7+
- `busy_timeout` 5000 → 15000 ms (one connection-scoped pragma in the `Storage` constructor).
8+
- `withBusyRetry` defaults: `maxAttempts` 5 → 8, `baseDelayMs` 5 → 10, `maxDelayMs` 250 → 1000. New worst-case wait ~3.85s vs old ~0.355s; happy-path callers stay sub-ms because no busy error means no retry sleep.
9+
- Updated `keeps busy_timeout set to N` assertion in `test/busy-retry.test.ts` from 5000 to 15000.
10+
11+
Triggered by a real lock storm on 2026-05-16: ~30 codex-fleet panes + 4 live Claude sessions + 1 worker = 34+ concurrent writers, which exhausted the previous 5s window + 5-retry tail. The fleet teardown is the immediate fix; this PR is the durable headroom.
12+
13+
## Handoff
14+
15+
- Handoff: change=`agent-claude-colony-storage-busy-timeout-retry-under-2026-05-16-12-11`; branch=`agent/claude/colony-storage-busy-timeout-retry-under-2026-05-16-12-11`; scope=`packages/storage src + test only`; action=`finish via PR to main`.
16+
17+
## Cleanup
18+
19+
- [ ] Run: `gx branch finish --branch agent/claude/colony-storage-busy-timeout-retry-under-2026-05-16-12-11 --base main --via-pr --wait-for-merge --cleanup`
20+
- [ ] Record PR URL + `MERGED` state in the completion handoff.
21+
- [ ] Confirm sandbox worktree is gone (`git worktree list`, `git branch -a`).

packages/storage/src/busy-retry.ts

Lines changed: 15 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,27 @@
11
// Synchronous SQLITE_BUSY retry wrapper for better-sqlite3 writes.
22
//
3-
// Background: the Storage constructor sets busy_timeout=5000 and
3+
// Background: the Storage constructor sets busy_timeout=15000 and
44
// journal_mode=WAL, which together absorb the overwhelming majority of
55
// contention between the worker daemon, MCP server, and CLI hooks. The
66
// remaining tail comes from edge cases — a long checkpoint, a migration
77
// that holds a write transaction, or a misbehaving hook that opens its
8-
// own connection. In those cases SQLite still raises SQLITE_BUSY after
9-
// the busy_timeout window expires, which the 168h gain telemetry saw
10-
// surface once on task_claim_quota_release_expired.
8+
// own connection — plus sustained pressure from the codex-fleet shape
9+
// (~30+ concurrent writers). In those cases SQLite still raises
10+
// SQLITE_BUSY after the busy_timeout window expires.
1111
//
1212
// This helper gives callers a defensive Node-level retry on top of
13-
// SQLite's own busy_timeout. Five attempts with backoff 5/20/80/250ms
14-
// cap total wait at ~355ms — small enough that the caller is not
15-
// noticeably slower, large enough that a transient checkpoint or
16-
// short-held write transaction has time to clear.
13+
// SQLite's own busy_timeout. Eight attempts with backoff
14+
// 10/40/160/640/1000/1000/1000ms cap total wait at ~3.85s — bounded
15+
// enough to keep CLI hooks under the 150ms p95 budget on the happy
16+
// path while giving a transient checkpoint or fleet-burst window time
17+
// to clear.
1718

1819
export interface BusyRetryOptions {
19-
/** Maximum number of attempts (including the first). Defaults to 5. */
20+
/** Maximum number of attempts (including the first). Defaults to 8. */
2021
maxAttempts?: number;
21-
/** Base delay in milliseconds; backoff is base * 4^(attempt-1) capped at 250ms. Defaults to 5. */
22+
/** Base delay in milliseconds; backoff is base * 4^(attempt-1) capped at maxDelayMs. Defaults to 10. */
2223
baseDelayMs?: number;
23-
/** Maximum per-attempt delay in milliseconds. Defaults to 250. */
24+
/** Maximum per-attempt delay in milliseconds. Defaults to 1000. */
2425
maxDelayMs?: number;
2526
}
2627

@@ -55,9 +56,9 @@ function sleepSync(ms: number): void {
5556
* `maxAttempts` retries.
5657
*/
5758
export function withBusyRetry<T>(fn: () => T, opts: BusyRetryOptions = {}): T {
58-
const maxAttempts = opts.maxAttempts ?? 5;
59-
const baseDelayMs = opts.baseDelayMs ?? 5;
60-
const maxDelayMs = opts.maxDelayMs ?? 250;
59+
const maxAttempts = opts.maxAttempts ?? 8;
60+
const baseDelayMs = opts.baseDelayMs ?? 10;
61+
const maxDelayMs = opts.maxDelayMs ?? 1000;
6162
let attempt = 0;
6263
// The loop body always either returns or throws, so the linter is
6364
// happy with the `while (true)` shape.

packages/storage/src/storage.ts

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -320,8 +320,11 @@ export class Storage {
320320
this.db = new Database(dbPath, opts.readonly ? { readonly: true } : {});
321321
// busy_timeout is connection-scoped. Multiple processes (worker daemon,
322322
// MCP server, CLI hooks) hit the same WAL file; without this they trip
323-
// SQLITE_BUSY immediately on contention. 5s lets the kernel retry.
324-
this.db.pragma('busy_timeout = 5000');
323+
// SQLITE_BUSY immediately on contention. 15s absorbs sustained
324+
// contention from ~30+ concurrent writers (the codex-fleet shape):
325+
// 5s was tuned for the worker+MCP+CLI trio and got exhausted under
326+
// fleet load on 2026-05-16.
327+
this.db.pragma('busy_timeout = 15000');
325328
// WAL mode lets readers and a single writer coexist without blocking,
326329
// which matters because the worker daemon, MCP server, and CLI hooks
327330
// all hit the same DB file concurrently. Without WAL the default

packages/storage/test/busy-retry.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ describe('Storage WAL mode', () => {
2525
expect(mode[0]?.toLowerCase()).toBe('wal');
2626
});
2727

28-
it('keeps busy_timeout set to 5000', () => {
28+
it('keeps busy_timeout set to 15000', () => {
2929
const timeout = (storage as unknown as { db: { pragma: (sql: string) => unknown[] } }).db
3030
.pragma('busy_timeout')
3131
.map((r: unknown) => (r as { timeout: number }).timeout);
32-
expect(timeout[0]).toBe(5000);
32+
expect(timeout[0]).toBe(15000);
3333
});
3434
});
3535

0 commit comments

Comments
 (0)