fix(core): add busy_timeout + WAL to workspace SQLite connections (#648)#686
Conversation
Core workspace connections used bare sqlite3.connect() with no busy_timeout and no WAL journaling, so under parallel batch execution a concurrent writer got an immediate "database is locked" OperationalError instead of waiting. This matches platform_store/database.py, which already sets WAL + 5s timeout. - Add private _open_db() helper in workspace.py (WAL + busy_timeout=5000) and route all 6 direct connects + get_db_connection through it. Pass db_path unchanged (no str() coercion) to preserve the prior connect semantics. - Route external callers through the shared path: budget.py via get_db_connection(workspace); costs_v2.py via _open_db(db_path). - Add a ThreadPoolExecutor parallel-writers test asserting no immediate database-locked error and that all writes persist. Also fix the staging/production Deploy workflow, which has been failing the health check since #643: the backend now hard-fails on the default AUTH_SECRET when auth is enabled (the default), but deploy.yml never wrote AUTH_SECRET into the generated env file. Write AUTH_SECRET (from the existing GitHub secret) into .env.staging/.env.production, fail fast with a clear message when it is unset, and document it in .env.staging.example.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughIntroduces a centralized ChangesSQLite WAL and busy_timeout centralization
AUTH_SECRET deployment injection
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📋 Issue PlannerBuilt with CodeRabbit's Coding Plans for faster development and fewer bugs. View plan used: ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Review — fix(core): busy_timeout + WAL for workspace SQLite (#648)This is a clean, well-scoped fix. The root cause is correctly identified, the centralisation via Correctness / Architecture
from codeframe.core.workspace import Workspace, _open_dbThe PR description explains why The simplest fix: drop the leading underscore and make it Test coverageMissing CLAUDE.md requires class TestConcurrentWriters:
pytestmark = pytest.mark.v2Thread-based test vs. process-based reality (acknowledged) The PR description notes this limitation clearly — threads share a process and the GIL, so contention is lighter than the production multi-process scenario. The 8-thread x 15-write x 1ms sleep approach is a reasonable proxy. Consider adding a one-line comment inside the test itself referencing this, so future readers do not assume it covers multi-process contention.
except sqlite3.OperationalError as exc: # pragma: no cover - failure path
errors.append(exc)Any Deploy workflowThe fail-fast guard is correct and the error message is actionable. The check is duplicated verbatim across the staging and production jobs, which is fine for two environments; a composite action would be worth it if a third environment is ever added. What is good
VerdictApprove with minor follow-up suggested. The one item worth addressing before merge is making |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/deploy.yml (1)
83-91: 💤 Low valueConsider enhancing the fail-fast guard to catch whitespace-only secrets.
The current check uses
-z "${ENV_AUTH_SECRET}"which only catches truly empty values. Whitespace-only secrets would pass this CI check but fail later at backend startup when_read_auth_secret()strips the whitespace and falls back toDEFAULT_SECRET(seecodeframe/auth/manager.py:26-28). While the backend will still refuse to start insecurely, catching it earlier in CI would provide clearer feedback.🛡️ Optional enhancement to catch whitespace-only secrets
For staging (line 86):
- if [ -z "${ENV_AUTH_SECRET}" ]; then + if [ -z "${ENV_AUTH_SECRET}" ] || [ -z "${ENV_AUTH_SECRET// /}" ]; then echo "❌ AUTH_SECRET GitHub Actions secret is not set (or empty)."For production (line 314):
- if [ -z "${ENV_AUTH_SECRET}" ]; then + if [ -z "${ENV_AUTH_SECRET}" ] || [ -z "${ENV_AUTH_SECRET// /}" ]; then echo "❌ AUTH_SECRET GitHub Actions secret is not set (or empty)."Or use a more readable pattern:
if [ -z "${ENV_AUTH_SECRET}" ] || ! echo "${ENV_AUTH_SECRET}" | grep -q '[^[:space:]]'; thenBased on learnings from
codeframe/auth/manager.py, which treats blank/whitespace AUTH_SECRET values as unset and falls back to the default secret, making the startup guard catch them as a hard-fail in hosted mode.Also applies to: 311-319
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/deploy.yml around lines 83 - 91, The ENV_AUTH_SECRET validation check in both the staging and production deployment sections currently only catches empty values using the `-z` test, but whitespace-only secrets would pass this check and fail later when the backend's _read_auth_secret() function strips whitespace and falls back to DEFAULT_SECRET. Enhance the condition at the staging check (around line 86) and the production check (around line 314) to also detect whitespace-only secrets by adding an additional test that verifies the secret contains at least one non-whitespace character, either by using a grep pattern like `! echo "${ENV_AUTH_SECRET}" | grep -q '[^[:space:]]'` or another suitable bash pattern that validates the secret is not just whitespace before proceeding with the deployment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/core/test_workspace.py`:
- Around line 180-249: The TestConcurrentWriters class at the beginning of the
diff is missing the required pytest v2 marker decorator. Add the `@pytest.mark.v2`
decorator to the TestConcurrentWriters class to ensure these concurrency
regression tests are included when running tests with marker-based selection
(e.g., uv run pytest -m v2), as required by the coding guidelines for Python
tests in the tests directory.
---
Nitpick comments:
In @.github/workflows/deploy.yml:
- Around line 83-91: The ENV_AUTH_SECRET validation check in both the staging
and production deployment sections currently only catches empty values using the
`-z` test, but whitespace-only secrets would pass this check and fail later when
the backend's _read_auth_secret() function strips whitespace and falls back to
DEFAULT_SECRET. Enhance the condition at the staging check (around line 86) and
the production check (around line 314) to also detect whitespace-only secrets by
adding an additional test that verifies the secret contains at least one
non-whitespace character, either by using a grep pattern like `! echo
"${ENV_AUTH_SECRET}" | grep -q '[^[:space:]]'` or another suitable bash pattern
that validates the secret is not just whitespace before proceeding with the
deployment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9a915826-9c38-45ca-bf0e-dd180f7e96dd
📒 Files selected for processing (6)
.env.staging.example.github/workflows/deploy.ymlcodeframe/adapters/e2b/budget.pycodeframe/core/workspace.pycodeframe/ui/routers/costs_v2.pytests/core/test_workspace.py
The original ThreadPoolExecutor stress test (8 writers) was flaky: SQLite's busy handler is not fair, so under heavy contention one writer can be starved past the 5s busy_timeout and still raise "database is locked" — even with the fix. Investigation also showed Python's sqlite3.connect already defaults to a 5s busy_timeout, so the substantive fix here is enabling WAL journaling (which removes the rollback-journal reader/writer case that fails immediately). Replace the stress test with a deterministic test: one thread holds the write lock (BEGIN IMMEDIATE) briefly while a second writer must wait for it and then succeed. Update the _open_db docstring to reflect that WAL is the real change.
Addresses CodeRabbit review: _read_auth_secret() treats blank/whitespace-only AUTH_SECRET as unset and falls back to the default secret, so a whitespace-only GitHub secret would pass the `-z` guard yet still hard-fail the backend at startup. Catch it in CI for clearer feedback (both staging and production).
|
Thanks @coderabbitai — triaged both findings: 1. 2. Whitespace-only Also note: while addressing this I found Python's |
|
Tip For best results, initiate chat on the files or code changes.
🐇 ✨ ✏️ Learnings added
|
Follow-up review (commits 2 & 3)Addressed since my last review:
Two items still open: 1. This was flagged in my first review and remains unchanged. Since 2. Per CLAUDE.md: "New v2 tests: add The existing module ( Everything else looks good. These two items are the only blockers before merge. |
Follow-up ReviewPicking up from my previous review — good iteration. Here is where things stand. What's resolved
Still outstanding from my previous review
from codeframe.core.workspace import Workspace, _open_db
# workspace.py — drop the underscore
def open_db(db_path: str | Path) -> sqlite3.Connection: ...
# costs_v2.py
from codeframe.core.workspace import Workspace, open_dbThis does not need to block merge, but it is worth a follow-up. A private-named function crossing a module boundary means anyone refactoring VerdictReady to merge. The core fix (WAL centralisation across all workspace connections) is correct and well-tested. The deploy bonus fix is also correct. The |
…module import Addresses Claude PR review: costs_v2.py imported the private _open_db() across package boundaries (fragile if renamed/inlined). Add a public get_db_connection_by_path(db_path) accessor in workspace.py (parallels get_db_connection but takes a raw path, for the costs router's fresh/locked-DB tolerant helpers) and route costs_v2.py through it. _open_db stays the private primitive used internally.
|
Thanks @claude — addressed the actionable item: Private cross-module import ( The other points are resolved by the test rework (commit 0b5ff85): the broad |
|
Claude finished @frankbria's task in 53s —— View job Follow-up Review
All previous review items resolved. Here's the final state:
Deterministic lock test (commit 0b5ff85) — confirmed resolved in the prior round. The broad
Verdict: ready to merge. The WAL centralisation is correct, well-tested, and the module boundary is now clean. |
Follow-up ReviewBoth issues flagged in my initial review have been addressed: Addressed: Private symbol import fixed - Addressed: Whitespace-only secret guard added - The deploy workflow now uses One remaining nit: pytest.mark.v2 on TestConcurrentWritersCLAUDE.md requires new v2 tests to carry this marker. The module has no module-level class TestConcurrentWriters:
pytestmark = pytest.mark.v2
"""Concurrency guards on workspace connections (issue #648)."""Class-level Verdict: Approve. Everything else is solid - the public wrapper pattern, the WAL+busy_timeout centralisation, the deterministic lock-contention test, and the fail-fast deploy guard are all good work. The marker nit can be addressed here or in a follow-up. |
CLAUDE.md asks new v2 tests to carry the marker explicitly (module-level pytestmark), even though tests/core/ is also auto-marked by conftest. Addresses the repeated CodeRabbit finding by following the documented project convention.
|
Final Review - all previous items resolved. Reviewing the current diff against all outstanding items from the review thread. Correction on my previous nit about pytest.mark.v2: The module-level pytestmark = pytest.mark.v2 I flagged in my last comment as still outstanding is already present in this diff. That was a misread — the marker was added to test_workspace.py at module level. Combined with the auto-marking in tests/core/conftest.py, TestConcurrentWriters is doubly covered. No change needed. All items resolved:
Verdict: Approve and merge. The WAL centralisation is correct and well-tested, module boundaries are clean, and the fail-fast deploy guard is solid. Nothing outstanding. |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/core/test_workspace.py (1)
227-255:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep the holder lock until the waiter is actually contending.
The fixed
time.sleep(0.5)leaves a false-positive path: if the test thread is paused after Line 244, the holder can commit before Line 251 runs, so the test passes without proving that the second writer waited on a held lock. Hold the transaction with a release event, run the waiter in its own thread, assert it does not finish immediately while the lock is held, then release and join both threads.🐛 Proposed deterministic contention check
- lock_acquired = threading.Event() - hold_seconds = 0.5 # comfortably under the 5s busy timeout + lock_acquired = threading.Event() + release_lock = threading.Event() + waiter_finished = threading.Event() + errors: list[BaseException] = [] def hold_write_lock() -> None: conn = get_db_connection(ws) conn.isolation_level = None # take manual control of the transaction try: conn.execute("BEGIN IMMEDIATE") conn.execute("INSERT INTO lock_probe (id, label) VALUES (1, 'holder')") lock_acquired.set() - time.sleep(hold_seconds) + assert release_lock.wait(timeout=5), "test never released holder lock" conn.execute("COMMIT") + except BaseException as exc: + errors.append(exc) + raise finally: conn.close() holder = threading.Thread(target=hold_write_lock) holder.start() assert lock_acquired.wait(timeout=5), "holder never acquired the write lock" - # The lock is held now. Without the busy timeout this write would raise - # ``database is locked`` immediately; with it, the write blocks until the - # holder commits (~hold_seconds) and then succeeds. - conn2 = get_db_connection(ws) - try: - conn2.execute("INSERT INTO lock_probe (id, label) VALUES (2, 'waiter')") - conn2.commit() - finally: - conn2.close() - holder.join(timeout=5) + def wait_for_write_lock() -> None: + conn2 = get_db_connection(ws) + try: + conn2.execute("INSERT INTO lock_probe (id, label) VALUES (2, 'waiter')") + conn2.commit() + except BaseException as exc: + errors.append(exc) + raise + finally: + conn2.close() + waiter_finished.set() + + waiter = threading.Thread(target=wait_for_write_lock) + waiter.start() + time.sleep(0.1) + assert not waiter_finished.is_set(), "waiter finished while holder still held lock" + + release_lock.set() + holder.join(timeout=5) + waiter.join(timeout=5) + assert not holder.is_alive(), "holder thread did not finish" + assert not waiter.is_alive(), "waiter thread did not finish" + assert not errors🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/core/test_workspace.py` around lines 227 - 255, The test uses a fixed sleep duration in hold_write_lock() to hold a write lock, but this creates a race condition: if the test thread pauses after the lock is acquired, the holder can commit before the waiter actually attempts to write, resulting in a false-positive test pass. Replace the time.sleep(hold_seconds) call in hold_write_lock() with a wait on a new threading.Event (release_lock), create a separate thread to run the waiter code (the conn2 insert and commit), use threading.Event.wait(timeout=...) to assert the waiter thread does not complete immediately while the holder lock is held, then signal the release_lock event to allow the holder to commit and join both threads to verify they completed.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/core/test_workspace.py`:
- Around line 227-255: The test uses a fixed sleep duration in hold_write_lock()
to hold a write lock, but this creates a race condition: if the test thread
pauses after the lock is acquired, the holder can commit before the waiter
actually attempts to write, resulting in a false-positive test pass. Replace the
time.sleep(hold_seconds) call in hold_write_lock() with a wait on a new
threading.Event (release_lock), create a separate thread to run the waiter code
(the conn2 insert and commit), use threading.Event.wait(timeout=...) to assert
the waiter thread does not complete immediately while the holder lock is held,
then signal the release_lock event to allow the holder to commit and join both
threads to verify they completed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 412c23a4-e6cd-4ecb-9665-7f73a2b13938
📒 Files selected for processing (1)
tests/core/test_workspace.py
Summary
Fixes #648. Core workspace SQLite connections used bare
sqlite3.connect()with nobusy_timeoutand no WAL journaling. Under parallel batch execution (multiple subprocesses + server background agent threads writing the same workspace DB), a concurrent writer got an immediatedatabase is lockedOperationalErrorinstead of waiting. This brings core in line withplatform_store/database.py, which already sets WAL +busy_timeout=5000.This PR also fixes the failing Deploy workflow (per request) — see the second section.
Changes (#648)
core/workspace.py: new private_open_db()helper applyingPRAGMA journal_mode=WAL+PRAGMA busy_timeout=5000. All 6 direct connects (_init_database,_ensure_schema_upgrades,create_or_load_workspace,get_workspace,get_db_connection,update_workspace_tech_stack) route through it.db_paththrough unchanged rather thanstr(db_path).str()would coerce a non-path (e.g. a test'sMagicMock) into a literal filename and silently create a junk DB file instead of raising — diverging from the priorsqlite3.connect()semantics. (str()was safe inplatform_storeonly because its path is always real.)adapters/e2b/budget.py:record_cloud_run/get_cloud_runuseget_db_connection(workspace).ui/routers/costs_v2.py:_query_costs/_open_workspace_connuse the shared_open_db(db_path)helper (these takedb_path: str, not aWorkspace, so routing throughget_db_connection(workspace)as the plan suggested wasn't applicable).TestConcurrentWritersintests/core/test_workspace.py— aThreadPoolExecutorparallel-writers test (8 threads × 15 writes) that asserts noOperationalErrorand that all writes persist, plus a PRAGMA assertion. Verified RED before the fix (reproduceddatabase is locked), GREEN after.Bonus: fix failing staging/production Deploy
The Deploy workflow has been failing the health check on every push to
mainsince #643 merged. Root cause (reproduced locally): #643 made the backend hard-fail on startup with the defaultAUTH_SECRETwhen auth is enabled (the default), butdeploy.ymlnever wroteAUTH_SECRETinto the generated.env.staging/.env.production. The server crash-looped under PM2, so/healthnever responded → "Verify deployment" failed after 12 attempts.AUTH_SECRET(from the existingAUTH_SECRETGitHub Actions secret) into the generated env files for both staging and production.AUTH_SECRETin.env.staging.example.Acceptance criteria
busy_timeoutand WAL.database is locked.Testing
uv run pytest tests/core tests/adapters tests/integration→ 2330 passed, 2 skipped, no stray files created.uv run ruff check→ clean.RuntimeErrorlocally under staging-like config (noAUTH_SECRET, auth default-on, self-hosted).Known limitations
busy_timeoutvia threads; it does not spawn separate OS processes (the production multi-writer scenario), but the connection-level pragmas are identical across both.AUTH_SECRETrepo secret already being set (it is, since 2026-01-03); the new guard makes a missing secret fail loudly.Summary by CodeRabbit
Documentation
AUTH_SECRETguidance, including secure random generation notes.Chores
Tests