Skip to content

fix: driver test suite & sqlite reliability#4559

Merged
MasterPtato merged 6 commits intomainfrom
04-05-wip_fix_whatever_is_going_on_with_envoys
Apr 8, 2026
Merged

fix: driver test suite & sqlite reliability#4559
MasterPtato merged 6 commits intomainfrom
04-05-wip_fix_whatever_is_going_on_with_envoys

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 5, 2026

🚅 Deployed to the rivet-pr-4559 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 8, 2026 at 9:08 pm
frontend-cloud 🕒 Building (View Logs) Web Apr 8, 2026 at 9:08 pm
frontend-inspector 🕒 Building (View Logs) Web Apr 8, 2026 at 9:08 pm
mcp-hub ✅ Success (View Logs) Web Apr 5, 2026 at 11:57 am
website ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:57 am
ladle ❌ Build Failed (View Logs) Web Apr 5, 2026 at 11:57 am

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 5, 2026 11:57 Destroyed
@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

PR Review: fix driver test suite and sqlite reliability

Summary: This PR fixes several correctness bugs and refactors the SQLite/KV channel architecture for reliability. The changes are broadly well-structured. A few items warrant attention before merging.

Bugs Fixed (Confirmed Good)

  • pegboard-envoy/src/conn.rs: Correct fix. old_last_ping_ts_entry was reading from create_ts_key instead of last_ping_ts_key. Classic copy-paste bug.
  • envoy-client/src/tasks/envoy/index.ts: lostTimeout was never updated because handleConnMessage/handleConnClose returned void. Returning the updated timeout is the right fix.
  • actor/conn/mod.ts: Making disconnect() idempotent via disconnectPromise is the correct pattern for preventing double-cleanup races.

Concerns

  1. KV channel single-writer guarantee removed (high priority): engine/packages/pegboard-kv-channel/src/lib.rs removes the global KvChannelState struct and the per-actor Mutex locking mechanism entirely. CLAUDE.md states the KV channel enforces single-writer locks per actor. The PR description does not mention this guarantee being intentionally relaxed. If two connections from different processes open the same actor KV channel concurrently, there is no cross-connection exclusion anymore. Please confirm this is intentional and, if so, update the CLAUDE.md invariant accordingly.

  2. void actor.processMessage() lost error observability: actor/router-websocket-endpoints.ts now fires action messages with void actor.processMessage() to allow concurrency. Errors thrown by processMessage for actions will be silently swallowed. Consider wrapping with a catch that at minimum logs the error.

  3. Legacy KV read path removed without migration check: engine/packages/epoxy/src/ops/kv/read_value.rs removes the legacy v2-format value read case (case 3: v1 subspace with v2 key layout). If any live actors still have data written in this intermediate format, reads will silently return None. Confirm the migration window has fully closed before this ships to production.

  4. NativeSqliteDatabase retry count is a magic number: rivetkit-typescript/packages/rivetkit/src/db/native-adapter.ts retries up to 3 times on stale KV channel errors with an inlined check (attempt >= 3). Extract to a named constant MAX_RECONNECT_ATTEMPTS = 3 with a brief comment.

Minor notes: actor/instance/mod.ts span-end fix is subtle but correct; a comment would help future readers. Shared test state in driver-engine.test.ts needs full teardown between test cases to prevent state leakage. turbo.json for sqlite-native is a good addition; verify outputs covers all napi-rs cross-compilation targets. broadcastProcessEvent correctly re-throws non-ActorStopping errors and silently drops ActorStopping.

Overall the PR is headed in the right direction. The KV single-writer question and the legacy read-path removal are the two items most worth validating before merge.

@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 28d5a1f to 8af4472 Compare April 5, 2026 12:01
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 5, 2026 12:01 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 8af4472 to 930ee95 Compare April 5, 2026 12:11
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 5, 2026 12:11 Destroyed
@NathanFlurry NathanFlurry changed the base branch from main to graphite-base/4559 April 6, 2026 02:47
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 930ee95 to 6d9fd88 Compare April 6, 2026 02:47
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4559 to 04-05-fix_cache_make_in_memory_cache_global April 6, 2026 02:47
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 02:47 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 6d9fd88 to 2f73a59 Compare April 6, 2026 17:23
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:23 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 2f73a59 to 5b25ad0 Compare April 6, 2026 17:26
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:26 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 5b25ad0 to b8143ad Compare April 6, 2026 17:27
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:27 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from b8143ad to 33436b9 Compare April 6, 2026 17:33
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:33 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 33436b9 to 6e18bf9 Compare April 6, 2026 17:37
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:37 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 6e18bf9 to 7597a94 Compare April 6, 2026 17:45
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 17:45 Destroyed
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 7597a94 to c6d8ae5 Compare April 6, 2026 18:55
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 6, 2026 18:55 Destroyed
@MasterPtato MasterPtato force-pushed the 04-05-fix_cache_make_in_memory_cache_global branch from 9f06a14 to e815bd1 Compare April 6, 2026 19:54
@NathanFlurry NathanFlurry changed the base branch from 04-06-fix_misc_token_fixes to graphite-base/4559 April 7, 2026 17:07
@NathanFlurry NathanFlurry force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 8b3a33e to a112a4f Compare April 7, 2026 17:08
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 7, 2026 17:08 Destroyed
@NathanFlurry NathanFlurry changed the base branch from graphite-base/4559 to 04-06-fix_remove_serverless_token April 7, 2026 17:08
@MasterPtato MasterPtato force-pushed the 04-06-fix_remove_serverless_token branch 2 times, most recently from b49a74d to 4de0f10 Compare April 8, 2026 20:14
@MasterPtato MasterPtato force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from a112a4f to 67870e8 Compare April 8, 2026 20:30
@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4559 April 8, 2026 20:30 Destroyed
Copy link
Copy Markdown
Contributor

MasterPtato commented Apr 8, 2026

Merge activity

  • Apr 8, 8:52 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 8, 9:08 PM UTC: Graphite rebased this pull request as part of a merge.
  • Apr 8, 9:09 PM UTC: @MasterPtato merged this pull request with Graphite.

@MasterPtato MasterPtato changed the base branch from 04-06-fix_remove_serverless_token to graphite-base/4559 April 8, 2026 21:04
@MasterPtato MasterPtato changed the base branch from graphite-base/4559 to main April 8, 2026 21:06
Reverts all file-system driver workarounds, test timing hacks,
client retry/polling logic, sleep timeout overrides, DB proxy
shutdown budgets, and connection-manager counters that were only
needed for FS driver tests. Keeps engine-relevant fixes: action
fire-and-forget in WS endpoints, conn single-flight disconnect,
DB unification, KV channel multiplexing, envoy bugfixes.
Normal (envoy-based) runner configs never had protocol_version set
by the metadata poller since that only runs for serverless configs.
This caused all actors on normal configs to be created with the v1
workflow, which uses the old runner architecture and never becomes
connectable through envoys.

The fix treats normal runner configs as v2 since they inherently
use the envoy protocol.
Normal (envoy-based) runner configs never had protocol_version set
by the metadata poller since that only runs for serverless configs.
This caused all actors on normal configs to be created with the v1
workflow, which uses the old runner architecture and never becomes
connectable through envoys.

The fix treats normal runner configs as v2 since they inherently
use the envoy protocol.
The cache module changes (global OnceLock for InMemoryDriver and
in_flight map) are not required for engine driver tests. Reverted
to main.

Added guidance to engine/CLAUDE.md about using concurrent containers
instead of Mutex<HashMap<...>>.
@MasterPtato MasterPtato force-pushed the 04-05-wip_fix_whatever_is_going_on_with_envoys branch from 67870e8 to 7cfcbcd Compare April 8, 2026 21:07
@MasterPtato MasterPtato mentioned this pull request Apr 9, 2026
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants