Skip to content

chore: move rivetkit to task model#4680

Draft
NathanFlurry wants to merge 1 commit into04-16-chore_rivetkit_to_rustfrom
04-19-chore_move_rivetkit_to_task_model
Draft

chore: move rivetkit to task model#4680
NathanFlurry wants to merge 1 commit into04-16-chore_rivetkit_to_rustfrom
04-19-chore_move_rivetkit_to_task_model

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented Apr 20, 2026

🚅 Deployed to the rivet-pr-4680 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web Apr 22, 2026 at 9:04 am
website 😴 Sleeping (View Logs) Web Apr 22, 2026 at 8:58 am
frontend-cloud ❌ Build Failed (View Logs) Web Apr 22, 2026 at 8:46 am
frontend-inspector ❌ Build Failed (View Logs) Web Apr 22, 2026 at 8:46 am
mcp-hub ✅ Success (View Logs) Web Apr 22, 2026 at 8:45 am
ladle ❌ Build Failed (View Logs) Web Apr 20, 2026 at 12:09 am

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 20, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry mentioned this pull request Apr 20, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude Bot commented Apr 20, 2026

PR #4680: chore: move rivetkit to task model - Code Review (Updated)

Note: This is a large draft PR (~40k additions, ~11k deletions). The diff exceeds GitHub's API size limit; this review covers key architectural changes.


Overview

This PR replaces the previous SleepController-centric lifecycle model with an event-driven ActorTask model:

  • Introduces ActorTask owning lifecycle channels and driving the actor through a tokio::select! main loop.
  • Replaces scattered atomic counters with WorkRegistry backed by a new AsyncCounter utility.
  • Introduces LifecycleEvent as an internal event bus for cross-subsystem notifications (state mutations, inspector attach, sleep ticks).
  • Moves the multi-phase shutdown state machine (ShutdownPhase enum) into ActorTask with explicit boxed Future steps.
  • Migrates active_http_request_count in envoy-client from AtomicUsize to AsyncCounter so SleepController can register a zero-notify directly.

Architecture - Positive

  • Event-driven drain - replacing the 10ms polling loop in wait_for_sleep_idle_window with Notify wakeups eliminates the latency floor on graceful shutdown.
  • AsyncCounter design - clean RAII abstraction with wait_zero(deadline) and fan-out register_zero_notify. Well-tested with start_paused = true.
  • Explicit ShutdownPhase state machine - boxed Future steps (SendingFinalize -> AwaitingFinalizeReply -> DrainingBefore -> ...) are clearer than the old nested timeout logic.
  • ConnHandles iterator - RAII read-lock-backed with #[must_use]; the doc warning against holding across .await is correct.
  • Test coverage in tests/modules/task.rs (2,932 lines) - comprehensive across state saves, sleep grace, inspector serialization, shutdown drain, hibernation persistence, and alarms.

Issues and Concerns

1. Global test hooks - parallel test isolation risk

task.rs defines three #[cfg(test)] process-global statics (SHUTDOWN_CLEANUP_HOOK, LIFECYCLE_EVENT_HOOK, SHUTDOWN_REPLY_HOOK). Two concurrent tests can race: test A installs a hook, test B overwrites and clears it on drop, test A fires and finds nothing. The test_hook_lock async mutex in task.rs mitigates this within that file, but other test modules reaching these globals do not share that lock.

Recommendation: verify all hook-using tests hold test_hook_lock() for the full guard lifetime, or annotate with #[tokio::test(flavor = "current_thread")].

2. AsyncCounter::increment uses Ordering::Relaxed while load uses Ordering::Acquire

A concurrent load(Acquire) in wait_zero could observe stale zero after a Relaxed increment. The Notify-loop preserves eventual correctness, but the asymmetry is a footgun. Consider Ordering::Release on increment, or add a comment explaining why Relaxed is safe.

3. lookup_http_request_counter registers a zero-notify on every call

zero_observers is a Vec that grows with each registration. The retain in decrement prunes dead Weak entries, but live duplicates accumulate and send redundant wakeups. Make register_zero_notify idempotent, or assert it is called at most once per counter.

4. ConnHandles read lock - no compile-time enforcement of the await constraint

The #[must_use] doc warns against holding across .await, but for conn in ctx.conns() { ... .await } is syntactically valid and Rust won't catch it. Current callsites all use .collect() safely, but this is a trap. Consider keeping ConnHandles crate-private and exposing ctx.conns_snapshot() -> Vec<ConnHandle> for external callers.

5. #[allow(dead_code)] on WorkRegistry signals incomplete wiring

websocket_callback and shutdown_counter are declared but their RegionGuard/CountGuard callers are not visible in this diff. Confirm this is intentional scaffolding or wire it before landing.

6. Dual sleep-path for wired vs. legacy actors could race

context.rs says sleep() is a user-facing bridge and ActorTask owns the shutdown drain. sleep.rs says start_sleep_timer is a compatibility timer for contexts not wired to ActorTask. If a wired actor can still trigger start_sleep_timer, both timers race. Gate the legacy path with a debug_assert! or remove it for ActorTask-wired contexts.

7. HWS restore error path - new early-exit before running state lacks test

Hibernation restore now runs before "send running state". A failure sends ActorStateStopped(Error) without ever sending ActorStateRunning. Confirm the engine handles "stop without running" gracefully, and add a test for this error path.


Minor / Style

  • sleep_tick, state_save_tick, inspector_serialize_state_tick are three identical async fn(Option<Instant>) bodies - extract a shared helper.
  • wait_for_count/wait_for_state test helpers use a 50x10ms busy loop; consider start_paused = true + advance() for deterministic timing.
  • CountGuard type alias is unused per the dead_code suppression - remove until needed.

Summary

The architectural direction is correct and code quality is high. Main items before landing:

  1. Global test hook isolation - guard or document single-thread requirement.
  2. AsyncCounter::increment ordering - document Relaxed rationale or bump to Release.
  3. Dual sleep-path - gate or remove the legacy SleepController timer for ActorTask-owned contexts.
  4. WorkRegistry dead_code - confirm scaffolding intent or wire before landing.
  5. HWS restore error path test - add coverage for the new early-exit failure case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant