Skip to content

Add troubleshooting guide for actor connection issues#4862

Open
NathanFlurry wants to merge 1 commit intomainfrom
claude/update-troubleshooting-docs-8Rdwq
Open

Add troubleshooting guide for actor connection issues#4862
NathanFlurry wants to merge 1 commit intomainfrom
claude/update-troubleshooting-docs-8Rdwq

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

This change adds a new troubleshooting section to the actors documentation that helps users diagnose and resolve WebSocket connection failures to actors. The guide includes:

  • Explanation of what happens when a client cannot connect to a running actor
  • How to interpret WebSocket close frames and error codes returned by the gateway
  • Instructions for using the ray_id identifier to report issues to Rivet engineers
  • HTTP ping command to test actor reachability without opening a WebSocket
  • Information about where connection errors are surfaced in the dashboard (close reason and provider pool warnings)

This documentation addition addresses a common troubleshooting scenario and provides users with concrete steps to diagnose connection problems.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

N/A - This is a documentation-only change.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

https://claude.ai/code/session_01AeY73t1oo9M2KTV4m4Y7sY

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 1, 2026

🚅 Deployed to the rivet-pr-4862 environment in rivet-frontend

Service Status Web Updated (UTC)
frontend-cloud ✅ Success (View Logs) Web May 7, 2026 at 2:48 pm
website 😴 Sleeping (View Logs) Web May 1, 2026 at 7:06 am
ladle ✅ Success (View Logs) Web May 1, 2026 at 6:55 am
frontend-inspector ❌ Build Failed (View Logs) Web May 1, 2026 at 6:55 am
mcp-hub ✅ Success (View Logs) Web May 1, 2026 at 6:54 am
kitchen-sink ❌ Build Failed (View Logs) Web May 1, 2026 at 6:53 am

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 1, 2026

Code Review

This PR adds a "Cannot connect to an actor" troubleshooting section to the actors docs. The content is focused and fills a genuine documentation gap. A few things to verify and one potential accuracy issue:


Accuracy: The curl ping command requires actor-side implementation

curl -v '{engine url}/gateway/{actor id}/ping'

Per engine/packages/guard/src/routing/actor_path.rs, the /gateway/{actor_id}/ping path is forwarded directly to the actor's HTTP handler — the /ping suffix is the stripped_path the actor receives. This means the command only works if the actor itself exposes a /ping HTTP route. Many actors are WebSocket-only and won't respond to this.

The guard does have a built-in endpoint (/health/ws), but that tests the guard's own WebSocket loop rather than actor reachability.

Suggestion: Either clarify that the actor must implement a /ping (or any HTTP) route, or note "replace /ping with any HTTP endpoint your actor responds to." Without this caveat, users may try the command with a WebSocket-only actor and get a confusing error that makes them think the actor is unreachable when it isn't.


Verified accurate

  • Close code 1011err_to_close_frame in engine/packages/guard-core/src/utils.rs uses CloseCode::Error for all non-normal closes, which maps to 1011. This is distinct from the 1008 convention in CLAUDE.md which only applies to auth failures. ✓
  • Reason format guard.websocket_service_unavailable#<ray_id> — matches format!("{}.{}#{}", group, code, ray_id) in the same file (line 268). ✓
  • ~10 seconds timeout — matches ACTOR_READY_TIMEOUT: Duration::from_secs(10) in engine/packages/guard/src/routing/pegboard_gateway/mod.rs. ✓

Minor suggestions

  • "engine url" is under-defined. For managed Rivet this would be https://api.rivet.dev; for self-hosted it varies. A brief clarification or link to connection docs would help users know what to substitute.
  • Other close codes exist (websocket_service_timeout, actor_ready_timeout, upstream_error, etc.). Noting that guard.websocket_service_unavailable is just one example, and that the <group>.<code>#<ray_id> format applies broadly, would make the section more useful for debugging other failure modes.
  • The PR diff shows context - If you need more warnings, set... but the current file on main reads - If you need more diagnostics, set.... The branch may need a rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants