Skip to content

apollo_infra: add tcp keepalive to remote server accepted sockets#13456

Merged
Itay-Tsabary-Starkware merged 1 commit into
mainfrom
03-23-apollo_infra_add_tcp_timeouts_server
Apr 13, 2026
Merged

apollo_infra: add tcp keepalive to remote server accepted sockets#13456
Itay-Tsabary-Starkware merged 1 commit into
mainfrom
03-23-apollo_infra_add_tcp_timeouts_server

Conversation

@Itay-Tsabary-Starkware
Copy link
Copy Markdown
Contributor

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware commented Mar 24, 2026

Mirror the client-side SO_KEEPALIVE behaviour on the server: set TCP
keepalive probes on each accepted socket via a configurable idle time
(tcp_keepalive_idle_time_ms). This lets the OS detect dead clients that
disappear without sending FIN/RST.

Co-Authored-By: Claude Sonnet 4.6 noreply@anthropic.com


Note

Medium Risk
Touches remote server connection setup and keepalive behavior; misconfiguration could lead to unexpected connection drops or resource retention, though changes are scoped and covered by tests.

Overview
Adds TCP-level keepalive configuration on RemoteComponentServer accepted sockets using socket2 (idle time derived from keepalive_timeout_ms * TCP_KEEPALIVE_FACTOR, plus probe interval and retry count) to help the OS detect dead peers.

Reuses the existing validate_keepalive_timeout_ms validation for RemoteServerConfig, adjusts a keepalive-related client/server test naming, and expands server eviction tests with a GOAWAY-frame detector to assert HTTP/2 PING eviction still works alongside TCP keepalive.

Reviewed by Cursor Bugbot for commit c87c629. Bugbot is set up for automated code reviews on this repo. Configure here.

@reviewable-StarkWare
Copy link
Copy Markdown

This change is Reviewable

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 24, 2026

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_client branch from 8336741 to a5e9d19 Compare March 25, 2026 09:04
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from fae3094 to 6745e24 Compare March 25, 2026 09:04
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from 03-23-apollo_infra_add_tcp_timeouts_client to graphite-base/13456 March 25, 2026 15:13
Copy link
Copy Markdown
Collaborator

@matanl-starkware matanl-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matanl-starkware reviewed 19 files and all commit messages, and made 1 comment.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on Itay-Tsabary-Starkware).


crates/apollo_infra/src/component_server/remote_component_server.rs line 88 at r1 (raw file):

    /// TCP keepalive: milliseconds an accepted socket may be idle before the OS sends probes.
    /// `None` leaves TCP keepalive unset (OS defaults apply).
    pub tcp_keepalive_idle_time_ms: Option<u64>,

Why do we need both tcp_keepalive and hyper_keepalive?

Code quote:

pub tcp_keepalive_idle_time_ms: Option<u64>,

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 6745e24 to b67ffbc Compare March 29, 2026 11:46
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from graphite-base/13456 to 03-23-apollo_infra_add_tcp_timeouts_client March 29, 2026 11:46
Comment thread crates/apollo_infra/src/component_server/remote_component_server.rs Outdated
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from 03-23-apollo_infra_add_tcp_timeouts_client to graphite-base/13456 March 29, 2026 12:38
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from b67ffbc to 98fdf82 Compare March 29, 2026 13:36
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from graphite-base/13456 to 03-23-apollo_infra_add_tcp_timeouts_client March 29, 2026 13:36
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from 03-23-apollo_infra_add_tcp_timeouts_client to graphite-base/13456 March 29, 2026 14:01
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 98fdf82 to e6749d5 Compare March 29, 2026 14:01
Copy link
Copy Markdown
Contributor Author

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Itay-Tsabary-Starkware made 1 comment.
Reviewable status: 0 of 23 files reviewed, 1 unresolved discussion (waiting on matanl-starkware).


crates/apollo_infra/src/component_server/remote_component_server.rs line 88 at r1 (raw file):

Previously, matanl-starkware (Matan Lior) wrote…

Why do we need both tcp_keepalive and hyper_keepalive?

removed, per our discussion

Comment thread crates/apollo_infra/src/tests/remote_component_client_server_test.rs Outdated
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from e6749d5 to 9347d9d Compare March 30, 2026 09:28
@graphite-app graphite-app Bot changed the base branch from graphite-base/13456 to main March 30, 2026 09:28
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented Mar 30, 2026

Merge activity

  • Mar 30, 9:28 AM UTC: Graphite rebased this pull request, because this pull request is set to merge when ready.
  • Apr 12, 12:38 PM UTC: Graphite rebased this pull request, because this pull request is set to merge when ready.

Comment thread crates/apollo_infra/src/tests/remote_component_client_server_test.rs Outdated
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 9347d9d to 0842fde Compare March 30, 2026 12:52
Copy link
Copy Markdown
Collaborator

@matanl-starkware matanl-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matanl-starkware reviewed 7 files and all commit messages, made 1 comment, and resolved 1 discussion.
Reviewable status: 7 of 23 files reviewed, 1 unresolved discussion (waiting on Itay-Tsabary-Starkware).


crates/apollo_infra/src/component_client/remote_component_client.rs line 121 at r4 (raw file):

    if tcp_keepalive >= http_keepalive {
        Ok(())
    } else {

This is guaranteed by the factor check (>1.0)

Code quote:

    if tcp_keepalive >= http_keepalive {
        Ok(())
    } else {

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from graphite-base/13456 to 04-09-apollo_infra_add_keepalive_time_greater_than_1_validation April 9, 2026 09:47
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from 04-09-apollo_infra_add_keepalive_time_greater_than_1_validation to graphite-base/13456 April 9, 2026 10:05
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 3a89d58 to 527a8f7 Compare April 9, 2026 10:05
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from graphite-base/13456 to 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module April 9, 2026 10:06
Comment thread crates/apollo_infra/src/component_server/remote_component_server.rs
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module branch from 56bc1b5 to efbc706 Compare April 9, 2026 11:07
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 527a8f7 to d114823 Compare April 9, 2026 11:07
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module branch from efbc706 to 7ac0021 Compare April 9, 2026 11:39
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from d114823 to 8daee35 Compare April 9, 2026 11:39
Copy link
Copy Markdown
Collaborator

@matanl-starkware matanl-starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@matanl-starkware reviewed 5 files and all commit messages, made 2 comments, and resolved 1 discussion.
Reviewable status: 7 of 25 files reviewed, all discussions resolved (waiting on Itay-Tsabary-Starkware).


crates/apollo_infra/src/component_client/remote_component_client_test.rs line 22 at r6 (raw file):

// keepalive_timeout_ms = 1100 ms: tcp_raw = 1650 ms, tcp_whole_secs = 1 s = 1000 ms, which is
// less than http_keepalive = 1100 ms.

Just make sure this test's duration is under 300 ms.

Code quote:

1100

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module branch from 7ac0021 to 175a958 Compare April 12, 2026 08:04
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 8daee35 to 2874529 Compare April 12, 2026 08:04
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2874529. Configure here.

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from 2874529 to c87c629 Compare April 12, 2026 09:51
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module branch from 175a958 to c1a5069 Compare April 12, 2026 09:51
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware changed the base branch from 04-09-apollo_infra_move_server_connection_eviction_test_to_designated_module to main April 12, 2026 12:36
Mirror the client-side SO_KEEPALIVE behaviour on the server: set TCP
keepalive probes on each accepted socket via a configurable idle time
(tcp_keepalive_idle_time_ms). This lets the OS detect dead clients that
disappear without sending FIN/RST.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Itay-Tsabary-Starkware Itay-Tsabary-Starkware force-pushed the 03-23-apollo_infra_add_tcp_timeouts_server branch from c87c629 to 7e7db02 Compare April 12, 2026 12:38
@cursor
Copy link
Copy Markdown

cursor Bot commented Apr 12, 2026

PR Summary

Medium Risk
Touches connection-level networking behavior by enabling and validating TCP keepalive on server-accepted sockets; misconfiguration could change connection liveness/eviction timing or surface OS-specific socket option issues.

Overview
Adds server-side TCP keepalive configuration to RemoteComponentServer by applying SO_KEEPALIVE (idle time derived via TCP_KEEPALIVE_FACTOR, interval, and retry count) on each accepted TCP stream.

Reuses the client’s validate_keepalive_timeout_ms to validate RemoteServerConfig.keepalive_timeout_ms, promotes socket2 to a normal dependency, and updates/extends tests to assert keepalive options and to ensure TCP keepalive does not interfere with existing HTTP/2 PING-based zombie connection eviction (including a new contains_goaway_frame helper).

Reviewed by Cursor Bugbot for commit 7e7db02. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor Author

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Itay-Tsabary-Starkware reviewed 26 files and all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on Itay-Tsabary-Starkware).

@Itay-Tsabary-Starkware Itay-Tsabary-Starkware added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit dc87619 Apr 13, 2026
23 of 27 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants