Skip to content

Commit 48418ca

Browse files
DeviaVirclaude
andcommitted
daemon: add configurable max-age to recycle RPC connections
Long-lived daemon RPC connections stay pinned to a single backend for their whole lifetime. When electrs connects through a load balancer such as a Kubernetes ClusterSetIP (`*.clusterset.local`), a connection established before a node rotation keeps routing to the original backend via the existing TCP/conntrack flow, even after healthier/closer backends become available. The connection is only re-established on error, so a still-working-but-stale endpoint is never rebalanced. Add a `--daemon-rpc-conn-max-age` option (seconds). When a connection exceeds the configured age it is proactively recycled before the next request, re-establishing the TCP connection so the load balancer can re-select a backend. Defaults to 0 = unlimited (never recycle), so behavior is unchanged unless explicitly enabled. The age check is also applied to the per-thread connections used for parallel RPC requests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: make max-age connection recycling best-effort Proactive recycling previously called the infinite-retry reconnect path while holding the daemon connection mutex, before sending the request. A transient "new connections fail" event at the load balancer could therefore block all requests on that connection instead of continuing to use the existing, still-healthy socket -- turning an LB hiccup into an electrs outage when --daemon-rpc-conn-max-age is enabled. Split tcp_connect() into a single-attempt tcp_connect_once() (primary then fallback, no retry/backoff) and keep the looping tcp_connect() for startup and post-failure reconnects, where there is no usable socket to fall back to. Max-age recycling now uses try_reconnect_once(): on success the connection is swapped, on failure we log and keep the existing connection, retrying recycling on a later request. Real send/recv failures still go through the existing infinite reconnect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: address Copilot review nits - config: parse --daemon-rpc-conn-max-age via value_t_or_exit! for consistent clap error handling instead of a manual parse + panic!. - daemon: store the actually-connected address (primary or fallback) on Connection and log it when recycling, so diagnostics aren't misleading when connected to the fallback daemon. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> daemon: rate-limit failed recycles, add metric + tests Address review feedback on proactive max-age recycling: - Blocker: a failed recycle attempt kept the existing connection (good) but did not update any timestamp, so is_expired() stayed true and every subsequent request re-attempted the recycle first -- each failed attempt blocking up to DAEMON_CONNECTION_TIMEOUT under the connection mutex. During a sustained "new connections fail" event this turned every fast RPC into a request paying a full connect timeout. Now a failed attempt records last_recycle_attempt and a cooldown (DAEMON_CONN_RECYCLE_COOLDOWN, default 30s) gates retries, so the old socket keeps serving requests at full speed between attempts. - Extract the recycle decision into a pure `recycle_due()` helper and cover it with unit tests (max-age boundary, None, and cooldown). - Add a daemon_rpc_conn_recycled{result="ok|failed"} counter so recycle behavior is observable in prod. - tcp_connect_once no longer warns per-attempt; it returns one descriptive error that callers log, avoiding double log lines on the recycle path. The startup/error loop logs that error + backoff. - Document in --daemon-rpc-conn-max-age help that the reconnect is inline on the request path, so the value should be generous (minutes). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent d7c2d33 commit 48418ca

5 files changed

Lines changed: 246 additions & 30 deletions

File tree

src/bin/electrs.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ fn run_server(config: Arc<Config>, salt_rwlock: Arc<RwLock<String>>) -> Result<(
7474
config.network_type,
7575
signal.clone(),
7676
&metrics,
77+
config.daemon_conn_max_age,
7778
)?);
7879
info!("opening database at {}", config.db_path.display());
7980
let store = Arc::new(Store::open(&config, &metrics, true));

src/bin/tx-fingerprint-stats.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ fn main() {
4040
config.network_type,
4141
signal,
4242
&metrics,
43+
config.daemon_conn_max_age,
4344
)
4445
.unwrap(),
4546
);

src/config.rs

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ use std::net::SocketAddr;
55
use std::net::ToSocketAddrs;
66
use std::path::{Path, PathBuf};
77
use std::sync::Arc;
8+
use std::time::Duration;
89
use stderrlog;
910

1011
use crate::chain::Network;
@@ -27,6 +28,7 @@ pub struct Config {
2728
pub daemon_rpc_addr: SocketAddr,
2829
pub daemon_rpc_fallback_addr: Option<SocketAddr>,
2930
pub daemon_parallelism: usize,
31+
pub daemon_conn_max_age: Option<Duration>,
3032
pub cookie: Option<String>,
3133
pub electrum_rpc_addr: SocketAddr,
3234
pub http_addr: SocketAddr,
@@ -177,6 +179,13 @@ impl Config {
177179
.help("Number of JSONRPC requests to send in parallel")
178180
.default_value("4")
179181
)
182+
.arg(
183+
Arg::with_name("daemon_rpc_conn_max_age")
184+
.long("daemon-rpc-conn-max-age")
185+
.help("Max age (in seconds) of a daemon RPC TCP connection before it is proactively recycled. Recycling re-establishes the connection, letting a load balancer (e.g. a Kubernetes ClusterSetIP) re-select a backend after node rotations. The reconnect happens inline on the next request, so prefer a generous value (minutes, not seconds) to avoid periodic latency spikes. 0 = unlimited / never recycle (default)")
186+
.default_value("0")
187+
.takes_value(true),
188+
)
180189
.arg(
181190
Arg::with_name("monitoring_addr")
182191
.long("monitoring-addr")
@@ -425,6 +434,12 @@ impl Config {
425434
.value_of("daemon_rpc_fallback_addr")
426435
.map(|e| str_to_socketaddr(e, "Bitcoin Fallback RPC"));
427436

437+
let daemon_conn_max_age: Option<Duration> =
438+
match value_t_or_exit!(m, "daemon_rpc_conn_max_age", u64) {
439+
0 => None, // 0 = unlimited / never recycle
440+
secs => Some(Duration::from_secs(secs)),
441+
};
442+
428443
let electrum_rpc_addr: SocketAddr = str_to_socketaddr(
429444
m.value_of("electrum_rpc_addr")
430445
.unwrap_or(&format!("127.0.0.1:{}", default_electrum_port)),
@@ -494,6 +509,7 @@ impl Config {
494509
daemon_rpc_addr,
495510
daemon_rpc_fallback_addr,
496511
daemon_parallelism: value_t_or_exit!(m, "daemon_parallelism", usize),
512+
daemon_conn_max_age,
497513
cookie,
498514
utxos_limit: value_t_or_exit!(m, "utxos_limit", usize),
499515
electrum_rpc_addr,

0 commit comments

Comments
 (0)