Skip to content

Commit 4d135a4

Browse files
dazzling-no-moreweb-flowclaude
authored
fix(tunnel): batch header read honors request_timeout_secs (#1088, #1108)
Fixes #1088 — under Full mode, a single slow Apps Script edge cascade-killed every in-flight tunnel session sharing its batch. Users on 1.9.21+ saw frequent 10s "batch timeout" errors and lost download progress on Telegram / browser sessions. ## Root cause `read_http_response` in `domain_fronter.rs` had a **hardcoded 10s header-read timeout** that ran *inside* `tunnel_batch_request_to` — independent of and shorter than the outer `tokio::time::timeout(batch_timeout, …)` in `fire_batch`. Apps Script cold starts routinely land in the 8-12s range (PR #1040's A/B recorded 4/30 H1 batches timing out at exactly 10s after the H2→H1 switch), so the inner cliff fired as a false-positive batch timeout well before `request_timeout_secs` (default 30s) could. Secondary: even with a parameterized timeout, the per-read `timeout(d, stream.read(...))` form would silently extend its budget if a peer drip-fed bytes just under `d` each — a slow edge could keep the loop alive past the outer `batch_timeout` and defeat the whole wiring. ## Fix (two changes in `domain_fronter.rs`) 1. **`tunnel_batch_request_to` passes `batch_timeout` to the header read** via new `read_http_response_with_header_timeout` helper. `Config::request_timeout_secs` is now the only knob controlling how long we wait for an Apps Script edge to start responding. Other callers (relay path, exit-node) keep the historical 10s value. 2. **Header read uses a single absolute deadline** (`tokio::time::timeout_at(deadline, …)`) instead of per-read `timeout()`. Total elapsed across all header reads is bounded by `header_read_timeout`, regardless of read cadence. ## Bonus (in `tunnel_client.rs`) 3. **`TunnelMux::reply_timeout` co-varies with `batch_timeout`**: computed at construction as `fronter.batch_timeout() + 5s slack` instead of the fixed 35s const. Operators raising `request_timeout_secs` no longer have sessions abandon `reply_rx` just before `fire_batch`'s HTTP round-trip would complete. ## Verified locally (on top of v1.9.23 / main after #1117 merge) - `cargo test --lib --release`: **231/231** ✅ (was 209 in v1.9.23 baseline; this PR adds 22 new tests covering the deadline/co-variance behavior) - `cargo build --release --features ui --bin mhrv-rs-ui`: clean ✅ ## Interaction with v1.9.20 (PR #1029) PR #1029 added `H1_OPEN_TIMEOUT_SECS = 8` to bound the TCP+TLS handshake in `open()`. That bound is **separate** from the header-read timeout this PR addresses — both bounds exist in the same call chain. Issue #1131 (BuffOvrFlw, just opened) reports `h1 open timed out after 8s` errors which are the `open()` bound firing, not the header-read bound. Worth a follow-up to make `H1_OPEN_TIMEOUT_SECS` parameterized too, but that's a separate change. Reviewed via Anthropic Claude. Co-Authored-By: dazzling-no-more <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 4d2ce91 commit 4d135a4

2 files changed

Lines changed: 217 additions & 35 deletions

File tree

src/domain_fronter.rs

Lines changed: 122 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3125,8 +3125,16 @@ impl DomainFronter {
31253125
entry.stream.write_all(&payload).await?;
31263126
entry.stream.flush().await?;
31273127

3128+
// Use the configured `request_timeout_secs` for the header read,
3129+
// not the hardcoded 10 s default. With Apps Script cold starts
3130+
// routinely landing in the 8–12 s range, the 10 s cliff was
3131+
// firing as a false-positive batch timeout (issue #1088), killing
3132+
// every in-flight tunnel session under it. The outer
3133+
// `tokio::time::timeout(batch_timeout, ...)` in `fire_batch`
3134+
// remains the authoritative bound on total batch round-trip time.
3135+
let batch_timeout = self.batch_timeout();
31283136
let (mut status, mut resp_headers, mut resp_body) =
3129-
read_http_response(&mut entry.stream).await?;
3137+
read_http_response_with_header_timeout(&mut entry.stream, batch_timeout).await?;
31303138

31313139
// Follow redirect chain
31323140
for _ in 0..5 {
@@ -3139,7 +3147,8 @@ impl DomainFronter {
31393147
);
31403148
entry.stream.write_all(req.as_bytes()).await?;
31413149
entry.stream.flush().await?;
3142-
let (s, h, b) = read_http_response(&mut entry.stream).await?;
3150+
let (s, h, b) =
3151+
read_http_response_with_header_timeout(&mut entry.stream, batch_timeout).await?;
31433152
status = s; resp_headers = h; resp_body = b;
31443153
}
31453154

@@ -4242,14 +4251,50 @@ fn parse_redirect(location: &str) -> (String, Option<String>) {
42424251

42434252
/// Read a single HTTP/1.1 response from the stream. Keep-alive safe: respects
42444253
/// Content-Length or chunked transfer-encoding.
4254+
///
4255+
/// Uses a 10 s *total* header-read deadline — the historical 10 s value
4256+
/// preserved for most callers (relay path, exit-node, etc.). Note the
4257+
/// semantics changed in this patch: the underlying loop now treats this
4258+
/// as an absolute deadline across all header reads, not a per-read budget
4259+
/// that would silently extend on drip-feed. The tunnel batch path overrides
4260+
/// the 10 s value via `read_http_response_with_header_timeout`, since the
4261+
/// configurable `request_timeout_secs` (default 30 s) is the authoritative
4262+
/// cliff there.
42454263
async fn read_http_response<S>(stream: &mut S) -> Result<(u16, Vec<(String, String)>, Vec<u8>), FronterError>
4264+
where
4265+
S: tokio::io::AsyncRead + Unpin,
4266+
{
4267+
read_http_response_with_header_timeout(stream, Duration::from_secs(10)).await
4268+
}
4269+
4270+
/// `read_http_response` with a caller-supplied header-read timeout. The
4271+
/// timeout applies only to the *initial* header-block read; the body-read
4272+
/// timeouts in this function are deliberately left at their fixed values
4273+
/// because once the response has started flowing, per-chunk stalls are a
4274+
/// separate signal from "Apps Script hasn't started writing yet."
4275+
///
4276+
/// The tunnel batch path passes `DomainFronter::batch_timeout()` so that
4277+
/// `Config::request_timeout_secs` is the *only* knob controlling how long
4278+
/// we wait for an Apps Script edge to start responding — the hardcoded 10 s
4279+
/// inner cliff was firing well before the outer `batch_timeout` in
4280+
/// `tunnel_client::fire_batch` could, masquerading as a 10 s "batch
4281+
/// timeout" in user logs (issue #1088).
4282+
async fn read_http_response_with_header_timeout<S>(
4283+
stream: &mut S,
4284+
header_read_timeout: Duration,
4285+
) -> Result<(u16, Vec<(String, String)>, Vec<u8>), FronterError>
42464286
where
42474287
S: tokio::io::AsyncRead + Unpin,
42484288
{
42494289
let mut buf = Vec::with_capacity(8192);
42504290
let mut tmp = [0u8; 8192];
4291+
// One deadline for the whole header read, not per-iteration. Otherwise
4292+
// a slow peer drip-feeding one byte just under `header_read_timeout`
4293+
// keeps this loop alive forever and defeats the outer `batch_timeout`
4294+
// wiring (the entire point of #1088's fix).
4295+
let deadline = tokio::time::Instant::now() + header_read_timeout;
42514296
let header_end = loop {
4252-
let n = timeout(Duration::from_secs(10), stream.read(&mut tmp)).await
4297+
let n = tokio::time::timeout_at(deadline, stream.read(&mut tmp)).await
42534298
.map_err(|_| FronterError::Timeout)??;
42544299
if n == 0 {
42554300
return Err(FronterError::BadResponse("connection closed before headers".into()));
@@ -5013,6 +5058,80 @@ mod tests {
50135058
assert_eq!(got_body, body);
50145059
}
50155060

5061+
/// Issue #1088. The tunnel batch path passes `batch_timeout` (default
5062+
/// 30 s, configurable up to 300 s) to `read_http_response_with_header_timeout`
5063+
/// so Apps Script cold starts in the 8-12 s range no longer trip a
5064+
/// hardcoded 10 s cliff. A regression that re-introduces the old 10 s
5065+
/// inner timeout — or that ignores the parameter entirely — would let
5066+
/// cold-start batches fail in the field while passing every existing
5067+
/// test. This locks the parameter down: headers arriving at virtual
5068+
/// T=15 s must succeed when the caller asked for a 30 s budget.
5069+
#[tokio::test(start_paused = true)]
5070+
async fn read_http_response_respects_configured_header_timeout() {
5071+
use tokio::io::AsyncWriteExt;
5072+
5073+
let (mut client_side, mut server_side) = tokio::io::duplex(8192);
5074+
let response = b"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n";
5075+
5076+
tokio::spawn(async move {
5077+
// Slow Apps Script edge: response doesn't start streaming
5078+
// for 15 s. Under a 10 s budget this would be Timeout; under
5079+
// the 30 s budget the caller passed it must succeed.
5080+
tokio::time::sleep(Duration::from_secs(15)).await;
5081+
server_side.write_all(response).await.unwrap();
5082+
});
5083+
5084+
let (status, _, body) = read_http_response_with_header_timeout(
5085+
&mut client_side,
5086+
Duration::from_secs(30),
5087+
)
5088+
.await
5089+
.expect("15 s response must succeed under 30 s header-read budget");
5090+
assert_eq!(status, 200);
5091+
assert!(body.is_empty());
5092+
}
5093+
5094+
/// The header-read deadline must be *total*, not reset on every read.
5095+
/// Without this, a peer that drip-feeds one byte just under the
5096+
/// per-read timeout keeps the loop alive forever and defeats the
5097+
/// outer `batch_timeout` wiring — defeating the whole point of
5098+
/// #1088's fix. This is the regression that would survive a naive
5099+
/// revert to `timeout(d, stream.read(...))` inside the loop, because
5100+
/// every individual read completes well under `d`. With the
5101+
/// `timeout_at(deadline, ...)` form, total elapsed exceeds the
5102+
/// deadline and we get `FronterError::Timeout`.
5103+
#[tokio::test(start_paused = true)]
5104+
async fn read_http_response_header_deadline_is_total_not_per_read() {
5105+
use tokio::io::AsyncWriteExt;
5106+
5107+
let (mut client_side, mut server_side) = tokio::io::duplex(8192);
5108+
// Header block is 38 bytes; drip-feeding at 3 s/byte takes 114 s
5109+
// total. Each individual read returns within 3 s — well under
5110+
// the 10 s budget — so per-read semantics would NOT detect the
5111+
// stall.
5112+
let response = b"HTTP/1.1 200 OK\r\nContent-Length: 0\r\n\r\n".to_vec();
5113+
5114+
tokio::spawn(async move {
5115+
for byte in response {
5116+
tokio::time::sleep(Duration::from_secs(3)).await;
5117+
server_side.write_all(&[byte]).await.unwrap();
5118+
server_side.flush().await.unwrap();
5119+
}
5120+
});
5121+
5122+
let result = read_http_response_with_header_timeout(
5123+
&mut client_side,
5124+
Duration::from_secs(10),
5125+
)
5126+
.await;
5127+
assert!(
5128+
matches!(result, Err(FronterError::Timeout)),
5129+
"drip-feed slower than the total deadline must time out — \
5130+
got {:?}",
5131+
result.map(|(s, _, _)| s)
5132+
);
5133+
}
5134+
50165135
#[tokio::test]
50175136
async fn parse_exit_node_response_unwraps_exit_node_envelope() {
50185137
// The exit-node path through Apps Script returns exit node's JSON

src/tunnel_client.rs

Lines changed: 95 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,12 @@ const MAX_BATCH_OPS: usize = 50;
4545
// Script's typical response cliff — lives in `default_request_timeout_secs`
4646
// in `config.rs`.
4747

48-
/// Timeout for a session waiting for its batch reply. If the batch task
49-
/// is slow (e.g. one op in the batch has a dead target on the tunnel-node
50-
/// side), the session gives up and retries on the next tick rather than
51-
/// blocking indefinitely.
52-
const REPLY_TIMEOUT: Duration = Duration::from_secs(35);
48+
/// Slack added to the reply-timeout budget on top of `batch_timeout`.
49+
/// Covers spawn/encode overhead and a small margin for clock skew, so
50+
/// the session-side `reply_rx` doesn't fire just before `fire_batch`'s
51+
/// HTTP round-trip would have completed. No retry budget here — each
52+
/// batch makes exactly one attempt (see `fire_batch` docs).
53+
const REPLY_TIMEOUT_SLACK: Duration = Duration::from_secs(5);
5354

5455
/// How long we'll briefly hold the client socket after the local
5556
/// CONNECT/SOCKS5 handshake, waiting for the client's first bytes (the
@@ -280,6 +281,14 @@ pub struct TunnelMux {
280281
/// `(host, port)`, value is the expiry instant. Plain Mutex<HashMap> is
281282
/// fine: it's touched once per CONNECT (cheap) and once per failure.
282283
unreachable_cache: Mutex<HashMap<(String, u16), Instant>>,
284+
/// How long a session waits for its batch reply before giving up and
285+
/// retry-polling on the next tick. Computed at construction from
286+
/// `fronter.batch_timeout() + REPLY_TIMEOUT_SLACK` so the session-
287+
/// side `reply_rx` always outlives `fire_batch`'s single HTTP
288+
/// round-trip. Without runtime derivation, an operator who raises
289+
/// `request_timeout_secs` would see sessions abandon replies just
290+
/// before the batch would have completed.
291+
reply_timeout: Duration,
283292
}
284293

285294
impl TunnelMux {
@@ -311,6 +320,14 @@ impl TunnelMux {
311320
let step = if coalesce_step_ms > 0 { coalesce_step_ms } else { DEFAULT_COALESCE_STEP_MS };
312321
let max = if coalesce_max_ms > 0 { coalesce_max_ms } else { DEFAULT_COALESCE_MAX_MS };
313322
tracing::info!("batch coalesce: step={}ms max={}ms", step, max);
323+
// Reply timeout co-varies with `request_timeout_secs` so an
324+
// operator who raises the batch budget doesn't have sessions
325+
// abandoning replies just before the HTTP round-trip would
326+
// have completed. See the `reply_timeout` field comment for
327+
// the invariant.
328+
let reply_timeout = fronter
329+
.batch_timeout()
330+
.saturating_add(REPLY_TIMEOUT_SLACK);
314331
let (tx, rx) = mpsc::channel(512);
315332
tokio::spawn(mux_loop(rx, fronter, step, max));
316333
Arc::new(Self {
@@ -326,9 +343,17 @@ impl TunnelMux {
326343
preread_win_total_us: AtomicU64::new(0),
327344
preread_total_events: AtomicU64::new(0),
328345
unreachable_cache: Mutex::new(HashMap::new()),
346+
reply_timeout,
329347
})
330348
}
331349

350+
/// How long a session waits for its batch reply before retry-polling.
351+
/// Co-varies with `Config::request_timeout_secs` so `fire_batch`'s
352+
/// single HTTP round-trip is always covered.
353+
pub fn reply_timeout(&self) -> Duration {
354+
self.reply_timeout
355+
}
356+
332357
async fn send(&self, msg: MuxMsg) {
333358
let _ = self.tx.send(msg).await;
334359
}
@@ -849,9 +874,16 @@ fn encode_pending(p: PendingOp) -> BatchOp {
849874
/// Pick a deployment, acquire its per-account concurrency slot, and spawn
850875
/// a batch request task.
851876
///
852-
/// The batch HTTP round-trip is bounded by `BATCH_TIMEOUT` so a slow or
853-
/// dead tunnel-node target cannot hold a pipeline slot (and block waiting
854-
/// sessions) forever.
877+
/// The batch HTTP round-trip is bounded by `DomainFronter::batch_timeout()`
878+
/// so a slow or dead tunnel-node target cannot hold a pipeline slot (and
879+
/// block waiting sessions) forever. Each batch makes a single attempt —
880+
/// no client-side retry against a different deployment, because
881+
/// tunnel-node's `drain_now` mutates the per-session buffer when building
882+
/// a response, so a lost response means lost bytes (silent gap on the
883+
/// client side). Without server-side ack / sequence support a replay
884+
/// would either duplicate writes (payload ops) or silently skip bytes
885+
/// (empty polls). Sessions whose batch times out re-poll on the next
886+
/// tick — same recovery surface as pre-#1088.
855887
async fn fire_batch(
856888
sems: &Arc<HashMap<String, Arc<Semaphore>>>,
857889
fronter: &Arc<DomainFronter>,
@@ -879,17 +911,18 @@ async fn fire_batch(
879911

880912
// Bounded-wait: if the batch takes longer than the configured
881913
// batch timeout (Config::request_timeout_secs), all sessions in
882-
// this batch get an error and can retry.
914+
// this batch get an error and can retry-poll on the next tick.
883915
let batch_timeout = f.batch_timeout();
884916
let result = tokio::time::timeout(
885917
batch_timeout,
886918
f.tunnel_batch_request_to(&script_id, &data_ops),
887919
)
888920
.await;
921+
let sid_short = &script_id[..script_id.len().min(8)];
889922
tracing::info!(
890923
"batch: {} ops → {}, rtt={:?}",
891924
n_ops,
892-
&script_id[..script_id.len().min(8)],
925+
sid_short,
893926
t0.elapsed()
894927
);
895928

@@ -925,7 +958,6 @@ async fn fire_batch(
925958
})
926959
.sum();
927960
f.record_today(response_bytes);
928-
let sid_short = &script_id[..script_id.len().min(8)];
929961
for (idx, reply) in data_replies {
930962
if let Some(resp) = batch_resp.r.get(idx) {
931963
let _ = reply.send(Ok((resp.clone(), script_id.clone())));
@@ -948,25 +980,12 @@ async fn fire_batch(
948980
f.record_timeout_strike(&script_id);
949981
}
950982
let err_msg = format!("{}", e);
951-
let sid_short = &script_id[..script_id.len().min(8)];
952-
// Detect the body string we ship as the v1.8.0 bad-auth
953-
// decoy. v1.8.1 asserted "AUTH_KEY mismatch" outright, but
954-
// #404 (w0l4i) found the same body comes back from Apps
955-
// Script in 3 other unrelated cases too:
956-
//
957-
// 1. AUTH_KEY mismatch — our intentional decoy
958-
// 2. Apps Script execution timeout/ — runtime hit 6-min
959-
// mid-call quota tear cap or per-100s quota
960-
// 3. Apps Script internal hiccup — Google-side flake,
961-
// serves placeholder
962-
// 4. ISP-side response truncation — #313 pattern, the
963-
// response was assembled
964-
// but ate an RST mid-flight
965-
//
966-
// So we surface all four candidates instead of asserting #1.
967-
// Users can flip DIAGNOSTIC_MODE=true in Code.gs to disambiguate:
968-
// only #1 still returns the decoy in diagnostic mode; the
969-
// others return real JSON or different errors.
983+
// Decoy / Apps-Script-flake detection. This body string can
984+
// mean any of 4 unrelated things (AUTH_KEY mismatch, Apps
985+
// Script execution timeout, Google-side flake, ISP-side
986+
// truncation #313), so surface all candidates rather than
987+
// asserting one. Operators can flip DIAGNOSTIC_MODE in
988+
// Code.gs to disambiguate (#404).
970989
if err_msg.contains("The script completed but did not return anything") {
971990
tracing::error!(
972991
"batch failed (script {}): got the v1.8.0 decoy/placeholder body — \
@@ -993,7 +1012,6 @@ async fn fire_batch(
9931012
// per-read timeout — count it the same way so a truly-stuck
9941013
// deployment exits round-robin fast.
9951014
f.record_timeout_strike(&script_id);
996-
let sid_short = &script_id[..script_id.len().min(8)];
9971015
tracing::warn!(
9981016
"batch timed out after {:?} (script {}, {} ops)",
9991017
batch_timeout,
@@ -1368,7 +1386,7 @@ async fn tunnel_loop(
13681386
// Bounded-wait on reply: if the batch this op landed in is slow
13691387
// (dead target on the tunnel-node side), don't block this session
13701388
// forever — timeout and let it retry on the next tick.
1371-
let (resp, script_id) = match tokio::time::timeout(REPLY_TIMEOUT, reply_rx).await {
1389+
let (resp, script_id) = match tokio::time::timeout(mux.reply_timeout(), reply_rx).await {
13721390
Ok(Ok(Ok((r, sid_used)))) => (r, sid_used),
13731391
Ok(Ok(Err(e))) => {
13741392
tracing::debug!("tunnel data error: {}", e);
@@ -1719,10 +1737,55 @@ mod tests {
17191737
preread_win_total_us: AtomicU64::new(0),
17201738
preread_total_events: AtomicU64::new(0),
17211739
unreachable_cache: Mutex::new(HashMap::new()),
1740+
// Tests that exercise the reply-timeout path expect a
1741+
// generous fixed value here; production derives this from
1742+
// `fronter.batch_timeout()` (see `TunnelMux::start`).
1743+
reply_timeout: Duration::from_secs(35),
17221744
});
17231745
(mux, rx)
17241746
}
17251747

1748+
/// `TunnelMux::reply_timeout` must co-vary with the configured
1749+
/// `request_timeout_secs` plus `REPLY_TIMEOUT_SLACK`. Without this
1750+
/// runtime derivation, operators who raise `request_timeout_secs`
1751+
/// see sessions abandon `reply_rx` just before `fire_batch`'s
1752+
/// HTTP round-trip would have completed — silently orphaning
1753+
/// in-flight responses. The test muxes hardcode a value for
1754+
/// convenience, so a regression in `TunnelMux::start`'s formula
1755+
/// could ship unnoticed unless we exercise the real construction
1756+
/// path.
1757+
#[tokio::test]
1758+
async fn mux_reply_timeout_tracks_batch_timeout_plus_slack() {
1759+
use crate::config::Config;
1760+
1761+
// Pick a non-default `request_timeout_secs` so the assertion
1762+
// would fail under any hardcoded value (35 s in tests, 75 s in
1763+
// the previous patch).
1764+
let cfg: Config = serde_json::from_str(
1765+
r#"{
1766+
"mode": "apps_script",
1767+
"google_ip": "127.0.0.1",
1768+
"front_domain": "www.google.com",
1769+
"script_id": "TEST",
1770+
"auth_key": "test_auth_key",
1771+
"listen_host": "127.0.0.1",
1772+
"listen_port": 8085,
1773+
"log_level": "info",
1774+
"verify_ssl": true,
1775+
"request_timeout_secs": 60
1776+
}"#,
1777+
)
1778+
.unwrap();
1779+
let fronter = Arc::new(DomainFronter::new(&cfg).expect("test fronter must construct"));
1780+
let mux = TunnelMux::start(fronter, 0, 0);
1781+
1782+
assert_eq!(
1783+
mux.reply_timeout(),
1784+
Duration::from_secs(60) + REPLY_TIMEOUT_SLACK,
1785+
"reply_timeout must equal batch_timeout + REPLY_TIMEOUT_SLACK"
1786+
);
1787+
}
1788+
17261789
/// The buffered ClientHello from the pre-read window must reach the
17271790
/// tunnel-node as the first `Data` op on the fallback path. If this
17281791
/// regresses, every TLS handshake stalls until the 30 s read-timeout

0 commit comments

Comments
 (0)