Commit 399ef9a
wireproto: per-handler idle watchdog to unblock H2 migration
Summary:
Adds a per-handler idle watchdog to the wireproto websocket-upgrade path so that handlers stuck on a wedged HTTP/2 stream get killed (and the Arcs / response buffers they hold get released) instead of accumulating indefinitely. The watchdog is gated behind `scm/mononoke:wireproto_idle_kill_seconds` (defaults to 0 = disabled until intentionally rolled out, per `fbcode/eden/.llms/rules/rust_unwrap_safety.md`).
This is intended to unblock re-enabling http/2 between revproxy and mononoke without waiting on full wireproto deprecation.
## Why the leak existed (S530959)
Each wireproto session in `slapi_server/repo_listener` spawns a `fwd` task that drains stdout/stderr/keepalive into a `WireprotoSink<W>` where `W` is the writer half of the HTTP body returned from `hyper::upgrade::on(...)`. The session tear-down chain is strictly serial:
```
1. request_handler returns (forward(...) errors when its stdout mpsc::Sender errors)
2. keep_alive.abort()
3. join_handle.await (waits for fwd to drain + flush + close wr)
```
Step 1 fires only when the mpsc::Sender to `stdout` returns Err, which only happens once the receiver inside `fwd` is dropped, which only happens once `fwd` exits, which only happens once `wr.poll_*` returns Ready (either Ok or Err).
Under HTTP/1.1, this is fine: peer disconnect closes TCP, the SSL stream sees FIN, `wr.poll_*` returns Err on the next poll, the chain unwinds in milliseconds.
Under HTTP/2 extended CONNECT (D71055412), one wireproto websocket is a single H2 stream multiplexed onto a long-lived H2 connection (with proxygen pool config `keep_alive_timeout_ms: 900000` / `max_pooled_conns_per_server: 10` post-D70000450). When a peer stops draining a stream without sending RST_STREAM — common when the connection is being kept open for other streams — `H2Upgraded::poll_ready` and `poll_flush` can pend on the H2 flow-control window indefinitely. None of the existing layers will ever return Err; `request_handler` keeps holding `Arc<Mononoke<Repo>>`, `RepoClient` (repo + push_redirector_args + session + logging), the boxed `HgProtoHandler::outstream`, and any in-flight response payload (`getbundle` responses can be megabytes per session).
This was the leak shape behind S530959 ("memory is being held in wireproto codepath" after H2 enablement in D70000450) and the reason both H2 *and* connection pooling for the mononoke backend pool have been kept off in revproxy ever since (see D86877171: "removes selective http/2 disablement for mononoke (no longer used as http/2 has been rolled back everywhere)"). The S530805 sapling-404 incident was a separate proxygen-side bug (D74514382 fixed it), but the memory pressure is what's actually keeping H2 disabled today.
## How this fixes it
`WireprotoSinkData::last_successful_io` is already updated on every successful poll on the sink (via `WireprotoSink::peek_io`). Until now nothing read it outside the failure-scuba path. This diff:
1. Shares `WireprotoSinkData` between the `fwd` task's `WireprotoSink` and a new `wireproto_idle_watchdog` future in `handle_wireproto`, via `Arc<Mutex<WireprotoSinkData>>`. The mutex is uncontended in steady state (one writer, one periodic reader) and is never held across an `.await`.
2. Races `request_handler` against the watchdog via `tokio::select!`. If the watchdog wins (no successful sink poll for `wireproto_idle_kill_seconds`):
- `request_handler` is dropped, releasing the Arc chain immediately.
- `join_handle.abort()` is called explicitly so the `await` below doesn't itself hang against the same wedged sink, and so `wr` (the upgraded H2 IO) is dropped promptly.
3. Bumps `STATS::wireproto_idle_killed` whenever the watchdog fires, so we have a leak-detection signal during the rollout.
The keepalive task writes an empty `Bytes` to the sink every 5 seconds, so any healthy session — even one whose protocol layer is doing slow CPU work — refreshes `last_successful_io` continuously. Only an actually-wedged sink fails to update.
## Why this only affects wireproto
`handle_wireproto` is reached exclusively from `MononokeHttpService::handle_websocket_request`, which is only called when `is_websocket_req(...)` returns true — i.e., the request is an HTTP/1.1 `Upgrade: websocket` or HTTP/2 extended CONNECT (`:protocol = websocket`). Every other path through `MononokeHttpService::handle` (SLAPI, EdenAPI, `/control`, `/health_check`, `/netspeedtest`) returns before reaching the wireproto upgrade and is unaffected by this change. There is no global timeout here that could trim long-running edenapi streams, and no new behavior on the H2 connection itself.
## Configuration / rollout
- New JK: `scm/mononoke:wireproto_idle_kill_seconds` (i64). `<= 0` ⇒ watchdog inert. Suggested production value: 120 (wireproto is interactive; 2 minutes of zero successful sink polls means wedged).
- Default in `just_knobs.json` should be 0 so this is a no-op until intentionally enabled.
- Killswitch: setting the JK to 0 instantly disables the watchdog without a code push. The `tokio::select!` arm then can never fire (the watchdog stays in `continue` forever).
- Fail-safe on JK errors: if the JK system itself returns Err (e.g., during a config push), the watchdog logs once a minute and stays inert rather than killing handlers.
## Wireproto-only memory release
When the watchdog fires we drop:
- `request_handler` future ⇒ `RepoClient`, `HgProtoHandler::outstream`, all in-flight response Bytes
- `Stdio { stdin, stdout, stderr }` ⇒ `mpsc::Sender`s, framed reader/writer halves
- After `keep_alive.abort()` + `join_handle.abort()`: the `WireprotoSink<W>` and the `H2Upgraded` (or `SslStream`-backed) `W` it owns
So a handler that would have leaked indefinitely instead releases everything within (`POLL_INTERVAL` + `wireproto_idle_kill_seconds`) of going idle.
Reviewed By: gustavoavena, YousefSalama
Differential Revision: D104390582
fbshipit-source-id: 5a177db7826f6b19c4220a6652fe2f4461d8cbfd1 parent 8ffb844 commit 399ef9a
2 files changed
Lines changed: 212 additions & 42 deletions
File tree
- eden/mononoke/servers/slapi/slapi_server/repo_listener/src
Lines changed: 188 additions & 30 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| 22 | + | |
21 | 23 | | |
22 | 24 | | |
| 25 | + | |
23 | 26 | | |
24 | 27 | | |
25 | 28 | | |
| |||
82 | 85 | | |
83 | 86 | | |
84 | 87 | | |
| 88 | + | |
85 | 89 | | |
86 | 90 | | |
87 | 91 | | |
88 | 92 | | |
89 | 93 | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
90 | 98 | | |
91 | 99 | | |
92 | 100 | | |
| |||
359 | 367 | | |
360 | 368 | | |
361 | 369 | | |
| 370 | + | |
362 | 371 | | |
363 | 372 | | |
364 | 373 | | |
| |||
372 | 381 | | |
373 | 382 | | |
374 | 383 | | |
375 | | - | |
376 | | - | |
377 | | - | |
| 384 | + | |
378 | 385 | | |
379 | 386 | | |
380 | 387 | | |
| |||
385 | 392 | | |
386 | 393 | | |
387 | 394 | | |
388 | | - | |
389 | | - | |
390 | | - | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
391 | 411 | | |
392 | 412 | | |
393 | 413 | | |
394 | 414 | | |
395 | | - | |
396 | | - | |
397 | | - | |
398 | | - | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
399 | 429 | | |
400 | 430 | | |
401 | 431 | | |
402 | 432 | | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
| 540 | + | |
| 541 | + | |
403 | 542 | | |
404 | 543 | | |
405 | 544 | | |
| |||
424 | 563 | | |
425 | 564 | | |
426 | 565 | | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
427 | 571 | | |
428 | 572 | | |
429 | 573 | | |
| |||
446 | 590 | | |
447 | 591 | | |
448 | 592 | | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
449 | 599 | | |
450 | 600 | | |
451 | 601 | | |
| |||
461 | 611 | | |
462 | 612 | | |
463 | 613 | | |
| 614 | + | |
464 | 615 | | |
465 | 616 | | |
466 | | - | |
| 617 | + | |
467 | 618 | | |
468 | 619 | | |
469 | 620 | | |
| |||
473 | 624 | | |
474 | 625 | | |
475 | 626 | | |
476 | | - | |
477 | | - | |
478 | | - | |
479 | 627 | | |
480 | 628 | | |
481 | | - | |
482 | | - | |
483 | | - | |
484 | | - | |
485 | | - | |
486 | | - | |
487 | | - | |
488 | | - | |
489 | | - | |
490 | | - | |
491 | | - | |
492 | | - | |
493 | | - | |
494 | | - | |
495 | | - | |
496 | | - | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
497 | 654 | | |
498 | 655 | | |
499 | 656 | | |
| |||
540 | 697 | | |
541 | 698 | | |
542 | 699 | | |
| 700 | + | |
543 | 701 | | |
544 | 702 | | |
545 | 703 | | |
| |||
Lines changed: 24 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
| 10 | + | |
9 | 11 | | |
10 | 12 | | |
11 | 13 | | |
| |||
20 | 22 | | |
21 | 23 | | |
22 | 24 | | |
23 | | - | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
24 | 29 | | |
25 | 30 | | |
26 | 31 | | |
27 | | - | |
28 | | - | |
29 | | - | |
30 | | - | |
31 | | - | |
| 32 | + | |
| 33 | + | |
32 | 34 | | |
33 | 35 | | |
34 | 36 | | |
| |||
41 | 43 | | |
42 | 44 | | |
43 | 45 | | |
44 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
45 | 50 | | |
46 | 51 | | |
47 | 52 | | |
48 | 53 | | |
49 | 54 | | |
50 | | - | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
51 | 59 | | |
52 | 60 | | |
53 | 61 | | |
54 | 62 | | |
55 | 63 | | |
56 | 64 | | |
57 | | - | |
58 | | - | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
59 | 68 | | |
60 | 69 | | |
61 | 70 | | |
62 | 71 | | |
63 | 72 | | |
64 | 73 | | |
65 | | - | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
66 | 78 | | |
67 | 79 | | |
68 | 80 | | |
| |||
76 | 88 | | |
77 | 89 | | |
78 | 90 | | |
79 | | - | |
| 91 | + | |
80 | 92 | | |
81 | 93 | | |
82 | 94 | | |
| |||
0 commit comments