swarm: `Connection::poll` regularly exceeds Tokio's 50 µs slow-poll threshold

## Context

`Connection::poll` (`swarm/src/connection.rs:274`) is a fixed-point loop
that re-polls every child sub-state-machine whenever any one returns
`Ready`, and only yields once *every* child returns `Pending`
simultaneously. With no upper bound on iterations per call, single
invocations regularly cross [tokio's 50 µs slow-poll threshold][threshold] —
holding a tokio worker thread back from timers, the I/O reactor, and
colocated tasks for hundreds of microseconds at a time.

[threshold]: https://docs.rs/tokio-metrics/latest/tokio_metrics/struct.TaskMonitor.html#associatedconstant.DEFAULT_SLOW_POLL_THRESHOLD

## Findings

120-cell sweep (`N ∈ {1,2,5,10,15} × RTT ∈ {0,5,25,50,100,200} ms ×
payload ∈ {4,16,64,256} KiB`) — real TCP, per-peer netns + `tc netem`,
single-worker tokio runtime, `taskset`-pinned per process:

- **106 / 120 cells** have ≥ 1 % of `Connection::poll` invocations
  cross 50 µs. The 14 cells under 1 % are all at 4 KiB / N ≤ 2.
- **Worst cells** (256 KiB / RTT=0 / N ≥ 2): **32–36 %** of polls slow,
  p99 = **350–375 µs** (~7× the threshold), and **~80 % of measurement
  wall-clock** is spent inside slow polls.
- **Idle isn't clean** either: 64 KiB / N=1 / RTT=0 = ~8 % slow ratio.
- **Swarm::poll is fine**: peak 2.1 %, mean 0.22 %, under 1 % in 119/120
  cells. The cost is inside `Connection::poll`, not the event loop
  above it.

## Reproducing

Full methodology, source, raw CSVs, and rendered heatmaps live in
[`latency-benchmark/`](https://github.com/sirandreww-starkware/rust-libp2p/tree/feat/poll-latency-coop-budget/latency-benchmark).
One-liner: `./latency-benchmark/sweep.sh` on Linux. `sudo` is only
required for `tc`/netns; RTT=0 cells skip both and run as a regular
user. ~50–60 min for the default grid.

## Possible direction

A tokio-`coop`-style work budget on the inner loop — bound iterations
per `Connection::poll` invocation, self-wake on saturation — would cap
per-poll duration without changing observable behaviour beyond one extra
scheduling round per saturated poll. Sketched in
[the bench README](https://github.com/sirandreww-starkware/rust-libp2p/blob/feat/poll-latency-coop-budget/latency-benchmark/README.md#plausible-fix).
The specific budget value is empirical; we don't have a strong preference.

Happy to refine the methodology, add measurements, or produce
before/after numbers once a fix shape lands.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

swarm: `Connection::poll` regularly exceeds Tokio's 50 µs slow-poll threshold #6438

Context

Findings

Reproducing

Possible direction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

swarm: Connection::poll regularly exceeds Tokio's 50 µs slow-poll threshold #6438

Description

Context

Findings

Reproducing

Possible direction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

swarm: `Connection::poll` regularly exceeds Tokio's 50 µs slow-poll threshold #6438