Commit 8dec6b2
authored
fix(ci3): cache DNS on build instances to dodge link-local PPS throttling (#24105)
## Problem
CI DNS failures (`curl: (6) Could not resolve host …`, e.g. the
`chonk_inputs.sh` S3 download) are consistent with AWS's **link-local
PPS limit**: traffic to the Amazon resolver (the VPC `.2` address /
`169.254.169.253`) is capped at **~1024 packets/sec per ENI**, and over
that, packets are silently dropped (`linklocal_allowance_exceeded` in
`ethtool -S`).
We confirmed the build's DNS path makes this likely: the devbox
container **and** nested docker-in-docker both get `nameserver
172.31.0.2` and query the VPC resolver directly — no caching. The host's
`systemd-resolved` *is* caching (~48% hit on host-only traffic) but
listens on loopback only (`127.0.0.53`), so containers can't use it.
With the build's parallelism (and the larger spot instances), the
aggregate DNS rate blows past 1024 pps.
## Fix
Route container DNS through the host's caching `systemd-resolved`:
- Expose its stub on the instance's **primary private IP** (derived from
`ip route get`, no IMDS dependency) via `DNSStubListenerExtra` — that's
the one address reachable from the devbox container *and* nested dind
(unlike the docker0 gateway).
- Point containers at it with `docker run --dns <priv_ip>`. Non-loopback
nameservers propagate through the nested dockerd, so dind inherits it.
Repeat lookups become cache hits and never reach the throttled resolver.
## Safety
This can only help, never break resolution: if the IP can't be derived,
`systemd-resolved` isn't active, or the stub doesn't come up on the IP
(5×0.5s health check via `ss`), `priv_ip` is cleared and `--dns` is
omitted — leaving DNS exactly as today.
No counter instrumentation included — we're treating link-local
throttling as the known cause. (PR #379's `linklocal_allowance_exceeded`
logging can confirm before/after if desired.)
## Validation
- `bash -n` on `ci3/bootstrap_ec2`; rendered+`bash -n` the injected
host-script block; verified the `ip route get` parse and the
`${priv_ip:+--dns …}` expansion locally.
- Full validation is the PR's own CI run: a build instance that resolves
through the cache and (ideally) flat `linklocal_allowance_exceeded`.1 file changed
Lines changed: 31 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
259 | 259 | | |
260 | 260 | | |
261 | 261 | | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
262 | 292 | | |
263 | 293 | | |
264 | 294 | | |
| |||
309 | 339 | | |
310 | 340 | | |
311 | 341 | | |
312 | | - | |
| 342 | + | |
313 | 343 | | |
314 | 344 | | |
315 | 345 | | |
| |||
0 commit comments