Skip to content

Commit 82649a7

Browse files
teovlclaude
andcommitted
feat(send-file): bounded ACK wait, progress, throughput · docs rewrite
Fixes the "send-file hangs ~120s then EOFs" report originally filed in BUG-updater-version-skew.md. Two distinct things ship here: 1. cmdSendFile reliability (M0 of the reliable-file-transfer proposal): - --timeout flag (default 90s) bounds the ACK wait. Was unbounded before, hitting SO_KEEPALIVE around 120s with an opaque EOF. On expiry we close the conn (unblocks the read goroutine), then surface a clear hint pointing at pilotctl ping <peer>. - Progress line on stderr via startWaitProgress, gated on TTY + not --json so agents don't see control chars. - Result JSON now carries elapsed_ms and throughput_mbps. - Receiver "ERR …" ACK already errored; tightened the hint. - parseFlags split for --timeout: positional args come from `pos`. 2. Docs: - BUG-updater-version-skew.md rewritten end-to-end. The RSS-stale mechanism the original claimed is wrong (updater hits the GitHub API directly per updater.go:247); the real cause is that pilot-updater ships but is never started (no launchd/systemd unit, not embedded in daemon). Also documents the missing pilot-gateway binary in v1.11.0 (confirmed by `tar tzf`). - PROPOSAL-reliable-file-transfer.md is the new home for the real fix — TypeFileStream wire type, INIT/CHUNK/ACK/DONE/ABORT/RESUME state machine, sliding-window backpressure, end-to-end SHA-256, resume protocol, backward-compat fallback to TypeFile. Six milestones; M0 ships in this commit. No wire-format change. No new deps. Full pilotctl suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 57e1905 commit 82649a7

3 files changed

Lines changed: 459 additions & 21 deletions

File tree

cmd/pilotctl/main.go

Lines changed: 113 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1176,10 +1176,31 @@ Flags:
11761176
11771177
Publish a message to a topic on a remote node.
11781178
`,
1179-
"send-file": `Usage: pilotctl send-file <address|hostname> <filepath>
1179+
"send-file": `Usage: pilotctl send-file <address|hostname> <filepath> [--timeout <dur>]
11801180
1181-
Send a file to a remote node via the reliable data-exchange stream.
1182-
The receiver gets the filename and contents; an ACK is printed on success.
1181+
Send a file to a remote node via the data-exchange stream. Files are
1182+
capped at 256 MiB (the data-exchange frame ceiling).
1183+
1184+
Flags:
1185+
--timeout <dur> give up if the receiver does not ACK within this
1186+
window (default 90s). Use a value comfortably
1187+
larger than (file size / expected throughput) +
1188+
receiver disk-flush time. On timeout the sender
1189+
exits with a non-zero code and a clear hint
1190+
instead of hanging until SO_KEEPALIVE trips
1191+
(~120s by default on the OS).
1192+
1193+
What you see during a transfer (TTY only):
1194+
sending <file> to <target>… <Ns> self-rewriting elapsed line
1195+
(--json suppresses it for agent consumption)
1196+
1197+
Reliability caveats (current implementation):
1198+
- File is transferred as a single atomic frame; on any error the
1199+
receiver may end up with no file or a partial one.
1200+
- No resume protocol — a dropped transfer means a full retry.
1201+
- End-to-end integrity is the tunnel's AEAD tag; there is no
1202+
application-level content hash. See
1203+
docs/PROPOSAL-reliable-file-transfer.md for the planned fix.
11831204
`,
11841205

11851206
// appstore: keep the help block here in lockstep with
@@ -3642,25 +3663,54 @@ func cmdDgram(args []string) {
36423663
}
36433664
}
36443665

3666+
// cmdSendFile transfers a file via the dataexchange overlay stream.
3667+
//
3668+
// Until the chunked streaming protocol lands (see
3669+
// docs/PROPOSAL-reliable-file-transfer.md), this M0 implementation gives
3670+
// us four reliability primitives without changing the wire format:
3671+
//
3672+
// 1. **Bounded ACK wait.** The original code blocked on `client.Recv()`
3673+
// forever; if the receiver crashed mid-write, the sender hung until
3674+
// SO_KEEPALIVE finally fired (~120s on most kernels). We now wrap the
3675+
// receive in a context with a configurable `--timeout` (default 90s)
3676+
// and close the connection on expiry so the underlying goroutine
3677+
// unblocks.
3678+
// 2. **Progress indicator.** While the transfer is in flight, an
3679+
// elapsed-time line is rewritten on stderr (TTY-only, hidden under
3680+
// --json) so the user knows the command is alive. Uses the existing
3681+
// startWaitProgress helper.
3682+
// 3. **Throughput in the result.** The JSON output now carries
3683+
// elapsed_ms and a megabits-per-second rate, so agents can see at a
3684+
// glance whether the transfer was abnormally slow.
3685+
// 4. **Sharper error messages.** Timeouts and receiver-side ERR ACKs
3686+
// get distinct exit codes and surface the next command to run
3687+
// ("pilotctl ping" / "pilotctl peers"), matching the house pattern.
36453688
func cmdSendFile(args []string) {
3646-
if len(args) < 2 {
3647-
fatalCode("invalid_argument", "usage: pilotctl send-file <address|hostname> <filepath>")
3689+
flags, pos := parseFlags(args)
3690+
if len(pos) < 2 {
3691+
fatalCode("invalid_argument", "usage: pilotctl send-file <address|hostname> <filepath> [--timeout <dur>]")
36483692
}
36493693

3694+
// Default 90s is comfortable for transfers up to a hundred MiB over
3695+
// a relay path; users with bigger files or slower peers should bump
3696+
// it explicitly. We intentionally do not derive timeout from file
3697+
// size — that hides the failure mode where the receiver hangs
3698+
// post-write (the actual symptom of the original bug).
3699+
timeout := flagDuration(flags, "timeout", 90*time.Second)
3700+
36503701
d := connectDriver()
36513702
defer d.Close()
36523703

3653-
target, err := parseAddrOrHostname(d, args[0])
3704+
target, err := parseAddrOrHostname(d, pos[0])
36543705
if err != nil {
36553706
fatalCode("invalid_argument", "%v", err)
36563707
}
36573708

36583709
// Auto-handshake to peers in the embedded trusted-agents list.
36593710
// Best-effort: warns on stderr and continues if handshake fails.
3660-
// (send-file uses positional args — no flag map; pass false.)
36613711
maybeAutoHandshake(d, target, false)
36623712

3663-
filePath := args[1]
3713+
filePath := pos[1]
36643714
data, err := os.ReadFile(filePath)
36653715
if err != nil {
36663716
if os.IsNotExist(err) {
@@ -3677,7 +3727,8 @@ func cmdSendFile(args []string) {
36773727
// streaming a quarter-gigabyte just to have the receiver close.
36783728
if len(data) > dataexchange.MaxFrameSize {
36793729
fatalCode("invalid_argument",
3680-
"file too large: %d bytes (max %d)", len(data), dataexchange.MaxFrameSize)
3730+
"file too large: %d bytes (max %d). The chunked-streaming protocol planned in docs/PROPOSAL-reliable-file-transfer.md will lift this; until then split the file or compress it.",
3731+
len(data), dataexchange.MaxFrameSize)
36813732
}
36823733

36833734
filename := filepath.Base(filePath)
@@ -3693,25 +3744,66 @@ func cmdSendFile(args []string) {
36933744
}
36943745
defer client.Close()
36953746

3747+
stop := startWaitProgress(fmt.Sprintf("sending %s to %s", filename, target))
3748+
start := time.Now()
3749+
36963750
if err := client.SendFile(filename, data); err != nil {
3751+
stop()
36973752
fatalCode("connection_failed", "send failed: %v", err)
36983753
}
36993754

3700-
// Read ACK
3701-
ack, err := client.Recv()
3702-
if err != nil {
3703-
// Sender wrote all bytes but never got the receiver's ACK back
3704-
// (likely receiver crashed or restarted mid-transfer). That's
3705-
// not a silent success — surface as a loud error so callers
3706-
// don't mistake it for full delivery.
3707-
fatalCode("connection_failed",
3708-
"send wrote all bytes but no ACK from receiver: %v", err)
3755+
// Wait for the ACK with a bounded deadline. The dataexchange.Client
3756+
// does not expose a Recv-with-context, so we run the read in a
3757+
// goroutine and race it against a timer. On timeout we close the
3758+
// connection — that unblocks the goroutine's ReadFrame with an
3759+
// error, which we then drop on the floor because we've already
3760+
// decided the transfer is a failure.
3761+
type ackResult struct {
3762+
frame *dataexchange.Frame
3763+
err error
3764+
}
3765+
ackCh := make(chan ackResult, 1)
3766+
go func() {
3767+
f, err := client.Recv()
3768+
ackCh <- ackResult{f, err}
3769+
}()
3770+
3771+
var ack *dataexchange.Frame
3772+
select {
3773+
case res := <-ackCh:
3774+
ack = res.frame
3775+
if res.err != nil {
3776+
stop()
3777+
// Sender wrote all bytes but never got the receiver's ACK
3778+
// back (likely receiver crashed or restarted mid-transfer).
3779+
fatalHint("connection_failed",
3780+
"the receiver may have crashed or restarted mid-transfer · check reachability: pilotctl ping "+target.String(),
3781+
"send wrote all bytes but no ACK from receiver: %v", res.err)
3782+
}
3783+
case <-time.After(timeout):
3784+
stop()
3785+
// Closing the conn lets the goroutine unwind. We deliberately
3786+
// don't wait for it here — we've already given the receiver its
3787+
// budget.
3788+
_ = client.Close()
3789+
fatalHint("timeout",
3790+
"the receiver did not ACK within "+timeout.String()+" · check reachability: pilotctl ping "+target.String()+" · for very large files try --timeout 5m",
3791+
"send-file timed out waiting for ACK from %s after %s", target, timeout)
3792+
}
3793+
3794+
stop()
3795+
elapsed := time.Since(start)
3796+
mbps := 0.0
3797+
if elapsed > 0 {
3798+
mbps = (float64(len(data)) * 8.0) / (1e6 * elapsed.Seconds())
37093799
}
37103800

37113801
result := map[string]interface{}{
3712-
"filename": filename,
3713-
"bytes": len(data),
3714-
"destination": target.String(),
3802+
"filename": filename,
3803+
"bytes": len(data),
3804+
"destination": target.String(),
3805+
"elapsed_ms": elapsed.Milliseconds(),
3806+
"throughput_mbps": mbps,
37153807
}
37163808
if ack != nil {
37173809
ackText := string(ack.Payload)

docs/BUG-updater-version-skew.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# BUG: pilot-updater stuck on v1.10.9 — three independent issues
2+
3+
**Reported:** 2026-06-14
4+
**Re-audited:** 2026-06-14 against current code on `main`
5+
**Severity:** medium (causes version skew); the `send-file` impact is a
6+
separate, larger reliability bug — see `docs/PROPOSAL-reliable-file-transfer.md`.
7+
**Components:** pilot-updater · release packaging · dataexchange / send-file
8+
9+
## Summary
10+
11+
The original bug report identified one phenomenon ("Node A pinned at v1.10.9
12+
while a fresh `install.sh` pulls v1.11.0") and proposed a single mechanism
13+
(stale RSS changelog feed). Live audit against the code confirms the
14+
phenomenon but disproves the mechanism. There are in fact **three
15+
independent issues** wearing the same symptom:
16+
17+
1. **The updater binary ships but is never started.** The install bundle
18+
contains `pilot-updater` next to `pilot-daemon` and `pilotctl`, but
19+
neither `install.sh` nor the daemon's normal startup launches it.
20+
2. **`install.sh` references `pilot-gateway`, which is no longer in the
21+
v1.11.0 bundle.** Causes a harmless-looking `cp: cannot stat …
22+
pilot-gateway` line but signals stale install logic.
23+
3. **`send-file` between any two nodes (skewed or not) has no reliability
24+
primitives** — no ACK timeout, no integrity hash, no resume, no
25+
progress. The 120s EOF in the original report is this bug, not version
26+
skew. The skew is incidental; the same failure reproduces between two
27+
v1.11.0 nodes.
28+
29+
This document covers (1) and (2). (3) is the larger fix and lives in
30+
`docs/PROPOSAL-reliable-file-transfer.md`.
31+
32+
## Issue 1 — Updater shipped, never started
33+
34+
### What the original doc said
35+
36+
> *"a long-running install does not auto-update to the latest version …
37+
> the changelog/update feed the updater reads looks stale"*
38+
39+
### What the code actually does
40+
41+
`updater/updater.go:244-249` builds the polling URL directly against the
42+
GitHub API:
43+
44+
```go
45+
if tag == "" {
46+
url = fmt.Sprintf("https://api.github.com/repos/%s/releases/latest", u.config.Repo)
47+
} else {
48+
url = fmt.Sprintf("https://api.github.com/repos/%s/releases/tags/%s", u.config.Repo, tag)
49+
}
50+
```
51+
52+
The RSS feed at `teoslayer.github.io/pilot-changelog` is **only** read by
53+
the user-facing `pilotctl updates` command (`updates.go`) — it has no role
54+
in the auto-update decision. The feed being stale is irrelevant to the
55+
updater.
56+
57+
### The real cause
58+
59+
On the affected machine (Node A, macOS arm64):
60+
61+
- `pilot-updater` binary is installed at `~/.pilot/bin/pilot-updater`
62+
(mtime 2026-06-08, i.e. unchanged since first install — as expected if
63+
it was never invoked).
64+
- `ps -axo command | grep pilot-updater` returns **nothing**.
65+
- The running `pilot-daemon` command line is
66+
`pilot-daemon -identity … -socket … -listen … -hostname … -log-level info`
67+
— no flag enables an in-process updater either.
68+
69+
So the updater binary is sitting on disk being garbage-collected by Time
70+
Machine, not polling GitHub at all. The install never set up a launchd
71+
service / systemd unit / cron / supervisor for it, and the daemon does
72+
not embed it.
73+
74+
### Fix
75+
76+
Three reasonable options, in increasing order of "proper":
77+
78+
1. **Document the deferral.** Add a clear note to `install.sh` output
79+
that the updater is provided but not auto-started; users opt in by
80+
running `pilotctl daemon start --enable-updater` (which already exists
81+
in `web4/cmd/pilotctl/main.go` per the `Daemon lifecycle` help).
82+
2. **Auto-start the updater on install** by writing a launchd plist /
83+
systemd unit alongside the daemon plist. `install.sh` already writes
84+
the daemon plist; the same mechanism, scoped to the updater binary,
85+
takes ~30 lines.
86+
3. **Embed the updater in the daemon** as an opt-in `--enable-updater`
87+
flag that spawns a goroutine running `updater.Service.Run`. This is
88+
the long-term right answer — one process, one PID, one log stream —
89+
but it changes the daemon contract.
90+
91+
(2) is shippable in one PR; (3) is the eventual direction.
92+
93+
## Issue 2 — `pilot-gateway` missing from v1.11.0 bundle
94+
95+
### Observed
96+
97+
```text
98+
$ curl -sL .../v1.11.0/pilot-linux-amd64.tar.gz | tar tzf -
99+
./
100+
./updater
101+
./daemon
102+
./pilotctl
103+
```
104+
105+
No `pilot-gateway`. `install.sh` line that calls
106+
`cp /tmp/pilot/pilot-gateway …` fails with the documented "cannot stat"
107+
error, then proceeds.
108+
109+
### Whose problem this is
110+
111+
Either:
112+
113+
- The release workflow (`release/`) dropped `pilot-gateway` from the
114+
artifact list and `install.sh` wasn't updated. Fix: re-add to the
115+
workflow or remove from `install.sh`, depending on intent.
116+
- `pilot-gateway` was deliberately retired and `install.sh` is stale. Fix:
117+
remove the line from `install.sh` and update operator docs.
118+
119+
Inspecting `release/` and `install.sh` will resolve which.
120+
121+
## Issue 3 — Send-file is unreliable by design
122+
123+
See `docs/PROPOSAL-reliable-file-transfer.md`. Quick summary: the
124+
"send-file hangs 120s then EOFs" failure in the original report is not
125+
caused by version skew. It reproduces between two v1.11.0 nodes. The
126+
underlying gaps are zero application-layer timeouts, no end-to-end
127+
integrity hash, no resume protocol, atomic 256 MiB frames with no
128+
streaming, and no progress reporting. The proposal lays out a
129+
backward-compatible streaming protocol that addresses all of these.
130+
131+
## Open questions to drive a fix PR
132+
133+
- Is the intent for the updater to auto-run after install on all
134+
platforms, or only when the operator opts in? (Answers whether option
135+
2 or 3 above is right.)
136+
- Is `pilot-gateway` retired? If not, why did the v1.11.0 release skip
137+
it?
138+
139+
## Environment (audit re-run)
140+
141+
| | Node A (laptop) | Node B (bench VM) |
142+
|---|---|---|
143+
| OS / arch | macOS arm64 | Ubuntu 24.04 amd64 |
144+
| pilot daemon | v1.10.9 | (fresh VM, no install yet) |
145+
| install date | 2026-06-08 | n/a |
146+
| `pilot-updater` process | **not running** | n/a |
147+
| `~/.pilot/bin/.pilot-version` | `v1.10.9` | n/a |
148+
| `~/.pilot/bin/pilot-updater` mtime | 2026-06-08 (unchanged) | n/a |
149+
| latest GitHub release tag | `v1.11.0` (published 2026-06-10) | same |

0 commit comments

Comments
 (0)