Skip to content

Commit 699d1d6

Browse files
authored
fix: subnet bootstrapping (#1545)
1 parent 7b49216 commit 699d1d6

13 files changed

Lines changed: 1585 additions & 335 deletions

File tree

.github/workflows/build.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ on:
77
jobs:
88
build:
99
name: Build IPC
10-
runs-on: ubuntu-22.04
10+
runs-on: ubuntu-24.04
1111
env:
1212
RUST_BACKTRACE: full
1313
RUSTFLAGS: -Dwarnings

fendermint/docker/runner.Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,8 @@
44
# in the builder depending on whether we are building on CI or locally, but they are concatenated
55
# just before the build.
66

7-
FROM debian:bookworm-slim
7+
# Use Ubuntu 24.04 to match glibc 2.39 required by binaries built on ubuntu-24.04 CI runner
8+
FROM ubuntu:24.04
89

910
RUN apt-get update && \
1011
apt-get install -y libssl3 ca-certificates curl && \

scripts/ipc-subnet-manager/QUICKSTART.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,17 @@ If you have existing nodes running, check their health:
4949
./ipc-subnet-manager.sh check
5050
```
5151

52-
## 5. Initialize Subnet
52+
## 5. Bootstrap Hosts (Fresh VMs Only)
53+
54+
If your validator hosts are fresh (no IPC installed), run bootstrap first:
55+
56+
```bash
57+
./ipc-subnet-manager.sh bootstrap --branch main
58+
```
59+
60+
This installs Rust, Foundry, Node.js, and builds IPC on each host. Takes ~15-30 min per host.
61+
62+
## 6. Initialize Subnet
5363

5464
⚠️ **WARNING**: This will destroy all existing data!
5565

@@ -61,7 +71,7 @@ If you have existing nodes running, check their health:
6171
./ipc-subnet-manager.sh init
6272
```
6373

64-
## 6. Monitor Progress
74+
## 7. Monitor Progress
6575

6676
```bash
6777
# Check health
@@ -74,6 +84,9 @@ If you have existing nodes running, check their health:
7484
## Common Commands
7585

7686
```bash
87+
# Bootstrap fresh hosts (run first on new VMs)
88+
./ipc-subnet-manager.sh bootstrap --branch main
89+
7790
# Initialize subnet from scratch
7891
./ipc-subnet-manager.sh init
7992

scripts/ipc-subnet-manager/README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,10 @@ alias ipc-manager='/usr/local/bin/bash /path/to/ipc-subnet-manager.sh'
3737

3838
### Remote Validators
3939
- Ubuntu/Debian-based Linux
40-
- `ipc-cli` binary installed
41-
- `cometbft` binary in PATH
42-
- User with sudo access (default: `philip`)
43-
- IPC user (default: `ipc`)
40+
- User with passwordless sudo access (default: `philip`)
4441
- SSH key-based authentication configured
42+
- **Option A**: Run `bootstrap` command to install everything (Rust, Foundry, Node.js, IPC)
43+
- **Option B**: Manually install `ipc-cli` and ensure `cometbft` is in PATH
4544

4645
## Installation
4746

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,196 @@
1+
# Systematic Troubleshooting: Port 26654 (IPLD Resolver) Not Listening
2+
3+
## Diagnostic Results (from your run)
4+
5+
| Check | Result |
6+
|-------|--------|
7+
| Config `listen_addr` |`/ip4/0.0.0.0/tcp/26654` |
8+
| Config `subnet_id` |`/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa` |
9+
| Config `[resolver] enabled` |`true` |
10+
| Start script | ✓ Has correct env vars |
11+
| Manual `env FM_... ipc-cli node start` | ✗ Port 26654 still not listening |
12+
| Logs: "IPLD Resolver disabled" or "starting..." |**Neither appears** |
13+
| Logs: "snapshots disabled" at node.rs | Line **142** (remote) vs **243** (current code) |
14+
15+
**ROOT CAUSE:** The remote binary was built from a different branch (e.g. f3-lifecycle). Line numbers don't match current code; the resolver block may not exist or is structured differently in that binary. The config and env vars are correct—the binary simply doesn't have the resolver code.
16+
17+
---
18+
19+
## Fix
20+
21+
Rebuild the binary on validators from the branch that has the resolver code:
22+
23+
```bash
24+
./ipc-manager update-binaries --branch feature/subnet-bootstrapping
25+
./ipc-manager restart --yes
26+
```
27+
28+
Then verify:
29+
30+
```bash
31+
./ipc-manager check
32+
ssh philip@34.16.93.183 "ss -tuln | grep 26654"
33+
```
34+
35+
---
36+
37+
## Root Cause Logic (from fendermint)
38+
39+
The resolver starts only when `resolver_enabled()` returns true:
40+
```rust
41+
// fendermint/app/settings/src/lib.rs:523-527
42+
pub fn resolver_enabled(&self) -> bool {
43+
!self.resolver.connection.listen_addr.is_empty()
44+
&& self.ipc.subnet_id != *ipc_api::subnet_id::UNDEF
45+
}
46+
```
47+
48+
**Both conditions must be true:**
49+
1. `resolver.connection.listen_addr` must be non-empty (e.g. `/ip4/0.0.0.0/tcp/26654`)
50+
2. `ipc.subnet_id` must not be UNDEF (root: 0, children: [])
51+
52+
If disabled, logs show: `"IPLD Resolver disabled."`
53+
If enabled, logs show: `"starting the IPLD Resolver Service..."`
54+
55+
---
56+
57+
## Step 1: Check Config on Remote
58+
59+
SSH to validator-1 and inspect the fendermint config:
60+
61+
```bash
62+
ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/default.toml"
63+
```
64+
65+
**Look for:**
66+
- `[resolver]` or `[resolver.connection]` section
67+
- `listen_addr = "/ip4/0.0.0.0/tcp/26654"` (or similar)
68+
- `[ipc]` section with `subnet_id = "/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa"`
69+
70+
**Grep for key sections:**
71+
```bash
72+
ssh philip@34.16.93.183 "sudo -u ipc grep -A5 '\[resolver\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
73+
ssh philip@34.16.93.183 "sudo -u ipc grep -A2 '\[ipc\]' /home/ipc/.ipc-node/fendermint/config/default.toml"
74+
ssh philip@34.16.93.183 "sudo -u ipc grep listen_addr /home/ipc/.ipc-node/fendermint/config/default.toml"
75+
```
76+
77+
---
78+
79+
## Step 2: Check Logs for Resolver Decision (CRITICAL)
80+
81+
```bash
82+
# Resolver decision
83+
ssh philip@34.16.93.183 "sudo -u ipc grep -E 'IPLD Resolver|resolver' /home/ipc/.ipc-node/logs/*.log 2>/dev/null | tail -20"
84+
85+
# Also check startup logs
86+
ssh philip@34.16.93.183 "sudo -u ipc tail -100 /home/ipc/.ipc-node/logs/*.app.log 2>/dev/null | grep -E 'Resolver|resolver|listen|26654'"
87+
```
88+
89+
**Interpretation:**
90+
- `"IPLD Resolver disabled."` → resolver_enabled() returned false (listen_addr empty and/or subnet_id UNDEF)
91+
- `"starting the IPLD Resolver Service..."` → resolver started (port issue may be elsewhere)
92+
93+
**If logs show "disabled":** The binary is loading config but resolver_enabled() is false. Possible causes:
94+
- `validator.toml` or `local.toml` overrides and clears listen_addr
95+
- Config parsing bug (e.g. Multiaddr type)
96+
- Different binary (f3-lifecycle) with different logic
97+
98+
**If logs show "starting...":** Resolver runs but port doesn't bind. Check for "IPLD Resolver Service failed" or bind errors.
99+
100+
---
101+
102+
## Step 3: Check Start Script (What Actually Runs)
103+
104+
```bash
105+
ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/start-node.sh 2>/dev/null || echo 'File not found'"
106+
```
107+
108+
**Verify:** Does it contain `export FM_RESOLVER__CONNECTION__LISTEN_ADDR` and `export FM_IPC__SUBNET_ID`?
109+
110+
---
111+
112+
## Step 4: Check How Node Is Currently Running
113+
114+
```bash
115+
ssh philip@34.16.93.183 "ps aux | grep 'ipc-cli node start' | grep -v grep"
116+
```
117+
118+
**Check:** Is the process started by start-node.sh or by a direct nohup command? (env vars only apply if set before the process starts)
119+
120+
---
121+
122+
## Step 5: Manual Test – Run With Explicit Env Vars
123+
124+
Stop the node, then run manually with env vars to isolate whether config or env is the issue:
125+
126+
```bash
127+
# On validator-1 (34.16.93.183)
128+
ssh philip@34.16.93.183
129+
130+
# Stop existing node
131+
sudo pkill -f "ipc-cli node start" || true
132+
sleep 3
133+
134+
# Run as ipc user with explicit env vars (no wrapper script)
135+
sudo -u ipc env \
136+
FM_RESOLVER__CONNECTION__LISTEN_ADDR=/ip4/0.0.0.0/tcp/26654 \
137+
FM_IPC__SUBNET_ID=/r314159/t410fjzsmxroshdmvdq5bg4zwqxx5lznwxaga4h7zgqa \
138+
/home/ipc/ipc/target/release/ipc-cli node start --home /home/ipc/.ipc-node
139+
140+
# Let it run 15-20 seconds, then Ctrl+C to stop
141+
# In another terminal, check port:
142+
# ssh philip@34.16.93.183 "ss -tuln | grep 26654"
143+
```
144+
145+
**If port 26654 appears:** Env vars work; the wrapper script or how it's invoked is the problem.
146+
**If port 26654 does NOT appear:** Config or binary (e.g. f3-lifecycle branch) may disable the resolver.
147+
148+
---
149+
150+
## Step 6: Check for Override Configs
151+
152+
Config load order: default.toml → validator.toml → local.toml → env. Later overrides can clear earlier values.
153+
154+
```bash
155+
ssh philip@34.16.93.183 "sudo -u ipc ls -la /home/ipc/.ipc-node/fendermint/config/"
156+
ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/validator.toml 2>/dev/null || echo 'No validator.toml'"
157+
ssh philip@34.16.93.183 "sudo -u ipc cat /home/ipc/.ipc-node/fendermint/config/local.toml 2>/dev/null || echo 'No local.toml'"
158+
```
159+
160+
## Step 7: Check Binary / Branch
161+
162+
```bash
163+
# Fix safe.directory first, then check branch
164+
ssh philip@34.16.93.183 "sudo -u ipc git -C /home/ipc/ipc config --global --add safe.directory /home/ipc/ipc 2>/dev/null; sudo -u ipc bash -c 'cd /home/ipc/ipc && git branch -v && git log -1 --oneline'"
165+
```
166+
167+
**Note:** If validators run `f3-lifecycle` (or another branch), resolver logic may differ from `feature/subnet-bootstrapping`.
168+
169+
---
170+
171+
## Step 8: Check Default Config Template
172+
173+
If the node was initialized with a different node-init, the default.toml may have been generated without resolver settings:
174+
175+
```bash
176+
ssh philip@34.16.93.183 "sudo -u ipc head -100 /home/ipc/.ipc-node/fendermint/config/default.toml"
177+
```
178+
179+
---
180+
181+
## Summary: Decision Tree
182+
183+
| Config has listen_addr? | Config has subnet_id? | Log says "disabled"? | Likely cause |
184+
|-------------------------|----------------------|----------------------|--------------|
185+
| No / empty | - | Yes | Config missing resolver.connection.listen_addr |
186+
| Yes | No / UNDEF | Yes | Config missing ipc.subnet_id |
187+
| Yes | Yes | Yes | Env override not applied (script/quoting) or binary differs |
188+
| Yes | Yes | No ("starting...") | Resolver starts but port bind fails (e.g. permission, conflict) |
189+
190+
---
191+
192+
## After Finding Root Cause
193+
194+
1. **If config is wrong:** Fix default.toml (or re-run node init with correct node-init.yml)
195+
2. **If env vars not applied:** Fix start script invocation (wrapper script, quoting, or use systemd with Environment=)
196+
3. **If binary/branch differs:** Build from feature/subnet-bootstrapping or adapt to that branch's config

0 commit comments

Comments
 (0)