Skip to content

Commit f4cb598

Browse files
committed
feat(installer): add post-install verification
Replace direct sudo commands with run_as_target helper for consistency and add comprehensive error handling and verification to the installation process. This includes: - Postcondition assertions in finalize() to verify critical files - Bootstrap verification for container file transfers - Post-install verification for container-based installations - Improved error handling with specific warning messages These changes ensure installation failures are detected early and provide clearer feedback when components are missing.
1 parent edeff30 commit f4cb598

3 files changed

Lines changed: 230 additions & 5 deletions

File tree

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# Long-Duration Installer Design Principles
2+
3+
Research notes from RCA of ACFS installation issues in LXD containers.
4+
5+
## Problem Context
6+
7+
Long-running installers (like ACFS's multi-phase bash installer) have unique reliability challenges:
8+
- Network transience
9+
- Mid-execution failures
10+
- User context switching (root ↔ target user)
11+
- Resumability requirements
12+
- Silent failures masking root causes
13+
14+
This document captures design principles for building robust installers.
15+
16+
---
17+
18+
## 1. Fail-Fast with Explicit Precondition Checks
19+
20+
Every phase should validate its preconditions *before* executing any actions, not discover failures mid-execution.
21+
22+
```bash
23+
# Good: Explicit precondition check
24+
install_phase_finalize() {
25+
# Preconditions
26+
[[ -d "$ACFS_HOME" ]] || { log_error "ACFS_HOME does not exist"; return 1; }
27+
[[ -w "$ACFS_HOME" ]] || { log_error "ACFS_HOME not writable"; return 1; }
28+
29+
# Actions only after preconditions pass
30+
...
31+
}
32+
```
33+
34+
**Why**: Catching issues early produces clearer error messages and avoids partial state.
35+
36+
---
37+
38+
## 2. Postcondition Assertions
39+
40+
After each phase completes, assert that the expected artifacts exist. Don't trust that commands succeeded—verify the outputs.
41+
42+
```bash
43+
# After installing scripts
44+
assert_file_exists() {
45+
[[ -f "$1" ]] || { log_error "Missing expected file: $1"; return 1; }
46+
}
47+
48+
# Postcondition check at end of finalize phase
49+
assert_file_exists "$ACFS_HOME/scripts/lib/dashboard.sh"
50+
assert_file_exists "$ACFS_HOME/scripts/lib/info.sh"
51+
assert_file_exists "/usr/local/bin/acfs"
52+
```
53+
54+
**Why**: Commands can succeed (exit 0) but produce no output due to edge cases. Verification catches these.
55+
56+
---
57+
58+
## 3. Idempotent Operations
59+
60+
Every step should be safe to re-run:
61+
- Use `mkdir -p` (doesn't fail if exists)
62+
- Check existence before creating
63+
- Use `cp -f` (overwrites safely)
64+
- Use `ln -sf` (replaces symlinks)
65+
66+
**Why**: Enables reliable resume after failures without manual cleanup.
67+
68+
---
69+
70+
## 4. Consistent User Context
71+
72+
Never mix root operations with user operations in the same logical step. Be explicit about which context you're in.
73+
74+
```bash
75+
# Bad: Mixed contexts
76+
$SUDO mkdir -p "$USER_DIR" # Creates as root
77+
chown "$USER:$USER" "$USER_DIR" # Fix ownership after
78+
79+
# Good: Consistent user context from the start
80+
run_as_target mkdir -p "$USER_DIR" # Created with correct ownership
81+
```
82+
83+
**Why**: Mixed contexts lead to ownership issues and race conditions.
84+
85+
---
86+
87+
## 5. Atomic Multi-Step Transactions
88+
89+
When multiple files must all succeed together, use a staging area and atomic move.
90+
91+
```bash
92+
# Stage files to temp, then move all atomically
93+
stage_dir=$(mktemp -d)
94+
cp script1.sh "$stage_dir/"
95+
cp script2.sh "$stage_dir/"
96+
# All copies succeeded, now move atomically
97+
mv "$stage_dir"/* "$ACFS_HOME/scripts/lib/"
98+
rmdir "$stage_dir"
99+
```
100+
101+
**Why**: Prevents partial installations where some files exist but not others.
102+
103+
---
104+
105+
## 6. Explicit Error Propagation (No Silent Failures)
106+
107+
Never use `|| true` for operations that *must* succeed. Reserve it only for truly optional operations.
108+
109+
```bash
110+
# Bad: Silently swallows critical failure
111+
$SUDO chown "$TARGET_USER:$TARGET_USER" "$ACFS_STATE_FILE" || true
112+
113+
# Good: Explicit error handling with degraded mode
114+
if ! $SUDO chown "$TARGET_USER:$TARGET_USER" "$ACFS_STATE_FILE"; then
115+
log_warn "Could not fix state.json ownership—some features may require sudo"
116+
fi
117+
```
118+
119+
**Why**: Silent failures cascade into mysterious downstream errors.
120+
121+
---
122+
123+
## 7. Structured Phase Orchestration
124+
125+
Use a single orchestrator pattern for all phases, never inline phase logic in `main()`.
126+
127+
```bash
128+
# Single pattern for all phases
129+
for phase_id in "${ACFS_PHASE_IDS[@]}"; do
130+
run_phase "$phase_id" "phase_${phase_id}" || exit 1
131+
done
132+
```
133+
134+
**Why**: Consistent orchestration enables consistent error handling, timing, and resumability.
135+
136+
---
137+
138+
## Application to ACFS
139+
140+
ACFS already implements many of these principles:
141+
- `state.sh`: Atomic writes, phase tracking, resume capability
142+
- `error_tracking.sh`: `try_step`, error context, output capture
143+
- `run_as_target`: Proper user context switching
144+
145+
The RCA revealed these patterns were **inconsistently applied** in the `finalize()` phase, leading to:
146+
- Directories created as root instead of target user
147+
- Silent ownership failures (`|| true`)
148+
- No postcondition verification
149+
- No bootstrap verification for container deployment
150+
151+
---
152+
153+
## Related Files
154+
155+
- [install.sh](file:///home/deeog/Desktop/agentic-coding/install.sh) - Main installer
156+
- [scripts/lib/state.sh](file:///home/deeog/Desktop/agentic-coding/scripts/lib/state.sh) - State management
157+
- [scripts/lib/error_tracking.sh](file:///home/deeog/Desktop/agentic-coding/scripts/lib/error_tracking.sh) - Error tracking

install.sh

Lines changed: 35 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4625,7 +4625,7 @@ finalize() {
46254625

46264626
# Install acfs scripts (for acfs CLI subcommands)
46274627
log_detail "Installing acfs scripts"
4628-
try_step "Creating ACFS scripts directory" $SUDO mkdir -p "$ACFS_HOME/scripts/lib" || return 1
4628+
try_step "Creating ACFS scripts directory" run_as_target mkdir -p "$ACFS_HOME/scripts/lib" || return 1
46294629

46304630
# Install script libraries
46314631
try_step "Installing logging.sh" install_asset "scripts/lib/logging.sh" "$ACFS_HOME/scripts/lib/logging.sh" || return 1
@@ -4663,7 +4663,7 @@ finalize() {
46634663
try_step "Installing newproj_screens.sh" install_asset "scripts/lib/newproj_screens.sh" "$ACFS_HOME/scripts/lib/newproj_screens.sh" || return 1
46644664
try_step "Installing newproj_tui.sh" install_asset "scripts/lib/newproj_tui.sh" "$ACFS_HOME/scripts/lib/newproj_tui.sh" || return 1
46654665

4666-
try_step "Creating newproj_screens directory" $SUDO mkdir -p "$ACFS_HOME/scripts/lib/newproj_screens" || return 1
4666+
try_step "Creating newproj_screens directory" run_as_target mkdir -p "$ACFS_HOME/scripts/lib/newproj_screens" || return 1
46674667

46684668
local screens=(
46694669
"screen_agents_preview.sh"
@@ -4710,7 +4710,9 @@ finalize() {
47104710
# Legacy state file (only if state.sh is unavailable)
47114711
if type -t state_load &>/dev/null; then
47124712
if [[ -f "$ACFS_STATE_FILE" ]]; then
4713-
$SUDO chown "$TARGET_USER:$TARGET_USER" "$ACFS_STATE_FILE" || true
4713+
if ! $SUDO chown "$TARGET_USER:$TARGET_USER" "$ACFS_STATE_FILE"; then
4714+
log_warn "Could not set ownership on state.json"
4715+
fi
47144716
fi
47154717
else
47164718
cat > "$ACFS_STATE_FILE" << EOF
@@ -4729,6 +4731,36 @@ EOF
47294731
$SUDO chown "$TARGET_USER:$TARGET_USER" "$ACFS_STATE_FILE"
47304732
fi
47314733

4734+
# ============================================================
4735+
# Postcondition Assertions
4736+
# Verify critical files were installed correctly
4737+
# ============================================================
4738+
local critical_files=(
4739+
"$ACFS_HOME/scripts/lib/dashboard.sh"
4740+
"$ACFS_HOME/scripts/lib/info.sh"
4741+
"$ACFS_HOME/scripts/lib/state.sh:optional"
4742+
"$ACFS_HOME/bin/acfs"
4743+
)
4744+
4745+
local missing_critical=0
4746+
for f_entry in "${critical_files[@]}"; do
4747+
local f="${f_entry%%:*}"
4748+
local optional="${f_entry#*:}"
4749+
if [[ ! -f "$f" ]]; then
4750+
if [[ "$optional" == "optional" ]]; then
4751+
log_warn "Optional file missing after finalize: $f"
4752+
else
4753+
log_error "Critical file missing after finalize: $f"
4754+
missing_critical=1
4755+
fi
4756+
fi
4757+
done
4758+
4759+
if [[ $missing_critical -eq 1 ]]; then
4760+
log_error "finalize phase failed postcondition checks"
4761+
return 1
4762+
fi
4763+
47324764
log_success "Installation complete!"
47334765
}
47344766

scripts/local/acfs_container.sh

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -82,8 +82,17 @@ cmd_create() {
8282
# Copy installer and repo files to container via tar pipe
8383
# This bypasses "Forbidden" errors some LXD setups produce when trying to read host files directly.
8484
log_detail "Transferring ACFS repo to container..."
85-
tar -C "$REPO_ROOT" -cf - install.sh scripts acfs acfs.manifest.yaml checksums.yaml 2>/dev/null | \
86-
acfs_lxc exec "$ACFS_CONTAINER_NAME" -- tar -xf - -C /tmp/
85+
if ! tar -C "$REPO_ROOT" -cf - install.sh scripts acfs acfs.manifest.yaml checksums.yaml 2>/dev/null | \
86+
acfs_lxc exec "$ACFS_CONTAINER_NAME" -- tar -xf - -C /tmp/; then
87+
log_error "Failed to transfer ACFS repo to container"
88+
return 1
89+
fi
90+
91+
# Verify critical bootstrap files exist in container
92+
if ! acfs_lxc exec "$ACFS_CONTAINER_NAME" -- test -f /tmp/scripts/acfs-global 2>/dev/null; then
93+
log_error "Bootstrap verification failed: scripts/acfs-global not in container"
94+
return 1
95+
fi
8796

8897
# Run installer inside container
8998
# We set ACFS_BOOTSTRAP_DIR=/tmp so install.sh knows where to find its own scripts/lib
@@ -96,6 +105,33 @@ cmd_create() {
96105
bash /tmp/install.sh --local --yes --mode vibe --skip-ubuntu-upgrade
97106
"
98107

108+
# Post-install verification
109+
log_info "Verifying installation..."
110+
local verification_warnings=0
111+
112+
# Check user-local acfs command
113+
if ! acfs_sandbox_exec "test -x ~/.local/bin/acfs" 2>/dev/null; then
114+
log_warn "User acfs command not found at ~/.local/bin/acfs"
115+
verification_warnings=1
116+
fi
117+
118+
# Check global wrapper
119+
if ! acfs_sandbox_exec_root "test -x /usr/local/bin/acfs" 2>/dev/null; then
120+
log_warn "Global acfs wrapper not found at /usr/local/bin/acfs"
121+
verification_warnings=1
122+
fi
123+
124+
# Check critical library scripts
125+
if ! acfs_sandbox_exec "test -f ~/.acfs/scripts/lib/dashboard.sh" 2>/dev/null; then
126+
log_warn "dashboard.sh not found in ~/.acfs/scripts/lib/"
127+
verification_warnings=1
128+
fi
129+
130+
if [[ $verification_warnings -eq 1 ]]; then
131+
log_warn "Some ACFS components may not have installed correctly"
132+
log_info "Run 'acfs-local shell' then 'acfs doctor' to diagnose"
133+
fi
134+
99135
echo ""
100136
echo "╔═══════════════════════════════════════════════════════════════╗"
101137
echo "║ Installation Complete! ║"

0 commit comments

Comments
 (0)