Skip to content

Commit 06bb95f

Browse files
committed
fix: [#222] configure SSH port via cloud-init with reboot pattern
Implements custom SSH port configuration during VM provisioning using cloud-init's reboot pattern following Hetzner best practices. Solution Overview: - Cloud-init writes SSH config file and triggers system reboot - Reboot ensures clean SSH restart with new port configuration - Provision handler waits for configured port (not default port 22) - Increased timeout from 60s to 120s for cloud-init + reboot time Key Changes: 1. Cloud-init template: Added write_files + runcmd with reboot 2. Provision handler: Use configured SSH port in wait_for_readiness() 3. SSH adapter: Increased DEFAULT_MAX_RETRY_ATTEMPTS from 30 to 60 4. Documentation: Created ADR and updated issue spec Technical Details: - Cloud-init creates /etc/ssh/sshd_config.d/99-custom-port.conf - System reboot guarantees SSH only on custom port (no port 22) - Total timeout: 120 seconds (60 attempts × 2 second interval) - Testing confirmed: SSH listens only on configured port after reboot Why Reboot Approach: - systemctl restart doesn't kill old SSH process when port changes - bootcmd ineffective - systemd auto-restarts SSH after bootcmd - Reboot is cleaner and follows Hetzner cloud-config tutorial Files Modified: - templates/tofu/common/cloud-init.yml.tera - src/application/command_handlers/provision/handler.rs - src/adapters/ssh/config.rs - docs/decisions/cloud-init-ssh-port-reboot.md (new) - docs/decisions/README.md - docs/issues/222-configure-ssh-service-port.md - project-words.txt References: - Hetzner cloud-config tutorial section 5.3 - Issue #222
1 parent d5d00ba commit 06bb95f

7 files changed

Lines changed: 300 additions & 30 deletions

File tree

docs/decisions/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ This directory contains architectural decision records for the Torrust Tracker D
66

77
| Status | Date | Decision | Summary |
88
| ------------- | ---------- | ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
9+
| ✅ Accepted | 2025-12-11 | [Cloud-Init SSH Port Configuration with Reboot](./cloud-init-ssh-port-reboot.md) | Use cloud-init with reboot pattern to configure custom SSH ports during VM provisioning |
910
| ✅ Accepted | 2025-12-10 | [Single Docker Image for Sequential E2E Command Testing](./single-docker-image-sequential-testing.md) | Use single Docker image with sequential command execution instead of multi-image phases |
1011
| ✅ Accepted | 2025-12-09 | [Register Command SSH Port Override](./register-ssh-port-override.md) | Add optional --ssh-port argument to register command for non-standard SSH ports |
1112
| ✅ Accepted | 2025-11-19 | [Disable MD060 Table Formatting Rule](./md060-table-formatting-disabled.md) | Disable MD060 to allow flexible table formatting and emoji usage |
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Decision: Cloud-Init SSH Port Configuration with Reboot
2+
3+
## Status
4+
5+
Accepted
6+
7+
## Date
8+
9+
2025-12-11
10+
11+
## Context
12+
13+
The deployer needs to support custom SSH ports for security and flexibility. The SSH port configuration must be applied **during VM provisioning** (not later in the configure phase) because:
14+
15+
1. **Provision phase dependencies**: The `WaitForCloudInitStep` runs during provision and uses Ansible to wait for cloud-init completion. Ansible connects using the custom port from the inventory configuration.
16+
17+
2. **Timing requirement**: If SSH is not already listening on the custom port when `WaitForCloudInitStep` executes, the provision command fails with connection errors.
18+
19+
3. **Architectural correctness**: SSH port is infrastructure configuration, not application configuration. It should be set during infrastructure provisioning, not as a post-provisioning step.
20+
21+
The challenge was ensuring SSH service reliably restarts with the new port configuration during cloud-init execution. Multiple approaches were tested:
22+
23+
- **systemctl restart**: Does not kill the old SSH process when port changes, resulting in SSH listening on both ports 22 and the custom port
24+
- **pkill + systemctl start**: Works but is brittle and non-standard
25+
- **bootcmd disable + runcmd restart**: Ineffective because systemd automatically re-enables and starts SSH after bootcmd completes
26+
27+
## Decision
28+
29+
We configure the custom SSH port via cloud-init using the **`write_files` + `reboot` pattern**, following Hetzner's cloud-config best practices:
30+
31+
1. **Write SSH configuration file** using cloud-init's `write_files` directive:
32+
33+
```yaml
34+
write_files:
35+
- path: /etc/ssh/sshd_config.d/99-custom-port.conf
36+
content: |
37+
# Custom SSH port configuration
38+
Port {{ ssh_port }}
39+
permissions: "0644"
40+
owner: root:root
41+
```
42+
43+
2. **Trigger system reboot** in cloud-init's `runcmd` phase:
44+
45+
```yaml
46+
runcmd:
47+
- reboot
48+
```
49+
50+
The reboot ensures:
51+
52+
- SSH service cleanly restarts with the new configuration
53+
- No old SSH processes remain on port 22
54+
- All services start in a consistent state
55+
- Package updates are applied (if cloud-init installed packages)
56+
57+
Additionally, we made two critical fixes to the provision handler:
58+
59+
1. **Use configured SSH port**: Changed `wait_for_readiness()` to use `SocketAddr::new(ip, ssh_port)` instead of `SshConfig::with_default_port()`, ensuring the provision handler waits for SSH on the correct custom port (not port 22).
60+
61+
2. **Increase SSH connectivity timeout**: Raised `DEFAULT_MAX_RETRY_ATTEMPTS` from 30 to 60 attempts (120 seconds total), accounting for the ~70-80 second cloud-init completion time plus reboot time.
62+
63+
## Consequences
64+
65+
### Positive
66+
67+
- **Clean SSH restart**: Reboot guarantees SSH only listens on the custom port, no lingering processes on port 22
68+
- **Industry best practice**: Follows Hetzner's documented cloud-config pattern for SSH port changes
69+
- **Simple and reliable**: Single `reboot` command is simpler than managing service lifecycle manually
70+
- **Correct architecture**: Infrastructure configuration happens during infrastructure provisioning
71+
- **No special cases**: Ansible can connect normally using the configured port without overrides or workarounds
72+
- **Compile-time safety**: Provision handler correctly waits for the configured port, preventing connection failures
73+
74+
### Negative
75+
76+
- **Slower provisioning**: Reboot adds ~10-20 seconds to VM initialization time
77+
- **Additional wait time**: Provision handler must wait longer (120s instead of 60s) for cloud-init and reboot to complete
78+
- **Complexity**: Three separate changes required (cloud-init template, provision handler port usage, timeout increase)
79+
80+
### Risks
81+
82+
- **Reboot timing**: If reboot takes longer than expected, SSH connectivity check might timeout (mitigated by 120-second timeout)
83+
- **Cloud-init failure**: If reboot fails or cloud-init has errors, the provision will fail (acceptable - we want to catch infrastructure issues early)
84+
85+
## Alternatives Considered
86+
87+
### Alternative 1: Ansible Playbook in Configure Phase
88+
89+
**Approach**: Use an Ansible playbook during the `configure` phase to reconfigure SSH port after provisioning.
90+
91+
**Why Rejected**:
92+
93+
- **Timing problem**: `WaitForCloudInitStep` in provision already fails before reaching configure phase
94+
- **Architectural mismatch**: SSH port is infrastructure config, should be set during VM initialization
95+
- **Added complexity**: Requires special connection handling (connect on 22, reconfigure, reconnect on custom port)
96+
- **More failure points**: Port transition adds potential for connection issues
97+
98+
### Alternative 2: systemctl restart Without Reboot
99+
100+
**Approach**: Use cloud-init `runcmd` to execute `systemctl restart ssh` without full system reboot.
101+
102+
**Why Rejected**:
103+
104+
- **Doesn't kill old process**: `systemctl restart` doesn't terminate the existing SSH daemon when port changes
105+
- **Dual port listening**: Results in SSH listening on both port 22 (old) and custom port (new)
106+
- **Testing showed failure**: Multiple test attempts confirmed SSH remained on port 22 after cloud-init "completion"
107+
108+
### Alternative 3: pkill + systemctl start
109+
110+
**Approach**: Kill SSH processes with `pkill -9 sshd`, then start fresh with `systemctl start ssh`.
111+
112+
**Why Rejected**:
113+
114+
- **Non-standard**: Violates best practices for service management
115+
- **Brittle**: Process killing is less reliable than clean reboot
116+
- **Not industry pattern**: No documentation or precedent for this approach
117+
118+
### Alternative 4: Wait for Port 22, Then Handle Port Change
119+
120+
**Approach**: Keep provision handler waiting for port 22, handle port transition separately.
121+
122+
**Why Rejected**:
123+
124+
- **Wrong abstraction**: Provision handler should use the configured port, not hardcode defaults
125+
- **Added complexity**: Would require special logic to detect port changes mid-provision
126+
- **Race conditions**: SSH might move to custom port at unpredictable times during cloud-init
127+
128+
## Related Decisions
129+
130+
- [Register Command SSH Port Override](./register-ssh-port-override.md) - Relates to SSH port handling in different commands
131+
- [Environment Variable Prefix](./environment-variable-prefix.md) - Relates to configuration management patterns
132+
133+
## References
134+
135+
- [Hetzner Cloud-Config Tutorial](https://community.hetzner.com/tutorials/basic-cloud-config) - Section 5.3 documents the reboot pattern for SSH configuration
136+
- [Cloud-Init Documentation](https://cloudinit.readthedocs.io/en/latest/) - Official cloud-init reference
137+
- [Issue #222: Configure SSH Service Port](../issues/222-configure-ssh-service-port.md) - Original issue specification
138+
- [OpenSSH sshd_config.d](https://manpages.debian.org/bookworm/openssh-server/sshd_config.5.en.html#Include) - Ubuntu SSH configuration directory pattern

docs/issues/222-configure-ssh-service-port.md

Lines changed: 138 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,19 +21,20 @@ This creates a critical configuration mismatch:
2121

2222
**Result**: The `provision` command fails during the `WaitForCloudInitStep` because Ansible cannot connect to the instance on the configured custom port (the playbook uses the inventory which specifies the custom port, but SSH is still on port 22). The deployment cannot proceed beyond provisioning.
2323

24-
## Solution Implemented: Cloud-Init SSH Port Configuration
24+
## Solution Implemented: Cloud-Init SSH Port Configuration with Reboot
2525

26-
After analysis, we determined the best solution is to **configure the SSH port via cloud-init during VM initialization**, rather than trying to reconfigure it later in the configure phase. This approach:
26+
After extensive analysis and testing, we determined the best solution is to **configure the SSH port via cloud-init during VM initialization with a system reboot**, following Hetzner's cloud-config best practices. This approach:
2727

2828
- ✅ Configures SSH port BEFORE any SSH connections are attempted
29+
- ✅ Ensures clean SSH restart with no lingering processes on port 22
2930
- ✅ No special connection handling or port overrides needed
3031
- ✅ Works seamlessly with both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep`
31-
- ✅ Simpler and more reliable than post-provisioning reconfiguration
32+
- ✅ Simpler and more reliable than post-provisioning reconfiguration or manual service management
3233

3334
### Implementation Overview
3435

3536
**Phase**: `provision` (VM initialization)
36-
**Mechanism**: Cloud-init `write_files` directive
37+
**Mechanism**: Cloud-init `write_files` + `runcmd` with reboot
3738
**Component**: `templates/tofu/common/cloud-init.yml.tera`
3839

3940
The SSH port is configured by:
@@ -48,18 +49,83 @@ The SSH port is configured by:
4849
2. **During VM Initialization** (first boot):
4950

5051
- Cloud-init creates `/etc/ssh/sshd_config.d/99-custom-port.conf` with the port setting
51-
- Cloud-init restarts the SSH service to apply the configuration
52+
- Cloud-init triggers system reboot via `runcmd: [reboot]`
53+
- System reboots, SSH service starts cleanly with new configuration
5254
- This happens BEFORE any Ansible connection attempts
5355

5456
3. **During SSH Connectivity Checks**:
55-
- Both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep` connect using the configured custom port
56-
- SSH service is already listening on the correct port - connection succeeds immediately
57+
- Provision handler's `wait_for_readiness()` uses the configured custom port (not default port 22)
58+
- SSH connectivity timeout increased to 120 seconds to account for cloud-init + reboot time (~70-80s)
59+
- Both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep` connect using the custom port
60+
- SSH service is listening only on the correct port - connection succeeds
61+
62+
### Critical Implementation Details
63+
64+
#### Cloud-Init Reboot Pattern
65+
66+
The cloud-init template uses the **reboot pattern** as documented in [Hetzner's cloud-config tutorial](https://community.hetzner.com/tutorials/basic-cloud-config):
67+
68+
```yaml
69+
write_files:
70+
- path: /etc/ssh/sshd_config.d/99-custom-port.conf
71+
content: |
72+
# Custom SSH port configuration
73+
Port {{ ssh_port }}
74+
permissions: "0644"
75+
owner: root:root
76+
77+
runcmd:
78+
# Reboot to apply SSH port configuration
79+
# The reboot ensures SSH service fully restarts with the new port from write_files
80+
# This is the recommended approach per Hetzner cloud-config best practices
81+
- reboot
82+
```
83+
84+
**Why reboot?** Three critical reasons (from Hetzner documentation):
85+
86+
1. Package updates may require reboot for patches to work properly
87+
2. Service configurations (like SSH port changes) are applied cleanly
88+
3. System starts in a consistent state with all configurations active
89+
90+
**Why not `systemctl restart ssh`?** Testing revealed multiple issues:
91+
92+
- `systemctl restart` doesn't kill the old SSH process when the port changes
93+
- Results in SSH listening on **both** port 22 (old PID) and custom port (new PID)
94+
- Cloud-init's runcmd execution of `systemctl restart ssh` often completes without actually restarting SSH
95+
- systemd automatically re-enables and starts SSH after bootcmd, making bootcmd-based approaches ineffective
96+
97+
#### Provision Handler Port Configuration
98+
99+
Two critical fixes were required in the provision handler:
100+
101+
1. **Use Configured Port** (`src/application/command_handlers/provision/handler.rs`):
102+
103+
```rust
104+
105+
// Before: Always waited for default port 22
106+
let ssh_config = SshConfig::with_default_port(instance_ip);
107+
108+
// After: Wait for configured custom port
109+
let ssh_port = environment.ssh_port();
110+
let ssh_config = SshConfig::new(SocketAddr::new(instance_ip, ssh_port));
111+
112+
```
113+
114+
2. **Increase Timeout** (`src/adapters/ssh/config.rs`):
115+
116+
```rust
117+
118+
// Changed from 30 to 60 attempts (120 seconds total)
119+
// Accounts for cloud-init completion (~70-80s) + reboot time
120+
pub const DEFAULT_MAX_RETRY_ATTEMPTS: u32 = 60;
121+
122+
```
57123

58124
### Files Modified
59125

60126
#### Template Files
61127

62-
- **`templates/tofu/common/cloud-init.yml.tera`**: Added `write_files` section to create SSH port configuration file, and `runcmd` to restart SSH service
128+
- **`templates/tofu/common/cloud-init.yml.tera`**: Added `write_files` section to create SSH port configuration file, and `runcmd: [reboot]` to trigger system reboot for clean SSH restart
63129

64130
#### Infrastructure Layer (DDD)
65131

@@ -69,11 +135,17 @@ The SSH port is configured by:
69135

70136
#### Application Layer (DDD)
71137

72-
- **`src/application/command_handlers/provision/handler.rs`**: Updated to pass `ssh_port` from environment to `TofuProjectGenerator`
138+
- **`src/application/command_handlers/provision/handler.rs`**: Updated to pass `ssh_port` from environment to `TofuProjectGenerator` and to `wait_for_readiness()`, changed to use `SocketAddr::new(ip, ssh_port)` instead of `SshConfig::with_default_port()`
139+
140+
#### Adapters Layer
141+
142+
- **`src/adapters/ssh/config.rs`**: Increased `DEFAULT_MAX_RETRY_ATTEMPTS` from 30 to 60 (120 seconds total timeout) to account for cloud-init completion and reboot time
73143

74-
### Why We Discarded the Ansible-Based Approach
144+
### Why We Discarded Alternative Approaches
75145

76-
**Initial Plan**: We initially considered using an Ansible playbook in the `configure` phase to reconfigure SSH port after provisioning.
146+
#### Alternative 1: Ansible-Based Approach (Configure Phase)
147+
148+
**Initial Plan**: Use an Ansible playbook in the `configure` phase to reconfigure SSH port after provisioning.
77149

78150
**Why It Was Discarded**:
79151

@@ -95,7 +167,38 @@ The SSH port is configured by:
95167
- Not during application setup (configure phase)
96168
- Cloud-init is the proper tool for initial system configuration
97169

98-
**Conclusion**: The cloud-init approach is cleaner, more reliable, and architecturally correct. It configures infrastructure settings during infrastructure provisioning, not as a post-provisioning step.
170+
#### Alternative 2: systemctl restart Without Reboot
171+
172+
**Approach**: Use cloud-init `runcmd` to execute `systemctl restart ssh` without full system reboot.
173+
174+
**Why It Was Discarded**:
175+
176+
- Testing revealed `systemctl restart ssh` doesn't kill the old SSH process when port changes
177+
- Results in SSH listening on **both** port 22 (old PID) and custom port (new PID)
178+
- Cloud-init runcmd execution often completes without SSH actually restarting
179+
- Multiple test attempts confirmed SSH remained on port 22 after cloud-init reported "completion"
180+
181+
#### Alternative 3: bootcmd disable + runcmd restart
182+
183+
**Approach**: Use `bootcmd` to disable SSH before it auto-starts, then use `runcmd` to restart it with new config.
184+
185+
**Why It Was Discarded**:
186+
187+
- systemd automatically re-enables and starts SSH approximately 3 seconds after bootcmd disables it
188+
- Testing showed SSH started at 19:19:51 despite bootcmd completing at 19:19:48
189+
- systemd service management overrides cloud-init's bootcmd attempts
190+
191+
#### Alternative 4: pkill + systemctl start
192+
193+
**Approach**: Kill SSH processes with `pkill -9 sshd`, then start fresh with `systemctl start ssh`.
194+
195+
**Why It Was Discarded**:
196+
197+
- Non-standard approach, violates best practices for service management
198+
- More brittle than clean system reboot
199+
- No industry precedent or documentation for this pattern
200+
201+
**Conclusion**: The cloud-init with reboot approach is the cleanest, most reliable, and follows industry best practices (Hetzner). It configures infrastructure settings during infrastructure provisioning with a guaranteed clean service restart.
99202

100203
## Acceptance Criteria
101204

@@ -132,22 +235,43 @@ The SSH port is configured by:
132235

133236
### Cloud-Init Configuration Format
134237

135-
The cloud-init template uses `write_files` to create a drop-in configuration file:
238+
The cloud-init template uses `write_files` + `reboot` pattern following Hetzner best practices:
136239

137240
```yaml
138241
write_files:
139242
- path: /etc/ssh/sshd_config.d/99-custom-port.conf
140243
content: |
244+
# Custom SSH port configuration
141245
Port {{ ssh_port }}
142246
permissions: "0644"
247+
owner: root:root
143248
144249
runcmd:
145-
- systemctl restart ssh
250+
# Reboot to apply SSH port configuration
251+
# The reboot ensures SSH service fully restarts with the new port from write_files
252+
# This is the recommended approach per Hetzner cloud-config best practices
253+
- reboot
146254
```
147255

148256
This approach:
149257

150258
- Uses Ubuntu's drop-in configuration directory pattern
259+
- Avoids modifying the main `/etc/ssh/sshd_config` file
260+
- Ensures clean SSH restart via system reboot (no lingering processes on port 22)
261+
- Follows industry best practices documented by Hetzner
262+
- Simpler than manual service lifecycle management
263+
264+
### Provision Handler Timeout Considerations
265+
266+
The provision handler waits up to **120 seconds** (60 attempts × 2 seconds) for SSH connectivity:
267+
268+
- Cloud-init completion takes approximately 70-80 seconds
269+
- System reboot adds approximately 10-20 seconds
270+
- Total time typically 80-100 seconds
271+
- 120-second timeout provides sufficient buffer
272+
273+
This timeout increase (from the previous 60 seconds) ensures reliable provisioning with custom SSH ports.
274+
151275
- Overrides the default port without modifying main config
152276
- Takes effect immediately after service restart
153277
- Works on all Ubuntu versions with systemd

0 commit comments

Comments
 (0)