You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: [#222] configure SSH port via cloud-init with reboot pattern
Implements custom SSH port configuration during VM provisioning using
cloud-init's reboot pattern following Hetzner best practices.
Solution Overview:
- Cloud-init writes SSH config file and triggers system reboot
- Reboot ensures clean SSH restart with new port configuration
- Provision handler waits for configured port (not default port 22)
- Increased timeout from 60s to 120s for cloud-init + reboot time
Key Changes:
1. Cloud-init template: Added write_files + runcmd with reboot
2. Provision handler: Use configured SSH port in wait_for_readiness()
3. SSH adapter: Increased DEFAULT_MAX_RETRY_ATTEMPTS from 30 to 60
4. Documentation: Created ADR and updated issue spec
Technical Details:
- Cloud-init creates /etc/ssh/sshd_config.d/99-custom-port.conf
- System reboot guarantees SSH only on custom port (no port 22)
- Total timeout: 120 seconds (60 attempts × 2 second interval)
- Testing confirmed: SSH listens only on configured port after reboot
Why Reboot Approach:
- systemctl restart doesn't kill old SSH process when port changes
- bootcmd ineffective - systemd auto-restarts SSH after bootcmd
- Reboot is cleaner and follows Hetzner cloud-config tutorial
Files Modified:
- templates/tofu/common/cloud-init.yml.tera
- src/application/command_handlers/provision/handler.rs
- src/adapters/ssh/config.rs
- docs/decisions/cloud-init-ssh-port-reboot.md (new)
- docs/decisions/README.md
- docs/issues/222-configure-ssh-service-port.md
- project-words.txt
References:
- Hetzner cloud-config tutorial section 5.3
- Issue #222
| ✅ Accepted | 2025-12-11 |[Cloud-Init SSH Port Configuration with Reboot](./cloud-init-ssh-port-reboot.md)| Use cloud-init with reboot pattern to configure custom SSH ports during VM provisioning |
9
10
| ✅ Accepted | 2025-12-10 |[Single Docker Image for Sequential E2E Command Testing](./single-docker-image-sequential-testing.md)| Use single Docker image with sequential command execution instead of multi-image phases |
10
11
| ✅ Accepted | 2025-12-09 |[Register Command SSH Port Override](./register-ssh-port-override.md)| Add optional --ssh-port argument to register command for non-standard SSH ports |
# Decision: Cloud-Init SSH Port Configuration with Reboot
2
+
3
+
## Status
4
+
5
+
Accepted
6
+
7
+
## Date
8
+
9
+
2025-12-11
10
+
11
+
## Context
12
+
13
+
The deployer needs to support custom SSH ports for security and flexibility. The SSH port configuration must be applied **during VM provisioning** (not later in the configure phase) because:
14
+
15
+
1.**Provision phase dependencies**: The `WaitForCloudInitStep` runs during provision and uses Ansible to wait for cloud-init completion. Ansible connects using the custom port from the inventory configuration.
16
+
17
+
2.**Timing requirement**: If SSH is not already listening on the custom port when `WaitForCloudInitStep` executes, the provision command fails with connection errors.
18
+
19
+
3.**Architectural correctness**: SSH port is infrastructure configuration, not application configuration. It should be set during infrastructure provisioning, not as a post-provisioning step.
20
+
21
+
The challenge was ensuring SSH service reliably restarts with the new port configuration during cloud-init execution. Multiple approaches were tested:
22
+
23
+
-**systemctl restart**: Does not kill the old SSH process when port changes, resulting in SSH listening on both ports 22 and the custom port
24
+
-**pkill + systemctl start**: Works but is brittle and non-standard
25
+
-**bootcmd disable + runcmd restart**: Ineffective because systemd automatically re-enables and starts SSH after bootcmd completes
26
+
27
+
## Decision
28
+
29
+
We configure the custom SSH port via cloud-init using the **`write_files` + `reboot` pattern**, following Hetzner's cloud-config best practices:
30
+
31
+
1.**Write SSH configuration file** using cloud-init's `write_files` directive:
2. **Trigger system reboot** in cloud-init's `runcmd` phase:
44
+
45
+
```yaml
46
+
runcmd:
47
+
- reboot
48
+
```
49
+
50
+
The reboot ensures:
51
+
52
+
- SSH service cleanly restarts with the new configuration
53
+
- No old SSH processes remain on port 22
54
+
- All services start in a consistent state
55
+
- Package updates are applied (if cloud-init installed packages)
56
+
57
+
Additionally, we made two critical fixes to the provision handler:
58
+
59
+
1. **Use configured SSH port**: Changed `wait_for_readiness()` to use `SocketAddr::new(ip, ssh_port)` instead of `SshConfig::with_default_port()`, ensuring the provision handler waits for SSH on the correct custom port (not port 22).
60
+
61
+
2. **Increase SSH connectivity timeout**: Raised `DEFAULT_MAX_RETRY_ATTEMPTS` from 30 to 60 attempts (120 seconds total), accounting for the ~70-80 second cloud-init completion time plus reboot time.
62
+
63
+
## Consequences
64
+
65
+
### Positive
66
+
67
+
- **Clean SSH restart**: Reboot guarantees SSH only listens on the custom port, no lingering processes on port 22
68
+
- **Industry best practice**: Follows Hetzner's documented cloud-config pattern for SSH port changes
69
+
- **Simple and reliable**: Single `reboot` command is simpler than managing service lifecycle manually
70
+
- **Correct architecture**: Infrastructure configuration happens during infrastructure provisioning
71
+
- **No special cases**: Ansible can connect normally using the configured port without overrides or workarounds
72
+
- **Compile-time safety**: Provision handler correctly waits for the configured port, preventing connection failures
73
+
74
+
### Negative
75
+
76
+
- **Slower provisioning**: Reboot adds ~10-20 seconds to VM initialization time
77
+
- **Additional wait time**: Provision handler must wait longer (120s instead of 60s) for cloud-init and reboot to complete
78
+
- **Complexity**: Three separate changes required (cloud-init template, provision handler port usage, timeout increase)
79
+
80
+
### Risks
81
+
82
+
- **Reboot timing**: If reboot takes longer than expected, SSH connectivity check might timeout (mitigated by 120-second timeout)
83
+
- **Cloud-init failure**: If reboot fails or cloud-init has errors, the provision will fail (acceptable - we want to catch infrastructure issues early)
84
+
85
+
## Alternatives Considered
86
+
87
+
### Alternative 1: Ansible Playbook in Configure Phase
88
+
89
+
**Approach**: Use an Ansible playbook during the `configure` phase to reconfigure SSH port after provisioning.
90
+
91
+
**Why Rejected**:
92
+
93
+
- **Timing problem**: `WaitForCloudInitStep` in provision already fails before reaching configure phase
94
+
- **Architectural mismatch**: SSH port is infrastructure config, should be set during VM initialization
95
+
- **Added complexity**: Requires special connection handling (connect on 22, reconfigure, reconnect on custom port)
96
+
- **More failure points**: Port transition adds potential for connection issues
97
+
98
+
### Alternative 2: systemctl restart Without Reboot
99
+
100
+
**Approach**: Use cloud-init `runcmd` to execute `systemctl restart ssh` without full system reboot.
101
+
102
+
**Why Rejected**:
103
+
104
+
- **Doesn't kill old process**: `systemctl restart` doesn't terminate the existing SSH daemon when port changes
105
+
- **Dual port listening**: Results in SSH listening on both port 22 (old) and custom port (new)
106
+
- **Testing showed failure**: Multiple test attempts confirmed SSH remained on port 22 after cloud-init "completion"
107
+
108
+
### Alternative 3: pkill + systemctl start
109
+
110
+
**Approach**: Kill SSH processes with `pkill -9 sshd`, then start fresh with `systemctl start ssh`.
111
+
112
+
**Why Rejected**:
113
+
114
+
- **Non-standard**: Violates best practices for service management
115
+
- **Brittle**: Process killing is less reliable than clean reboot
116
+
- **Not industry pattern**: No documentation or precedent for this approach
117
+
118
+
### Alternative 4: Wait for Port 22, Then Handle Port Change
119
+
120
+
**Approach**: Keep provision handler waiting for port 22, handle port transition separately.
121
+
122
+
**Why Rejected**:
123
+
124
+
- **Wrong abstraction**: Provision handler should use the configured port, not hardcode defaults
125
+
- **Added complexity**: Would require special logic to detect port changes mid-provision
126
+
- **Race conditions**: SSH might move to custom port at unpredictable times during cloud-init
127
+
128
+
## Related Decisions
129
+
130
+
- [Register Command SSH Port Override](./register-ssh-port-override.md) - Relates to SSH port handling in different commands
131
+
- [Environment Variable Prefix](./environment-variable-prefix.md) - Relates to configuration management patterns
132
+
133
+
## References
134
+
135
+
- [Hetzner Cloud-Config Tutorial](https://community.hetzner.com/tutorials/basic-cloud-config) - Section 5.3 documents the reboot pattern for SSH configuration
136
+
- [Cloud-Init Documentation](https://cloudinit.readthedocs.io/en/latest/) - Official cloud-init reference
137
+
- [Issue #222: Configure SSH Service Port](../issues/222-configure-ssh-service-port.md) - Original issue specification
@@ -21,19 +21,20 @@ This creates a critical configuration mismatch:
21
21
22
22
**Result**: The `provision` command fails during the `WaitForCloudInitStep` because Ansible cannot connect to the instance on the configured custom port (the playbook uses the inventory which specifies the custom port, but SSH is still on port 22). The deployment cannot proceed beyond provisioning.
23
23
24
-
## Solution Implemented: Cloud-Init SSH Port Configuration
24
+
## Solution Implemented: Cloud-Init SSH Port Configuration with Reboot
25
25
26
-
After analysis, we determined the best solution is to **configure the SSH port via cloud-init during VM initialization**, rather than trying to reconfigure it later in the configure phase. This approach:
26
+
After extensive analysis and testing, we determined the best solution is to **configure the SSH port via cloud-init during VM initialization with a system reboot**, following Hetzner's cloud-config best practices. This approach:
27
27
28
28
- ✅ Configures SSH port BEFORE any SSH connections are attempted
29
+
- ✅ Ensures clean SSH restart with no lingering processes on port 22
29
30
- ✅ No special connection handling or port overrides needed
30
31
- ✅ Works seamlessly with both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep`
31
-
- ✅ Simpler and more reliable than post-provisioning reconfiguration
32
+
- ✅ Simpler and more reliable than post-provisioning reconfiguration or manual service management
32
33
33
34
### Implementation Overview
34
35
35
36
**Phase**: `provision` (VM initialization)
36
-
**Mechanism**: Cloud-init `write_files`directive
37
+
**Mechanism**: Cloud-init `write_files`+ `runcmd` with reboot
@@ -48,18 +49,83 @@ The SSH port is configured by:
48
49
2.**During VM Initialization** (first boot):
49
50
50
51
- Cloud-init creates `/etc/ssh/sshd_config.d/99-custom-port.conf` with the port setting
51
-
- Cloud-init restarts the SSH service to apply the configuration
52
+
- Cloud-init triggers system reboot via `runcmd: [reboot]`
53
+
- System reboots, SSH service starts cleanly with new configuration
52
54
- This happens BEFORE any Ansible connection attempts
53
55
54
56
3.**During SSH Connectivity Checks**:
55
-
- Both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep` connect using the configured custom port
56
-
- SSH service is already listening on the correct port - connection succeeds immediately
57
+
- Provision handler's `wait_for_readiness()` uses the configured custom port (not default port 22)
58
+
- SSH connectivity timeout increased to 120 seconds to account for cloud-init + reboot time (~70-80s)
59
+
- Both `WaitForSSHConnectivityStep` and `WaitForCloudInitStep` connect using the custom port
60
+
- SSH service is listening only on the correct port - connection succeeds
61
+
62
+
### Critical Implementation Details
63
+
64
+
#### Cloud-Init Reboot Pattern
65
+
66
+
The cloud-init template uses the **reboot pattern** as documented in [Hetzner's cloud-config tutorial](https://community.hetzner.com/tutorials/basic-cloud-config):
// Changed from 30 to 60 attempts (120 seconds total)
119
+
// Accounts for cloud-init completion (~70-80s) + reboot time
120
+
pub const DEFAULT_MAX_RETRY_ATTEMPTS: u32 = 60;
121
+
122
+
```
57
123
58
124
### Files Modified
59
125
60
126
#### Template Files
61
127
62
-
-**`templates/tofu/common/cloud-init.yml.tera`**: Added `write_files` section to create SSH port configuration file, and `runcmd` to restart SSH service
128
+
- **`templates/tofu/common/cloud-init.yml.tera`**: Added `write_files` section to create SSH port configuration file, and `runcmd: [reboot]` to trigger system reboot for clean SSH restart
63
129
64
130
#### Infrastructure Layer (DDD)
65
131
@@ -69,11 +135,17 @@ The SSH port is configured by:
69
135
70
136
#### Application Layer (DDD)
71
137
72
-
-**`src/application/command_handlers/provision/handler.rs`**: Updated to pass `ssh_port` from environment to `TofuProjectGenerator`
138
+
- **`src/application/command_handlers/provision/handler.rs`**: Updated to pass `ssh_port` from environment to `TofuProjectGenerator` and to `wait_for_readiness()`, changed to use `SocketAddr::new(ip, ssh_port)` instead of `SshConfig::with_default_port()`
139
+
140
+
#### Adapters Layer
141
+
142
+
- **`src/adapters/ssh/config.rs`**: Increased `DEFAULT_MAX_RETRY_ATTEMPTS` from 30 to 60 (120 seconds total timeout) to account for cloud-init completion and reboot time
73
143
74
-
### Why We Discarded the Ansible-Based Approach
144
+
### Why We Discarded Alternative Approaches
75
145
76
-
**Initial Plan**: We initially considered using an Ansible playbook in the `configure` phase to reconfigure SSH port after provisioning.
146
+
#### Alternative 1: Ansible-Based Approach (Configure Phase)
147
+
148
+
**Initial Plan**: Use an Ansible playbook in the `configure` phase to reconfigure SSH port after provisioning.
77
149
78
150
**Why It Was Discarded**:
79
151
@@ -95,7 +167,38 @@ The SSH port is configured by:
95
167
- Not during application setup (configure phase)
96
168
- Cloud-init is the proper tool for initial system configuration
97
169
98
-
**Conclusion**: The cloud-init approach is cleaner, more reliable, and architecturally correct. It configures infrastructure settings during infrastructure provisioning, not as a post-provisioning step.
170
+
#### Alternative 2: systemctl restart Without Reboot
171
+
172
+
**Approach**: Use cloud-init `runcmd` to execute `systemctl restart ssh` without full system reboot.
173
+
174
+
**Why It Was Discarded**:
175
+
176
+
- Testing revealed `systemctl restart ssh` doesn't kill the old SSH process when port changes
177
+
- Results in SSH listening on **both** port 22 (old PID) and custom port (new PID)
178
+
- Cloud-init runcmd execution often completes without SSH actually restarting
179
+
- Multiple test attempts confirmed SSH remained on port 22 after cloud-init reported "completion"
180
+
181
+
#### Alternative 3: bootcmd disable + runcmd restart
182
+
183
+
**Approach**: Use `bootcmd` to disable SSH before it auto-starts, then use `runcmd` to restart it with new config.
184
+
185
+
**Why It Was Discarded**:
186
+
187
+
- systemd automatically re-enables and starts SSH approximately 3 seconds after bootcmd disables it
188
+
- Testing showed SSH started at 19:19:51 despite bootcmd completing at 19:19:48
189
+
- systemd service management overrides cloud-init's bootcmd attempts
190
+
191
+
#### Alternative 4: pkill + systemctl start
192
+
193
+
**Approach**: Kill SSH processes with `pkill -9 sshd`, then start fresh with `systemctl start ssh`.
194
+
195
+
**Why It Was Discarded**:
196
+
197
+
- Non-standard approach, violates best practices for service management
198
+
- More brittle than clean system reboot
199
+
- No industry precedent or documentation for this pattern
200
+
201
+
**Conclusion**: The cloud-init with reboot approach is the cleanest, most reliable, and follows industry best practices (Hetzner). It configures infrastructure settings during infrastructure provisioning with a guaranteed clean service restart.
99
202
100
203
## Acceptance Criteria
101
204
@@ -132,22 +235,43 @@ The SSH port is configured by:
132
235
133
236
### Cloud-Init Configuration Format
134
237
135
-
The cloud-init template uses `write_files`to create a drop-in configuration file:
238
+
The cloud-init template uses `write_files` + `reboot` pattern following Hetzner best practices:
0 commit comments