Skip to content

Commit b3ab657

Browse files
fix(agent): systematic failure recovery for agent tunnel enrollment (#1802)
## Summary Agent tunnel enrollment spans two non-transactional write phases — the agent's `up` (Rust: client cert/key + the fixed-name `gateway-ca.pem` + `agent.json`) and the MSI custom action (advertise subnets/domains + rollback bookkeeping). A failure in either phase left the machine in a partial state: orphaned cert files, a clobbered `gateway-ca.pem`, or a half-written `agent.json` that even a rollback couldn't recover from. This makes the whole enrollment **recoverable end to end** — every failure path leaves the machine exactly as it was before enroll. ## Agent (Rust) - `persist_enrollment_response` is now transactional: load/validate the config **before** any write (a corrupt `agent.json` fails before touching disk), back up the fixed-name `gateway-ca.pem`, and roll back partial cert/CA writes on any failure. - `save_config` creates its parent directory (fixes fresh standalone `agent.exe up` on a clean machine) and writes **atomically** (temp + rename) so a mid-write failure never truncates `agent.json`. ## Installer (C#) - `EnrollAgentTunnel` snapshots the pre-enrollment `Tunnel` section and `gateway-ca.pem` into a per-install **rollback marker** (`%TEMP%\{installId}-tunnel-rollback.json`), written atomically. The marker is **required**: if it can't be recorded, the enrollment is undone inline and the CA fails. - New marker-driven `RollbackEnrollAgentTunnel` (`Execute.rollback`) only cleans up / restores when *this* install recorded a marker — so it never touches pre-existing or partial state. It restores the original `Tunnel` section and `gateway-ca.pem`, and deletes the certs this install wrote. - All `agent.json` writes go through an atomic temp-replace helper, so the rollback can always re-parse it. - Drains stdout/stderr concurrently with `WaitForExit` (fixes a pipe-buffer deadlock that could kill a healthy `up`) and fails loudly when operator-supplied advertise subnets/domains can't be persisted (instead of silently dropping them). ## Failure-recovery matrix | Failure point | Recovery | |---|---| | Empty / pre-write failure | No marker → rollback no-op; nothing was written | | `up` fails mid-write (bad json / save error) | Rust self-rolls-back its partial writes → non-zero exit → no marker | | `up` ok, marker write fails | Inline cleanup + CA fails | | `up` ok, advertisements write fails | Marker present → rollback restores | | Later MSI action fails after success | Marker present → rollback restores | ## Test - `cargo check -p devolutions-agent` — clean. - `dotnet build DevolutionsAgent.csproj -c Debug` — 0 errors (8 pre-existing WiX CNDL warnings), MSI builds. Changelog: ignore
1 parent 5abe183 commit b3ab657

4 files changed

Lines changed: 617 additions & 70 deletions

File tree

devolutions-agent/src/config.rs

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -237,7 +237,25 @@ impl ConfHandle {
237237
pub fn save_config(conf: &dto::ConfFile) -> anyhow::Result<()> {
238238
let conf_file_path = get_conf_file_path();
239239
let json = serde_json::to_string_pretty(conf).context("failed JSON serialization of configuration")?;
240-
std::fs::write(&conf_file_path, json).with_context(|| format!("failed to write file at {conf_file_path}"))?;
240+
241+
// Ensure the parent directory exists — a fresh machine (standalone `agent.exe up`, no MSI) may
242+
// not have the data directory yet, and a bare write would fail before any config is created.
243+
if let Some(parent) = conf_file_path.parent().filter(|parent| !parent.as_str().is_empty()) {
244+
std::fs::create_dir_all(parent)
245+
.with_context(|| format!("failed to create configuration directory: {parent}"))?;
246+
}
247+
248+
// Write atomically: serialize to a sibling temp file, then rename over the target. A failure
249+
// mid-write therefore never truncates or corrupts an existing agent.json — the original stays
250+
// intact until the rename atomically replaces it (std::fs::rename replaces on Windows too).
251+
let tmp_path = Utf8PathBuf::from(format!("{conf_file_path}.tmp"));
252+
std::fs::write(&tmp_path, json).with_context(|| format!("failed to write temporary config at {tmp_path}"))?;
253+
if let Err(e) = std::fs::rename(&tmp_path, &conf_file_path) {
254+
// Rename failed: don't leave the temp file lingering next to agent.json.
255+
let _ = std::fs::remove_file(&tmp_path);
256+
return Err(anyhow::Error::new(e).context(format!("failed to replace config at {conf_file_path}")));
257+
}
258+
241259
Ok(())
242260
}
243261

devolutions-agent/src/enrollment.rs

Lines changed: 86 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -181,59 +181,106 @@ fn persist_enrollment_response(
181181
.unwrap_or_else(|| Utf8PathBuf::from("."));
182182
let cert_dir = config_dir.join("certs");
183183

184+
// Load (and validate) the existing config BEFORE writing anything to disk. A corrupt
185+
// agent.json must fail the enrollment here — while nothing has been written yet — rather than
186+
// after we've already overwritten the fixed-name gateway-ca.pem and dropped the new cert/key.
187+
// This preserves other settings (Updater, Session, PEDM, etc.) configured by the MSI installer
188+
// or admin.
189+
let mut conf_file = config::load_conf_file_or_generate_new().context("failed to load existing configuration")?;
190+
184191
std::fs::create_dir_all(&cert_dir)
185192
.with_context(|| format!("failed to create certificate directory: {}", cert_dir))?;
186193

187194
let client_cert_path = cert_dir.join(format!("{agent_id}-cert.pem"));
188195
let client_key_path = cert_dir.join(format!("{agent_id}-key.pem"));
189196
let gateway_ca_path = cert_dir.join("gateway-ca.pem");
190197

191-
// Write the locally-generated private key first (before cert/CA from the network).
192-
std::fs::write(&client_key_path, key_pem)
193-
.with_context(|| format!("failed to write client private key: {client_key_path}"))?;
198+
// gateway-ca.pem is a fixed filename we are about to overwrite. Back up any existing copy so a
199+
// later failure can restore it. The {agent_id}-prefixed cert/key files are uniquely named, so
200+
// a rollback just removes them.
201+
let gateway_ca_backup = if gateway_ca_path.exists() {
202+
Some(
203+
std::fs::read(&gateway_ca_path)
204+
.with_context(|| format!("failed to back up existing gateway CA certificate: {gateway_ca_path}"))?,
205+
)
206+
} else {
207+
None
208+
};
194209

195-
std::fs::write(&client_cert_path, &client_cert_pem)
196-
.with_context(|| format!("failed to write client certificate: {client_cert_path}"))?;
210+
// All the destructive writes happen inside this single fallible step. `up` is otherwise
211+
// non-transactional: if it wrote the cert files (and clobbered gateway-ca.pem) and then failed
212+
// at config save, a non-zero exit would leave partial, orphaned state behind. On any failure we
213+
// roll that partial state back below so the machine is left exactly as it was before enroll.
214+
let persist = || -> Result<()> {
215+
// Write the locally-generated private key first (before cert/CA from the network).
216+
std::fs::write(&client_key_path, key_pem)
217+
.with_context(|| format!("failed to write client private key: {client_key_path}"))?;
218+
219+
std::fs::write(&client_cert_path, &client_cert_pem)
220+
.with_context(|| format!("failed to write client certificate: {client_cert_path}"))?;
221+
222+
std::fs::write(&gateway_ca_path, &gateway_ca_cert_pem)
223+
.with_context(|| format!("failed to write gateway CA certificate: {gateway_ca_path}"))?;
224+
225+
// Restrict permissions on cert/key files (owner-only on Unix).
226+
#[cfg(unix)]
227+
{
228+
use std::os::unix::fs::PermissionsExt as _;
229+
let restricted = std::fs::Permissions::from_mode(0o600);
230+
for path in [&client_cert_path, &client_key_path, &gateway_ca_path] {
231+
std::fs::set_permissions(path, restricted.clone())
232+
.with_context(|| format!("failed to set permissions on {path}"))?;
233+
}
234+
}
197235

198-
std::fs::write(&gateway_ca_path, &gateway_ca_cert_pem)
199-
.with_context(|| format!("failed to write gateway CA certificate: {gateway_ca_path}"))?;
236+
// Preserve existing domain config from previous enrollment/manual configuration.
237+
let existing_tunnel = conf_file.tunnel.as_ref();
200238

201-
// Restrict permissions on cert/key files (owner-only on Unix).
202-
#[cfg(unix)]
203-
{
204-
use std::os::unix::fs::PermissionsExt as _;
205-
let restricted = std::fs::Permissions::from_mode(0o600);
206-
for path in [&client_cert_path, &client_key_path, &gateway_ca_path] {
207-
std::fs::set_permissions(path, restricted.clone())
208-
.with_context(|| format!("failed to set permissions on {path}"))?;
209-
}
210-
}
239+
let tunnel_conf = config::dto::TunnelConf {
240+
enabled: true,
241+
gateway_endpoint: quic_endpoint.clone(),
242+
client_cert_path: Some(client_cert_path.clone()),
243+
client_key_path: Some(client_key_path.clone()),
244+
gateway_ca_cert_path: Some(gateway_ca_path.clone()),
245+
advertise_subnets,
246+
advertise_domains: existing_tunnel.map(|t| t.advertise_domains.clone()).unwrap_or_default(),
247+
auto_detect_domain: existing_tunnel.map(|t| t.auto_detect_domain).unwrap_or(true),
248+
heartbeat_interval_secs: Some(60),
249+
route_advertise_interval_secs: Some(30),
250+
server_spki_sha256: Some(server_spki_sha256),
251+
};
211252

212-
// Load existing config and update only the Tunnel section.
213-
// This preserves other settings (Updater, Session, PEDM, etc.) that may have been
214-
// configured by the MSI installer or admin.
215-
let mut conf_file = config::load_conf_file_or_generate_new().context("failed to load existing configuration")?;
253+
conf_file.tunnel = Some(tunnel_conf);
216254

217-
// Preserve existing domain config from previous enrollment/manual configuration.
218-
let existing_tunnel = conf_file.tunnel.as_ref();
219-
220-
let tunnel_conf = config::dto::TunnelConf {
221-
enabled: true,
222-
gateway_endpoint: quic_endpoint.clone(),
223-
client_cert_path: Some(client_cert_path.clone()),
224-
client_key_path: Some(client_key_path.clone()),
225-
gateway_ca_cert_path: Some(gateway_ca_path.clone()),
226-
advertise_subnets,
227-
advertise_domains: existing_tunnel.map(|t| t.advertise_domains.clone()).unwrap_or_default(),
228-
auto_detect_domain: existing_tunnel.map(|t| t.auto_detect_domain).unwrap_or(true),
229-
heartbeat_interval_secs: Some(60),
230-
route_advertise_interval_secs: Some(30),
231-
server_spki_sha256: Some(server_spki_sha256),
232-
};
255+
config::save_config(&conf_file)?;
233256

234-
conf_file.tunnel = Some(tunnel_conf);
257+
Ok(())
258+
};
235259

236-
config::save_config(&conf_file)?;
260+
if let Err(error) = persist() {
261+
// Roll back the partial writes: remove the uniquely-named cert/key, and restore (or remove)
262+
// the overwritten fixed-name gateway-ca.pem so a failed enroll never destroys or orphans state.
263+
let _ = std::fs::remove_file(&client_key_path);
264+
let _ = std::fs::remove_file(&client_cert_path);
265+
match &gateway_ca_backup {
266+
Some(original) => {
267+
// Atomic restore: write to a sibling temp then rename over the target so a failed
268+
// restore mid-write can't truncate the previous CA. Best-effort like the rest of
269+
// this rollback block; clean up the temp if either step fails.
270+
let tmp = Utf8PathBuf::from(format!("{gateway_ca_path}.tmp"));
271+
if std::fs::write(&tmp, original)
272+
.and_then(|_| std::fs::rename(&tmp, &gateway_ca_path))
273+
.is_err()
274+
{
275+
let _ = std::fs::remove_file(&tmp);
276+
}
277+
}
278+
None => {
279+
let _ = std::fs::remove_file(&gateway_ca_path);
280+
}
281+
}
282+
return Err(error);
283+
}
237284

238285
Ok(PersistedEnrollment {
239286
agent_id,

package/AgentWindowsManaged/Actions/AgentActions.cs

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -298,9 +298,38 @@ internal static class AgentActions
298298
AgentProperties.AgentTunnelAdvertiseSubnets,
299299
AgentProperties.AgentTunnelAdvertiseDomains,
300300
AgentProperties.InstallDir,
301+
// installId is a typed WixProperty<Guid>, so reference its string Id; the rollback CA
302+
// reads it to locate the per-install tunnel rollback marker.
303+
AgentProperties.installId.Id,
301304
}.Select(p => $"{p}=[{p}]")),
302305
};
303306

307+
/// <summary>
308+
/// Undo a successful agent tunnel enrollment if a later install action triggers an MSI rollback.
309+
/// </summary>
310+
/// <remarks>
311+
/// The enrollment writes cert/key files and a Tunnel section into agent.json that aren't
312+
/// MSI-managed components, so MSI's own rollback leaves them orphaned. The condition mirrors
313+
/// <see cref="enrollAgentTunnel"/> exactly (the feature being installed) so the rollback covers
314+
/// every path the forward action can run — including adding the feature during a maintenance or
315+
/// upgrade transition, not just first install. The rollback is marker-driven: it only deletes
316+
/// and restores when EnrollAgentTunnel recorded a per-install marker, so it never touches a
317+
/// pre-existing Tunnel section or certs left by a failed/partial enrollment.
318+
/// </remarks>
319+
private static readonly ElevatedManagedAction rollbackEnrollAgentTunnel = new(
320+
new Id($"CA.{nameof(rollbackEnrollAgentTunnel)}"),
321+
CustomActions.RollbackEnrollAgentTunnel,
322+
Return.ignore,
323+
When.Before, new Step(enrollAgentTunnel.Id),
324+
Features.AGENT_TUNNEL_FEATURE.BeingInstall(),
325+
Sequence.InstallExecuteSequence)
326+
{
327+
Execute = Execute.rollback,
328+
Impersonate = false,
329+
// installId locates the per-install tunnel rollback marker EnrollAgentTunnel writes.
330+
UsesProperties = UseProperties(new[] { AgentProperties.installId }),
331+
};
332+
304333
private static readonly ElevatedManagedAction registerExplorerCommand = new(
305334
CustomActions.RegisterExplorerCommand
306335
)
@@ -375,6 +404,7 @@ private static string UseProperties(IEnumerable<IWixProperty> properties)
375404
setFeaturesToConfigure,
376405
configureFeatures,
377406
enrollAgentTunnel,
407+
rollbackEnrollAgentTunnel,
378408
createProgramDataDirectory,
379409
setProgramDataDirectoryPermissions,
380410
createProgramDataPedmDirectories,

0 commit comments

Comments
 (0)