Skip to content

Commit d38f8f5

Browse files
committed
fix(kvm-clock): do not jump monotonic clock on restore
Firecracker has never advanced the clock on restore for any of the supported clocksources. Since Linux 5.16, the KVM_CLOCK_REALTIME has been passed to kvm-clock, causing the monotonic time in the guest to jump when using kvm-clock as clock source. Despite being unexpected and not what Firecracker should do, we recognize this may be a valid usecase so this patch adds a way to configure it, keeping the default to the expected documented behaviour. This patch adds a new API flag to LoadSnapshot, clock_realtime, that advances the clock on restore when set (default is False). Rather than the clock flags being decided at snapshot time, the restore path ignores those flags and decides what to do depending on the clock_realtime flag. This is because the other available flag (KVM_CLOCK_TSC_STABLE) cannot even be passed to `set_clock`, meaning the only valid flag is KVM_CLOCK_REALTIME. The name of the flag was kept generic as we may add this behaviour for the other clock sources in the future, if the need arises. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
1 parent 7a68325 commit d38f8f5

File tree

13 files changed

+226
-19
lines changed

13 files changed

+226
-19
lines changed

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,12 @@ and this project adheres to
5050
balloon statistics descriptor length to prevent a guest-controlled oversized
5151
descriptor from temporarily stalling the VMM event loop. Only affects microVMs
5252
with `stats_polling_interval_s > 0`.
53+
- [#5809](https://github.com/firecracker-microvm/firecracker/pull/5809): Fixed a
54+
bug on host Linux >= 5.16 for x86_64 guests using the `kvm-clock` clock source
55+
causing the monotonic clock to jump on restore by the wall-clock time elapsed
56+
since the snapshot was taken. Users using `kvm-clock` that want to explicitly
57+
advance the clock with `KVM_CLOCK_REALTIME` can opt back in using the new
58+
`clock_realtime` flag in `LoadSnapshot` API.
5359

5460
## [1.15.0]
5561

docs/snapshotting/snapshot-support.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -493,6 +493,11 @@ resumed with the guest OS wall-clock continuing from the moment of the snapshot
493493
creation. For this reason, the wall-clock should be updated to the current time,
494494
on the guest-side. More details on how you could do this can be found at a
495495
[related FAQ](../../FAQ.md#my-guest-wall-clock-is-drifting-how-can-i-fix-it).
496+
When using `kvm-clock` as clock source on `x86_64`, it's possible to optionally
497+
set the `clock_realtime: true` in the `LoadSnapshot` request to advance the
498+
clock on the guest at restore time (host Linux >= 5.16 is required to support
499+
this feature). Note that this may cause issues within the guest as the clock
500+
will appear to suddenly jump.
496501
497502
## Provisioning host disk space for snapshots
498503

src/firecracker/src/api_server/request/snapshot.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ fn parse_put_snapshot_load(body: &Body) -> Result<ParsedRequest, RequestError> {
111111
resume_vm: snapshot_config.resume_vm,
112112
network_overrides: snapshot_config.network_overrides,
113113
vsock_override: snapshot_config.vsock_override,
114+
clock_realtime: snapshot_config.clock_realtime,
114115
};
115116

116117
// Construct the `ParsedRequest` object.
@@ -189,6 +190,7 @@ mod tests {
189190
resume_vm: false,
190191
network_overrides: vec![],
191192
vsock_override: None,
193+
clock_realtime: false,
192194
};
193195
let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap();
194196
assert!(
@@ -220,6 +222,7 @@ mod tests {
220222
resume_vm: false,
221223
network_overrides: vec![],
222224
vsock_override: None,
225+
clock_realtime: false,
223226
};
224227
let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap();
225228
assert!(
@@ -251,6 +254,7 @@ mod tests {
251254
resume_vm: true,
252255
network_overrides: vec![],
253256
vsock_override: None,
257+
clock_realtime: false,
254258
};
255259
let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap();
256260
assert!(
@@ -291,6 +295,7 @@ mod tests {
291295
host_dev_name: String::from("vmtap2"),
292296
}],
293297
vsock_override: None,
298+
clock_realtime: false,
294299
};
295300
let mut parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap();
296301
assert!(
@@ -319,6 +324,7 @@ mod tests {
319324
resume_vm: true,
320325
network_overrides: vec![],
321326
vsock_override: None,
327+
clock_realtime: false,
322328
};
323329
let parsed_request = parse_put_snapshot(&Body::new(body), Some("load")).unwrap();
324330
assert_eq!(

src/firecracker/swagger/firecracker.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1650,6 +1650,14 @@ definitions:
16501650
for restoring a snapshot with a different socket path than the one used
16511651
when the snapshot was created. For example, when the original socket path
16521652
is no longer available or when deploying to a different environment.
1653+
clock_realtime:
1654+
type: boolean
1655+
description:
1656+
"[x86_64 only] When set to true, passes KVM_CLOCK_REALTIME to
1657+
KVM_SET_CLOCK on restore, advancing kvmclock by the wall-clock time
1658+
elapsed since the snapshot was taken. When false (default), kvmclock resumes
1659+
from where it was at snapshot time. This option may be extended to other clock
1660+
sources and CPU architectures in the future."
16531661

16541662

16551663
TokenBucket:

src/vmm/src/arch/x86_64/vm.rs

Lines changed: 68 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ use std::fmt;
55
use std::sync::{Arc, Mutex};
66

77
use kvm_bindings::{
8-
KVM_CLOCK_TSC_STABLE, KVM_IRQCHIP_IOAPIC, KVM_IRQCHIP_PIC_MASTER, KVM_IRQCHIP_PIC_SLAVE,
8+
KVM_CLOCK_REALTIME, KVM_IRQCHIP_IOAPIC, KVM_IRQCHIP_PIC_MASTER, KVM_IRQCHIP_PIC_SLAVE,
99
KVM_PIT_SPEAKER_DUMMY, MsrList, kvm_clock_data, kvm_irqchip, kvm_pit_config, kvm_pit_state2,
1010
};
1111
use kvm_ioctls::Cap;
@@ -30,6 +30,8 @@ pub enum ArchVmError {
3030
SetPit2(kvm_ioctls::Error),
3131
/// Set clock error: {0}
3232
SetClock(kvm_ioctls::Error),
33+
/// clock_realtime requested but not present in the snapshot state
34+
ClockRealtimeNotInState,
3335
/// Set IrqChipPicMaster error: {0}
3436
SetIrqChipPicMaster(kvm_ioctls::Error),
3537
/// Set IrqChipPicSlave error: {0}
@@ -127,13 +129,25 @@ impl ArchVm {
127129
/// - [`kvm_ioctls::VmFd::set_irqchip`] errors.
128130
/// - [`kvm_ioctls::VmFd::set_irqchip`] errors.
129131
/// - [`kvm_ioctls::VmFd::set_irqchip`] errors.
130-
pub fn restore_state(&mut self, state: &VmState) -> Result<(), ArchVmError> {
132+
pub fn restore_state(
133+
&mut self,
134+
state: &VmState,
135+
clock_realtime: bool,
136+
) -> Result<(), ArchVmError> {
131137
self.fd()
132138
.set_pit2(&state.pitstate)
133139
.map_err(ArchVmError::SetPit2)?;
134-
self.fd()
135-
.set_clock(&state.clock)
136-
.map_err(ArchVmError::SetClock)?;
140+
let mut clock = state.clock;
141+
clock.flags = if clock_realtime {
142+
// clock_realtime needs to be present in the snapshot
143+
if clock.flags & KVM_CLOCK_REALTIME == 0 {
144+
return Err(ArchVmError::ClockRealtimeNotInState);
145+
}
146+
KVM_CLOCK_REALTIME
147+
} else {
148+
0
149+
};
150+
self.fd().set_clock(&clock).map_err(ArchVmError::SetClock)?;
137151
self.fd()
138152
.set_irqchip(&state.pic_master)
139153
.map_err(ArchVmError::SetIrqChipPicMaster)?;
@@ -167,9 +181,7 @@ impl ArchVm {
167181
pub fn save_state(&self) -> Result<VmState, ArchVmError> {
168182
let pitstate = self.fd().get_pit2().map_err(ArchVmError::VmGetPit2)?;
169183

170-
let mut clock = self.fd().get_clock().map_err(ArchVmError::VmGetClock)?;
171-
// This bit is not accepted in SET_CLOCK, clear it.
172-
clock.flags &= !KVM_CLOCK_TSC_STABLE;
184+
let clock = self.fd().get_clock().map_err(ArchVmError::VmGetClock)?;
173185

174186
let mut pic_master = kvm_irqchip {
175187
chip_id: KVM_IRQCHIP_PIC_MASTER,
@@ -248,10 +260,13 @@ impl fmt::Debug for VmState {
248260
#[cfg(test)]
249261
mod tests {
250262
use kvm_bindings::{
251-
KVM_CLOCK_TSC_STABLE, KVM_IRQCHIP_IOAPIC, KVM_IRQCHIP_PIC_MASTER, KVM_IRQCHIP_PIC_SLAVE,
263+
KVM_CLOCK_REALTIME, KVM_IRQCHIP_IOAPIC, KVM_IRQCHIP_PIC_MASTER, KVM_IRQCHIP_PIC_SLAVE,
252264
KVM_PIT_SPEAKER_DUMMY,
253265
};
266+
use kvm_ioctls::Cap;
267+
use std::time::SystemTime;
254268

269+
use crate::arch::ArchVmError;
255270
use crate::vstate::vm::VmState;
256271
use crate::vstate::vm::tests::{setup_vm, setup_vm_with_memory};
257272

@@ -270,15 +285,53 @@ mod tests {
270285
vm_state.pitstate.flags | KVM_PIT_SPEAKER_DUMMY,
271286
KVM_PIT_SPEAKER_DUMMY
272287
);
273-
assert_eq!(vm_state.clock.flags & KVM_CLOCK_TSC_STABLE, 0);
274288
assert_eq!(vm_state.pic_master.chip_id, KVM_IRQCHIP_PIC_MASTER);
275289
assert_eq!(vm_state.pic_slave.chip_id, KVM_IRQCHIP_PIC_SLAVE);
276290
assert_eq!(vm_state.ioapic.chip_id, KVM_IRQCHIP_IOAPIC);
277291

278292
let (_, mut vm) = setup_vm_with_memory(0x1000);
279293
vm.setup_irqchip().unwrap();
280294

281-
vm.restore_state(&vm_state).unwrap();
295+
vm.restore_state(&vm_state, false).unwrap();
296+
}
297+
298+
#[cfg(target_arch = "x86_64")]
299+
#[test]
300+
fn test_vm_save_restore_state_kvm_clock_realtime() {
301+
let (kvm, vm) = setup_vm_with_memory(0x1000);
302+
vm.setup_irqchip().unwrap();
303+
304+
let clock_realtime_supported =
305+
kvm.fd.check_extension_int(Cap::AdjustClock).cast_unsigned() & KVM_CLOCK_REALTIME != 0;
306+
307+
// mock a state without realtime information
308+
let mut vm_state = vm.save_state().unwrap();
309+
vm_state.clock.flags &= !KVM_CLOCK_REALTIME;
310+
311+
let (_, mut vm) = setup_vm_with_memory(0x1000);
312+
vm.setup_irqchip().unwrap();
313+
314+
let res = vm.restore_state(&vm_state, true);
315+
assert!(res == Err(ArchVmError::ClockRealtimeNotInState));
316+
317+
// mock a state with realtime information
318+
vm_state.clock.flags |= KVM_CLOCK_REALTIME;
319+
vm_state.clock.realtime = SystemTime::now()
320+
.duration_since(SystemTime::UNIX_EPOCH)
321+
.unwrap()
322+
.as_nanos()
323+
.try_into()
324+
.unwrap();
325+
326+
let (_, mut vm) = setup_vm_with_memory(0x1000);
327+
vm.setup_irqchip().unwrap();
328+
329+
let res = vm.restore_state(&vm_state, true);
330+
if clock_realtime_supported {
331+
res.unwrap()
332+
} else {
333+
assert!(matches!(res, Err(ArchVmError::SetClock(err)) if err.errno() == libc::EINVAL))
334+
}
282335
}
283336

284337
#[cfg(target_arch = "x86_64")]
@@ -296,18 +349,18 @@ mod tests {
296349
// Try to restore an invalid PIC Master chip ID
297350
let orig_master_chip_id = vm_state.pic_master.chip_id;
298351
vm_state.pic_master.chip_id = KVM_NR_IRQCHIPS;
299-
vm.restore_state(&vm_state).unwrap_err();
352+
vm.restore_state(&vm_state, false).unwrap_err();
300353
vm_state.pic_master.chip_id = orig_master_chip_id;
301354

302355
// Try to restore an invalid PIC Slave chip ID
303356
let orig_slave_chip_id = vm_state.pic_slave.chip_id;
304357
vm_state.pic_slave.chip_id = KVM_NR_IRQCHIPS;
305-
vm.restore_state(&vm_state).unwrap_err();
358+
vm.restore_state(&vm_state, false).unwrap_err();
306359
vm_state.pic_slave.chip_id = orig_slave_chip_id;
307360

308361
// Try to restore an invalid IOPIC chip ID
309362
vm_state.ioapic.chip_id = KVM_NR_IRQCHIPS;
310-
vm.restore_state(&vm_state).unwrap_err();
363+
vm.restore_state(&vm_state, false).unwrap_err();
311364
}
312365

313366
#[cfg(target_arch = "x86_64")]
@@ -321,6 +374,6 @@ mod tests {
321374
let serialized_data = bitcode::serialize(&state).unwrap();
322375
let restored_state: VmState = bitcode::deserialize(&serialized_data).unwrap();
323376

324-
vm.restore_state(&restored_state).unwrap();
377+
vm.restore_state(&restored_state, false).unwrap();
325378
}
326379
}

src/vmm/src/builder.rs

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -423,6 +423,8 @@ pub enum BuildMicrovmFromSnapshotError {
423423
SeccompFiltersInternal(#[from] crate::seccomp::InstallationError),
424424
/// Failed to restore devices: {0}
425425
RestoreDevices(#[from] DeviceManagerPersistError),
426+
/// clock_realtime is not supported on aarch64.
427+
UnsupportedClockRealtime,
426428
}
427429

428430
/// Builds and starts a microVM based on the provided MicrovmState.
@@ -438,6 +440,7 @@ pub fn build_microvm_from_snapshot(
438440
uffd: Option<Uffd>,
439441
seccomp_filters: &BpfThreadMap,
440442
vm_resources: &mut VmResources,
443+
clock_realtime: bool,
441444
) -> Result<Arc<Mutex<Vmm>>, BuildMicrovmFromSnapshotError> {
442445
// Build Vmm.
443446
debug!("event_start: build microvm from snapshot");
@@ -479,14 +482,17 @@ pub fn build_microvm_from_snapshot(
479482

480483
#[cfg(target_arch = "aarch64")]
481484
{
485+
if clock_realtime {
486+
return Err(BuildMicrovmFromSnapshotError::UnsupportedClockRealtime);
487+
}
482488
let mpidrs = construct_kvm_mpidrs(&microvm_state.vcpu_states);
483489
// Restore kvm vm state.
484490
vm.restore_state(&mpidrs, &microvm_state.vm_state)?;
485491
}
486492

487493
// Restore kvm vm state.
488494
#[cfg(target_arch = "x86_64")]
489-
vm.restore_state(&microvm_state.vm_state)?;
495+
vm.restore_state(&microvm_state.vm_state, clock_realtime)?;
490496

491497
// Restore the boot source config paths.
492498
vm_resources.boot_source.config = microvm_state.vm_info.boot_source;

src/vmm/src/persist.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,7 @@ pub fn restore_from_snapshot(
466466
uffd,
467467
seccomp_filters,
468468
vm_resources,
469+
params.clock_realtime,
469470
)
470471
.map_err(RestoreFromSnapshotError::Build)
471472
}

src/vmm/src/rpc_interface.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1286,6 +1286,7 @@ mod tests {
12861286
resume_vm: false,
12871287
network_overrides: vec![],
12881288
vsock_override: None,
1289+
clock_realtime: false,
12891290
},
12901291
)));
12911292
check_unsupported(runtime_request(VmmAction::SetEntropyDevice(

src/vmm/src/vmm_config/snapshot.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,10 @@ pub struct LoadSnapshotParams {
8181
pub network_overrides: Vec<NetworkOverride>,
8282
/// When set, the vsock backend UDS path will be overridden
8383
pub vsock_override: Option<VsockOverride>,
84+
/// [x86_64 only] When set to true, passes `KVM_CLOCK_REALTIME` to `KVM_SET_CLOCK` on restore,
85+
/// advancing kvmclock by the wall-clock time elapsed since the snapshot was taken. When false
86+
/// (default), kvmclock resumes from where it was at snapshot time.
87+
pub clock_realtime: bool,
8488
}
8589

8690
/// Stores the configuration for loading a snapshot that is provided by the user.
@@ -113,6 +117,9 @@ pub struct LoadSnapshotConfig {
113117
/// Whether or not to override the vsock backend UDS path.
114118
#[serde(skip_serializing_if = "Option::is_none")]
115119
pub vsock_override: Option<VsockOverride>,
120+
/// [x86_64 only] When set to true, passes `KVM_CLOCK_REALTIME` to `KVM_SET_CLOCK` on restore.
121+
#[serde(default)]
122+
pub clock_realtime: bool,
116123
}
117124

118125
/// Stores the configuration used for managing snapshot memory.

src/vmm/src/vstate/vm.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -866,7 +866,7 @@ pub(crate) mod tests {
866866
let serialized_data = bitcode::serialize(&state).unwrap();
867867

868868
let restored_state: VmState = bitcode::deserialize(&serialized_data).unwrap();
869-
vm.restore_state(&restored_state).unwrap();
869+
vm.restore_state(&restored_state, false).unwrap();
870870

871871
let mut resource_allocator = vm.resource_allocator();
872872
let gsi_new = resource_allocator.allocate_gsi_msi(1).unwrap()[0];

0 commit comments

Comments
 (0)