Sync permanent state blob upon GET_STATEBLOB cmd by adrobvz · Pull Request #1122 · stefanberger/swtpm

adrobvz · 2026-05-07T14:30:52Z

We've seen a live migration failure in production, with Win2k25 guest, with the following symptoms:

QEMU dest log:

qemu-kvm: tpm-emulator: Setting the stateblob (type 2) failed with a TPM error 0x800
qemu-kvm: error while loading state for instance 0x0 of device 'tpm-emulator'
qemu-kvm: load of migration failed: Input/output error

swtpm dest log:

main: Initializing TPM at Fri May  1 14:04:10 2026                              
mainLoop:                                                                       
SWTPM_NVRAM_Validate_Dir: Rooted state path /path/to/shared/storage             
SWTPM_NVRAM_Validate_Dir: Rooted state path /path/to/shared/storage             
SWTPM_NVRAM_LoadData: No such file /path/to/shared/storage/tpm2-00.permall                                                                                     
Data client disconnected

The root cause is not perfectly clear, either orchestration layer left the blob out of sync, or some kind of storage error occured when the blob was initially written on the src node. Either way, I suggest an additional sync for the permanent state blob as a potential workaround for such errors, similar to how it's done with volatile state.

ctrlchannel_return_state() reads tpm2-00.permall from disk when handling GET_STATEBLOB(PERMANENT). At the same time libtpms holds the live PERMANENT state in RAM and only flushes it via _plat__NvCommit(). If, for whatever reason, the on-disk state blob is missing, empty or out of sync relative to libtpms RAM at the moment live migration runs, an empty or short PERMANENT blob is shipped to the destination, which results in: qemu-kvm: tpm-emulator: Setting the stateblob (type 2) failed with a TPM error 0x800 Regardless of how the desync arose (orchestrator-level cleanup of the state dir, transient NvCommit failure on a prior receive, etc.), let's add additional sync as a workaround for such cases. Namely, add SWTPM_NVRAM_Store_Permanent() that fetches PERMANENT from libtpms via TPMLIB_GetState() and writes it through SWTPM_NVRAM_StoreData() - similarly to how it's done for SWTPM_NVRAM_Store_Volatile(). That is to make sure disk is force-synced from RAM at the migration moment. Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>

dlunev · 2026-05-07T16:13:39Z

This state could be easily achieved if we will migrate Win2025 just being installed, f.e. at some installer stage when TPM is not yet initialized for the very first time.

stefanberger · 2026-05-07T22:12:08Z

We've seen a live migration failure in production, with Win2k25 guest, with the following symptoms:

A migration error should not occur due to the VM that is running, like Win2k25. Have you had such issues with any other workloads or do they all work fine? Does it always happen with Win2k25 or is it 'a live migration failure' that only occurred once or once in a while?

QEMU dest log:

qemu-kvm: tpm-emulator: Setting the stateblob (type 2) failed with a TPM error 0x800
qemu-kvm: error while loading state for instance 0x0 of device 'tpm-emulator'
qemu-kvm: load of migration failed: Input/output error

swtpm dest log:

main: Initializing TPM at Fri May  1 14:04:10 2026                              
mainLoop:                                                                       
SWTPM_NVRAM_Validate_Dir: Rooted state path /path/to/shared/storage             
SWTPM_NVRAM_Validate_Dir: Rooted state path /path/to/shared/storage             
SWTPM_NVRAM_LoadData: No such file /path/to/shared/storage/tpm2-00.permall                                                                                     
Data client disconnected

The root cause is not perfectly clear, either orchestration layer left the blob out of sync, or some kind of storage error occured when the blob was initially written on the src node. Either way, I suggest an additional sync for the permanent state

To me it seems that it was an issue with the storage.

blob as a potential workaround for such errors, similar to how it's done with volatile state.

stefanberger · 2026-05-08T01:38:49Z

+/*
+ * Flush libtpms's in-memory PERMANENT state to disk so that the next
+ * SWTPM_NVRAM_GetStateBlob(PERMANENT) returns a blob consistent with what
+ * libtpms is actually running. Counterpart to SWTPM_NVRAM_Store_Volatile().


This makes sense and it shouldn't affect anything negatively. However, I was under the impression that the PERMANENT state was always stored whenever a command 'affected' it, e.g. when an NVRAM space was created or undefined etc., and therefore the latest state was always on disk. So I am a bit surprised that this is necessary and that nobody else complained about it so far.

stefanberger · 2026-05-08T01:41:49Z

+ * libtpms is actually running. Counterpart to SWTPM_NVRAM_Store_Volatile().
+ *
+ * Without this, GET_STATEBLOB(PERMANENT) on the migration-out path reads a
+ * file that may be missing, empty, or stale relative to libtpms RAM (the


You don't happen to have a reproducer for how we could generate a 'missing' or 'empty file' situation other than using rm intentionally, do you?

adrobvz · 2026-05-11T18:06:01Z

Hello Stefan,

I spent some more time trying to reproduce the issue by hand. We thought migrating a VM before the guest OS is able to send any TPM related commands might be it. I tried Win11 during installation, then empty VM booted into UEFI shell - no luck. The reason is because the state is being created before VM is even launched by swtpm_setup, which is in turn launched by libvirt.

So I still don't have a solid reproduce for this, except for doing manual rm / truncate. However, as you mentioned above, the situation when there's not the latest state on the disk shouldn't occur. Shouldn't we consider having that additional sync? Having it does help with the naive truncate reproduction.

stefanberger · 2026-05-11T18:29:12Z

Shouldn't we consider having that additional sync? Having it does help with the naive truncate reproduction.

Thanks for running some more tests.

My point is that it shouldn't be necessary because whenever a command is sent to the TPM 2 that alters its permanent state then the permanent state should be written to disk automatically. If this wasn't the case then we would have to have it write its state to disk every time we shut it down as well...

stefanberger reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync permanent state blob upon GET_STATEBLOB cmd#1122

Sync permanent state blob upon GET_STATEBLOB cmd#1122
adrobvz wants to merge 1 commit into
stefanberger:masterfrom
adrobvz:master

adrobvz commented May 7, 2026

Uh oh!

dlunev commented May 7, 2026

Uh oh!

stefanberger commented May 7, 2026

Uh oh!

stefanberger May 8, 2026

Uh oh!

stefanberger May 8, 2026

Uh oh!

adrobvz commented May 11, 2026

Uh oh!

stefanberger commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

adrobvz commented May 7, 2026

Uh oh!

dlunev commented May 7, 2026

Uh oh!

stefanberger commented May 7, 2026

Uh oh!

stefanberger May 8, 2026

Choose a reason for hiding this comment

Uh oh!

stefanberger May 8, 2026

Choose a reason for hiding this comment

Uh oh!

adrobvz commented May 11, 2026

Uh oh!

stefanberger commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefanberger commented May 11, 2026 •

edited

Loading