Sync permanent state blob upon GET_STATEBLOB cmd#1122
Conversation
ctrlchannel_return_state() reads tpm2-00.permall from disk when handling
GET_STATEBLOB(PERMANENT). At the same time libtpms holds the live
PERMANENT state in RAM and only flushes it via _plat__NvCommit(). If,
for whatever reason, the on-disk state blob is missing, empty or
out of sync relative to libtpms RAM at the moment live migration runs,
an empty or short PERMANENT blob is shipped to the destination, which
results in:
qemu-kvm: tpm-emulator: Setting the stateblob (type 2) failed with a
TPM error 0x800
Regardless of how the desync arose (orchestrator-level cleanup of the state
dir, transient NvCommit failure on a prior receive, etc.), let's add
additional sync as a workaround for such cases. Namely, add
SWTPM_NVRAM_Store_Permanent() that fetches PERMANENT from libtpms via
TPMLIB_GetState() and writes it through SWTPM_NVRAM_StoreData() -
similarly to how it's done for SWTPM_NVRAM_Store_Volatile(). That is to
make sure disk is force-synced from RAM at the migration moment.
Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
|
This state could be easily achieved if we will migrate Win2025 just being installed, f.e. at some installer stage when TPM is not yet initialized for the very first time. |
A migration error should not occur due to the VM that is running, like Win2k25. Have you had such issues with any other workloads or do they all work fine? Does it always happen with Win2k25 or is it 'a live migration failure' that only occurred once or once in a while?
To me it seems that it was an issue with the storage.
|
| /* | ||
| * Flush libtpms's in-memory PERMANENT state to disk so that the next | ||
| * SWTPM_NVRAM_GetStateBlob(PERMANENT) returns a blob consistent with what | ||
| * libtpms is actually running. Counterpart to SWTPM_NVRAM_Store_Volatile(). |
There was a problem hiding this comment.
This makes sense and it shouldn't affect anything negatively. However, I was under the impression that the PERMANENT state was always stored whenever a command 'affected' it, e.g. when an NVRAM space was created or undefined etc., and therefore the latest state was always on disk. So I am a bit surprised that this is necessary and that nobody else complained about it so far.
| * libtpms is actually running. Counterpart to SWTPM_NVRAM_Store_Volatile(). | ||
| * | ||
| * Without this, GET_STATEBLOB(PERMANENT) on the migration-out path reads a | ||
| * file that may be missing, empty, or stale relative to libtpms RAM (the |
There was a problem hiding this comment.
You don't happen to have a reproducer for how we could generate a 'missing' or 'empty file' situation other than using rm intentionally, do you?
|
Hello Stefan, I spent some more time trying to reproduce the issue by hand. We thought migrating a VM before the guest OS is able to send any TPM related commands might be it. I tried Win11 during installation, then empty VM booted into UEFI shell - no luck. The reason is because the state is being created before VM is even launched by So I still don't have a solid reproduce for this, except for doing manual |
Thanks for running some more tests. My point is that it shouldn't be necessary because whenever a command is sent to the TPM 2 that alters its permanent state then the permanent state should be written to disk automatically. If this wasn't the case then we would have to have it write its state to disk every time we shut it down as well... |
We've seen a live migration failure in production, with Win2k25 guest, with the following symptoms:
QEMU dest log:
swtpm dest log:
The root cause is not perfectly clear, either orchestration layer left the blob out of sync, or some kind of storage error occured when the blob was initially written on the src node. Either way, I suggest an additional sync for the permanent state blob as a potential workaround for such errors, similar to how it's done with volatile state.