initrd: fix TPM1 counter auth regression and defend lock cascade failure

tlaurion · tlaurion · commit c20cd28d61fc · 2026-05-15T21:03:45.000-04:00
Commit 5168be2494 (PR #2035) changed increment_tpm_counter to pass the TPM owner passphrase as the TPM1 counter's auth value (-pwdc), but check_tpm_counter was left using empty auth (-pwdc ''). This caused every counter increment to compute SHA1(owner_pass) while the counter was created with SHA1("") — persistent TPM_AUTH_FAIL. Per TCG TPM Main Spec Part 3, TPM_CreateCounter uses owner auth (-pwdo) but TPM_IncrementCounter uses the counter's own authData, not the owner password. The correct design for Heads' rollback counter is empty auth: rollback security comes from the signed /boot/kexec_rollback.txt and TPM sealing, not counter access control. The repeated auth failures (3 per boot × ~5 boots) triggered TPM 1.2 dictionary-attack lockout (TPM_DEFEND_LOCK_RUNNING), which persisted through forceclear on some implementations, causing subsequent tpm takeown to fail and TPM reset to abort. Changes: - initrd/bin/tpmr.sh (_tpm_auth_retry, tpm2_counter_inc, tpm2_seal, tpm1_seal): add 'defend' and '0x98e|0x149' to auth detection grep patterns so defend lock and TPM2 RC codes are treated as retryable auth failures rather than fatal errors - initrd/bin/tpmr.sh (tpm1_reset): detect "defend lock" after takeown failure and cycle physical presence to clear the lock state before retrying — a full AC power cycle remains the fallback if software presence is insufficient - initrd/etc/functions.sh (check_tpm_counter): pass -pwdc '' (empty counter auth) instead of -pwdc "${tpm_passphrase:-}" so the counter is created with SHA1("") per TCG spec - initrd/etc/functions.sh (increment_tpm_counter): try -pwdc '' first for TPM1 (correct behavior). If that fails on a readable counter (created by the buggy inter-version code), prompt for owner passphrase and retry as migration fallback - initrd/etc/functions.sh (increment_tpm_counter): remove the TPM1-specific owner-passphrase prompt block added by the regression — no longer needed as new counters use empty auth - doc/tpm.md: document TPM1 boot chain, tpmtotp tool selection, auth retry patterns, defend lock recovery, and physical presence Signed-off-by: Thierry Laurion <insurgo@riseup.net>
diff --git a/doc/tpm.md b/doc/tpm.md
@@ -10,8 +10,35 @@ See also: [architecture.md](architecture.md), [boot-process.md](boot-process.md)
 ## tpmr — unified TPM abstraction
 
 `initrd/bin/tpmr.sh` is a shell script wrapper that presents a single interface
-over both TPM 1.2 (`tpm` / `trousers`) and TPM 2.0 (`tpm2-tools`). All Heads
-scripts call `tpmr.sh` rather than invoking `tpm` or `tpm2` directly.
+over both TPM 1.2 and TPM 2.0. All Heads scripts call `tpmr.sh` rather than
+invoking TPM tools directly.
+
+### Boot chain and TPM tool selection
+
+```text
+initrd/init  (PID 1)
+  └─ CONFIG_BOOTSCRIPT → /bin/gui-init.sh          [board config]
+       ├─ source /etc/functions.sh                   [shared TPM helpers]
+       ├─ source /etc/gui_functions.sh               [whiptail wrappers]
+       └─ calls initrd/bin/tpmr.sh                   [TPM abstraction]
+            ├─ TPM1: calls `tpm` (tpmtotp util/tpm)  [CONFIG_TPM2_TOOLS != y]
+            │          modules/tpmtotp → output: totp hotp qrenc util/tpm
+            │
+            └─ TPM2: calls tpm2_* (tpm2-tools)       [CONFIG_TPM2_TOOLS=y]
+                       modules/tpm2-tss + modules/tpm2-tools
+```
+
+TPM1 support comes exclusively from the `tpmtotp` module (`modules/tpmtotp`),
+which builds `util/tpm` as part of its outputs. This binary is installed to
+the initrd as `tpm` and supports subcommands such as `physicalpresence`,
+`forceclear`, `takeown -pwdo`, `counter_create`, `counter_increment`, etc.
+
+TPM2 support comes from `modules/tpm2-tss` (TSS software stack) and
+`modules/tpm2-tools` (command-line tools like `tpm2_nvdefine`,
+`tpm2_getcap`, `tpm2_nvincrement`).
+
+Both TPM1 and TPM2 boards may also enable `CONFIG_TPMTOTP=y` for the
+`totp` and `hotp` utilities, which are independent of the TPM version.
 
 ### PCR sizes
 
@@ -398,3 +425,68 @@ To verify that a new board's coreboot config matches the expected RoT:
 | Auth sessions | Not used | Required for policy-based unseal |
 | `kexec_finalize` | No-op | Extends PCRs, then `tpm2 shutdown` |
 | `startsession` | No-op | Creates encryption session |
+
+### TPM1 auth retry and error detection
+
+`_tpm_auth_retry()` in `initrd/bin/tpmr.sh` provides shared retry logic for
+both TPM1 and TPM2 operations that need authorization. On auth failure
+(wrong passphrase), the passphrase cache is shredded and the user is
+re-prompted up to 3 times before giving up.
+
+Auth failure is detected by grepping the command output for known error
+patterns.  TPM1 (tpmtotp) errors go to stdout via `printf()` with
+`TPM_GetErrMsg()` strings.  TPM2 (tpm2-tools) errors go to stderr via
+`LOG_ERR()` and may include raw TPM response codes.
+
+| Pattern | Type | TPM version | Example error |
+| --- | --- | --- | --- |
+| `authorization|auth|bad|permission` | English words | TPM1+TPM2 | `TPM_AUTHFAIL`, `bad passphrase` |
+| `defend` | English word | TPM1 | `Defend lock running` |
+| `0x98e|0x149` | Hex codes | TPM2 | `TPM2_RC_AUTH_FAIL`, `TPM2_RC_NV_AUTHORIZATION` |
+
+### TPM1 reset defend lock
+
+`TPM_DEFEND_LOCK_RUNNING` (`tpm_error.h`: `TPM_BASE + TPM_NON_FATAL + 3`)
+is a standard TPM 1.2 error raised when the TPM's dictionary-attack
+protection is active. After too many failed authorization attempts, the
+TPM enters a time-out period and refuses all authorization operations —
+including `tpm takeown` even after a successful `tpm forceclear`
+(forceclear clears the owner but not the dictionary attack counter on
+some implementations).
+
+tpmtotp's `tpm takeown` outputs:
+```
+Error Defend lock running from TPM_TakeOwnership
+```
+
+`tpm1_reset()` in `initrd/bin/tpmr.sh` detects "defend lock" in the
+`takeown` output and cycles physical presence (`physicaldisable` /
+`physicalenable` / `physicalpresence` / `physicalsetdeactivated`) to
+reset the TPM state machine and clear the lock on chips that honour
+software presence.  `TPM_ResetLockValue` (in tpmtotp's `util/resetlockvalue.c`)
+exists but requires owner auth — after forceclear there is no owner,
+so it cannot be used.
+
+If the cycling also fails, only a full AC power cycle (not just reboot)
+will clear the defend lock.  The timeout duration is chip-specific and
+not documented in the tpmtotp source.
+
+### TPM1 physical presence
+
+TPM1.2 forceclear requires physical presence to be asserted. The
+`tpm1_reset()` function does this with `tpm physicalpresence -s` (software
+presence). On some platforms (e.g., Dell OptiPlex, some Infineon TPMs),
+software physical presence may not work — the TPM firmware only accepts
+hardware-asserted presence (GPIO set by BIOS). In that case, `forceclear`
+returns success but may not fully reset the TPM, or `takeown` may fail
+with unexpected errors.
+
+When software physical presence fails, the LOG shows:
+```
+tpm1_reset: unable to set physical presence
+```
+
+This is logged but not fatal — `tpm forceclear` is still attempted.
+If the TPM firmware ignores software physical presence, the reset fails
+and the user must use the platform's hardware TPM reset mechanism
+(typically a BIOS option or jumper).
diff --git a/initrd/bin/tpmr.sh b/initrd/bin/tpmr.sh
@@ -354,7 +354,7 @@ tpm2_counter_inc() {
 		rm -f "$tmp_err_file"
 		shred -n 10 -z -u /tmp/secret/tpm_owner_passphrase 2>/dev/null || true
 		DEBUG "tpm2_counter_inc attempt $attempt failed. Stderr: $tmp_err_content"
-		if ! echo "$tmp_err_content" | grep -qiE 'authorization|auth|bad|permission|0x98e|0x149'; then
+		if ! echo "$tmp_err_content" | grep -qiE 'authorization|auth|bad|permission|defend|0x98e|0x149'; then
 			DIE "Can't increment TPM counter for $index, access denied."
 		fi
 		WARN "Authentication failed, retrying..."
@@ -370,16 +370,26 @@ tpm2_counter_inc() {
 # Caching: prompt_tpm_owner_password reuses cached passphrase if available.
 # On auth failure the cache is shredded; next prompt will ask the user.
 #
+# Error stream selection:
+#   TPM1 (tpmtotp):   errors go to stdout via printf() — capture stdout+stderr
+#   TPM2 (tpm2-tools): errors go to stderr via LOG_ERR() — capture stderr only
+#
+# Auth detection grep patterns:
+#   English words  — TPM1 (TPM_GetErrMsg returns "Authentication failed...")
+#                  — TPM2 (tpm2-tools LOG_ERR returns "TPM2_RC_AUTH_FAIL...")
+#   defend         — TPM1 "Defend lock running" (TPM_DEFEND_LOCK_RUNNING)
+#   0x98e, 0x149  — TPM2 raw hex codes (TPM2_RC_AUTH_FAIL, TPM2_RC_NV_AUTHORIZATION)
+#
 # Usage: _tpm_auth_retry <label> <error_stream> <tpm_type> <pw_flag> <cmd...>
 #   <label>:        short name for debug (e.g. "counter_create")
-#   <error_stream>: "stdout" (TPM1) or "stderr" (TPM2)
+#   <error_stream>: "stdout" (TPM1: tpmtotp printf) or "stderr" (TPM2: tpm2-tools LOG_ERR)
 #   <tpm_type>:    "tpm1" or "tpm2"
 #   <pw_flag>:     passphrase flag for TPM1 (-pwdo or -pwdc), ignored for TPM2
 #   <cmd...>:       the tpm command and its non-auth arguments
 #
 # Exit codes:
 #   0: success
-#   1: non-auth error (e.g., "out of resources" 0x15) — caller should check
+#   1: non-auth error (e.g., TPM1 "out of resources" 0x15) — caller should check
 _tpm_auth_retry() {
 	local label="$1" error_stream="$2" tpm_type="$3" pw_flag="$4"
 	shift 4
@@ -417,7 +427,7 @@ _tpm_auth_retry() {
 		DEBUG "_tpm_auth_retry $label attempt $attempt failed: $out_content"
 		rm -f "$tmp_file"
 		shred -n 10 -z -u /tmp/secret/tpm_owner_passphrase 2>/dev/null || true
-		if echo "$out_content" | grep -qiE 'authorization|auth|bad|permission'; then
+		if echo "$out_content" | grep -qiE 'authorization|auth|bad|permission|defend|0x98e|0x149'; then
 			WARN "$label failed (bad passphrase?). Retrying..."
 		else
 			# Non-auth error (e.g., out of resources 0x15)
@@ -641,7 +651,7 @@ tpm2_seal() {
 		rm -f "$tmp_err_file"
 		DEBUG "Failed attempt $attempt to write sealed secret to NVRAM from tpm2_seal. Stderr: $tmp_err_content"
 		shred -n 10 -z -u /tmp/secret/tpm_owner_passphrase 2>/dev/null || true
-		if echo "$tmp_err_content" | grep -qiE 'authorization|auth|bad|permission'; then
+		if echo "$tmp_err_content" | grep -qiE 'authorization|auth|bad|permission|defend|0x98e|0x149'; then
 			if [ "$attempt" -ge 3 ]; then
 				DIE "Unable to write sealed secret to TPM NVRAM after 3 attempts. Reset the TPM and try again."
 			fi
@@ -759,7 +769,7 @@ tpm1_seal() {
 			rm -f "$tmp_def_out"
 			DEBUG "tpm1_seal nv_definespace failed (attempt $attempt): $def_out_content"
 			# If auth failure, retry after re-prompt; otherwise bail out.
-			if echo "$def_out_content" | grep -qiE 'authorization|auth|bad|permission'; then
+			if echo "$def_out_content" | grep -qiE 'authorization|auth|bad|permission|defend'; then
 				shred -n 10 -z -u /tmp/secret/tpm_owner_passphrase 2>/dev/null || true
 				WARN "nv_definespace failed (bad passphrase?). Retrying..."
 				continue
@@ -788,7 +798,7 @@ tpm1_seal() {
 		fi
 		DEBUG "tpm1_seal nv_writevalue(post-define) output: $tmp_out_content"
 		shred -n 10 -z -u /tmp/secret/tpm_owner_passphrase 2>/dev/null || true
-		if echo "$tmp_out_content" | grep -qiE 'authorization|auth|bad|permission'; then
+		if echo "$tmp_out_content" | grep -qiE 'authorization|auth|bad|permission|defend'; then
 			if [ "$attempt" -ge 3 ]; then
 				DIE "Unable to write sealed secret to TPM NVRAM after 3 attempts"
 			fi
@@ -1075,9 +1085,34 @@ tpm1_reset() {
 	DO_WITH_DEBUG tpm physicalenable >/dev/null 2>&1 || LOG "tpm1_reset: unable to physicalenable after clear"
 
 	# 3. Take ownership with the new TPM owner passphrase.
-	if ! DO_WITH_DEBUG --mask-position 3 tpm takeown -pwdo "$tpm_owner_passphrase" >/dev/null 2>&1; then
-		LOG "tpm1_reset: tpm takeown failed after forceclear"
-		return 1
+	# TPM_DEFEND_LOCK_RUNNING is a standard TPM 1.2 error raised after
+	# too many failed authorization attempts (see tpm_error.h).  The TPM
+	# enters a time-out period and refuses all authorization operations —
+	# including takeown, even after a successful forceclear (forceclear
+	# clears the owner but not the dictionary attack counter on some
+	# implementations).  
+	# TPM_ResetLockValue requires owner auth, which does not exist after
+	# forceclear, so we cannot call it.  Cycle physical presence
+	# (physicaldisable + physicalenable) to reset the TPM state machine
+	# on chips that honour software presence.  If the lock persists,
+	# only a full AC power cycle (not just reboot) will clear it.
+	local takeown_rc takeown_out
+	takeown_out="$(DO_WITH_DEBUG --mask-position 3 tpm takeown -pwdo "$tpm_owner_passphrase" 2>&1)" && takeown_rc=0 || takeown_rc=$?
+	if [ $takeown_rc -ne 0 ]; then
+		if echo "$takeown_out" | grep -qi "defend lock"; then
+			LOG "tpm1_reset: defend lock detected after forceclear — cycling physical presence to clear"
+			DO_WITH_DEBUG tpm physicaldisable >/dev/null 2>&1 || true
+			DO_WITH_DEBUG tpm physicalenable >/dev/null 2>&1 || true
+			DO_WITH_DEBUG tpm physicalpresence -s >/dev/null 2>&1 || true
+			DO_WITH_DEBUG tpm physicalsetdeactivated -c >/dev/null 2>&1 || true
+			if ! DO_WITH_DEBUG --mask-position 3 tpm takeown -pwdo "$tpm_owner_passphrase" >/dev/null 2>&1; then
+				LOG "tpm1_reset: tpm takeown still failed after defend lock recovery"
+				return 1
+			fi
+		else
+			LOG "tpm1_reset: tpm takeown failed after forceclear"
+			return 1
+		fi
 	fi
 
 	# 4. Leave TPM enabled, present, and not deactivated.
diff --git a/initrd/etc/functions.sh b/initrd/etc/functions.sh
@@ -1872,7 +1872,7 @@ check_tpm_counter() {
 		(
 			set +e
 			tpmr.sh counter_create \
-				-pwdc "${tpm_passphrase:-}" \
+				-pwdc '' \
 				-la "$LABEL" \
 				>/tmp/counter 2> >(tee >(SINK_LOG "tpm counter_create stderr") >&2)
 			echo $? > /tmp/counter_create_rc
@@ -2051,23 +2051,13 @@ increment_tpm_counter() {
 	fi
 
 	# Prefer explicit passphrase, otherwise reuse cached TPM owner passphrase.
+	# TPM2 uses owner-auth fallback in tpm2_counter_inc; TPM1 uses empty counter
+	# auth (SHA1("")) per TCG spec — no owner passphrase needed for increment.
 	if [ -z "$tpm_passphrase" ] && [ -s /tmp/secret/tpm_owner_passphrase ]; then
 		tpm_passphrase="$(cat /tmp/secret/tpm_owner_passphrase)"
 		DEBUG "increment_tpm_counter: using cached TPM owner passphrase"
 	fi
 
-	# TPM1 counter_increment requires owner auth in practice on this path.
-	# origin/master typically reached this with cached owner passphrase already set,
-	# but the newer reseal/update flows can call this later in the session after
-	# that cache is absent. Prompt once and cache to avoid empty -pwdc failures.
-	if [ "$CONFIG_TPM2_TOOLS" != "y" ] && [ -z "$tpm_passphrase" ]; then
-		WARN "TPM Owner Passphrase is required to update rollback counter before signing updated boot hashes."
-		DEBUG "increment_tpm_counter: TPM1 path has no cached/provided owner passphrase; prompting now"
-		prompt_tpm_owner_password
-		tpm_passphrase="$tpm_owner_passphrase"
-		DEBUG "increment_tpm_counter: TPM1 owner passphrase obtained and cached"
-	fi
-
 	# Try to increment the counter.  We normally hide the verbose
 	# output of tpmr.sh commands to avoid overwhelming the console, but we
 	# must *not* swallow any interactive prompts.  The previous implementation
@@ -2094,7 +2084,11 @@ increment_tpm_counter() {
 			increment_ok="y"
 		fi
 	else
-		# TPM1 path uses owner auth in practice.
+		# TPM1 counter uses empty auth (SHA1 of "") per TCG spec.
+		# The counter's auth is separate from the owner passphrase.
+		# If empty auth fails on a readable counter, the counter was
+		# created by pre-fix code with owner-passphrase auth — prompt
+		# for owner passphrase and retry as migration fallback.
 		# NOTE: tpmtotp C code prints ALL output (success + errors) to stdout.
 		# We must capture stdout to detect failures properly.
 		# DO_WITH_DEBUG internally captures the command's stderr (tee /dev/stderr
@@ -2104,10 +2098,25 @@ increment_tpm_counter() {
 		if (
 			set -o pipefail
 			DO_WITH_DEBUG --mask-position 5 \
-				tpmr.sh counter_increment -ix "$counter_id" -pwdc "${tpm_passphrase:-}" \
+				tpmr.sh counter_increment -ix "$counter_id" -pwdc '' \
 					2>/dev/null | tee /tmp/counter-"$counter_id" >/dev/null
 		); then
 			increment_ok="y"
+		elif [ "$counter_present" = "y" ]; then
+			if [ -z "$tpm_passphrase" ]; then
+				WARN "TPM Owner Passphrase required to increment counter created by previous Heads version"
+				prompt_tpm_owner_password
+				tpm_passphrase="$tpm_owner_passphrase"
+			fi
+			if (
+				set -o pipefail
+				DO_WITH_DEBUG --mask-position 5 \
+					tpmr.sh counter_increment -ix "$counter_id" -pwdc "${tpm_passphrase}" \
+						2>/dev/null | tee /tmp/counter-"$counter_id" >/dev/null
+			); then
+				increment_ok="y"
+				WARN "TPM1 counter used owner-passphrase auth (pre-fix). Consider resetting TPM via GUI menu to switch to empty auth."
+			fi
 		fi
 	fi