Skip to content

Latest commit

 

History

History
312 lines (196 loc) · 18.8 KB

File metadata and controls

312 lines (196 loc) · 18.8 KB

windows-bootstrap — Pitfalls & Lessons

Twelve gotchas logged from the DevTower bring-up (2026-05-07, Dell XPS 15 9550/9560). Read this before re-running. Each pitfall = one hour of debug time saved.


1. PowerShell 5.1 reads .ps1 files as cp1252 unless they have a UTF-8 BOM

Symptom: parser errors like Unexpected token ')' in expression or statement and The '<' operator is reserved for future use — even though the file looks syntactically clean to a human eye.

Cause: file written via Claude Code's Write tool defaults to UTF-8 without BOM. PS 5.1's parser falls back to system ANSI (cp1252 on US-locale Windows). Em-dashes (, U+2014), arrows (), and other non-ASCII bytes get garbled into invalid token sequences.

Mitigation: write all .ps1 and .cmd files using ASCII only. No em-dashes, no smart quotes, no arrows. Use --, ', ->. After write, grep [^\x00-\x7F] and replace any hits.

Detection: Grep [^\x00-\x7F] -path <file.ps1> — non-empty result means trouble.


2. diskpart's format after create partition primary fails silently

Symptom: diskpart script runs to completion but the new partition ends up RAW (no filesystem), drive letter shows up but is unformatted.

Cause: diskpart's create partition primary leaves focus on the partition, but format fs=fat32 operates on a selected volume. The volume doesn't exist yet (it's RAW). diskpart prints There is no volume selected. Please select a volume and try again. and continues.

Mitigation: don't use diskpart for formatting. Use PowerShell's Format-Volume after diskpart wipes:

# diskpart-only: clean, GPT, create partition, assign letter (NO format)
# Then in PowerShell:
Format-Volume -DriveLetter $usbLetter -FileSystem FAT32 -NewFileSystemLabel WIN11INST -Force -Confirm:$false

Better: skip diskpart entirely, use PowerShell storage cmdlets (Clear-Disk, Initialize-Disk, New-Partition, Format-Volume). Cleaner, more debuggable.


3. diskpart's convert gpt errors on a freshly-cleaned disk

Symptom: After clean, running convert gpt errors with The disk you specified is not MBR formatted. Please select an empty MBR disk to convert.

Cause: convert gpt is meant for MBR→GPT conversion. After clean, the disk has no partition table. The cmdlet refuses to convert "nothing" into GPT.

Mitigation: use Initialize-Disk -PartitionStyle GPT instead. It accepts a RAW disk and writes a fresh GPT header.


4. Initialize-Disk errors with "already been initialized"

Symptom: After a previous run that initialized to GPT (or after Clear-Disk -RemoveData which leaves the partition style intact), Initialize-Disk -PartitionStyle GPT throws The disk has already been initialized.

Cause: Idempotency-by-exception. The cmdlet doesn't no-op on success.

Mitigation: pre-check the partition style instead of catch-and-match-error-message:

$diskInfo = Get-Disk -Number $DiskNumber
if ($diskInfo.PartitionStyle -eq 'GPT') {
  Ok "Disk $DiskNumber already GPT"
} elseif ($diskInfo.PartitionStyle -eq 'RAW') {
  Initialize-Disk -Number $DiskNumber -PartitionStyle GPT
} else {
  Set-Disk -Number $DiskNumber -PartitionStyle GPT  # MBR→GPT conversion
}

Originally I tried try/catch with -match 'already initialized' — but the actual error message is "already been initialized" (note the been). Regex didn't match, fell through to Die. Lesson: don't grep error messages, query state.


5. Robocopy /MIR exit code 11 is "files copied + extras + failures" — partial success

Symptom: Robocopy returns exit 11. My check if ($exit -ge 8) { Die } fired. Build aborted even though all boot-critical files copied successfully.

Cause: Robocopy exit codes are bit-flags, not severity levels:

  • 1 = files copied
  • 2 = extras detected (will be deleted with /MIR)
  • 4 = mismatches
  • 8 = some failures
  • 16 = fatal

11 = 1 | 2 | 8 = files copied + extras + failures. Some files failed (probably edge-case attributes the FAT32 USB couldn't preserve), but the install media itself was complete.

Mitigation: don't gate on exit code alone. After robocopy, verify boot-critical files exist by name:

$critical = @("$usb\bootmgr.efi", "$usb\efi\boot\bootx64.efi", "$usb\sources\boot.wim")
$missing = $critical | Where-Object { -not (Test-Path $_) }
if ($missing) { Die "missing critical files: $($missing -join ', ')" }
# Else: accept the partial copy.

Also: prefer /E (all subdirs) over /MIR (mirror, deletes extras). For a fresh-formatted USB, there are no extras; /MIR is overkill.

Add /COPY:DAT /DCOPY:DAT /XJ for FAT32-friendly attribute set (no security/owner) and to skip junctions.


6. appraiserres.dll on USB has read-only attribute, WriteAllBytes fails

Symptom: After successfully copying ISO contents to USB, the script's [System.IO.File]::WriteAllBytes($appraiser, @()) step throws Access to the path 'D:\sources\appraiserres.dll' is denied.

Cause: Robocopy preserves file attributes. The ISO marks appraiserres.dll (and many other system files) as read-only. FAT32 honors the read-only attribute. WriteAllBytes refuses to overwrite.

Mitigation: clear the read-only attribute first, OR delete and recreate, OR use Set-ItemProperty -Name IsReadOnly:

$appraiser = "$usbDrive\sources\appraiserres.dll"
if (Test-Path $appraiser) {
  Set-ItemProperty -Path $appraiser -Name IsReadOnly -Value $false -ErrorAction SilentlyContinue
  [System.IO.File]::WriteAllBytes($appraiser, @())
}

Or wrap in try/catch and continue — the registry-based HW bypass in autounattend.xml is the load-bearing one. The appraiserres.dll truncation is belt-and-suspenders.

Order matters: this step is currently AFTER the slow DISM split. If it fails, the slow work is already done. Restructure to run autounattend.xml + $OEM$ copy before the appraiser truncation, so a late failure doesn't lose the install media.


7. Win 11 25H2 install.wim is 7+ GB — exceeds FAT32's 4 GB single-file limit

Symptom: Robocopy refuses to copy install.wim to FAT32 USB.

Cause: FAT32 max file size is 4 GB - 1 byte. Win 11 25H2's install.wim is ~7 GB.

Mitigation: split it with DISM:

dism /Split-Image /ImageFile:"$isoDrive\sources\install.wim" /SWMFile:"$usbDrive\sources\install.swm" /FileSize:3800

Produces install.swm + install2.swm + install3.swm, each <4 GB. Windows Setup auto-detects the split files. No autounattend changes needed.

This is the slowest step on USB 2.0 (~25 min for 7 GB at 4-5 MB/s real-world). USB 3.0 cuts it to ~3-5 min.


8. UAC on Win 11 default = secure desktop, blocks UI Automation

Symptom: When Start-Process -Verb RunAs triggers UAC, my Windows MCP Snapshot returns Error capturing desktop state: screen grab failed. Auto-clicking Yes via mcp__windows__Click doesn't work.

Cause: Windows 11 default UAC level is "Always notify me when programs try to make changes to my computer" — which uses the secure desktop. The secure desktop is an isolated Winsta whose UI Automation tree is not accessible to medium-integrity processes.

Mitigation: accept that one user click is required per elevated invocation. Don't try to auto-click. Tell the user clearly: "UAC incoming, click Yes." Never silently — UAC prompts that go unanswered for 2 minutes get auto-canceled.

To bypass secure desktop: registry HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\PromptOnSecureDesktop = 0. Don't recommend — weakens security posture.


9. Start-Process -Verb RunAs -Wait sometimes orphans the elevated child

Symptom: Parent PowerShell exits with success (exit code 0), but the elevated child it spawned is still running. The bg-task notification fires "completed" prematurely. Polling for completion based on parent exit gives false signals.

Cause: With -Verb RunAs, the parent-child relationship breaks on UAC interaction (the new process is reparented to a system process). -Wait sometimes succeeds, sometimes returns immediately. Inconsistent.

Mitigation: don't trust -Wait. Use a done-flag file: have the elevated runner write a JSON completion marker when finished, and have the orchestrator poll for that file. Implementation in templates/build-usb-runner.ps1.tmpl.


10. Orphaned elevated PowerShell holds file locks indefinitely

Symptom: Remove-Item build-usb.log errors with because it is being used by another process — even though no recent script should hold it.

Cause: When Start-Process -Verb RunAs -Wait orphans the elevated child (PITFALLS #9), the child keeps any file handles it had open (e.g. Start-Transcript's log file). Until the elevated child actually exits, the lock persists.

Mitigation:

  1. Use timestamped log filenames (e.g. build-20260507-005212.log) so each run has a fresh file regardless of orphans.
  2. Have the runner kill any prior orphaned powershell.exe processes at the start (the runner does this — see template).
  3. Killing an elevated process from medium integrity fails. Either re-elevate to kill, or wait for the orphan to exit on its own (eventually does, as it errors out).

11. CDP Tailscale key generator's "Pre-approved" toggle has a fragile selector

Symptom: Generated key works, but the new node lands as "Needs approval" in https://login.tailscale.com/admin/machines. tailscale up --auth-key=... succeeds but the node can't be reached until manually approved.

Cause: Tailscale's modal for "Generate auth key" uses a stack of toggles (Reusable / Ephemeral / Pre-approved / Tags). The DOM uses generic <button role="switch"> elements without distinctive selectors. My script's heuristic aria-label + nearby-text matching is brittle.

Mitigation:

  1. Manually verify the modal ticked Pre-approved (or check the resulting node state) — if not, click approve in admin once. Fast.
  2. Future improvement: tighten the selector by reading Tailscale's actual class names (they have stable hash suffixes per build, but the data-component attributes may be stable). Run the generator with full DOM dump first.
  3. Or: use Tailscale's official OAuth flow with a stored client ID/secret — fully scriptable, no UI brittleness. Requires one-time OAuth client setup in Tailscale admin.

12. <PlainText>true</PlainText> for autounattend passwords is unreliable

Symptom: User cannot log in to the local account created by autounattend. Account exists (visible on the lock screen) but the password as written into the autounattend doesn't work.

Cause (observed on Win 11 25H2 during the DevTower validation, 2026-05-07): Despite Microsoft's documentation claiming <PlainText>true</PlainText> should store the password verbatim, in practice Windows still applies the literal-suffix transformation in some code paths. Net effect: the account's actual password becomes <plain>Password instead of <plain>. Earlier guides that say "PlainText=true Just Works" were likely written for older Windows versions or a specific config combo.

Mitigation: always use <PlainText>false</PlainText> with a properly base64-encoded value. The skill's autounattend.xml.tmpl uses a {{PASSWORD_ENCODED}} placeholder; render it via:

$plain = 'DevTower2026'   # what the user types at the lock screen
$encoded = [Convert]::ToBase64String([Text.Encoding]::Unicode.GetBytes($plain + 'Password'))
# Substitute $encoded into both <Value> elements (LocalAccount and AutoLogon)

The suffix is the literal string Password for both LocalAccount and AutoLogon password slots. Not AutoLogonPassword, not AdministratorPassword -- those suffixes apply to other autounattend elements with different <Name> semantics.

If a user is already locked out because they hit this on a previous build:

  1. At the lock screen press Shift + F10 to open admin cmd.
  2. net user <user> NewPass2026! /active:yes
  3. Log in with the new password. Then either rotate further or slmgr issues separately.

13. $OEM$ folder MUST be at <usb>\sources\$OEM$\, not at the USB root

Symptom: Setup completes Windows install successfully, but C:\Windows\Setup\Scripts\ is empty or missing on the new install. SetupComplete.cmd and post-install.ps1 are never copied or executed. Tailscale doesn't install. The node never announces on the tailnet.

Cause (observed on Win 11 25H2 during the DevTower validation, 2026-05-07): Microsoft's distribution-share / configuration-set documentation says $OEM$ folders can live in two places:

  • <media>\sources\$OEM$\ -- the default Setup-recognized location
  • <media>\$OEM$\ (root) -- only processed if the autounattend has <UseConfigurationSet>true</UseConfigurationSet> set in the windowsPE pass Microsoft-Windows-Setup component

If you put $OEM$ at the USB root WITHOUT <UseConfigurationSet>true</UseConfigurationSet>, Setup ignores it entirely. No error, no warning, just a silent skip. The whole post-install bootstrap chain breaks.

Mitigation: place $OEM$ at <usb>\sources\$OEM$\$$\Setup\Scripts\ (which lands the contents at C:\Windows\Setup\Scripts\ post-install). The skill's build-usb.ps1.tmpl builds it there as of the v2 fix.

Diagnostic on the live target (if you suspect this bit you):

Test-Path 'C:\Windows\Setup\Scripts'             # should be $true
Get-ChildItem 'C:\Windows\Setup\Scripts'         # should list SetupComplete.cmd + post-install.ps1
Test-Path 'C:\Program Files\Tailscale'           # should be $true if post-install ran

If those are missing/empty, recover via the standalone recover.ps1 script (run on the target with $env:TS_AUTH_KEY set; one-line iwr ... | iex paste).


14. Always ship a standalone recover.ps1 for post-install bootstrap

Symptom: Any failure mode that prevents SetupComplete.cmd from running (PITFALLS #13, or a corrupted $OEM$ copy, or Setup running OOBE in a way that blocks SetupComplete, etc.) leaves the target with a working Windows install but no Tailscale, no OpenSSH, no remote channel for the operator. The user is stuck physically at the keyboard for whatever recovery is needed.

Mitigation: ship recover.ps1 at the repo root and document the one-liner in SKILL.md and CHECKLIST.md Phase 7. The script is idempotent: installs OpenSSH if missing, installs Tailscale if missing, brings the tailnet up with the supplied auth key. User runs once from elevated PS:

$env:TS_AUTH_KEY = 'tskey-auth-...'
iwr https://raw.githubusercontent.com/<owner>/agentic-windows-bootstrap/master/recover.ps1 -UseBasicParsing | iex

Once the target is on the tailnet, the operator picks up via tailscale ssh. This turns ANY post-install failure into "one paste plus 30 seconds" instead of "rebuild the USB and retry the whole 30-min install."

The unattended path is still the goal -- recover.ps1 is the safety net, not the primary flow.


15. autounattend <LocalAccount> block can silently fail to create the user

Symptom: After install, the lock screen shows the configured hostname but no logon-able user. net user shows only built-ins (Administrator disabled, Guest, DefaultAccount, WDAGUtilityAccount). The <LocalAccount> block in autounattend was processed without errors but the user just doesn't exist.

Cause (observed during DevTower validation, 2026-05-07): unclear. Combinations that have caused it:

  • <PlainText>true</PlainText> with the Password suffix quirk (PITFALLS #12) -- account created with mangled password, then maybe partial-rollback?
  • <Group>Administrators</Group> not applying when default-Users would have been the fallback -- some Win11 25H2 builds appear to refuse to create the LocalAccount if the Group element fails validation
  • $OEM$ placement bug (PITFALLS #13) cascading: Setup couldn't read its config-set, may have skipped LocalAccount processing entirely
  • Some interaction with SkipUserOOBE=true + no MS account: Setup gets to OOBE-skip and drops the LocalAccount block on the floor

Mitigation (v4): always also enable the built-in Administrator with a known password in autounattend:

<UserAccounts>
  <AdministratorPassword>
    <Value>{{ADMIN_PASSWORD_ENCODED}}</Value>
    <PlainText>false</PlainText>
  </AdministratorPassword>
  <LocalAccounts>
    <LocalAccount wcm:action="add">...</LocalAccount>
  </LocalAccounts>
</UserAccounts>

Encoding: base64 UTF-16LE of (plain + "AdministratorPassword"). Note: suffix is AdministratorPassword, not Password -- different from LocalAccount.

If LocalAccount creation fails silently, Administrator is still enabled and admin-capable. No utilman-recovery dance required.

Also added as FirstLogonCommand: net localgroup Administrators <user> /add -- belt-and-suspenders to ensure group membership stuck if the account did get created. Errors silently if no account.


16. Windows classifies the Tailscale adapter as Public, silently blocking inbound

Symptom: Tailscale joins the tailnet successfully (visible in tailscale status from other nodes, tailscale ping works). But TCP connections from any other tailnet node to any port on the Windows node time out. Firewall rules added by hand for ports 22 / 3389 don't help.

Cause: Tailscale on Windows creates a virtual network adapter. Windows' Network Location Awareness service classifies new adapters by default (without prompting on Server SKUs / domain-joined / certain OOBE paths). The Tailscale adapter often lands in Public, where Windows Defender Firewall blocks inbound by default. Rules created without -Profile Any only apply to Private/Domain.

Mitigation (v4): post-install.ps1 now does after tailscale up:

Get-NetConnectionProfile | Where-Object InterfaceAlias -match 'tailscale' | Set-NetConnectionProfile -NetworkCategory Private

Plus the firewall rules created by post-install use -Profile Any and -RemoteAddress 100.64.0.0/10 (CGNAT range = tailnet only) so they apply regardless of adapter classification.

Diagnostic if it bites you anyway:

Get-NetConnectionProfile | Format-Table InterfaceAlias, NetworkCategory
# Tailscale row should be Private. If Public, force it:
Get-NetConnectionProfile | Where-Object InterfaceAlias -match 'tailscale' | Set-NetConnectionProfile -NetworkCategory Private

Pattern: every pitfall has the same shape

Most of the above follow this pattern:

  1. PowerShell cmdlet or built-in tool errors in a way that's NOT obviously a bug.
  2. The error message mentions state ("already X" or "no Y") that the script thought it controlled.
  3. Catching the exception and parsing the message is brittle (string changes between versions).
  4. Pre-checking state via Get-* cmdlets, then conditionally acting, is more robust than try/catch.

This is the windows-bootstrap-equivalent of "always check return codes" — but for state-driven systems.