Skip to content

Commit e21c8e4

Browse files
author
Kristopher Turner
committed
Phase 2.8-2.10: Add Troubleshooting sections to Parts 4, 5, 6
Part 4 Cluster Deployment (12 files): - phase-03: task-12 combined script, task-13 verification - phase-05: monitor-deployment, monitor-validation - phase-06: all 8 post-deployment tasks (SDN, quorum, security groups, SSH, storage, images, logical networks, verification) Part 5 Operational Foundations (4 files): - phase-04-security: defender, azure policy, security baselines, security logging Part 6 Testing & Validation (5 files): - infrastructure health, network RDMA, HA testing, security compliance, backup DR All tables follow standard Issue | Cause | Resolution format with minimum 3 rows per task.
1 parent e4303d9 commit e21c8e4

21 files changed

Lines changed: 212 additions & 0 deletions

docs/implementation/04-cluster-deployment/phase-03-os-configuration/task-12-complete-combined-script-all-steps.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,17 @@ foreach ($node in $nodes) {
209209

210210
---
211211

212+
## Troubleshooting
213+
214+
| Issue | Cause | Resolution |
215+
|-------|-------|------------|
216+
| Script fails on one node but succeeds on others | WinRM connectivity issue or credential mismatch | Verify `Enter-PSSession -ComputerName <node>` works; re-run `Enable-PSRemoting -Force` on the failing node |
217+
| Hostname not changed after reboot | `Rename-Computer` requires a restart to take effect | Confirm the script includes `-Restart` flag; manually reboot if needed: `Restart-Computer -Force` |
218+
| Static IP not applied | DHCP still enabled on the management adapter | Run `Set-NetIPInterface -InterfaceAlias "Management" -Dhcp Disabled` before assigning the static IP |
219+
| DNS resolution fails after configuration | DNS server addresses not set or wrong order | Verify with `Get-DnsClientServerAddress`; re-run the DNS configuration step with correct IPs from `variables.yml` |
220+
221+
---
222+
212223
## Navigation
213224

214225
[← Task 11: Clear Storage](./task-11-clear-previous-storage-configuration-conditional.mdx) · [↑ Phase 03](./index.mdx)

docs/implementation/04-cluster-deployment/phase-03-os-configuration/task-13-phase03-verification.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -227,6 +227,16 @@ scripts/deploy/04-cluster-deployment/phase-03-os-configuration/
227227

228228
---
229229

230+
## Troubleshooting
231+
232+
| Issue | Cause | Resolution |
233+
|-------|-------|------------|
234+
| Verification script reports group membership mismatch | GPO not yet applied or replication delay | Run `gpupdate /force` on the affected node, wait 60 seconds, then re-run verification |
235+
| PSRemoting test fails for one or more nodes | WinRM service not running or firewall blocking port 5985 | On the failing node: `Enable-PSRemoting -Force`; verify firewall rule: `Get-NetFirewallRule -Name WINRM-HTTP-In-TCP` |
236+
| Script cannot resolve node hostname | DNS record missing or stale | Verify DNS with `Resolve-DnsName <hostname>`; re-run DNS record creation from Phase 01 Task 03 |
237+
238+
---
239+
230240
## Navigation
231241

232242
[← Task 12: Combined Script](./task-12-complete-combined-script-all-steps.mdx) · [↑ Phase 03](./index.mdx) · [Phase 04: ARC Registration →](../phase-04-arc-registration/index.mdx)

docs/implementation/04-cluster-deployment/phase-05-cluster-deployment/deployment-monitoring/monitor-deployment.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -691,6 +691,16 @@ while ($true) {
691691

692692
---
693693

694+
## Troubleshooting
695+
696+
| Issue | Cause | Resolution |
697+
|-------|-------|------------|
698+
| Monitoring script shows no deployment activity | Deployment not yet started or action plan ID is wrong | Verify deployment was triggered in Azure portal; check `Get-ActionPlanInstances` for the correct action plan ID |
699+
| Script cannot connect to cluster nodes | WinRM not enabled or firewall blocking port 5985 | Verify connectivity: `Test-WSMan -ComputerName <node>`; enable WinRM: `Enable-PSRemoting -Force` on each node |
700+
| Deployment stuck at a specific step for extended time | Resource provisioning delay or prerequisite failure | Check the specific step's log file on the seed node: `C:\CloudDeployment\Logs\`; review the action plan instance details for error messages |
701+
702+
---
703+
694704
**Version Control**
695705
- Created: 2026-03-09 by Azure Local Cloudnology Team
696706
- Last Updated: 2026-03-09 by Azure Local Cloudnology Team

docs/implementation/04-cluster-deployment/phase-05-cluster-deployment/deployment-monitoring/monitor-validation.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -683,6 +683,16 @@ while ($true) {
683683

684684
---
685685

686+
## Troubleshooting
687+
688+
| Issue | Cause | Resolution |
689+
|-------|-------|------------|
690+
| Validation script reports `FAIL` on environment checks | Prerequisites not met before deployment started | Review the specific check that failed; address the prerequisite (e.g., DNS, NTP, AD connectivity) and re-run validation |
691+
| Script timeout waiting for validation completion | Deployment validation takes longer than expected | Increase `$RefreshInterval`; check Azure portal deployment status for progress; review `C:\CloudDeployment\Logs\` for errors |
692+
| All validation checks show `Unknown` status | Cluster nodes unreachable or validation service not running | Verify node connectivity: `Test-Connection <node-ip>`; check ECE agent status: `Get-Service LifeCycleManagementAgent` |
693+
694+
---
695+
686696
**Version Control**
687697
- Created: 2026-03-09 by Azure Local Cloudnology Team
688698
- Last Updated: 2026-03-09 by Azure Local Cloudnology Team

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-01-deploy-sdn.mdx

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1492,6 +1492,17 @@ All values are defined in the `#region CONFIGURATION` block at the top. No `vari
14921492

14931493
---
14941494

1495+
## Troubleshooting
1496+
1497+
| Issue | Cause | Resolution |
1498+
|-------|-------|------------|
1499+
| `Add-EceFeature` fails with action plan error | OS build mismatch or missing prerequisites | Verify OS build is `10.0.26100` with `systeminfo`; ensure all nodes are Arc-registered and cluster is healthy |
1500+
| Network Controller cluster group shows `Offline` | NC VMs failed to start or quorum lost | Check NC VM status: `Get-ClusterGroup \| Where-Object Name -match "Network Controller"`; restart the group: `Start-ClusterGroup <NCGroupName>` |
1501+
| DNS resolution fails for NC FQDN | DNS record not created or propagated | Create the DNS record manually: `Add-DnsServerResourceRecordA -Name "<SDNPrefix>-NC" -ZoneName "<domain>" -IPv4Address "<NC-IP>"` |
1502+
| SDN feature not visible in Azure portal | Arc resource bridge not synced | Wait 10-15 minutes for sync; verify bridge health: `az arcappliance show --resource-group <rg> --name <appliance>` |
1503+
1504+
---
1505+
14951506
## Navigation
14961507

14971508
| Previous | Up | Next |

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-02-cluster-quorum-configuration.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -566,6 +566,16 @@ All values are defined in the `#region CONFIGURATION` block at the top. Edit tho
566566

567567
---
568568

569+
## Troubleshooting
570+
571+
| Issue | Cause | Resolution |
572+
|-------|-------|------------|
573+
| `Set-ClusterQuorum` fails with access denied | Insufficient permissions on cluster or storage account | Run as cluster administrator; verify SPN has `Storage Account Key Operator Service Role` on the witness storage account |
574+
| Quorum state shows `NotConfigured` | Cloud witness storage account key is incorrect or rotated | Re-configure with fresh key: `Set-ClusterQuorum -CloudWitness -AccountName <name> -AccessKey <newkey>` |
575+
| Witness resource shows `Failed` | Storage account firewall blocking cluster node IPs | Add cluster node public IPs to the storage account firewall allow list or enable `Allow trusted Microsoft services` |
576+
577+
---
578+
569579
## Navigation
570580

571581
| Previous | Up | Next |

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-03-security-groups-applied-to-nodes.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -525,6 +525,16 @@ All values are defined in the `#region CONFIGURATION` block at the top. Edit `$O
525525

526526
---
527527

528+
## Troubleshooting
529+
530+
| Issue | Cause | Resolution |
531+
|-------|-------|------------|
532+
| Security group not listed in local group membership | GPO not applied or AD replication delay | Run `gpupdate /force` on the node; verify AD group exists: `Get-ADGroup -Identity <groupname>` |
533+
| PSRemoting test fails with access denied | User not in `Remote Management Users` group | Add the user's group to Remote Management Users: `Add-LocalGroupMember -Group "Remote Management Users" -Member "<domain>\<group>"` |
534+
| Group membership shows SID instead of name | Domain controller unreachable for name resolution | Verify DC connectivity: `Test-ComputerSecureChannel`; check DNS resolution to domain controllers |
535+
536+
---
537+
528538
## Navigation
529539

530540
| Previous | Up | Next |

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-04-ssh-connectivity-to-nodes.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -767,6 +767,16 @@ HybridConnectivity tunnel — not who can authenticate to sshd itself.
767767

768768
---
769769

770+
## Troubleshooting
771+
772+
| Issue | Cause | Resolution |
773+
|-------|-------|------------|
774+
| SSH connection refused on port 22 | `sshd` service not running or Windows Firewall blocking | Start service: `Start-Service sshd; Set-Service sshd -StartupType Automatic`; verify firewall: `Get-NetFirewallRule -Name *ssh*` |
775+
| Arc SSH tunnel fails with RBAC error | User lacks `Virtual Machine Local User Login` role | Assign the role: `az role assignment create --role "Virtual Machine Local User Login" --assignee <objectId> --scope <resourceScope>` |
776+
| Authentication fails with valid credentials | `sshd_config` not configured for password or key auth | Edit `C:\ProgramData\ssh\sshd_config` to enable `PasswordAuthentication yes` or configure authorized keys; restart: `Restart-Service sshd` |
777+
778+
---
779+
770780
## Navigation
771781

772782
| Previous | Up | Next |

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-05-storage-configuration.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -935,6 +935,16 @@ az stack-hci-vm storagepath list `
935935

936936
---
937937

938+
## Troubleshooting
939+
940+
| Issue | Cause | Resolution |
941+
|-------|-------|------------|
942+
| Storage pool `HealthStatus` shows `Degraded` | One or more physical disks are unhealthy or missing | Check disk health: `Get-PhysicalDisk \| Where-Object HealthStatus -ne Healthy`; replace failed disks and initiate repair: `Repair-VirtualDisk` |
943+
| CSV volume shows `Offline` or `Redirected` | Node owning the volume is offline or network partition | Verify cluster node status: `Get-ClusterNode`; move CSV ownership: `Move-ClusterSharedVolume -Name <csv> -Node <healthy-node>` |
944+
| Storage path registration fails in Azure | Arc resource bridge unhealthy or missing permissions | Verify bridge: `az arcappliance show`; ensure the SPN has Contributor on the resource group |
945+
946+
---
947+
938948
## Navigation
939949

940950
| Previous | Up | Next |

docs/implementation/04-cluster-deployment/phase-06-post-deployment/task-06-image-downloads.mdx

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -447,6 +447,16 @@ If an image stays in `Downloading` for more than 60 minutes or enters `Failed` s
447447

448448
---
449449

450+
## Troubleshooting
451+
452+
| Issue | Cause | Resolution |
453+
|-------|-------|------------|
454+
| Image stuck in `Downloading` for over 60 minutes | Insufficient CSV free space or blocked outbound connectivity | Check CSV free space: `Get-ClusterSharedVolume`; verify outbound to `*.blob.core.windows.net` on port 443 |
455+
| Image provisioning state is `Failed` | Invalid image source URL or Arc resource bridge issue | Delete and recreate: `az stack-hci-vm image delete --name <img> --resource-group <rg> --yes`; verify bridge health first |
456+
| Image not visible in Azure portal | Sync delay between cluster and Azure | Wait 10-15 minutes; if still missing, verify the custom location resource is healthy: `az customlocation show` |
457+
458+
---
459+
450460
## Navigation
451461

452462
| | |

0 commit comments

Comments
 (0)