|
| 1 | +--- |
| 2 | +title: "Lab Monitoring Plan" |
| 3 | +sidebar_label: "Lab Monitoring Plan" |
| 4 | +sidebar_position: 3 |
| 5 | +description: "Monitoring strategy and observability plan for the Azure Local lab environment" |
| 6 | +--- |
| 7 | + |
| 8 | +# Lab Monitoring Plan |
| 9 | + |
| 10 | +[](./index.mdx) |
| 11 | +[](https://learn.microsoft.com/en-us/azure/azure-local/) |
| 12 | + |
| 13 | +> **DOCUMENT CATEGORY**: Runbook |
| 14 | +> **SCOPE**: Lab environment monitoring and observability |
| 15 | +> **PURPOSE**: Define the monitoring strategy for the Azure Local lab environment, modelled after what a production deployment would use |
| 16 | +> **MASTER REFERENCE**: [Monitor Azure Local with Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-local/manage/monitor-overview) |
| 17 | +
|
| 18 | +**Status**: Example / Reference |
| 19 | + |
| 20 | +:::info Lab Reference Document |
| 21 | +This monitoring plan is a worked example showing how a real Azure Local deployment would be documented. Replace all `[EXAMPLE]` values with your environment-specific details before using in production. |
| 22 | +::: |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Monitoring Objectives |
| 27 | + |
| 28 | +The lab monitoring strategy delivers the same observability posture as a production deployment, enabling: |
| 29 | + |
| 30 | +- **Cluster health visibility** — Real-time and historical health across all nodes, storage, and network |
| 31 | +- **Workload performance** — Per-VM and per-service metrics for workloads running on the cluster |
| 32 | +- **Alerting** — Automated notification when critical thresholds are breached |
| 33 | +- **Audit trail** — Log retention sufficient for security review and troubleshooting |
| 34 | +- **Capacity planning** — Trend data for predicting growth and right-sizing |
| 35 | + |
| 36 | +--- |
| 37 | + |
| 38 | +## Environment Scope |
| 39 | + |
| 40 | +| Attribute | Value | |
| 41 | +|-----------|-------| |
| 42 | +| Environment Name | `[EXAMPLE] azlocal-lab-001` | |
| 43 | +| Cluster Node Count | `[EXAMPLE] 3` | |
| 44 | +| Azure Region | `[EXAMPLE] australiaeast` | |
| 45 | +| Azure Subscription | `[EXAMPLE] AzureLocal-Lab-Sub` | |
| 46 | +| Resource Group | `[EXAMPLE] rg-azlocal-lab-monitoring` | |
| 47 | +| Log Analytics Workspace | `[EXAMPLE] law-azlocal-lab-001` | |
| 48 | +| Retention Period | `[EXAMPLE] 30 days (hot) / 90 days (archive)` | |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Monitoring Architecture |
| 53 | + |
| 54 | +The lab uses the same stack documented in the [Monitoring and Observability solution](../operations/monitoring-on-azure-local.md): |
| 55 | + |
| 56 | +``` |
| 57 | +Cluster Nodes (3x) |
| 58 | + │ |
| 59 | + ├──▶ Azure Monitor Agent (AMA) |
| 60 | + │ │ |
| 61 | + │ ├──▶ Log Analytics Workspace (law-azlocal-lab-001) |
| 62 | + │ │ └──▶ KQL Queries / Workbooks / Alerts |
| 63 | + │ └──▶ Azure Monitor Metrics |
| 64 | + │ |
| 65 | + ├──▶ Prometheus (workload metrics — deployed on AKS Arc) |
| 66 | + │ │ |
| 67 | + │ └──▶ Azure Managed Grafana (grafana-azlocal-lab-001) |
| 68 | + │ |
| 69 | + └──▶ Azure Monitor Alerts ──▶ Action Group (lab-ops-email) |
| 70 | +``` |
| 71 | + |
| 72 | +### Components |
| 73 | + |
| 74 | +| Component | Deployment | Notes | |
| 75 | +|-----------|-----------|-------| |
| 76 | +| Azure Monitor Agent | Arc-enabled VM extension | Replaces legacy MMA/OMS | |
| 77 | +| Log Analytics Workspace | Azure-hosted | Centralized for all nodes and workloads | |
| 78 | +| Prometheus | AKS Arc cluster | Scrapes node-exporter and kube-state-metrics | |
| 79 | +| Azure Managed Grafana | Azure-hosted | Pre-built Azure Local dashboards | |
| 80 | +| Azure Monitor Alerts | Azure-hosted | Threshold and anomaly-based rules | |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Key Metrics and Thresholds |
| 85 | + |
| 86 | +### Cluster Infrastructure |
| 87 | + |
| 88 | +| Metric | Warning Threshold | Critical Threshold | Collection Interval | |
| 89 | +|--------|------------------|--------------------|---------------------| |
| 90 | +| Node CPU utilization | 70% | 90% | 60 seconds | |
| 91 | +| Node memory utilization | 75% | 90% | 60 seconds | |
| 92 | +| Storage pool capacity | 70% | 85% | 5 minutes | |
| 93 | +| Storage I/O latency (avg) | 5 ms | 20 ms | 60 seconds | |
| 94 | +| Network packet loss | 0.1% | 1% | 60 seconds | |
| 95 | +| Node heartbeat | — | No heartbeat > 5 min | 1 minute | |
| 96 | + |
| 97 | +### Workload Metrics |
| 98 | + |
| 99 | +| Metric | Warning Threshold | Critical Threshold | Collection Interval | |
| 100 | +|--------|------------------|--------------------|---------------------| |
| 101 | +| VM CPU utilization | 80% | 95% | 5 minutes | |
| 102 | +| VM memory utilization | 80% | 95% | 5 minutes | |
| 103 | +| VM disk read latency | 10 ms | 30 ms | 5 minutes | |
| 104 | +| VM disk write latency | 10 ms | 30 ms | 5 minutes | |
| 105 | +| AVD session count | 75% of capacity | 95% of capacity | 5 minutes | |
| 106 | + |
| 107 | +### Azure Local-Specific |
| 108 | + |
| 109 | +| Metric | Warning Threshold | Critical Threshold | |
| 110 | +|--------|------------------|--------------------| |
| 111 | +| Arc connectivity status | — | Disconnected > 15 min | |
| 112 | +| Storage Spaces Direct (S2D) health | Degraded | Failed | |
| 113 | +| Failover Cluster health | — | Any node offline | |
| 114 | + |
| 115 | +--- |
| 116 | + |
| 117 | +## Log Analytics Configuration |
| 118 | + |
| 119 | +### Data Sources |
| 120 | + |
| 121 | +| Source | Table | Retention | |
| 122 | +|--------|-------|-----------| |
| 123 | +| Windows Event Logs (System, Application) | `Event` | 30 days | |
| 124 | +| Performance Counters | `Perf` | 30 days | |
| 125 | +| Azure Activity Log | `AzureActivity` | 90 days | |
| 126 | +| Arc Agent Heartbeat | `Heartbeat` | 30 days | |
| 127 | +| Custom Syslog (Linux nodes) | `Syslog` | 30 days | |
| 128 | +| Security Events | `SecurityEvent` | 90 days | |
| 129 | + |
| 130 | +### Key KQL Queries |
| 131 | + |
| 132 | +**Node availability (last 24 hours):** |
| 133 | +```kusto |
| 134 | +Heartbeat |
| 135 | +| where TimeGenerated > ago(24h) |
| 136 | +| summarize LastHeartbeat = max(TimeGenerated) by Computer |
| 137 | +| extend MinutesSince = datetime_diff('minute', now(), LastHeartbeat) |
| 138 | +| project Computer, LastHeartbeat, MinutesSince |
| 139 | +| order by MinutesSince desc |
| 140 | +``` |
| 141 | + |
| 142 | +**Top CPU consumers (last 1 hour):** |
| 143 | +```kusto |
| 144 | +Perf |
| 145 | +| where TimeGenerated > ago(1h) |
| 146 | +| where ObjectName == "Processor" and CounterName == "% Processor Time" |
| 147 | +| summarize AvgCPU = avg(CounterValue) by Computer |
| 148 | +| order by AvgCPU desc |
| 149 | +| take 10 |
| 150 | +``` |
| 151 | + |
| 152 | +**Storage latency trend:** |
| 153 | +```kusto |
| 154 | +Perf |
| 155 | +| where TimeGenerated > ago(6h) |
| 156 | +| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Read" |
| 157 | +| summarize AvgLatencyMs = avg(CounterValue * 1000) by bin(TimeGenerated, 5m), Computer |
| 158 | +| render timechart |
| 159 | +``` |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +## Alert Configuration |
| 164 | + |
| 165 | +### Action Group |
| 166 | + |
| 167 | +| Setting | Value | |
| 168 | +|---------|-------| |
| 169 | +| Action Group Name | `[EXAMPLE] ag-azlocal-lab-ops` | |
| 170 | +| Email Recipients | `[EXAMPLE] lab-ops@contoso.com` | |
| 171 | +| Webhook (optional) | `[EXAMPLE] Teams channel webhook URL` | |
| 172 | +| SMS (optional) | `[EXAMPLE] N/A for lab` | |
| 173 | + |
| 174 | +### Alert Rules |
| 175 | + |
| 176 | +| Alert Name | Severity | Condition | Action | |
| 177 | +|-----------|----------|-----------|--------| |
| 178 | +| Node CPU Critical | Sev 1 | CPU > 90% for 5 min | Email + Teams | |
| 179 | +| Node Memory Critical | Sev 1 | Memory > 90% for 5 min | Email + Teams | |
| 180 | +| Node Heartbeat Lost | Sev 0 | No heartbeat > 5 min | Email + Teams | |
| 181 | +| Storage Pool Warning | Sev 2 | Capacity > 70% | Email | |
| 182 | +| Storage Pool Critical | Sev 1 | Capacity > 85% | Email + Teams | |
| 183 | +| Storage Latency Critical | Sev 1 | Avg latency > 20ms | Email | |
| 184 | +| Arc Connectivity Lost | Sev 1 | Disconnected > 15 min | Email + Teams | |
| 185 | +| VM CPU Critical | Sev 2 | VM CPU > 95% for 10 min | Email | |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Grafana Dashboards |
| 190 | + |
| 191 | +| Dashboard | Source | Purpose | |
| 192 | +|-----------|--------|---------| |
| 193 | +| Azure Local Cluster Overview | [azurelocal-monitoring repo](https://github.com/AzureLocal/azurelocal-monitoring) | Node health, storage, network | |
| 194 | +| AVD Session Monitoring | [azurelocal-avd](https://azurelocal.github.io/azurelocal-avd/) | Session counts, hostpool health | |
| 195 | +| VM Performance | Community / Custom | Per-VM CPU, memory, disk | |
| 196 | +| Storage Spaces Direct | Microsoft | S2D health, capacity, latency | |
| 197 | + |
| 198 | +Access: `[EXAMPLE] https://grafana-azlocal-lab-001.australiaeast.grafana.azure.com` |
| 199 | + |
| 200 | +--- |
| 201 | + |
| 202 | +## Response Procedures |
| 203 | + |
| 204 | +### Severity 0 — Node Down |
| 205 | + |
| 206 | +1. Validate in Azure portal → Azure Arc → Servers |
| 207 | +2. Check physical hardware (iDRAC/iLO console) |
| 208 | +3. Attempt remote power cycle if hardware permits |
| 209 | +4. Escalate to on-site team if remote resolution fails |
| 210 | +5. Document incident in [Support Instructions](../operations/support-instructions.mdx) |
| 211 | + |
| 212 | +### Severity 1 — Critical Threshold Breached |
| 213 | + |
| 214 | +1. Acknowledge alert within 15 minutes |
| 215 | +2. Open Log Analytics and run relevant KQL query to confirm scope |
| 216 | +3. Check workload impact — live migration VMs if node is saturated |
| 217 | +4. Review recent changes via Azure Activity Log |
| 218 | +5. Document findings in incident log |
| 219 | + |
| 220 | +### Severity 2 — Warning Threshold Breached |
| 221 | + |
| 222 | +1. Review trend over last 24 hours in Grafana |
| 223 | +2. Determine if threshold is trending toward critical |
| 224 | +3. Open capacity review ticket if storage warning persists > 48 hours |
| 225 | +4. No immediate escalation required unless trend worsens |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +## Review Cadence |
| 230 | + |
| 231 | +| Review | Frequency | Owner | |
| 232 | +|--------|-----------|-------| |
| 233 | +| Alert rule tuning | Monthly | Lab Ops | |
| 234 | +| Log Analytics cost review | Monthly | Lab Ops | |
| 235 | +| Dashboard review | Quarterly | Lab Ops | |
| 236 | +| Retention policy review | Quarterly | Lab Ops | |
| 237 | +| Monitoring architecture review | Annually | Platform Team | |
| 238 | + |
| 239 | +--- |
| 240 | + |
| 241 | +## References |
| 242 | + |
| 243 | +- [Monitoring and Observability on Azure Local](../operations/monitoring-on-azure-local.md) — solution guide |
| 244 | +- [Monitor Azure Local with Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-local/manage/monitor-cluster) — Microsoft Learn |
| 245 | +- [Azure Monitor Agent overview](https://learn.microsoft.com/en-us/azure/azure-monitor/agents/azure-monitor-agent-overview) |
| 246 | +- [azurelocal-monitoring repository](https://github.com/AzureLocal/azurelocal-monitoring) — deployment scripts and dashboards |
| 247 | +- [Azure Managed Grafana](https://learn.microsoft.com/en-us/azure/managed-grafana/overview) |
0 commit comments