Skip to content

Commit 14f4942

Browse files
docs: add lab monitoring plan example, clean up operations stubs, move training to top-level
- Add docs/lab-environment/lab-monitoring-plan.mdx — a worked example monitoring plan for the lab environment, modelling what a real deployment would document (metrics, thresholds, alert rules, KQL queries, response procedures) - Update lab-environment/index.mdx to reference the new monitoring plan - Remove docs/operations/as-built.mdx and monitoring-plan.mdx stubs (placeholder-only, not appropriate for community docs) - Move docs/operations/training/ to docs/training/ — training spans the full platform, not just operations - Update docs/operations/index.mdx to reflect removed files and updated training link
1 parent 43f4b89 commit 14f4942

9 files changed

Lines changed: 250 additions & 80 deletions

File tree

docs/lab-environment/index.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ This section contains documentation for Azure Local lab environments used for te
3535

3636
- [Lab Access](./access.mdx) - Access instructions and credentials
3737
- [Lab As-Built](./lab-as-built.mdx) - As-built configuration documentation
38+
- [Lab Monitoring Plan](./lab-monitoring-plan.mdx) - Monitoring strategy and observability plan
Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
---
2+
title: "Lab Monitoring Plan"
3+
sidebar_label: "Lab Monitoring Plan"
4+
sidebar_position: 3
5+
description: "Monitoring strategy and observability plan for the Azure Local lab environment"
6+
---
7+
8+
# Lab Monitoring Plan
9+
10+
[![Runbook](https://img.shields.io/badge/Type-Runbook-blue?style=flat-square)](./index.mdx)
11+
[![Azure](https://img.shields.io/badge/Platform-Azure_Local-0078D4?style=flat-square&logo=microsoftazure)](https://learn.microsoft.com/en-us/azure/azure-local/)
12+
13+
> **DOCUMENT CATEGORY**: Runbook
14+
> **SCOPE**: Lab environment monitoring and observability
15+
> **PURPOSE**: Define the monitoring strategy for the Azure Local lab environment, modelled after what a production deployment would use
16+
> **MASTER REFERENCE**: [Monitor Azure Local with Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-local/manage/monitor-overview)
17+
18+
**Status**: Example / Reference
19+
20+
:::info Lab Reference Document
21+
This monitoring plan is a worked example showing how a real Azure Local deployment would be documented. Replace all `[EXAMPLE]` values with your environment-specific details before using in production.
22+
:::
23+
24+
---
25+
26+
## Monitoring Objectives
27+
28+
The lab monitoring strategy delivers the same observability posture as a production deployment, enabling:
29+
30+
- **Cluster health visibility** — Real-time and historical health across all nodes, storage, and network
31+
- **Workload performance** — Per-VM and per-service metrics for workloads running on the cluster
32+
- **Alerting** — Automated notification when critical thresholds are breached
33+
- **Audit trail** — Log retention sufficient for security review and troubleshooting
34+
- **Capacity planning** — Trend data for predicting growth and right-sizing
35+
36+
---
37+
38+
## Environment Scope
39+
40+
| Attribute | Value |
41+
|-----------|-------|
42+
| Environment Name | `[EXAMPLE] azlocal-lab-001` |
43+
| Cluster Node Count | `[EXAMPLE] 3` |
44+
| Azure Region | `[EXAMPLE] australiaeast` |
45+
| Azure Subscription | `[EXAMPLE] AzureLocal-Lab-Sub` |
46+
| Resource Group | `[EXAMPLE] rg-azlocal-lab-monitoring` |
47+
| Log Analytics Workspace | `[EXAMPLE] law-azlocal-lab-001` |
48+
| Retention Period | `[EXAMPLE] 30 days (hot) / 90 days (archive)` |
49+
50+
---
51+
52+
## Monitoring Architecture
53+
54+
The lab uses the same stack documented in the [Monitoring and Observability solution](../operations/monitoring-on-azure-local.md):
55+
56+
```
57+
Cluster Nodes (3x)
58+
59+
├──▶ Azure Monitor Agent (AMA)
60+
│ │
61+
│ ├──▶ Log Analytics Workspace (law-azlocal-lab-001)
62+
│ │ └──▶ KQL Queries / Workbooks / Alerts
63+
│ └──▶ Azure Monitor Metrics
64+
65+
├──▶ Prometheus (workload metrics — deployed on AKS Arc)
66+
│ │
67+
│ └──▶ Azure Managed Grafana (grafana-azlocal-lab-001)
68+
69+
└──▶ Azure Monitor Alerts ──▶ Action Group (lab-ops-email)
70+
```
71+
72+
### Components
73+
74+
| Component | Deployment | Notes |
75+
|-----------|-----------|-------|
76+
| Azure Monitor Agent | Arc-enabled VM extension | Replaces legacy MMA/OMS |
77+
| Log Analytics Workspace | Azure-hosted | Centralized for all nodes and workloads |
78+
| Prometheus | AKS Arc cluster | Scrapes node-exporter and kube-state-metrics |
79+
| Azure Managed Grafana | Azure-hosted | Pre-built Azure Local dashboards |
80+
| Azure Monitor Alerts | Azure-hosted | Threshold and anomaly-based rules |
81+
82+
---
83+
84+
## Key Metrics and Thresholds
85+
86+
### Cluster Infrastructure
87+
88+
| Metric | Warning Threshold | Critical Threshold | Collection Interval |
89+
|--------|------------------|--------------------|---------------------|
90+
| Node CPU utilization | 70% | 90% | 60 seconds |
91+
| Node memory utilization | 75% | 90% | 60 seconds |
92+
| Storage pool capacity | 70% | 85% | 5 minutes |
93+
| Storage I/O latency (avg) | 5 ms | 20 ms | 60 seconds |
94+
| Network packet loss | 0.1% | 1% | 60 seconds |
95+
| Node heartbeat || No heartbeat > 5 min | 1 minute |
96+
97+
### Workload Metrics
98+
99+
| Metric | Warning Threshold | Critical Threshold | Collection Interval |
100+
|--------|------------------|--------------------|---------------------|
101+
| VM CPU utilization | 80% | 95% | 5 minutes |
102+
| VM memory utilization | 80% | 95% | 5 minutes |
103+
| VM disk read latency | 10 ms | 30 ms | 5 minutes |
104+
| VM disk write latency | 10 ms | 30 ms | 5 minutes |
105+
| AVD session count | 75% of capacity | 95% of capacity | 5 minutes |
106+
107+
### Azure Local-Specific
108+
109+
| Metric | Warning Threshold | Critical Threshold |
110+
|--------|------------------|--------------------|
111+
| Arc connectivity status || Disconnected > 15 min |
112+
| Storage Spaces Direct (S2D) health | Degraded | Failed |
113+
| Failover Cluster health || Any node offline |
114+
115+
---
116+
117+
## Log Analytics Configuration
118+
119+
### Data Sources
120+
121+
| Source | Table | Retention |
122+
|--------|-------|-----------|
123+
| Windows Event Logs (System, Application) | `Event` | 30 days |
124+
| Performance Counters | `Perf` | 30 days |
125+
| Azure Activity Log | `AzureActivity` | 90 days |
126+
| Arc Agent Heartbeat | `Heartbeat` | 30 days |
127+
| Custom Syslog (Linux nodes) | `Syslog` | 30 days |
128+
| Security Events | `SecurityEvent` | 90 days |
129+
130+
### Key KQL Queries
131+
132+
**Node availability (last 24 hours):**
133+
```kusto
134+
Heartbeat
135+
| where TimeGenerated > ago(24h)
136+
| summarize LastHeartbeat = max(TimeGenerated) by Computer
137+
| extend MinutesSince = datetime_diff('minute', now(), LastHeartbeat)
138+
| project Computer, LastHeartbeat, MinutesSince
139+
| order by MinutesSince desc
140+
```
141+
142+
**Top CPU consumers (last 1 hour):**
143+
```kusto
144+
Perf
145+
| where TimeGenerated > ago(1h)
146+
| where ObjectName == "Processor" and CounterName == "% Processor Time"
147+
| summarize AvgCPU = avg(CounterValue) by Computer
148+
| order by AvgCPU desc
149+
| take 10
150+
```
151+
152+
**Storage latency trend:**
153+
```kusto
154+
Perf
155+
| where TimeGenerated > ago(6h)
156+
| where ObjectName == "LogicalDisk" and CounterName == "Avg. Disk sec/Read"
157+
| summarize AvgLatencyMs = avg(CounterValue * 1000) by bin(TimeGenerated, 5m), Computer
158+
| render timechart
159+
```
160+
161+
---
162+
163+
## Alert Configuration
164+
165+
### Action Group
166+
167+
| Setting | Value |
168+
|---------|-------|
169+
| Action Group Name | `[EXAMPLE] ag-azlocal-lab-ops` |
170+
| Email Recipients | `[EXAMPLE] lab-ops@contoso.com` |
171+
| Webhook (optional) | `[EXAMPLE] Teams channel webhook URL` |
172+
| SMS (optional) | `[EXAMPLE] N/A for lab` |
173+
174+
### Alert Rules
175+
176+
| Alert Name | Severity | Condition | Action |
177+
|-----------|----------|-----------|--------|
178+
| Node CPU Critical | Sev 1 | CPU > 90% for 5 min | Email + Teams |
179+
| Node Memory Critical | Sev 1 | Memory > 90% for 5 min | Email + Teams |
180+
| Node Heartbeat Lost | Sev 0 | No heartbeat > 5 min | Email + Teams |
181+
| Storage Pool Warning | Sev 2 | Capacity > 70% | Email |
182+
| Storage Pool Critical | Sev 1 | Capacity > 85% | Email + Teams |
183+
| Storage Latency Critical | Sev 1 | Avg latency > 20ms | Email |
184+
| Arc Connectivity Lost | Sev 1 | Disconnected > 15 min | Email + Teams |
185+
| VM CPU Critical | Sev 2 | VM CPU > 95% for 10 min | Email |
186+
187+
---
188+
189+
## Grafana Dashboards
190+
191+
| Dashboard | Source | Purpose |
192+
|-----------|--------|---------|
193+
| Azure Local Cluster Overview | [azurelocal-monitoring repo](https://github.com/AzureLocal/azurelocal-monitoring) | Node health, storage, network |
194+
| AVD Session Monitoring | [azurelocal-avd](https://azurelocal.github.io/azurelocal-avd/) | Session counts, hostpool health |
195+
| VM Performance | Community / Custom | Per-VM CPU, memory, disk |
196+
| Storage Spaces Direct | Microsoft | S2D health, capacity, latency |
197+
198+
Access: `[EXAMPLE] https://grafana-azlocal-lab-001.australiaeast.grafana.azure.com`
199+
200+
---
201+
202+
## Response Procedures
203+
204+
### Severity 0 — Node Down
205+
206+
1. Validate in Azure portal → Azure Arc → Servers
207+
2. Check physical hardware (iDRAC/iLO console)
208+
3. Attempt remote power cycle if hardware permits
209+
4. Escalate to on-site team if remote resolution fails
210+
5. Document incident in [Support Instructions](../operations/support-instructions.mdx)
211+
212+
### Severity 1 — Critical Threshold Breached
213+
214+
1. Acknowledge alert within 15 minutes
215+
2. Open Log Analytics and run relevant KQL query to confirm scope
216+
3. Check workload impact — live migration VMs if node is saturated
217+
4. Review recent changes via Azure Activity Log
218+
5. Document findings in incident log
219+
220+
### Severity 2 — Warning Threshold Breached
221+
222+
1. Review trend over last 24 hours in Grafana
223+
2. Determine if threshold is trending toward critical
224+
3. Open capacity review ticket if storage warning persists > 48 hours
225+
4. No immediate escalation required unless trend worsens
226+
227+
---
228+
229+
## Review Cadence
230+
231+
| Review | Frequency | Owner |
232+
|--------|-----------|-------|
233+
| Alert rule tuning | Monthly | Lab Ops |
234+
| Log Analytics cost review | Monthly | Lab Ops |
235+
| Dashboard review | Quarterly | Lab Ops |
236+
| Retention policy review | Quarterly | Lab Ops |
237+
| Monitoring architecture review | Annually | Platform Team |
238+
239+
---
240+
241+
## References
242+
243+
- [Monitoring and Observability on Azure Local](../operations/monitoring-on-azure-local.md) — solution guide
244+
- [Monitor Azure Local with Azure Monitor](https://learn.microsoft.com/en-us/azure/azure-local/manage/monitor-cluster) — Microsoft Learn
245+
- [Azure Monitor Agent overview](https://learn.microsoft.com/en-us/azure/azure-monitor/agents/azure-monitor-agent-overview)
246+
- [azurelocal-monitoring repository](https://github.com/AzureLocal/azurelocal-monitoring) — deployment scripts and dashboards
247+
- [Azure Managed Grafana](https://learn.microsoft.com/en-us/azure/managed-grafana/overview)

docs/operations/as-built.mdx

Lines changed: 0 additions & 38 deletions
This file was deleted.

docs/operations/index.mdx

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,8 +40,6 @@ For Azure-native services deployed on Azure Local (AVD, AKS, App Services, SQL M
4040

4141
## Runbooks and Procedures
4242

43-
- [As-Built Documentation](./as-built.mdx)
44-
- [Monitoring Plan](./monitoring-plan.mdx)
4543
- [Support Instructions](./support-instructions.mdx)
46-
- [Training](./training/)
44+
- [Training](../training/)
4745

docs/operations/monitoring-plan.mdx

Lines changed: 0 additions & 38 deletions
This file was deleted.
File renamed without changes.
File renamed without changes.
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ description: "Training materials for Azure Local operations"
77

88
# Training
99

10-
[![Reference](https://img.shields.io/badge/Type-Reference-purple?style=flat-square)](../index.mdx)
10+
[![Reference](https://img.shields.io/badge/Type-Reference-purple?style=flat-square)](./index.mdx)
1111
[![Azure](https://img.shields.io/badge/Platform-Azure_Local-0078D4?style=flat-square&logo=microsoftazure)](https://learn.microsoft.com/en-us/azure/azure-local/)
1212

1313
> **DOCUMENT CATEGORY**: Reference

0 commit comments

Comments
 (0)