feat: add AKS cost optimization report template (7 scenarios)#2067
feat: add AKS cost optimization report template (7 scenarios)#2067Harshaa wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Add aks-cost-optimization-report.md template covering 8 scenarios: overprovisioned pods, missing requests/limits, idle workloads, namespace cost allocation, node pool rightsizing, cluster autoscaler, spot node pools, and reserved instances. Update workflow.md Step 1.7 to reference the new template. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds an AKS-specific cost optimization report template to the azure-cost skill so agents can generate a consistent, scenario-driven AKS cost savings report, and wires it into the existing cost-optimization workflow as an AKS reference.
Changes:
- Added a new
aks-cost-optimization-report.mdtemplate covering 7 AKS cost scenarios plus prerequisites for the AKS Cost Analysis add-on. - Updated the cost optimization workflow to reference the new AKS report template in the AKS-specific analysis step.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| plugin/skills/azure-cost/cost-optimization/workflow.md | Adds a new reference entry to load the AKS cost optimization report template when doing AKS-focused analysis. |
| plugin/skills/azure-cost/cost-optimization/aks-cost-optimization-report.md | Introduces a structured, fill-in report template with scenarios, commands, and a savings summary for AKS cost optimization. |
| **Reference files (load only what is needed for the request):** | ||
| - [Cost Analysis Add-on](./azure-aks-cost-addon.md) — enable namespace-level cost visibility | ||
| - [Anomaly Investigation](./azure-aks-anomalies.md) — cost spikes, scaling events, budget alerts | ||
| - [AKS Cost Optimization Report Template](./aks-cost-optimization-report.md) — use when generating an AKS cost report covering overprovisioned pods, node rightsizing, autoscaler, spot nodes, and reserved instances |
There was a problem hiding this comment.
The workflow description says this AKS report template covers “reserved instances”, but the linked template doesn’t include a reserved instances scenario/section. Either remove “reserved instances” here or add a dedicated reserved instances recommendation section to the template so the reference text is accurate.
| **Cluster**: <CLUSTER_NAME> | **Resource Group**: <RESOURCE_GROUP> | ||
| **Location**: <LOCATION> | **Nodes**: <NODE_COUNT> x <VM_SIZE> | **Tier**: <TIER> | ||
| **Generated**: <TIMESTAMP> | ||
| **AKS Cost Analysis Add-on**: <ENABLED|DISABLED> |
There was a problem hiding this comment.
Placeholder tokens are inconsistent within the template (e.g., header uses <CLUSTER_NAME>/<RESOURCE_GROUP> but later sections use /, and the portal link uses <SUB_ID>). Standardize on one set of placeholder names throughout so users don’t have to translate between formats.
| ## EXECUTIVE SUMMARY | ||
| - Current Monthly Node Cost: $<TOTAL_MONTHLY> (<NODE_COUNT> x $<NODE_PRICE>/mo) | ||
| - Scenarios Analyzed: 7 | ||
| - **Total Potential Savings: $<TOTAL_SAVINGS>/month ($<TOTAL_ANNUAL>/year)** |
There was a problem hiding this comment.
The total-savings placeholders are inconsistent between sections (e.g., Executive Summary uses <TOTAL_SAVINGS>/<TOTAL_ANNUAL> while the Total Savings Summary table uses /<TOTAL_A>). Use the same placeholder names in both places to prevent confusion when filling in the template.
| - **Total Potential Savings: $<TOTAL_SAVINGS>/month ($<TOTAL_ANNUAL>/year)** | |
| - **Total Potential Savings: $<TOTAL>/month ($<TOTAL_A>/year)** |
| ```bash | ||
| az aks nodepool update \ | ||
| --cluster-name <CLUSTER> --resource-group <RG> \ | ||
| --name agentpool \ |
There was a problem hiding this comment.
The autoscaler enablement example hard-codes the node pool name as agentpool. AKS clusters often have different pool names and multiple pools; use a placeholder (and/or note that this must be applied per node pool) to avoid users running the command against a non-existent or wrong pool.
| ```bash | |
| az aks nodepool update \ | |
| --cluster-name <CLUSTER> --resource-group <RG> \ | |
| --name agentpool \ | |
| > Apply this per node pool that should autoscale. Replace `<NODE_POOL_NAME>` with the target pool name. | |
| ```bash | |
| az aks nodepool update \ | |
| --cluster-name <CLUSTER> --resource-group <RG> \ | |
| --name <NODE_POOL_NAME> \ |
| --eviction-policy Delete \ | ||
| --spot-max-price -1 \ | ||
| --node-vm-size <VM_SKU> \ | ||
| --node-count <N> |
There was a problem hiding this comment.
The Spot node pool creation command omits labels/taints and autoscaler settings that are used elsewhere in this repo’s AKS Spot guidance (e.g., labeling nodes enables nodeSelector/affinity targeting, and autoscaler min=0/max=N helps avoid paying for idle nodes). Consider aligning this command with plugin/skills/azure-kubernetes/references/azure-aks-spot.md so workloads can reliably target the Spot pool and the pool can scale cost-effectively.
| --node-count <N> | |
| --node-count <N> \ | |
| --labels kubernetes.azure.com/scalesetpriority=spot workload-type=spot \ | |
| --node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \ | |
| --enable-cluster-autoscaler \ | |
| --min-count 0 \ | |
| --max-count <MAX_NODES> |
| Add toleration to eligible workloads: | ||
| ```yaml | ||
| tolerations: | ||
| - key: "kubernetes.azure.com/scalesetpriority" | ||
| operator: "Equal" | ||
| value: "spot" | ||
| effect: "NoSchedule" | ||
| ``` |
There was a problem hiding this comment.
Only adding a toleration makes a workload eligible to run on Spot nodes, but doesn’t encourage/ensure it will actually land on the Spot pool (it may continue to schedule on regular nodes). Consider adding guidance to use a nodeSelector or node affinity in addition to the toleration (see the spot scheduling patterns in plugin/skills/azure-diagnostics/troubleshooting/aks/spot-and-zone-issues.md and plugin/skills/azure-kubernetes/references/azure-aks-spot.md).
| | <NS> | <PCT>% | <PCT>% | $<COST> | | ||
|
|
||
| View in Azure Portal: | ||
| ``` |
There was a problem hiding this comment.
The Azure Portal URL is in a fenced code block without a language tag. Elsewhere in this directory (and in report-template.md) fenced blocks specify a language like text; consider switching this block to ```text for consistency and better rendering.
| ``` | |
| ```text |
jongio
left a comment
There was a problem hiding this comment.
The Copilot bot already caught the placeholder inconsistencies and reserved instances gap - I won't repeat those. Three additional items:
-
Missing output path convention -
report-template.mdstarts with acreate_fileinstruction targetingoutput/costoptimizereport<YYYYMMDD_HHMMSS>.md. This template doesn't follow that pattern, so an agent won't know where to write the output. -
Duplication risk - Scenarios 4 and 7 duplicate content from
azure-aks-cost-addon.mdandazure-aks-spot.md. When those files get updated, this template will drift. Consider cross-referencing instead. -
Rightsizing heuristics - The fixed multipliers in Scenario 1 don't account for burst patterns. Details in inline comments.
| @@ -0,0 +1,238 @@ | |||
| # AKS Cost Optimization Report | |||
There was a problem hiding this comment.
The sibling report-template.md starts with a create_file instruction specifying where agents should save the output (output/costoptimizereport<YYYYMMDD_HHMMSS>.md). This template doesn't include that pattern. Without it, an agent loading this template has no convention for where to write the generated report. Consider adding a similar create_file header or at minimum a note about the expected output path.
|
|
||
| **Node impact**: <CURRENT_NODES> -> <TARGET_NODES> nodes | Saves **$<SAVINGS>/month** | ||
|
|
||
| Rightsizing guidelines: |
There was a problem hiding this comment.
These multipliers assume steady-state workloads. For bursty services (startup probes, periodic batch processing, traffic spikes), setting requests to actual x1.5 based on recent averages can cause throttling or OOMKills during peaks. Worth adding a caveat: For workloads with known burst patterns, use p99 or max metrics instead of averages so the agent doesn't blindly apply these as universal rules.
| # Delete if no longer needed | ||
| kubectl delete deployment/<NAME> -n <NS> | ||
| ``` | ||
|
|
There was a problem hiding this comment.
This scenario overlaps with azure-aks-cost-addon.md in this same directory, which already covers enabling the add-on and checking tier/status. Having the same steps in two places means two things to update when the API changes. Consider keeping just the cost allocation table structure here and referencing the existing file for the enable/check commands.
| > Skip this section if no eligible workloads identified. | ||
|
|
||
| Spot VMs offer up to 90% discount but can be evicted with 30s notice. | ||
| Suitable for: batch jobs, dev/test workloads, stateless tolerant services. |
There was a problem hiding this comment.
This is a thinner version of the spot node pool guidance in plugin/skills/azure-kubernetes/references/azure-aks-spot.md, which includes workload suitability criteria, mixed pool patterns, PDB guidance, and eviction handling. The bot's comments about missing labels, taints, and nodeSelector all stem from the same root cause: this section duplicates content that exists in richer form elsewhere. Consider referencing that file for implementation details and keeping this scenario focused on the cost data (eligible workloads table + savings estimate).
Summary
Add a new AKS-specific cost optimization report template to the
azure-costskill, covering 7 impactful cost scenarios for AKS workloads.Changes
New file:
plugin/skills/azure-cost/cost-optimization/aks-cost-optimization-report.mdA structured report template (sibling to
report-template.md) covering:Includes a prerequisite section for the AKS Cost Analysis add-on (required for Scenario 4).
Modified:
plugin/skills/azure-cost/cost-optimization/workflow.mdAdded reference to
aks-cost-optimization-report.mdin Step 1.7 alongside existing AKS references.