Skip to content

feat: add AKS cost optimization report template (7 scenarios)#2067

Draft
Harshaa wants to merge 1 commit intomicrosoft:mainfrom
Harshaa:feat/aks-cost-optimization-report
Draft

feat: add AKS cost optimization report template (7 scenarios)#2067
Harshaa wants to merge 1 commit intomicrosoft:mainfrom
Harshaa:feat/aks-cost-optimization-report

Conversation

@Harshaa
Copy link
Copy Markdown
Contributor

@Harshaa Harshaa commented Apr 27, 2026

Summary

Add a new AKS-specific cost optimization report template to the azure-cost skill, covering 7 impactful cost scenarios for AKS workloads.

Changes

New file: plugin/skills/azure-cost/cost-optimization/aks-cost-optimization-report.md

A structured report template (sibling to report-template.md) covering:

# Scenario
1 Overprovisioned Pods
2 Missing Requests/Limits
3 Idle Workloads
4 Namespace Cost Allocation
5 Node Pool Rightsizing
6 Cluster Autoscaler Configuration
7 Spot Node Pool Adoption

Includes a prerequisite section for the AKS Cost Analysis add-on (required for Scenario 4).

Modified: plugin/skills/azure-cost/cost-optimization/workflow.md

Added reference to aks-cost-optimization-report.md in Step 1.7 alongside existing AKS references.

Add aks-cost-optimization-report.md template covering 8 scenarios:
overprovisioned pods, missing requests/limits, idle workloads,
namespace cost allocation, node pool rightsizing, cluster autoscaler,
spot node pools, and reserved instances.

Update workflow.md Step 1.7 to reference the new template.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 27, 2026 21:36
@Harshaa Harshaa marked this pull request as draft April 27, 2026 21:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an AKS-specific cost optimization report template to the azure-cost skill so agents can generate a consistent, scenario-driven AKS cost savings report, and wires it into the existing cost-optimization workflow as an AKS reference.

Changes:

  • Added a new aks-cost-optimization-report.md template covering 7 AKS cost scenarios plus prerequisites for the AKS Cost Analysis add-on.
  • Updated the cost optimization workflow to reference the new AKS report template in the AKS-specific analysis step.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
plugin/skills/azure-cost/cost-optimization/workflow.md Adds a new reference entry to load the AKS cost optimization report template when doing AKS-focused analysis.
plugin/skills/azure-cost/cost-optimization/aks-cost-optimization-report.md Introduces a structured, fill-in report template with scenarios, commands, and a savings summary for AKS cost optimization.

**Reference files (load only what is needed for the request):**
- [Cost Analysis Add-on](./azure-aks-cost-addon.md) — enable namespace-level cost visibility
- [Anomaly Investigation](./azure-aks-anomalies.md) — cost spikes, scaling events, budget alerts
- [AKS Cost Optimization Report Template](./aks-cost-optimization-report.md) — use when generating an AKS cost report covering overprovisioned pods, node rightsizing, autoscaler, spot nodes, and reserved instances
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow description says this AKS report template covers “reserved instances”, but the linked template doesn’t include a reserved instances scenario/section. Either remove “reserved instances” here or add a dedicated reserved instances recommendation section to the template so the reference text is accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +2 to +5
**Cluster**: <CLUSTER_NAME> | **Resource Group**: <RESOURCE_GROUP>
**Location**: <LOCATION> | **Nodes**: <NODE_COUNT> x <VM_SIZE> | **Tier**: <TIER>
**Generated**: <TIMESTAMP>
**AKS Cost Analysis Add-on**: <ENABLED|DISABLED>
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder tokens are inconsistent within the template (e.g., header uses <CLUSTER_NAME>/<RESOURCE_GROUP> but later sections use /, and the portal link uses <SUB_ID>). Standardize on one set of placeholder names throughout so users don’t have to translate between formats.

Copilot uses AI. Check for mistakes.
## EXECUTIVE SUMMARY
- Current Monthly Node Cost: $<TOTAL_MONTHLY> (<NODE_COUNT> x $<NODE_PRICE>/mo)
- Scenarios Analyzed: 7
- **Total Potential Savings: $<TOTAL_SAVINGS>/month ($<TOTAL_ANNUAL>/year)**
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total-savings placeholders are inconsistent between sections (e.g., Executive Summary uses <TOTAL_SAVINGS>/<TOTAL_ANNUAL> while the Total Savings Summary table uses /<TOTAL_A>). Use the same placeholder names in both places to prevent confusion when filling in the template.

Suggested change
- **Total Potential Savings: $<TOTAL_SAVINGS>/month ($<TOTAL_ANNUAL>/year)**
- **Total Potential Savings: $<TOTAL>/month ($<TOTAL_A>/year)**

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +152
```bash
az aks nodepool update \
--cluster-name <CLUSTER> --resource-group <RG> \
--name agentpool \
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autoscaler enablement example hard-codes the node pool name as agentpool. AKS clusters often have different pool names and multiple pools; use a placeholder (and/or note that this must be applied per node pool) to avoid users running the command against a non-existent or wrong pool.

Suggested change
```bash
az aks nodepool update \
--cluster-name <CLUSTER> --resource-group <RG> \
--name agentpool \
> Apply this per node pool that should autoscale. Replace `<NODE_POOL_NAME>` with the target pool name.
```bash
az aks nodepool update \
--cluster-name <CLUSTER> --resource-group <RG> \
--name <NODE_POOL_NAME> \

Copilot uses AI. Check for mistakes.
--eviction-policy Delete \
--spot-max-price -1 \
--node-vm-size <VM_SKU> \
--node-count <N>
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Spot node pool creation command omits labels/taints and autoscaler settings that are used elsewhere in this repo’s AKS Spot guidance (e.g., labeling nodes enables nodeSelector/affinity targeting, and autoscaler min=0/max=N helps avoid paying for idle nodes). Consider aligning this command with plugin/skills/azure-kubernetes/references/azure-aks-spot.md so workloads can reliably target the Spot pool and the pool can scale cost-effectively.

Suggested change
--node-count <N>
--node-count <N> \
--labels kubernetes.azure.com/scalesetpriority=spot workload-type=spot \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count <MAX_NODES>

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +191
Add toleration to eligible workloads:
```yaml
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
```
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only adding a toleration makes a workload eligible to run on Spot nodes, but doesn’t encourage/ensure it will actually land on the Spot pool (it may continue to schedule on regular nodes). Consider adding guidance to use a nodeSelector or node affinity in addition to the toleration (see the spot scheduling patterns in plugin/skills/azure-diagnostics/troubleshooting/aks/spot-and-zone-issues.md and plugin/skills/azure-kubernetes/references/azure-aks-spot.md).

Copilot uses AI. Check for mistakes.
| <NS> | <PCT>% | <PCT>% | $<COST> |

View in Azure Portal:
```
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Azure Portal URL is in a fenced code block without a language tag. Elsewhere in this directory (and in report-template.md) fenced blocks specify a language like text; consider switching this block to ```text for consistency and better rendering.

Suggested change
```
```text

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Copilot bot already caught the placeholder inconsistencies and reserved instances gap - I won't repeat those. Three additional items:

  1. Missing output path convention - report-template.md starts with a create_file instruction targeting output/costoptimizereport<YYYYMMDD_HHMMSS>.md. This template doesn't follow that pattern, so an agent won't know where to write the output.

  2. Duplication risk - Scenarios 4 and 7 duplicate content from azure-aks-cost-addon.md and azure-aks-spot.md. When those files get updated, this template will drift. Consider cross-referencing instead.

  3. Rightsizing heuristics - The fixed multipliers in Scenario 1 don't account for burst patterns. Details in inline comments.

@@ -0,0 +1,238 @@
# AKS Cost Optimization Report
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sibling report-template.md starts with a create_file instruction specifying where agents should save the output (output/costoptimizereport<YYYYMMDD_HHMMSS>.md). This template doesn't include that pattern. Without it, an agent loading this template has no convention for where to write the generated report. Consider adding a similar create_file header or at minimum a note about the expected output path.


**Node impact**: <CURRENT_NODES> -> <TARGET_NODES> nodes | Saves **$<SAVINGS>/month**

Rightsizing guidelines:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These multipliers assume steady-state workloads. For bursty services (startup probes, periodic batch processing, traffic spikes), setting requests to actual x1.5 based on recent averages can cause throttling or OOMKills during peaks. Worth adding a caveat: For workloads with known burst patterns, use p99 or max metrics instead of averages so the agent doesn't blindly apply these as universal rules.

# Delete if no longer needed
kubectl delete deployment/<NAME> -n <NS>
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scenario overlaps with azure-aks-cost-addon.md in this same directory, which already covers enabling the add-on and checking tier/status. Having the same steps in two places means two things to update when the API changes. Consider keeping just the cost allocation table structure here and referencing the existing file for the enable/check commands.

> Skip this section if no eligible workloads identified.

Spot VMs offer up to 90% discount but can be evicted with 30s notice.
Suitable for: batch jobs, dev/test workloads, stateless tolerant services.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thinner version of the spot node pool guidance in plugin/skills/azure-kubernetes/references/azure-aks-spot.md, which includes workload suitability criteria, mixed pool patterns, PDB guidance, and eviction handling. The bot's comments about missing labels, taints, and nodeSelector all stem from the same root cause: this section duplicates content that exists in richer form elsewhere. Consider referencing that file for implementation details and keeping this scenario focused on the cost data (eligible workloads table + savings estimate).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants