Skip to content

Latest commit

 

History

History
2294 lines (1629 loc) · 92 KB

File metadata and controls

2294 lines (1629 loc) · 92 KB

CleanCloud Rules

Complete reference for all 45 rules implemented by CleanCloud (30 hygiene + 15 AI/ML).


Design Principles

All CleanCloud rules follow these principles:

1. Read-Only Always

  • Uses read-only cloud APIs exclusively
  • No Delete*, Modify*, Tag*, or Update* operations
  • Safe for production environments

2. Conservative by Default

  • Multiple signals preferred over single indicators
  • Age-based thresholds prevent false positives on temporary resources
  • Prefer false negatives over false positives

3. Explicit Confidence Levels

Every finding includes a confidence level:

  • HIGH - Multiple strong signals, very likely orphaned
  • MEDIUM - Moderate signals, worth reviewing
  • LOW - Weak signals, informational only

4. Review-Only Recommendations

  • Findings are candidates for human review, not automated action
  • Clear reasoning provided for each finding
  • No rule should justify deletion on its own

Quick Reference

AWS:

Rule ID Cost Surface What It Detects
aws.ec2.instance.stopped Compute EC2 instances stopped 30+ days (EBS charges continue)
aws.ec2.security_group.unused Governance Security groups with no ENI associations
aws.ebs.unattached Storage EBS volumes not attached to any instance
aws.ebs.snapshot.old Storage Snapshots ≥ 90 days old
aws.ec2.ami.old Storage AMIs older than 180 days
aws.ec2.elastic_ip.unattached Network Elastic IPs not currently associated with any instance or network interface
aws.ec2.eni.detached Network Detached ENIs not currently attached
aws.ec2.nat_gateway.idle Network NAT Gateways with zero traffic 14+ days
aws.elbv2.alb.idle / aws.elbv2.nlb.idle / aws.elb.clb.idle Network Load balancers with zero traffic 14+ days
aws.rds.instance.idle Platform RDS instances with zero connections 14+ days
aws.rds.snapshot.old Storage Manual RDS snapshots older than 90 days
aws.cloudwatch.logs.infinite_retention Observability Log groups with no retention policy
aws.resource.untagged Governance EC2/S3/CloudWatch resources with zero tags
aws.sagemaker.endpoint.idle AI/ML Real-time SageMaker endpoints InService with no observed InvokeEndpoint traffic across billable production variants for 14+ days (opt-in: --category ai)
aws.sagemaker.notebook.idle AI/ML SageMaker Notebook Instances InService with stale control-plane timestamps for 14+ days (opt-in: --category ai)
aws.ec2.gpu.idle AI/ML EC2 GPU/accelerator instances (p/g/trn/inf/dl families) running with <5% GPU or <10% CPU utilisation over 7 days (opt-in: --category ai)
aws.bedrock.provisioned_throughput.idle AI/ML Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills per MU per hour regardless of traffic (opt-in: --category ai)
aws.sagemaker.studio_app.idle AI/ML SageMaker Studio KernelGateway/JupyterLab/CodeEditor apps InService with no usable recent activity signal for 7+ days (opt-in: --category ai)
aws.sagemaker.training_job.long_running AI/ML SageMaker training jobs still InProgress beyond the configured threshold (default 24h), using TrainingStartTime when present else CreationTime (opt-in: --category ai)

Azure:

Rule ID Cost Surface What It Detects
azure.vm.stopped_not_deallocated Compute Stopped but not deallocated VMs (full charges)
azure.compute.disk.unattached Storage Managed disks not attached to any VM
azure.compute.snapshot.old Storage Snapshots older than 30–90 days
azure.network.public_ip.unused Network Public IPs not attached to any interface
azure.load_balancer.no_backends Network Standard LBs with zero backend members
azure.application_gateway.no_backends Network App Gateways with zero backend targets
azure.virtual_network_gateway.idle Network VPN/ExpressRoute Gateways with no connections
azure.app_service_plan.empty Platform Paid App Service Plans with zero apps
azure.app_service.idle Platform App Services with zero HTTP requests 14+ days
azure.sql.database.idle Platform Azure SQL databases with zero connections 14+ days
azure.container_registry.unused Platform Container registries with no pulls 90+ days
azure.resource.untagged Governance Disks and snapshots with zero tags
azure.aml.compute.idle AI/ML AML compute clusters with min_node_count > 0 and no active nodes 14+ days (opt-in: --category ai)
azure.ml.compute_instance.idle AI/ML Azure ML Compute Instances Running with no control-plane activity 14+ days (opt-in: --category ai)
azure.ml.online_endpoint.idle AI/ML Azure ML managed online endpoints in Succeeded provisioning state with zero scoring requests for 7+ days (opt-in: --category ai)
azure.ai_search.idle AI/ML Azure AI Search services (Standard tier+) with zero search queries for 30+ days (opt-in: --category ai)
azure.openai.provisioned_deployment.idle AI/ML Azure OpenAI provisioned deployments (PTUs) with zero API requests for 7+ days (opt-in: --category ai) (default, configurable)

GCP:

Rule ID Cost Surface What It Detects
gcp.compute.vm.stopped Compute TERMINATED VM instances stopped 30+ days (disk charges continue)
gcp.compute.disk.unattached Storage Persistent Disks in READY state with no attached VM
gcp.compute.snapshot.old Storage Disk snapshots older than 90 days
gcp.compute.ip.unused Network Reserved static IPs (regional and global) in RESERVED state
gcp.sql.instance.idle Platform Cloud SQL instances with zero connections for 14+ days
gcp.vertex.endpoint.idle AI/ML Vertex AI Online Prediction endpoints with dedicated capacity and zero predictions for 14+ days (--category ai)
gcp.vertex.workbench.idle AI/ML Vertex AI Workbench instances ACTIVE with no control-plane activity for 14+ days (--category ai)
gcp.vertex.training_job.long_running AI/ML Vertex AI CustomJobs and TrainingPipelines in RUNNING state beyond 24h threshold; GPU/TPU/expensive-CPU early warning at 90% of threshold — hung or runaway jobs on GPU-backed machines cost $4–$80+/hr per node (opt-in: --category ai)
gcp.tpu.idle AI/ML Cloud TPU nodes in READY state with near-zero utilization (duty_cycle ≤ 2%) for 7+ days — idle TPU v4 costs ~$12.88/hr, v5p can exceed $33/hr (opt-in: --category ai)
gcp.vertex.featurestore.idle AI/ML Vertex AI Feature Store online stores (legacy and new-gen) with zero ReadFeatureValues requests for 30+ days — Bigtable-backed stores bill ~$197/node/month regardless of utilization (opt-in: --category ai)

AWS Rules

Compute Waste

Stopped EC2 Instances

Rule ID: aws.ec2.instance.stopped

What it detects: EC2 instances in 'stopped' state for 30+ days

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Stop time from CloudTrail LookupEvents ≥ 30 days ago (deterministic timestamp)
  • Not flagged: no CloudTrail stop event found or stopped < 30 days ago

Risk: MEDIUM

Why this matters:

  • Stopped EC2 instances do not charge for compute — but every attached EBS volume accrues storage costs at ~$0.10/GB-month, every hour, regardless of instance state
  • A 500 GB root + data volume on a forgotten stopped instance costs ~$50/month indefinitely
  • Any associated Elastic IPs continue to charge ~$0.005/hour while unattached
  • Stopped instances are the most common form of "I meant to clean that up" infrastructure debt

Detection logic:

for instance in describe_instances(state=stopped):
    stop_event = cloudtrail_lookup_events(EventName="StopInstances", instance_id=instance.id)
    # Uses latest StopInstances event after most recent StartInstances (restart-cycle aware)
    if stop_event and (now - stop_event.eventTime).days >= 30:
        confidence = "HIGH"  # Deterministic CloudTrail timestamp, not a heuristic

Cost estimates:

  • Based on total attached EBS storage × $0.10/GB-month
  • Example: 2 × 100 GB volumes = ~$20/month in ongoing storage charges
  • Additional Elastic IP charges are tracked separately by the aws.ec2.elastic_ip.unattached rule

Common causes:

  • Test or dev instances left stopped after a project ended
  • Migration source instances never terminated after cutover
  • Incident response boxes started and never cleaned up
  • Autoscaling warm pools drained but not terminated

Required permissions:

  • ec2:DescribeInstances
  • ec2:DescribeVolumes
  • cloudtrail:LookupEvents

Unused Security Groups

Rule ID: aws.ec2.security_group.unused

What it detects: Security groups not associated with any network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: No ENI associations found (service-managed groups may appear unused between deployments)

Risk: LOW

Why this matters:

  • Security groups with no ENI associations are pure governance debt
  • Each unused group widens the blast radius if a misconfiguration is later introduced
  • Compliance audits (SOC 2, ISO 27001, PCI DSS) flag unused security groups as a control failure
  • In accounts with hundreds of groups, unused ones obscure the real security posture and add friction to every access review
  • Cost is indirect but real: engineer time spent auditing and explaining phantom groups in compliance reviews

Detection logic:

in_use_sg_ids = {
    group["GroupId"]
    for eni in describe_network_interfaces()
    for group in eni["Groups"]
}
for sg in describe_security_groups():
    if sg.name != "default" and sg.id not in in_use_sg_ids:
        confidence = "MEDIUM"

Exclusions:

  • default security groups — AWS prevents deletion of the default group; flagging it is noise

Caveats:

  • A security group referenced only in another group's inbound rules (not attached to any ENI) will be flagged. This is intentional.
  • Service-managed groups (RDS, ELB, Lambda) may appear unused briefly between deployments. Review before deleting.

Common causes:

  • Leftover groups from deleted EC2 instances, RDS databases, or ELB deployments
  • Test stacks torn down without full cleanup
  • Groups created manually but never attached
  • CloudFormation stacks deleted leaving orphaned groups

Required permissions:

  • ec2:DescribeSecurityGroups
  • ec2:DescribeNetworkInterfaces

Storage Waste

Unattached EBS Volumes

Rule ID: aws.ebs.unattached

What it detects: EBS volumes not attached to any EC2 instance

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Volume in available state for ≥7 days (not attached to any instance)
  • Not flagged: < 7 days

Why this threshold:

  • Allows time for deployment cycles
  • Accounts for rollback windows
  • Reduces false positives from autoscaling

Common causes:

  • Volumes from terminated EC2 instances
  • Failed deployments or rollbacks
  • Autoscaling cleanup gaps

Required permission: ec2:DescribeVolumes


Old EBS Snapshots

Rule ID: aws.ebs.snapshot.old

What it detects: Snapshots ≥ 90 days old (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • LOW: Age ≥ 90 days (conservative — age alone is a weak signal)

Detection logic:

for snapshot in describe_snapshots(OwnerIds=["self"]):
    age_days = (now - snapshot.StartTime).days
    if age_days >= days_old:  # default 90
        confidence = "LOW"  # age alone is a weak signal
        risk = "LOW"

Limitations:

  • Snapshots linked to registered AMIs are excluded (avoids false positives)
  • Does NOT verify snapshot is unused (conservative approach)

Common causes:

  • Backup retention policies without lifecycle rules
  • Snapshots from deleted volumes
  • Over-retention without cleanup

Required permissions:

  • ec2:DescribeSnapshots
  • ec2:DescribeSnapshotAttribute

Old AMIs

Rule ID: aws.ec2.ami.old

What it detects: AMIs (Amazon Machine Images) older than 180 days (default threshold)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Age ≥ 180 days (AMI may still be actively used as template)

Why MEDIUM confidence:

  • Age alone is a moderate signal
  • AMI may be a golden image still used for launches
  • Cannot check if AMI is referenced by launch templates or Auto Scaling groups

Why this matters:

  • AMIs have associated EBS snapshots that incur storage costs
  • Old unused AMIs accumulate over time
  • Storage costs are ~$0.05/GB-month

Detection logic:

for ami in describe_images(Owners=["self"]):
    age_days = (now - ami.creation_date).days
    if age_days >= 180 (default) and ami.state == "available":
        # Flag as old AMI

What gets checked:

  • AMI creation date
  • AMI state (only "available" AMIs are flagged)
  • Associated snapshot sizes for cost estimation

Common causes:

  • AMIs from old deployments
  • Test/dev AMIs no longer needed
  • Superseded golden images
  • AMIs from terminated projects

Cost estimates:

  • Based on total EBS snapshot storage
  • ~$0.05/GB-month for snapshot storage
  • Example: 100 GB AMI = ~$5/month

Required permission: ec2:DescribeImages


Network Waste

Unattached Elastic IPs

Rule ID: aws.ec2.elastic_ip.unattached

What it detects: Elastic IPs currently not associated with any instance or network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Currently not associated (all four AWS association fields absent per DescribeAddresses)

Why this matters:

  • Unattached Elastic IPs incur small hourly charges
  • State is deterministic (no AssociationId, InstanceId, NetworkInterfaceId, or PrivateIpAddress means not attached)
  • Clear cost optimization signal with zero ambiguity

Detection logic:

if not any([eip.get("AssociationId"), eip.get("InstanceId"),
            eip.get("NetworkInterfaceId"), eip.get("PrivateIpAddress")]):
    confidence = "HIGH"  # Deterministic state: not associated

Common causes:

  • Elastic IPs from terminated EC2 instances
  • Reserved IPs for DR that are no longer needed
  • Failed deployments leaving orphaned IPs
  • Manual allocation without attachment

Edge cases handled:

  • Classic EIPs without AllocationTime are annotated as is_classic: true in details
  • Detection is purely state-based — no age threshold is applied

Required permission: ec2:DescribeAddresses


Detached Network Interfaces (ENIs)

Rule ID: aws.ec2.eni.detached

What it detects: Elastic Network Interfaces (ENIs) currently not attached (Status=available)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Currently not attached — no temporal threshold; Status=available is the sole eligibility signal

Why this matters:

  • Detached ENIs incur small hourly charges
  • Often forgotten after failed deployments or incomplete teardowns
  • Clear signal with minimal ambiguity

Detection logic:

if eni['Status'] == 'available':  # Currently detached
    confidence = "HIGH"  # Deterministic state: not attached

What gets flagged:

  • User-created ENIs (InterfaceType='interface')
  • Lambda/ECS/RDS ENIs (RequesterManaged=true but YOUR resources!) - explicitly annotated in evidence and details
  • Detached ENIs from deleted services

Key insight: RequesterManaged=true means "AWS created this in YOUR VPC for YOUR resource" — these ARE your responsibility and often waste. RequesterManaged ENIs are included in findings with an explicit evidence signal and requester_managed: true in details for downstream filtering.

Common causes:

  • Failed EC2 instance launches
  • Incomplete infrastructure teardown
  • Terminated instances with retained ENIs
  • Forgotten manual ENI creations

Required permission: ec2:DescribeNetworkInterfaces


Idle NAT Gateways

Rule ID: aws.ec2.nat_gateway.idle

What it detects: NAT Gateways with zero traffic for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: No traffic detected for 14+ days (CloudWatch metrics checked, but seasonal patterns not verified)

Why MEDIUM confidence:

  • Zero traffic is a strong signal, but gateway may be for DR/standby
  • Cannot verify planned future usage or blue/green deployments
  • Seasonal traffic patterns not checked

Why this matters:

  • NAT Gateways cost $0.045/hour + $0.045/GB data processing ($32/month base)
  • Idle gateways are a clear cost optimization signal
  • Common after VPC restructuring or service migrations

Detection logic:

for gw in describe_nat_gateways():
    if gw.state == "available" and age >= idle_threshold_days:
        # All 5 metrics must return datapoints and all must be zero
        # If any metric has no datapoints, the item is skipped
        for metric in required_metrics:
            value = get_metric(metric, period=idle_threshold_days)
            if value is None:
                skip  # Missing data is NOT treated as zero traffic
            if value > 0:
                skip  # Active traffic detected
        confidence = "HIGH" if no_route_table_refs else "MEDIUM"

CloudWatch metrics checked:

  • AWS/NATGatewayBytesOutToDestination (daily sum)
  • AWS/NATGatewayBytesInFromSource (daily sum)
  • AWS/NATGatewayBytesInFromDestination (daily sum)
  • AWS/NATGatewayBytesOutToSource (daily sum)
  • AWS/NATGatewayActiveConnectionCount (daily sum)

Note: If any metric has no data for the period (e.g. newly created gateway), the item is skipped — missing data is NOT treated as zero traffic.

Common causes:

  • VPC restructuring leaving orphaned NAT Gateways
  • Service migrations to different subnets/VPCs
  • Dev/staging environments with no active workloads
  • DR standby gateways (intentional, but worth reviewing)

Cost estimates:

  • ~$32/month base cost per idle NAT Gateway
  • Additional $0.045/GB data processing when active

Required permissions:

  • ec2:DescribeNatGateways
  • cloudwatch:GetMetricStatistics

Idle Elastic Load Balancers (ALB/CLB/NLB)

Rule IDs:

  • aws.elbv2.alb.idle — Application Load Balancer
  • aws.elbv2.nlb.idle — Network Load Balancer
  • aws.elb.clb.idle — Classic Load Balancer

What it detects: Load balancers with zero traffic for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Zero traffic AND no registered targets/instances
  • MEDIUM: Zero traffic only (targets/instances may still be registered)

Risk: MEDIUM

Why this matters:

  • ELBs incur base hourly charges regardless of traffic (~$16-22/month)
  • Idle load balancers are a clear cost optimization signal
  • Common after service migrations or decommissions

Detection logic:

# ALB/NLB (elbv2)
for lb in describe_load_balancers():
    if age >= idle_threshold_days:
        traffic = get_metric(RequestCount or NewFlowCount, period=idle_threshold_days)
        has_targets = check_target_groups(lb)
        if traffic == 0:
            confidence = "HIGH" if not has_targets else "MEDIUM"

# CLB (elb)
for lb in describe_load_balancers():
    if age >= idle_threshold_days:
        traffic = get_metric(RequestCount, period=idle_threshold_days)
        has_instances = len(lb.instances) > 0
        if traffic == 0:
            confidence = "HIGH" if not has_instances else "MEDIUM"

CloudWatch metrics checked:

  • AWS/ApplicationELBRequestCount (ALB, daily sum)
  • AWS/NetworkELBNewFlowCount (NLB, daily sum)
  • AWS/ELBRequestCount (CLB, daily sum)

Exclusions:

  • LBs younger than the idle threshold

Common causes:

  • Service migrations leaving orphaned load balancers
  • Dev/staging environments with no active workloads
  • Decommissioned applications with retained infrastructure
  • Blue/green deployments with stale LBs

Cost estimates:

  • ~$16-22/month base cost per idle load balancer (region dependent)

Required permissions:

  • elasticloadbalancing:DescribeLoadBalancers
  • elasticloadbalancing:DescribeTargetGroups
  • elasticloadbalancing:DescribeTargetHealth
  • cloudwatch:GetMetricStatistics

Platform Waste

Idle RDS Instances

Rule ID: aws.rds.instance.idle

What it detects: RDS instances with zero database connections for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Zero connections for 14+ days (CloudWatch metrics checked, strong but not conclusive signal)

Why MEDIUM confidence:

  • Zero database connections is a strong signal of non-use, but cannot rule out Aurora-style architectures or scheduled workloads that connect infrequently
  • Connection pools and proxies (RDS Proxy, PgBouncer) can hide real usage while keeping observed client connection counts low or zero

Risk: MEDIUM

Why MEDIUM risk:

  • RDS instances are among the more expensive AWS resources, but zero connections alone does not confirm the instance is safe to delete

Why this matters:

  • RDS instances incur hourly charges regardless of usage
  • Idle instances with no connections are a clear cost optimization signal
  • Common after application migrations or decommissions

Detection logic:

for instance in describe_db_instances():
    if instance.status == "available" and age >= idle_threshold_days:
        if not instance.read_replica_source:  # Skip read replicas
            connections_max = get_metric(DatabaseConnections, statistic="Maximum", period=idle_threshold_days)
            if connections_max == 0:
                confidence = "MEDIUM"
                risk = "MEDIUM"

CloudWatch metrics checked:

  • AWS/RDS -> DatabaseConnections (Maximum statistic)

Exclusions:

  • Aurora cluster members (DBClusterIdentifier set) — Aurora instances are managed at cluster level and may show zero connections individually even when the cluster is active
  • Read replicas (ReadReplicaSourceDBInstanceIdentifier set)
  • Instances younger than the idle threshold

Common causes:

  • Applications migrated to different databases
  • Dev/staging instances left running
  • Decommissioned services with retained databases
  • Test databases no longer needed

Required permissions:

  • rds:DescribeDBInstances
  • cloudwatch:GetMetricStatistics

Old Manual RDS Snapshots

Rule ID: aws.rds.snapshot.old

What it detects: Manual RDS snapshots older than 90 days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • LOW: Snapshot age is known and exceeds threshold (age alone is a weak signal)

Risk: LOW

Why this matters:

  • Manual RDS snapshots are retained indefinitely until explicitly deleted
  • Storage charges accrue at ~$0.095/GB-month regardless of whether the source DB still exists
  • Snapshots older than 90 days are rarely needed for active recovery

Detection logic:

for snapshot in describe_db_snapshots(SnapshotType="manual"):
    if snapshot.status == "available":
        age_days = (now - snapshot.create_time).days
        if age_days >= days_old:
            confidence = "LOW"
            risk = "LOW"

Exclusions:

  • Automated snapshots (SnapshotType=automated) — managed by RDS retention policy, auto-deleted
  • Snapshots in non-available states

Common causes:

  • Pre-migration snapshots never cleaned up
  • Manual backups taken before schema changes and forgotten
  • Snapshots of deleted databases retained for compliance but past their useful life

Cost estimate: ~$0.095/GB-month based on AllocatedStorage (the provisioned DB size). RDS snapshots are incremental so actual storage used may be lower — treat this as a ceiling estimate, not an exact figure.

Required permissions:

  • rds:DescribeDBSnapshots
  • rds:DescribeDBSnapshotAttributes

Observability Waste

CloudWatch Log Groups (Infinite Retention)

Rule ID: aws.cloudwatch.logs.infinite_retention

What it detects: Log groups with no retention policy (never expires)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: No retention policy configured (directly observable configuration fact)

Risk tiers:

  • HIGH: Log group has ≥1 GB stored bytes (significant ongoing cost)
  • MEDIUM: Log group has >0 stored bytes
  • LOW: Log group has 0 stored bytes (still flagged — retention should be set regardless)

Why this matters:

  • Logs grow indefinitely without retention
  • Can reach GBs/TBs over months
  • Often forgotten after service decommission

Common causes:

  • Default CloudFormation behavior (no retention)
  • Manual log group creation
  • Missing lifecycle policies

Required permission: logs:DescribeLogGroups


Governance

Untagged Resources

Rule ID: aws.resource.untagged

What it detects: Resources with zero tags

Resources checked:

  • EBS volumes
  • S3 buckets
  • CloudWatch log groups

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Zero tags (directly observable fact from authoritative tag source)

Why this matters:

  • Ownership ambiguity
  • Compliance violations (SOC2, ISO27001)
  • Cleanup decision paralysis

Required permissions:

  • ec2:DescribeVolumes
  • s3:ListAllMyBuckets
  • s3:GetBucketTagging
  • logs:DescribeLogGroups
  • logs:ListTagsForResource

AI/ML Waste

Idle SageMaker Endpoints

Rule ID: aws.sagemaker.endpoint.idle

Category: ai

What it detects: Real-time SageMaker endpoints in InService state with no observed InvokeEndpoint traffic across billable production variants for 14+ days (default, configurable). Async endpoints are excluded. Serverless variants without current provisioned concurrency are not treated as continuous idle-cost candidates.

Confidence:

  • HIGH: All evaluated billable variants returned datapoints and zero summed invocations over the observation window
  • MEDIUM: At least one evaluated billable variant returned no CloudWatch datapoints, but no billable variant showed positive invocation traffic

Risk:

  • HIGH: Any billable variant is accelerator-backed (ml.g*, ml.p*, ml.inf*, ml.trn*)
  • MEDIUM: All billable variants are CPU-backed

Why this matters:

  • SageMaker endpoints accrue charges continuously while InService, regardless of traffic
  • Endpoints deployed for experiments or demos are frequently abandoned after initial testing
  • Multi-variant endpoints multiply the cost per variant

Detection signal:

  • Inventory comes from ListEndpoints(StatusEquals="InService")
  • Runtime variants come from DescribeEndpoint.ProductionVariants
  • Async inference is excluded via DescribeEndpointConfig.AsyncInferenceConfig
  • Activity is evaluated from AWS/SageMaker Invocations using EndpointName + VariantName
  • estimated_monthly_cost_usd is intentionally left unset by this rule

Required permissions:

  • sagemaker:ListEndpoints
  • sagemaker:DescribeEndpoint
  • sagemaker:DescribeEndpointConfig
  • cloudwatch:GetMetricStatistics

Not run by default. AI/ML rules are opt-in to avoid surprising users who don't use these services. Run with cleancloud scan --provider aws --category ai (or --category all to combine with hygiene rules). Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule.


Idle SageMaker Notebook Instances

Rule ID: aws.sagemaker.notebook.idle

Category: ai

What it detects: SageMaker Notebook Instances in InService state whose CreationTime and LastModifiedTime are both at least 14 days old (default, configurable). This is a conservative stale control-plane heuristic, not a direct notebook-usage signal.

Detection signal — why LastModifiedTime: SageMaker Notebook Instances do not publish a native notebook-session activity metric for this rule. LastModifiedTime is the only canonical control-plane timestamp available, but it is a weak signal: it is not a direct indicator of Jupyter usage, kernel execution, or user access. The rule therefore emits only MEDIUM-confidence review candidates.

Confidence:

  • MEDIUM: notebook age and stale control-plane age both meet or exceed the configured threshold

Risk:

  • HIGH: GPU/accelerator-backed instance (ml.g4dn.*, ml.g5.*, ml.p3.*, ml.p4d.*, ml.p4de.*, ml.p5.*, Inferentia, Trainium)
  • MEDIUM: CPU-backed instance

Why this matters:

  • Notebook Instances bill continuously while InService, regardless of whether any kernels are running
  • Notebooks are commonly left running after a sprint ends, a project is deprioritised, or a team member leaves
  • Unlike endpoints, notebooks have no auto-scaling — they remain billable until explicitly stopped

Important scope note:

  • Stopped notebook instances are intentionally out of scope for this rule
  • Their retained storage cost should be handled by a separate storage / cost-waste rule
  • estimated_monthly_cost_usd is intentionally left unset by this rule

Required permissions:

  • sagemaker:ListNotebookInstances

Not run by default. AI/ML rules are opt-in to avoid surprising users who don't use these services. Run with cleancloud scan --provider aws --category ai (or --category all to combine with hygiene rules). Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule.


Idle EC2 GPU Instances

Rule ID: aws.ec2.gpu.idle

Category: ai

What it detects: EC2 GPU and accelerator instances (p2/p3/p4/p5, g4dn/g4ad/g5/g5g/g6/g6e/gr6, trn1/trn2, inf1/inf2, dl1/dl2q families) in running state with low utilisation over 7+ days (default, configurable). Unlike SageMaker rules which target managed services, this rule catches raw GPU instances spun up directly for training, inference, or experimentation and left running after the job completes.

Detection uses two tiers based on metric availability:

  • GPU utilisation (HIGH confidence): When the NVIDIA CloudWatch agent is installed, nvidia_smi_utilization_gpu is read from the CWAgent namespace. MAX statistic across all GPU indices is used — a single active GPU on a multi-GPU instance (e.g., p4d.24xlarge with 8 A100s) will not be masked by averaging.
  • CPU utilisation fallback (MEDIUM confidence): When the NVIDIA agent is not installed, CPUUtilization from AWS/EC2 is used as a proxy signal. Neuron instances (Trainium/Inferentia) always use this path by design — they use the AWS Neuron SDK, not NVIDIA CUDA.

Confidence levels:

  • HIGH: GPU metric available AND max GPU utilisation < 5% over 7 days
  • MEDIUM: GPU metric unavailable; avg CPU utilisation < 10% over 7 days

Risk levels:

  • CRITICAL: idle_ratio ≥ 2.0 (e.g. running for 14+ days at the 7-day threshold)
  • HIGH: GPU/accelerator instance with low utilisation (all other cases)

Cost estimates (us-east-1 on-demand):

Instance Est. monthly cost
g4dn.xlarge (T4) $379
g5.xlarge (A10G) $604
p3.2xlarge (V100) $2,234
p4d.24xlarge (8× A100 40GB) $23,374
p4de.24xlarge (8× A100 80GB) $32,074
g6e.48xlarge (8× L40S) $18,000
p5.48xlarge (8× H100) $98,318
trn2.48xlarge (Trainium2) $110,000

Configurable parameters:

Parameter Default Description
idle_days 7 Days of low utilisation before flagging
gpu_threshold 5.0 Max GPU utilisation % (HIGH confidence path)
cpu_threshold 10.0 Max CPU utilisation % (MEDIUM confidence fallback)

Required permissions:

  • ec2:DescribeInstances
  • cloudwatch:GetMetricStatistics
  • cloudwatch:ListMetrics

Not run by default. Run with cleancloud scan --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule. The NVIDIA CloudWatch agent is not required — instances without it fall back to CPU utilisation at MEDIUM confidence.


Idle Bedrock Provisioned Throughput

Rule ID: aws.bedrock.provisioned_throughput.idle

Category: ai

What it detects: AWS Bedrock Provisioned Throughput reservations (Model Units) in InService state with zero invocations over 7+ days (default, configurable). Provisioned Throughput reserves dedicated model capacity and bills per Model Unit per hour regardless of whether any inference requests are made — up to ~$7,300/MU/month for Claude 3 Opus on no-commitment pricing. A zero-invocation reservation is paying for capacity delivering zero value.

Confidence:

  • HIGH: Zero invocations confirmed for the full idle window (deployment age ≥ idle_days)

Risk:

  • HIGH: All provisioned throughput reservations (significant always-on spend)

Why this matters:

  • Provisioned Throughput bills per Model Unit per hour while InService, regardless of invocation count
  • Claude 3 Opus: ~$7,300/MU/month; Claude 3 Sonnet / 3.5 Sonnet: ~$2,600/MU/month; Claude 3 Haiku: ~$600/MU/month (no-commitment pricing — reserved terms are 25–60% lower but still significant)
  • Abandoned proof-of-concept and experiment reservations are common — teams switch to on-demand after initial testing but forget to delete the provisioned throughput

Cost estimates (per Model Unit, us-east-1, no-commitment):

Model family Monthly cost per MU
Claude 3 Opus ~$7,300
Claude 3 Sonnet / 3.5 Sonnet ~$2,600
Claude 3 Haiku / 3.5 Haiku ~$600
Meta Llama 3 ~$1,000

Multiply by desiredModelUnits for total monthly idle cost.

Configurable parameters:

Parameter Default Description
idle_days 7 Days of zero invocations before flagging

Required permissions:

  • bedrock:ListProvisionedModelThroughputs
  • cloudwatch:GetMetricStatistics

Not run by default. Run with cleancloud scan --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.


Idle SageMaker Studio Apps

Rule ID: aws.sagemaker.studio_app.idle

Category: ai

What it detects: SageMaker Studio apps of type KernelGateway, JupyterLab, or CodeEditor in InService state with no usable recent activity signal for 7+ days (default, configurable). Other app types, including JupyterServer, are excluded from evaluation.

Detection signal: LastUserActivityTimestamp from sagemaker:DescribeApp, but only when it is usable. AWS documents that health checks can also update LastUserActivityTimestamp; if it exactly matches LastHealthCheckTimestamp, the app is skipped and not treated as idle.

Confidence:

  • HIGH: usable_activity_signal = true and the last usable activity timestamp is at least the configured threshold old

Risk:

  • HIGH: GPU/accelerator instance (ml.g*, ml.p*, ml.inf*, ml.trn*)
  • MEDIUM: CPU instance

GPU families: ml.g4dn, ml.g5, ml.p2, ml.p3, ml.p4d, ml.p4de, ml.p5, ml.trn1, ml.inf1, ml.inf2

Why this matters:

  • Studio apps remain InService (and billing) until explicitly deleted — there is no auto-stop by default
  • KernelGateway, JupyterLab, and CodeEditor apps each launch a separate compute instance per user session or space
  • Teams frequently leave apps running after finishing a sprint, switching to a new space, or abandoning a project
  • estimated_monthly_cost_usd is intentionally left unset by this rule

Configurable parameters:

Parameter Default Description
idle_days_threshold 7 Days since the last usable activity timestamp before flagging

Required permissions:

  • sagemaker:ListApps
  • sagemaker:DescribeApp

Not run by default. Run with cleancloud scan --provider aws --category ai. Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.


Long-Running SageMaker Training Jobs

Rule ID: aws.sagemaker.training_job.long_running

Category: ai

What it detects: SageMaker training jobs still in InProgress beyond the configured threshold (default 24 hours). Runtime is measured from TrainingStartTime when present, otherwise from CreationTime.

Detection signal: Inventory is built by fully paginating ListTrainingJobs without relying on StatusEquals for completeness, then filtering TrainingJobStatus client-side. DescribeTrainingJob is used to confirm the current status, resolve the runtime anchor, and read StoppingCondition, EnableManagedSpotTraining, ResourceConfig, and optional heterogeneous InstanceGroups.

Confidence:

  • HIGH: elapsed runtime exceeds the applicable SageMaker stopping-condition limit (MaxWaitTimeInSeconds for managed Spot when present, otherwise MaxRuntimeInSeconds when TrainingStartTime is present)
  • MEDIUM: elapsed runtime meets the threshold but no applicable stopping-condition limit was exceeded (or no such limit is configured)

Risk:

  • HIGH: GPU/accelerator instance (ml.g*, ml.p*, ml.inf*, ml.trn*)
  • MEDIUM: Non-GPU/accelerator instance

GPU/accelerator families: ml.g4dn, ml.g5, ml.g6, ml.g6e, ml.g7, ml.p2, ml.p3, ml.p4d, ml.p4de, ml.p5, ml.p5en, ml.p6, ml.trn1, ml.trn2, ml.inf1, ml.inf2

Managed spot training: EnableManagedSpotTraining=true changes the effective wall-clock stopping limit. MaxRuntimeInSeconds counts only active compute time (not spot wait time) and is not a reliable wall-clock signal. For spot jobs the rule uses MaxWaitTimeInSeconds as the stopping limit; the summary and signals explicitly label which limit was exceeded.

Heterogeneous clusters: When ResourceConfig.InstanceGroups is present, accelerator detection is evaluated across the groups rather than inferred from a single primary instance type.

Why this matters:

  • Long-running distributed training can keep all workers running and billing while producing limited or no useful progress
  • Training jobs are not automatically stopped just because they are unusually long
  • estimated_monthly_cost_usd is intentionally omitted — this is a transient runtime review rule, not a monthly-cost rule

Configurable parameters:

Parameter Default Description
long_running_hours_threshold 24 Hours before a training job is considered long-running

Required permissions:

  • sagemaker:ListTrainingJobs
  • sagemaker:DescribeTrainingJob

Not run by default. Run with cleancloud scan --provider aws --category ai. Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.


Idle Azure ML Compute Clusters

Rule ID: azure.aml.compute.idle

Category: ai

What it detects: Azure Machine Learning compute clusters (AmlCompute) with min_node_count > 0 and zero active nodes over 14+ days. Clusters configured with a non-zero minimum keep instances running continuously regardless of job activity — identical billing model to SageMaker InService endpoints. GPU clusters (NC/ND/NV series) cost $600–$15K/month at minimum node count.

Confidence:

  • HIGH: Zero active nodes for the full 14-day window (cluster age ≥ 14 days)
  • MEDIUM: Zero active nodes, cluster age is 7–13 days, or cluster creation time unavailable

Risk:

  • HIGH: GPU-backed VM size (Standard_NC*, Standard_ND*, Standard_NV*)
  • MEDIUM: CPU-backed VM size

Why this matters:

  • min_node_count > 0 means instances are always running, always billed — even with no jobs submitted
  • GPU clusters cost $600–$15K/month per node at minimum capacity
  • Clusters are frequently created for experiments or training runs and left with non-zero minimums for "warm-start convenience"

Metric strategy: Queries Azure Monitor Active Nodes metric (with ComputeName dimension filter). Falls back to NodeCount and CurrentNodeCount if the primary metric is unavailable. Only dimension-filtered metrics are used to confirm idle — workspace-level unfiltered queries cannot safely confirm individual cluster state.

Estimated monthly cost (per node at min_node_count):

  • Standard_NC6 — ~$648/month
  • Standard_NC12 — ~$1,296/month
  • Standard_NC6s_v3 — ~$2,203/month
  • Standard_ND40rs_v2 — ~$15,862/month
  • Standard_D4_v2 — ~$259/month

Required permissions:

  • Microsoft.MachineLearningServices/workspaces/read
  • Microsoft.MachineLearningServices/workspaces/computes/read
  • Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai (or --category all). Add Microsoft.MachineLearningServices/workspaces/read and Microsoft.MachineLearningServices/workspaces/computes/read to your custom role or use the built-in AzureML Data Scientist role in read-only mode.


Idle Azure ML Compute Instances

Rule ID: azure.ml.compute_instance.idle

Category: ai

What it detects: Azure ML Compute Instances in Running state with no control-plane activity for 14+ days, detected via last_operation.operation_time. Compute Instances are single-VM interactive development environments (Jupyter, VS Code, RStudio) that bill continuously while Running — regardless of kernel activity. GPU instances (NC/ND/NV series) idle for 2× the threshold are escalated to CRITICAL.

Detection signal — why last_operation: Azure ML Compute Instances do not publish per-instance utilisation metrics to Azure Monitor by default. last_operation.operation_time is updated by the Azure ML control plane on Start, Stop, Restart, and Create operations. An instance with no recent operation has had no control-plane activity — the same approach used for SageMaker Notebook LastModifiedTime. Falls back to system_data.last_modified_at if last_operation is unavailable.

Confidence:

  • HIGH: last_operation.operation_time or last_modified_at signal ≥ 14 days ago AND instance age ≥ 14 days
  • MEDIUM: ≥ 75% of threshold on both signals, OR age-only fallback (when neither last_operation nor last_modified_at is available — age alone is not evidence of idleness)

Risk:

  • CRITICAL: GPU instance AND idle_ratio ≥ 2.0 (e.g. 28+ days at the default 14-day window)
  • HIGH: GPU instance (Standard_NC*, Standard_ND*, Standard_NV*)
  • MEDIUM: CPU instance

Why this matters:

  • Compute Instances bill at the full VM rate while Running — a stopped instance costs nothing
  • GPU instances cost $600–$15K+/month running continuously
  • Data scientists frequently leave instances Running after finishing a sprint, switching to a new instance, or during holidays

Estimated monthly cost:

  • Standard_DS3_v2 — ~$260/month
  • Standard_NC6s_v3 — ~$2,203/month
  • Standard_NC24s_v3 — ~$8,812/month
  • Standard_ND40rs_v2 — ~$15,862/month

Required permissions:

  • Microsoft.MachineLearningServices/workspaces/read
  • Microsoft.MachineLearningServices/workspaces/computes/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.


Idle Azure OpenAI Provisioned Deployment

Rule ID: azure.openai.provisioned_deployment.idle

Category: ai

What it detects: Azure OpenAI provisioned deployments (PTUs) with zero API requests for 7+ days (default, configurable). Provisioned Throughput Units reserve dedicated model capacity and bill continuously at ~$1,460/PTU/month on-demand regardless of traffic — a single idle 100-PTU GPT-4o deployment wastes ~$146,000/month.

Configurable parameters:

Parameter Default Description
idle_days 7 Days of zero requests before flagging

Detection signal:

Queries Azure Monitor AzureOpenAIRequests (falling back to ProcessedPromptTokens) with a ModelDeploymentName dimension filter to isolate per-deployment traffic. If the per-deployment dimension is unsupported in a region, falls back to account-level totals. Conservative: returns no finding on any API error.

Provisioned SKUs detected:

  • ProvisionedManaged — single-region reserved capacity
  • GlobalProvisionedManaged — multi-region reserved capacity
  • DataZoneProvisionedManaged — data-zone-scoped reserved capacity

Confidence:

  • HIGH: Per-deployment metric confirms zero requests AND deployment age ≥ idle_days
  • MEDIUM: Per-deployment zero confirmed but age < idle_days; OR account-level zero (per-deployment dimension unavailable in region)

Risk:

  • HIGH: ≥ 7 PTUs (~$10K+/month estimated)
  • MEDIUM: < 7 PTUs (still significant — PTU deployments have no cost-free tier)

Why this matters:

  • PTU deployments have no free tier — every hour of idle time is pure waste
  • Common abandonment pattern: PoC deployments left running after evaluation, dev/test deployments forgotten when team moves to production, traffic migrated to a new deployment without decommissioning the old one
  • Nobody else detects idle PTU deployments in CI — first-mover advantage

Estimated monthly cost:

  • 1 PTU — ~$1,460/month (on-demand)
  • 10 PTUs — ~$14,600/month
  • 100 PTUs — ~$146,000/month
  • Note: Monthly/annual reserved pricing is 30–50% lower; estimated cost shown is on-demand ceiling

Required permissions:

  • Microsoft.CognitiveServices/accounts/read
  • Microsoft.CognitiveServices/accounts/deployments/read
  • Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai (or --category all). Add the permissions above to your custom read-only role.


Idle Azure ML Online Endpoints

Rule ID: azure.ml.online_endpoint.idle

Category: ai

What it detects: Azure ML managed online endpoints in Succeeded provisioning state with zero scoring requests for 7+ days (default, configurable). These endpoints bill per-instance based on minimum replica count regardless of traffic — a GPU-backed endpoint with no scoring requests is paying for capacity delivering zero value.

Detection signal: Queries Azure Monitor RequestCount (falling back to ModelEndpointRequests) with an EndpointName dimension filter to isolate per-endpoint traffic. If the dimension is unsupported, falls back to workspace-level totals. Age-only fallback applies when metric data is unavailable and endpoint age ≥ 2× idle window (MEDIUM confidence).

Configurable parameters:

Parameter Default Description
idle_days 7 Days of zero scoring requests before flagging

Confidence:

  • HIGH: Per-endpoint metric confirms zero requests AND endpoint age ≥ idle_days
  • MEDIUM: Zero requests confirmed but age < idle_days; OR metric data unavailable and age ≥ 2× idle_days

Risk:

  • CRITICAL: GPU/accelerator instance AND idle_ratio ≥ 2.0 (idle for 2× the threshold)
  • HIGH: GPU/accelerator instance (Standard_NC*, Standard_ND*, Standard_NV*, T4/A100 families)
  • MEDIUM: CPU-backed instance

Why this matters:

  • Managed online endpoints bill per minimum replica continuously while in Succeeded state — even with zero traffic
  • GPU-backed endpoints cost $200–$2,600+/month at single minimum replica
  • Experiment and PoC endpoints are commonly abandoned after demos without being deleted or scaled to zero
  • Unlike batch endpoints, managed online endpoints have no auto-scale-to-zero by default

Estimated monthly cost:

  • Standard_NC6 (K80 GPU) — ~$657/month per replica
  • Standard_NC6s_v2 — ~$900/month per replica
  • Standard_NC12 — ~$1,300/month per replica
  • CPU-backed (fallback) — ~$200/month per replica

Required permissions:

  • Microsoft.MachineLearningServices/workspaces/read
  • Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read
  • Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read
  • Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.


Idle Azure AI Search Services

Rule ID: azure.ai_search.idle

Category: ai

What it detects: Azure AI Search services on Standard tier or above with zero search queries over a 30-day window (default, configurable). Cost is computed per SKU × replica count × partition count — a Standard3 service with 3 replicas and 2 partitions idles at ~$6,282/month.

Detection signal: Queries Azure Monitor SearchQueriesPerSecond (Average), falling back to TotalSearchRequestCount (Sum). Service-level metrics only — no per-index dimension filtering needed. Age-only fallback applies when metric data is unavailable and service age ≥ 2× idle window (MEDIUM confidence).

Watched SKUs: standard, standard2, standard3, storage_optimized_l1, storage_optimized_l2 — Basic tier is excluded (low cost, no signal).

Configurable parameters:

Parameter Default Description
idle_days 30 Days of zero queries before flagging

Confidence:

  • HIGH: Zero average SearchQueriesPerSecond for the full idle window AND service age ≥ idle_days
  • MEDIUM: Zero confirmed but age < idle_days; OR metric data unavailable and age ≥ 2× idle_days

Risk:

  • HIGH: Estimated monthly cost ≥ $1,000 (e.g. Standard2+ or multi-replica/partition Standard)
  • MEDIUM: All other cases

Why this matters:

  • AI Search services bill continuously by SKU × replicas × partitions regardless of query volume
  • A Standard service with 1 replica and 1 partition costs ~$261/month idle — scale up to 2 replicas and the bill doubles
  • Services are commonly left running after a project ends, a search index is replaced, or a PoC is abandoned
  • Standard3 High-Density (HD) with 12 partitions can idle at ~$12,564/month

Estimated monthly cost per replica per partition:

SKU Monthly cost
Standard $261
Standard2 $523
Standard3 $1,047
Storage Optimized L1 $2,014
Storage Optimized L2 $4,028

Multiply by replica_count × partition_count for total monthly idle cost.

Required permissions:

  • Microsoft.Search/searchServices/read
  • Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.


Azure Rules

Compute Waste

Stopped (Not Deallocated) VMs

Rule ID: azure.vm.stopped_not_deallocated

What it detects: VMs in 'Stopped' state (OS-level shutdown) that are not deallocated, still incurring full compute charges

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Power state is 'Stopped' (deterministic state check, zero false positives)

Risk: HIGH

Why HIGH risk:

  • Stopped-but-not-deallocated VMs incur full compute charges ($30-500+/month depending on SKU)
  • Users often believe their VM is "off" but are paying full price
  • Classic Azure cost trap with significant financial impact

Why this matters:

  • Azure distinguishes between 'Stopped' (OS shutdown) and 'Deallocated' (compute released)
  • Only deallocated VMs stop incurring compute charges
  • 100% deterministic state check with zero false positives

Detection logic:

for vm in virtual_machines.list_all():
    instance_view = virtual_machines.instance_view(resource_group, vm.name)
    power_state = get_power_state(instance_view.statuses)  # PowerState/* code
    if power_state == "PowerState/stopped":
        confidence = "HIGH"  # Deterministic: stopped but not deallocated
        risk = "HIGH"  # Full compute charges still applied

Power states:

  • PowerState/running — active, skip
  • PowerState/deallocated — properly stopped, skip
  • PowerState/stoppedFLAGGED (still incurring compute charges)
  • PowerState/starting, PowerState/stopping, PowerState/deallocating — transitional, skip

Common causes:

  • Shutting down the VM from inside the OS (instead of Azure portal/CLI)
  • Using Stop-AzVM without -StayProvisioned false
  • RDP/SSH shutdown commands
  • Automated scripts that stop but don't deallocate

Required permission: Microsoft.Compute/virtualMachines/read


Storage Waste

Unattached Managed Disks

Rule ID: azure.compute.disk.unattached

What it detects: Managed disks not attached to any VM

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Unattached ≥ 7 days (conservative for all ages — unattached state is deterministic but attachment intent is not)
  • Not flagged: < 7 days

Detection logic:

for disk in disks.list():
    if disk.managed_by is not None:
        continue  # attached to a VM
    age_days = (now - disk.time_created).days
    if age_days >= 7:
        confidence = "MEDIUM"  # conservative regardless of age
    else:
        continue  # too new to flag

Common causes:

  • Disks from deleted VMs
  • Failed deployments
  • Autoscaling cleanup gaps

Required permission: Microsoft.Compute/disks/read


Old Managed Disk Snapshots

Rule ID: azure.compute.snapshot.old

What it detects: Snapshots older than configured thresholds

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Age ≥ 30 days (conservative for all ages — age alone is a moderate signal)
  • Not flagged: < 30 days

Detection logic:

for snapshot in snapshots.list():
    age_days = (now - snapshot.time_created).days
    if age_days >= 90:
        confidence = "MEDIUM"  # conservative even at high age
    elif age_days >= 30:
        confidence = "MEDIUM"
    else:
        continue  # too new to flag

Limitations:

  • Does NOT check if snapshot is referenced by images
  • Conservative to avoid false positives

Common causes:

  • Snapshots from backup jobs
  • Over-retention without lifecycle policies
  • Snapshots from deleted disks

Required permission: Microsoft.Compute/snapshots/read


Network Waste

Unused Public IP Addresses

Rule ID: azure.network.public_ip.unused

What it detects: Public IPs not attached to any network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Not attached (deterministic state, but may be reserved intentionally)

Why this matters:

  • Public IPs incur charges even when unused
  • State is deterministic (no heuristics needed)

Detection logic:

if public_ip.ip_configuration is None:
    confidence = "MEDIUM"

Required permission: Microsoft.Network/publicIPAddresses/read


Standard Load Balancer with No Backend Members

Rule ID: azure.load_balancer.no_backends

What it detects: Standard Load Balancers where all backend pools have zero members

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Standard SKU with zero backend members across all pools (deterministic state)

Excluded:

  • Basic SKU load balancers are skipped (retired, no cost signal)

Why this matters:

  • Standard Load Balancers incur base charges (~$18/month) regardless of backends
  • Empty LBs are a clear cost optimization signal
  • Common after VM/VMSS teardowns or migrations

Detection logic:

if lb.sku.name == "Standard":
    pools = lb.backend_address_pools or []
    # Check both NIC-based and IP-based backend representations
    has_members = any(
        pool.backend_ip_configurations or pool.load_balancer_backend_addresses
        for pool in pools
    )
    if not has_members:
        confidence = "HIGH"  # Deterministic: zero members across all pools

Backend representations checked:

  • backend_ip_configurations — NIC-based backends (standard VMs)
  • load_balancer_backend_addresses — IP-based backends (Private Link, hybrid)

Common causes:

  • VMs or VMSS deleted but LB retained
  • Migration from Basic to Standard leaving empty LBs
  • Failed deployments or incomplete teardowns
  • Hub-spoke architecture cleanup gaps

Required permission: Microsoft.Network/loadBalancers/read


Application Gateway with No Backend Targets

Rule ID: azure.application_gateway.no_backends

What it detects: Application Gateways where all backend pools have zero targets (no IP addresses or FQDNs)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: All backend pools have zero targets (deterministic state)

Excluded:

  • Gateways with provisioning_state != "Succeeded" are skipped (in-progress)

Why this matters:

  • Application Gateways incur significant charges regardless of backends
  • Standard_v2 and WAF_v2 SKUs cost $150-300+/month
  • Empty gateways are a clear cost optimization signal

Detection logic:

for gw in application_gateways:
    pools = gw.backend_address_pools or []
    has_any_targets = any(
        (pool.backend_addresses and len(pool.backend_addresses) > 0) or
        (pool.backend_ip_configurations and len(pool.backend_ip_configurations) > 0)
        for pool in pools
    )
    if not has_any_targets:
        confidence = "HIGH"  # Deterministic: zero targets across all pools
        risk = "MEDIUM"  # Significant cost impact ($150-300+/month)

Backend targets checked:

  • backend_addresses array (IP addresses or FQDNs)
  • backend_ip_configurations array (NIC-based backend references)

Common causes:

  • Backend VMs or services deleted but gateway retained
  • Migration or transition leaving empty gateways
  • Failed deployments or incomplete teardowns
  • WAF-only setup without actual backends (rare)

Cost estimates by SKU:

  • Standard_v2, WAF_v2: $150-300+/month
  • Standard, WAF (v1): $20-50/month

Required permission: Microsoft.Network/applicationGateways/read


Idle VNet Gateways (VPN/ExpressRoute)

Rule ID: azure.virtual_network_gateway.idle

What it detects: VPN Gateways and ExpressRoute Gateways with no active connections

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: No active connections (connection state checked, but P2S clients not verified)

Why MEDIUM confidence:

  • We can verify Site-to-Site and ExpressRoute connections
  • Point-to-Site VPN client count requires additional API calls
  • Gateway may have P2S config but no way to check active clients without deeper inspection

Risk: HIGH

Why HIGH risk:

  • VNet Gateways are among the most expensive idle resources ($500-3,500+/month)
  • Cost impact is material even for a single idle gateway
  • Significantly higher than Load Balancers ($18/month) or App Gateways ($150-300/month)

Why this matters:

  • VNet Gateways incur significant charges regardless of connections
  • VPN Gateway SKUs: $27-3,500+/month depending on SKU
  • ExpressRoute Gateway SKUs: $125-1,100+/month
  • Idle gateways are a major cost optimization signal

Detection logic:

for gw in virtual_network_gateways:
    connections = list_connections(gw)
    active_connections = [c for c in connections if c.connection_status == "Connected"]

    if gw.gateway_type == "Vpn":
        if len(active_connections) == 0 and not has_p2s_config:
            # Flag as idle
    elif gw.gateway_type == "ExpressRoute":
        if len(active_connections) == 0:
            # Flag as idle

Connection states checked:

  • Site-to-Site VPN connections (connection_status == "Connected")
  • ExpressRoute circuit connections
  • Point-to-Site VPN configuration (presence only, not active client count)

Common causes:

  • VPN tunnels torn down but gateway retained
  • ExpressRoute circuits decommissioned
  • Test/dev gateways left running
  • Migration or transition leaving orphaned gateways
  • DR standby gateways (intentional, but worth reviewing)

Cost estimates by SKU:

  • Basic: $27/month
  • VpnGw1/ErGw1AZ: $140-195/month
  • VpnGw2/ErGw2AZ: $360-505/month
  • VpnGw3/ErGw3AZ: $930-1,115/month
  • HighPerformance/UltraPerformance: $335-670/month

Required permissions:

  • Microsoft.Network/virtualNetworkGateways/read
  • Microsoft.Network/connections/read

Platform Waste

Empty App Service Plans

Rule ID: azure.app_service_plan.empty

What it detects: Paid App Service Plans with zero hosted apps (number_of_sites == 0)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Paid tier plan with 0 apps (deterministic state)

Excluded tiers:

  • Free and Shared tiers are skipped (no cost signal)

Why this matters:

  • Paid App Service Plans incur charges regardless of hosted apps
  • Empty plans are a clear cost optimization signal
  • Common after app deletions or failed deployments

Detection logic:

if plan.number_of_sites == 0:
    if plan.sku.tier not in ("Free", "Shared"):
        confidence = "HIGH"  # Deterministic: zero apps on paid plan

Common causes:

  • Apps deleted but plan retained
  • Failed deployments leaving empty plans
  • Scaling plans created but never used
  • Migration leaving old plans behind

Required permissions: Microsoft.Web/serverfarms/read, Microsoft.Web/serverfarms/sites/read


Idle Azure SQL Databases

Rule ID: azure.sql.database.idle

What it detects: Azure SQL databases with zero connections for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Zero connections for 14+ days (Azure Monitor metrics checked, strong idle signal)

Risk: HIGH

Why HIGH risk:

  • Azure SQL databases in Standard/Premium tiers cost $15-$7,500+/month
  • Idle databases with no connections are a clear cost optimization signal

Why this matters:

  • Azure SQL databases incur charges regardless of usage
  • Standard and Premium tiers have significant hourly costs
  • Idle databases are a major cost optimization opportunity

Detection logic:

for server in sql_servers:
    for db in databases.list_by_server(rg, server.name):
        if db.name == "master":  # Skip system databases
            continue
        if db.sku.tier == "Basic":  # Skip Basic tier (< $5/month)
            continue
        connections = get_metric(connection_successful, period=14_days)
        if connections == 0:
            confidence = "HIGH"
            risk = "HIGH"

Azure Monitor metrics checked:

  • connection_successful (daily total over 14-day window)

Exclusions:

  • System databases (master)
  • Basic tier databases (< $5/month, not worth flagging)

Common causes:

  • Applications migrated to different databases
  • Dev/staging databases left running
  • Decommissioned services with retained databases
  • Test databases no longer needed

Cost estimates by SKU:

  • Standard S0: ~$15/month
  • Standard S3: ~$150/month
  • Premium P1: ~$465/month
  • Premium P6: ~$3,720/month
  • Premium P15: ~$7,446/month

Required permissions:

  • Microsoft.Sql/servers/read
  • Microsoft.Sql/servers/databases/read
  • Microsoft.Insights/metrics/read

Idle App Services

Rule ID: azure.app_service.idle

What it detects: Running App Service web apps with zero HTTP requests for 14+ days on paid plans

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Zero requests for 14+ days (Azure Monitor Requests metric, strong idle signal)

Risk: MEDIUM

Why this matters:

  • App Service Plans on paid tiers bill compute charges continuously regardless of traffic
  • An app with zero requests for 14+ days is a strong signal of abandonment
  • Common for dev/staging apps that were never decommissioned

Detection logic:

for app in web_apps.list():
    if app.state == "Running" and app.sku.tier not in ("Free", "Shared", "Dynamic"):
        requests = monitor.metrics("Requests", period=days_idle)
        if requests == 0:
            confidence = "HIGH"
            risk = "MEDIUM"

Excluded tiers:

  • Free, Shared, Dynamic (Consumption/serverless) — no meaningful idle cost

Common causes:

  • Dev or staging apps left running after project end
  • Feature branches deployed and never torn down
  • Apps migrated to containers but old App Service not removed

Cost estimates by tier (single instance):

  • Basic: ~$55/month
  • Standard: ~$73/month
  • Premium/PremiumV2/V3: ~$146/month
  • Isolated/IsolatedV2: ~$298/month

Cost assumes one instance. Scaled-out plans (multiple instances) will cost proportionally more — treat these as minimum estimates.

Not detected:

  • Non-HTTP workloads such as WebJobs or background services with no inbound HTTP traffic — these produce zero Requests metric data even when active. Review before deleting.

Required permissions:

  • Microsoft.Web/sites/read
  • Microsoft.Web/serverfarms/read
  • Microsoft.Insights/metrics/read

Unused Container Registries

Rule ID: azure.container_registry.unused

What it detects: Container registries with zero image pulls for 90+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • HIGH: Zero successful pulls AND zero successful pushes for 90+ days (Azure Monitor SuccessfulPullCount and SuccessfulPushCount metrics)

Risk: LOW

Why this matters:

  • Container registries accrue storage and per-operation charges regardless of usage
  • A registry with no pulls and no pushes for 90+ days signals complete abandonment
  • Common after workload migrations to other registries or container platforms

Detection logic:

for registry in registries.list():
    if registry.provisioning_state == "Succeeded":
        pulls = monitor.metrics("SuccessfulPullCount", period=days_unused)
        pushes = monitor.metrics("SuccessfulPushCount", period=days_unused)
        if pulls == 0 and pushes == 0:
            confidence = "HIGH"
            risk = "LOW"

Registries with active push activity (e.g. CI pipelines writing images) but zero pulls are not flagged — they are in active use.

Common causes:

  • Workloads migrated to another registry (e.g., Docker Hub → ACR → GHCR)
  • Projects retired without cleaning up the registry
  • Old build artifacts never consumed by any deployment

Cost estimates by SKU (base fee only):

  • Basic: ~$5/month + storage
  • Standard: ~$20/month + storage
  • Premium: ~$50/month + storage

These are floor estimates. ACR also charges per GB of stored images (~$0.003/GB-day). For registries with large image layers, storage can exceed the base fee — actual cost may be significantly higher.

Required permissions:

  • Microsoft.ContainerRegistry/registries/read
  • Microsoft.Insights/metrics/read

Governance

Untagged Resources

Rule ID: azure.resource.untagged

What it detects: Resources with zero tags

Resources checked:

  • Managed disks (7+ days old)
  • Snapshots

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

  • MEDIUM: Untagged disk that's also unattached
  • LOW: Untagged snapshot or attached disk

Required permissions:

  • Microsoft.Compute/disks/read
  • Microsoft.Compute/snapshots/read

GCP Rules

Compute Waste

Stopped VM Instances

Rule ID: gcp.compute.vm.stopped

What it detects: VM instances in TERMINATED state for 30+ days

Confidence:

  • HIGH: lastStopTimestamp present and ≥ 30 days ago (deterministic timestamp)
  • MEDIUM: lastStopTimestamp absent — instance is TERMINATED but stop time is unavailable
  • Not flagged: stopped < 30 days, or instance in any other state (RUNNING, STAGING, etc.)

Risk: LOW

Why this matters:

  • A TERMINATED GCP VM does not charge for vCPU or memory — but every attached Persistent Disk accrues storage charges at ~$0.04/GB-month (standard) or ~$0.17/GB-month (SSD), regardless of instance state
  • A 500 GB root disk on a forgotten stopped instance costs ~$20/month indefinitely
  • This is the GCP equivalent of a stopped EC2 instance — the compute is free, the storage is not

Detection logic:

for instance in instances_client.aggregated_list(project=project_id):
    if instance.status == "TERMINATED":
        if _parse_gcp_timestamp(instance.last_stop_timestamp) > cutoff:
            flag(instance)

Cost estimate: Sum of attached PERSISTENT disk sizes × $0.04/GB/month (SCRATCH disks excluded — they are ephemeral)

Required permissions:

  • compute.instances.list (included in roles/compute.viewer)

Storage Waste

Unattached Persistent Disks

Rule ID: gcp.compute.disk.unattached

What it detects: Persistent Disks in READY state with no attached VM (users == [])

Confidence:

  • HIGH: Disk is READY and has no users — unambiguous detachment

Risk: LOW

Why this matters:

  • GCP charges for Persistent Disks regardless of whether they are attached to a VM
  • pd-standard: ~$0.04/GB/month, pd-ssd: ~$0.17/GB/month, pd-balanced: ~$0.10/GB/month, pd-extreme: ~$0.12/GB/month
  • Unattached disks accumulate when VMs are deleted without deleting their disks — the most common source of GCP storage waste
  • A 500 GB pd-ssd left unattached costs ~$85/month

Detection logic:

for disk in disks_client.aggregated_list(project=project_id):
    if disk.status == "READY" and not disk.users:
        flag(disk)

Cost estimate by disk type:

Type Rate
pd-standard $0.04/GB/month
pd-balanced $0.10/GB/month
pd-ssd $0.17/GB/month
pd-extreme $0.12/GB/month

Required permissions:

  • compute.disks.list (included in roles/compute.viewer)

Old Disk Snapshots

Rule ID: gcp.compute.snapshot.old

What it detects: Disk snapshots older than 90 days

Confidence:

  • HIGH: Source disk no longer exists (snapshot is orphaned — the source was deleted)
  • MEDIUM: Source disk still exists (might be intentional long-term backup or DR snapshot)

Risk: LOW

Why this matters:

  • GCP snapshots are billed at ~$0.026/GB/month compressed storage in Cloud Storage
  • Automated snapshot policies are frequently removed while their snapshots are left behind
  • One-off manual snapshots are rarely cleaned up — they persist indefinitely until explicitly deleted
  • Snapshots are global resources — they accumulate across all zones and appear in no specific region

Detection logic:

for snapshot in snapshots_client.list(project=project_id):
    if snapshot.status == "READY":
        if _parse_gcp_timestamp(snapshot.creation_timestamp) < cutoff:
            confidence = HIGH if not snapshot.source_disk else MEDIUM
            flag(snapshot)

Cost estimate: Uses storage_bytes (actual compressed size) when available; falls back to disk_size_gb × $0.026/GB/month

Note: region_filter is ignored for snapshots — GCP snapshots are global resources with no region attribute.

Required permissions:

  • compute.snapshots.list (included in roles/compute.viewer)

Network Waste

Unused Reserved Static IPs

Rule ID: gcp.compute.ip.unused

What it detects: Reserved static IP addresses (regional and global) in RESERVED status (not IN_USE)

Confidence:

  • HIGH: IP status is RESERVED — unambiguous, GCP itself confirms it is not attached

Risk: LOW

Why this matters:

  • GCP bills $0.01/hour ($7.20/month) for each static IP in RESERVED status under the PREMIUM network tier
  • Reserved IPs accumulate when VMs, load balancers, or NAT gateways are deleted without releasing their IPs
  • Unlike ephemeral IPs, reserved IPs persist independently — they must be explicitly released to stop billing

Detection logic:

# Regional IPs
for address in addresses_client.aggregated_list(project=project_id):
    if address.status == "RESERVED":
        flag(address, scope="regional")

# Global IPs (skipped if region_filter is set)
for address in global_addresses_client.list(project=project_id):
    if address.status == "RESERVED":
        flag(address, scope="global")

Graceful degradation: If compute.globalAddresses.list is denied but regional IPs succeed, the rule returns regional findings rather than failing entirely.

Cost estimate: $7.20/month per unused IP (PREMIUM network tier default)

Required permissions:

  • compute.addresses.list (included in roles/compute.viewer)
  • compute.globalAddresses.list (included in roles/compute.viewer)

Platform Waste

Idle Cloud SQL Instances

Rule ID: gcp.sql.instance.idle

What it detects: Cloud SQL instances in RUNNABLE state with zero database connections for 14+ days

Confidence:

  • HIGH: Monitoring confirms zero connections for the full 14-day window

Risk: HIGH

Why this matters:

  • Cloud SQL bills continuously for vCPU and memory regardless of query load
  • A db-n1-standard-2 costs ~$93/month with zero queries
  • Dev and staging databases are frequently left running after feature branches merge or projects wind down
  • Cloud SQL is the highest-cost idle resource type in most GCP environments

Detection logic:

for instance in sql_admin_api.list(project_id):
    if instance.state == "RUNNABLE" and not is_read_replica(instance):
        if not has_connections(monitoring_client, project_id, instance.name, days=14):
            flag(instance)

Conservative monitoring fallback: If Cloud Monitoring is unavailable or permission-denied, the instance is assumed active — it is not flagged. This avoids false positives when monitoring data is temporarily unavailable.

Read replicas excluded: Read replicas have no independent billing basis — the primary instance cost is what matters.

Cost estimates by tier:

Tier ~Monthly cost
db-f1-micro $7.67
db-g1-small $25.22
db-n1-standard-1 $46.55
db-n1-standard-2 $93.10
db-n1-standard-4 $186.19
db-n1-highmem-2 $113.45
db-n1-highmem-4 $226.90

Costs are approximate for us-central1 with HA disabled.

Required permissions:

  • cloudsql.instances.list (included in roles/cloudsql.viewer)
  • monitoring.timeSeries.list (included in roles/monitoring.viewer)

AI/ML Waste (opt-in — --category ai)

Idle Vertex AI Online Prediction Endpoints

Rule ID: gcp.vertex.endpoint.idle

What it detects: Vertex AI Online Prediction endpoints with dedicatedResources.minReplicaCount > 0 and zero prediction requests for 14+ days

Confidence:

  • HIGH: Zero predictions for the full 14-day window (endpoint age ≥ 14 days)
  • MEDIUM: Zero predictions, endpoint age ≥ 75% of threshold (≥ 10 days), or age unknown

Risk: HIGH (GPU-backed endpoints: T4, V100, A100, L4, H100, TPU), MEDIUM (CPU-only)

Why this matters:

  • Vertex AI endpoints with minReplicaCount > 0 keep dedicated compute running 24/7 regardless of traffic
  • GPU endpoints (T4: $311/month per GPU, A100: $2,933/month, H100: $8,000/month) are especially costly when idle
  • Experiment and prototype endpoints are commonly abandoned after demos without being deleted or scaled to zero
  • Endpoints using automaticResources (which scale to zero) are excluded — only dedicatedResources incur idle cost

Detection logic:

for endpoint in vertex_ai_api.list(project_id, location="-"):  # all locations
    total_min_replicas = sum(
        m.dedicatedResources.minReplicaCount
        for m in endpoint.deployedModels
        if m.dedicatedResources  # skip automaticResources
    )
    if total_min_replicas > 0:
        if not has_predictions(monitoring_client, endpoint_id, days=14):
            flag(endpoint)

Conservative monitoring fallback: If Cloud Monitoring is unavailable or permission-denied, the endpoint is assumed active — it is not flagged.

Cost estimates by machine type (per node, us-central1):

Machine Type ~Monthly cost/node
n1-standard-4 $138
n1-standard-8 $277
n1-standard-4 + T4 GPU $449
n1-standard-4 + V100 GPU $1,523
a2-highgpu-1g (A100 40GB) $2,933
a2-highgpu-2g (2× A100) $5,866
a2-ultragpu-1g (A100 80GB) $5,103
g2-standard-8 (L4 GPU) $1,060

Costs are approximate for us-central1, on-demand. Multiply by minReplicaCount for total monthly idle cost.

Required permissions:

  • aiplatform.endpoints.list (included in roles/aiplatform.viewer)
  • monitoring.timeSeries.list (included in roles/monitoring.viewer)

Idle Vertex AI Workbench Instances

Rule ID: gcp.vertex.workbench.idle

What it detects: Vertex AI Workbench instances in ACTIVE state with no control-plane activity for 14+ days

Confidence:

  • HIGH: updateTime ≥ 14 days ago AND instance age ≥ 14 days
  • MEDIUM: updateTime ≥ 75% of threshold (≥ 10 days) and instance age ≥ 10 days, or updateTime unavailable (age-fallback, capped at MEDIUM)

Risk: CRITICAL (GPU-backed, idle ≥ 2× threshold), HIGH (GPU-backed), MEDIUM (CPU-only)

Why this matters:

  • Workbench instances incur continuous compute charges while ACTIVE, even with no open notebooks or active kernels
  • GPU instances (T4: $311/month, A100: $2,933/month, H100: $8,000/month) are extremely costly when left idle
  • Data scientists commonly leave instances running after a sprint ends, a project is deprioritised, or when switching to a newer instance

Detection logic:

for instance in notebooks_api.list(project_id, location="-"):  # all locations
    if instance.state == "ACTIVE":
        idle_days = (now - instance.updateTime).days
        if idle_days >= 14:
            flag(instance)

updateTime is updated by the Notebooks API when the instance is started, stopped, restarted, or reconfigured. Instances with stale updateTime have had no control-plane activity. This mirrors LastModifiedTime (SageMaker) and last_modified_at (Azure ML).

Cost estimates (per instance, us-central1, on-demand):

Machine Type ~Monthly cost
n1-standard-4 $138
n1-standard-4 + T4 GPU $449
n1-standard-4 + V100 GPU $1,523
a2-highgpu-1g (A100 40GB) $2,933
g2-standard-8 (L4 GPU) $1,060

Required permissions:

  • notebooks.instances.list (included in roles/notebooks.viewer)

Long-Running Vertex AI Training Jobs

Rule ID: gcp.vertex.training_job.long_running

What it detects: Vertex AI CustomJobs (state=JOB_STATE_RUNNING) and TrainingPipelines (state=PIPELINE_STATE_RUNNING) that have been running longer than expected. The default threshold is 24 hours. GPU/TPU accelerator jobs and expensive CPU clusters raise an early warning at 90% of the threshold (21.6h at defaults) because high burn rates make runaway detection time-sensitive.

Most training jobs complete in minutes to a few hours. A job still running well past the threshold is likely hung, stalled, or runaway — waiting on data, deadlocked in distributed training, caught in an OOM loop, or simply forgotten after a project was cancelled.

GPU-backed training is especially costly: an A100 40GB node (a2-highgpu-1g) runs at ~$4/hour; an H100 node (a3-highgpu-8g) with 8 GPUs runs at ~$80/hour. Distributed multi-worker jobs multiply cost linearly.

Confidence:

  • HIGH: duration ≥ long_running_hours × 3 — clearly runaway for almost any single training run
  • MEDIUM: duration ≥ long_running_hours — worth reviewing; could be legitimate large-scale training
  • MEDIUM (early warning): GPU/TPU accelerator job, or CPU cluster with burn rate above expensive_hourly_threshold (default $20/hr), at 90–100% of threshold — not emitted for cheap CPU-only jobs below threshold

Risk:

Confidence GPU/Accelerator Risk
HIGH Yes CRITICAL
HIGH No or unknown HIGH
MEDIUM Any MEDIUM

Why this matters:

  • Vertex AI CustomJobs with GPU workers continue billing as long as they are in JOB_STATE_RUNNING
  • There is no automatic stop unless timeout is set in the job spec — jobs can run indefinitely if hung or if the stopping condition is never met
  • TrainingPipelines wrap CustomJobs and can also run indefinitely if the underlying job does not terminate

Detection logic:

# Queries both resource types across all locations via REST API
for job in vertex_ai.customJobs(project, locations="-", filter='state="JOB_STATE_RUNNING"'):
    duration = now - job.startTime  # fallback to createTime if absent
    is_accelerator = has_gpu_or_tpu(job.workerPoolSpecs)
    burn_rate = total_hourly_cost(job.workerPoolSpecs)
    if duration < threshold * 0.9:
        continue  # too young
    if duration < threshold and not (is_accelerator or burn_rate > 20):
        continue  # early-warning zone: skip cheap CPU-only jobs

for pipeline in vertex_ai.trainingPipelines(project, locations="-", filter='state="PIPELINE_STATE_RUNNING"'):
    ...  # same logic; hardware parsed from trainingTaskInputs when available

Hardware detection:

  • Accelerator classification uses workerPoolSpecs[].machineSpec.acceleratorType against a frozenset of known accelerator types (GPU families and TPU pod types), plus machine type prefixes that bundle accelerator cost (a2-*, a3-*, a4-*, a4x-*, g2-*, g4-*, ct4-*, ct5*, ct6*, tpu*)
  • TPU machines use tpuTopology (e.g. "2x4") to derive the physical host count — replicaCount is always 1 in the Vertex AI API regardless of pod size
  • TrainingPipelines embed hardware in opaque trainingTaskInputs — when specs cannot be parsed, cost uses a duration-tiered placeholder (>24h → $20/hr, 6–24h → $5/hr, <6h → $1/hr) and is_accelerator is False (unknown hardware does not imply GPU workload)
  • For bundled accelerator machines, co-scheduling is modeled: when acceleratorCount divides machine_gpu_count evenly, the machine cost is divided by machine_gpu_count ÷ acceleratorCount replicas per VM

Cost reported:

  • Accrued cost so far: duration_hours × hourly_burn_rate (sum across all worker pools); stored raw in details["accrued_cost_usd"] and capped at $1M in display text
  • estimated_monthly_cost_usd is intentionally None — training jobs are transient, not recurring monthly expenses; populating that field would corrupt monthly savings totals
  • Pricing is a static estimate (us-central1, on-demand); details["pricing_scope"] = "us-central1_reference" and details["pricing_note"] indicate the reference region and whether the job's actual region may differ significantly
  • details["pricing_confidence"] is "published" when all prices come from GCP pricing pages, or "partial_estimate" for newer machine families (a3-megagpu, a4-, g4-, ct5p-, ct6e-, tpu7x-*) where rates are estimated

Cost estimates (per node, us-central1, on-demand):

Machine Type ~Hourly cost Notes
n1-standard-8 + T4 ~$0.80/hr GPU cost additive
n1-standard-8 + V100 ~$2.27/hr GPU cost additive
a2-highgpu-1g (A100 40GB) ~$4.02/hr GPU bundled
a2-highgpu-8g (8× A100 40GB) ~$32.14/hr GPU bundled
a3-highgpu-8g (8× H100 80GB) ~$80.00/hr GPU bundled [est]
g2-standard-8 (L4) ~$1.45/hr GPU bundled
ct5lp-hightpu-8t (8× TPU v5e) ~$9.60/hr TPU bundled

What it does not check:

  • Intentional long-running distributed training (LLM pre-training, large fine-tunes)
  • Checkpoint saving — job may be making progress without visible status updates
  • Committed use discounts — actual cost may be significantly lower than on-demand estimate
  • Preemptible/Spot workers — cost and interruption semantics differ
  • Co-scheduling for g2-standard-32 — GPU count is ambiguous in GCP docs; that machine type uses full-price-per-replica as a conservative fallback

Required permissions:

  • aiplatform.customJobs.list (included in roles/aiplatform.viewer)
  • aiplatform.trainingPipelines.list (included in roles/aiplatform.viewer)

Idle Cloud TPU Nodes

Rule ID: gcp.tpu.idle

What it detects: Cloud TPU nodes in READY state with near-zero utilization for 7+ days. A READY TPU node incurs compute charges continuously, regardless of whether any workload is running. Forgotten TPU nodes left running after a training job completes are a common source of runaway cost.

Confidence:

  • HIGH: Cloud Monitoring reports max tpu.googleapis.com/node/accelerator/duty_cycle ≤ 2% across all workers over the idle window (7 days by default) — the TPU was genuinely not executing any workload
  • LOW: Monitoring data unavailable; node exists for ≥ idle_days with no observed activity — existence duration is not a reliable idle proxy (node may still be in active use)

Risk:

Confidence Hourly cost Risk
HIGH ≥ $10/hr CRITICAL
HIGH < $10/hr HIGH
LOW Any MEDIUM

Why this matters:

  • TPU nodes bill from the moment they reach READY state, regardless of utilization
  • Unlike GPU instances, Cloud TPU nodes have no automatic stop after a job completes — they must be explicitly deleted
  • An idle v4 node (4 chips, 2x2x1 topology) costs ~$12.88/hr; a v5p-8 costs ~$33.60/hr; a forgotten large pod runs up thousands per day

Detection logic:

# List all READY TPU nodes via Cloud TPU v2 REST API (locations/- wildcard)
for node in tpu.projects.locations.nodes.list(project, location="-"):
    if node.state != "READY":
        continue
    age = age_days(node.createTime)
    if age < idle_days:
        continue  # too young — enforce minimum observation window
    # Check Cloud Monitoring for near-zero duty_cycle
    duty_cycle = max_duty_cycle(node.id, window=idle_days)
    if duty_cycle is not None:
        idle = duty_cycle <= 0.02  # HIGH confidence
    else:
        idle = True  # LOW confidence — age-based heuristic, utilization unknown

Cost estimates (us-central1, on-demand):

TPU Type Chips ~Hourly cost Notes
v2-8 8 $12.00/hr $1.50/chip-hr, published
v3-8 8 $17.60/hr $2.20/chip-hr (device); v3 pod is $2.00/chip-hr
v4 (2x2x1) 4 $12.88/hr $3.22/chip-hr, published
v4 (2x2x2) 8 $25.76/hr $3.22/chip-hr, published
v5e (litepod-4) 4 $4.80/hr $1.20/chip-hr, published
v5e (litepod-8) 8 $9.60/hr $1.20/chip-hr, published
v5p-4 4 $16.80/hr $4.20/chip-hr, published
v5p-8 8 $33.60/hr $4.20/chip-hr, published

What it does not check:

  • Batch or scheduled jobs that run intermittently (the 7-day window may miss a recent burst)
  • Preemptible TPU nodes — may have been interrupted and not yet restarted intentionally
  • Committed use discounts — actual cost may be significantly lower
  • Nodes shared across teams where utilization is tracked externally

Required permissions:

  • tpu.nodes.list (included in roles/tpu.viewer)
  • monitoring.timeSeries.list (included in roles/monitoring.viewer) — optional; falls back to age-based detection if absent

Idle Vertex AI Feature Store Online Stores

Rule ID: gcp.vertex.featurestore.idle

What it detects: Vertex AI Feature Store online stores — both legacy featurestores (with fixedNodeCount > 0 or autoscaled via scaling.minNodeCount) and new-generation featureOnlineStores (Bigtable-backed or Optimized) — that have received zero online serving requests for 30+ days while remaining in STABLE state. Legacy featurestores and Bigtable-backed online stores incur continuous Bigtable compute charges; Optimized stores incur storage and query compute charges. Feature stores are frequently left running after a model or recommendation system is retired.

Confidence:

  • HIGH: Cloud Monitoring confirms zero online_serving/request_count over the 30-day window — the store had no ReadFeatureValues (or equivalent) requests at all
  • LOW: Monitoring data unavailable; store has been in STABLE state for ≥ 30 days — heuristic: age only, request activity unknown

Risk:

Confidence Risk
HIGH HIGH
LOW MEDIUM

Why this matters:

  • Legacy featurestores with fixedNodeCount > 0 bill ~$0.27/node-hour (us-central1, SSD-backed Bigtable) continuously — a 1-node store costs ~$197/month, a 3-node HA store costs ~$591/month
  • New-generation featureOnlineStores (Bigtable-backed) have similar per-node costs via autoScaling.minNodeCount
  • Optimized (BigQuery-backed) featureOnlineStores have lower base cost but still incur storage and query charges
  • These stores are often provisioned during model development and forgotten after the serving layer is replaced

Detection logic:

# Legacy featurestores with online serving configured (fixed or autoscaled)
for store in vertex_ai.featurestores(project, locations="-"):
    config = store.onlineServingConfig
    if config.fixedNodeCount == 0 and config.scaling.minNodeCount == 0:
        continue  # no online serving cost
    requests = monitoring.sum("featurestore/online_serving/request_count", window=30d)
    if requests is not None:
        if requests == 0:
            flag()  # HIGH confidence
    elif age_days >= 30:
        flag()  # LOW confidence — age heuristic, request activity unknown

# New featureOnlineStores (Bigtable or Optimized)
for store in vertex_ai.featureOnlineStores(project, locations="-"):
    requests = monitoring.sum("featureonlinestore/online_serving/request_count", window=30d)
    if requests is not None:
        if requests == 0:
            flag()  # HIGH confidence
    elif age_days >= 30:
        flag()  # LOW confidence — age heuristic, request activity unknown

Cost estimates (us-central1, on-demand):

Store type Config ~Monthly cost
Legacy featurestore 1 Bigtable node ~$197/mo
Legacy featurestore 3 Bigtable nodes (HA) ~$591/mo
Feature Online Store 1 Bigtable node (min) ~$197/mo
Feature Online Store 3 Bigtable nodes (min) ~$591/mo
Feature Online Store Optimized (BigQuery) ~$100+/mo [est]

What it does not check:

  • Periodic or low-frequency batch workflows querying less often than the 30-day window
  • Feature stores used by scheduled pipelines (e.g. weekly batch inference)
  • Committed use discounts — actual cost may be lower
  • Stores intentionally kept warm for latency-sensitive cold-start mitigation

Required permissions:

  • aiplatform.featurestores.list (included in roles/aiplatform.viewer)
  • aiplatform.featureOnlineStores.list (included in roles/aiplatform.viewer)
  • monitoring.timeSeries.list (included in roles/monitoring.viewer) — optional; falls back to age-based detection if absent

Rule Stability Guarantee

Once a rule reaches production status:

  • Rule ID remains stable
  • Confidence semantics unchanged
  • Backwards compatibility preserved
  • Schema additions only (no breaking changes)

This guarantees trust for long-running CI/CD integrations.


Coming Soon

AI/ML (all providers):

  • Orphaned SageMaker training artifacts in S3 (AWS)

AWS:

  • S3 lifecycle gaps, Redshift idle, NAT Gateway routing waste

Azure:

  • Azure Firewall idle, AKS node pool idle, Azure Batch unused pools

GCP:

  • GKE node pool idle, BigQuery slot waste, GCS cold storage, Cloud Run idle revisions

Multi-Cloud:

  • Rule filtering (--rules flag)
  • Policy-as-code (cleancloud.yaml)

Next: AWS Setup → | Azure Setup → | GCP Setup → | CI/CD Integration →