CleanCloud Rules

Complete reference for all 45 rules implemented by CleanCloud (30 hygiene + 15 AI/ML).

Design Principles

All CleanCloud rules follow these principles:

1. Read-Only Always

Uses read-only cloud APIs exclusively
No Delete*, Modify*, Tag*, or Update* operations
Safe for production environments

2. Conservative by Default

Multiple signals preferred over single indicators
Age-based thresholds prevent false positives on temporary resources
Prefer false negatives over false positives

3. Explicit Confidence Levels

Every finding includes a confidence level:

HIGH - Multiple strong signals, very likely orphaned
MEDIUM - Moderate signals, worth reviewing
LOW - Weak signals, informational only

4. Review-Only Recommendations

Findings are candidates for human review, not automated action
Clear reasoning provided for each finding
No rule should justify deletion on its own

Quick Reference

AWS:

Rule ID	Cost Surface	What It Detects
`aws.ec2.instance.stopped`	Compute	EC2 instances stopped 30+ days (EBS charges continue)
`aws.ec2.security_group.unused`	Governance	Security groups with no ENI associations
`aws.ebs.unattached`	Storage	EBS volumes not attached to any instance
`aws.ebs.snapshot.old`	Storage	Snapshots ≥ 90 days old
`aws.ec2.ami.old`	Storage	AMIs older than 180 days
`aws.ec2.elastic_ip.unattached`	Network	Elastic IPs not currently associated with any instance or network interface
`aws.ec2.eni.detached`	Network	Detached ENIs not currently attached
`aws.ec2.nat_gateway.idle`	Network	NAT Gateways with zero traffic 14+ days
`aws.elbv2.alb.idle` / `aws.elbv2.nlb.idle` / `aws.elb.clb.idle`	Network	Load balancers with zero traffic 14+ days
`aws.rds.instance.idle`	Platform	RDS instances with zero connections 14+ days
`aws.rds.snapshot.old`	Storage	Manual RDS snapshots older than 90 days
`aws.cloudwatch.logs.infinite_retention`	Observability	Log groups with no retention policy
`aws.resource.untagged`	Governance	EC2/S3/CloudWatch resources with zero tags
`aws.sagemaker.endpoint.idle`	AI/ML	Real-time SageMaker endpoints `InService` with no observed `InvokeEndpoint` traffic across billable production variants for 14+ days (opt-in: `--category ai`)
`aws.sagemaker.notebook.idle`	AI/ML	SageMaker Notebook Instances `InService` with stale control-plane timestamps for 14+ days (opt-in: `--category ai`)
`aws.ec2.gpu.idle`	AI/ML	EC2 GPU/accelerator instances (p/g/trn/inf/dl families) running with <5% GPU or <10% CPU utilisation over 7 days (opt-in: `--category ai`)
`aws.bedrock.provisioned_throughput.idle`	AI/ML	Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills per MU per hour regardless of traffic (opt-in: `--category ai`)
`aws.sagemaker.studio_app.idle`	AI/ML	SageMaker Studio `KernelGateway`/`JupyterLab`/`CodeEditor` apps `InService` with no usable recent activity signal for 7+ days (opt-in: `--category ai`)
`aws.sagemaker.training_job.long_running`	AI/ML	SageMaker training jobs still `InProgress` beyond the configured threshold (default 24h), using `TrainingStartTime` when present else `CreationTime` (opt-in: `--category ai`)

Azure:

Rule ID	Cost Surface	What It Detects
`azure.vm.stopped_not_deallocated`	Compute	Stopped but not deallocated VMs (full charges)
`azure.compute.disk.unattached`	Storage	Managed disks not attached to any VM
`azure.compute.snapshot.old`	Storage	Snapshots older than 30–90 days
`azure.network.public_ip.unused`	Network	Public IPs not attached to any interface
`azure.load_balancer.no_backends`	Network	Standard LBs with zero backend members
`azure.application_gateway.no_backends`	Network	App Gateways with zero backend targets
`azure.virtual_network_gateway.idle`	Network	VPN/ExpressRoute Gateways with no connections
`azure.app_service_plan.empty`	Platform	Paid App Service Plans with zero apps
`azure.app_service.idle`	Platform	App Services with zero HTTP requests 14+ days
`azure.sql.database.idle`	Platform	Azure SQL databases with zero connections 14+ days
`azure.container_registry.unused`	Platform	Container registries with no pulls 90+ days
`azure.resource.untagged`	Governance	Disks and snapshots with zero tags
`azure.aml.compute.idle`	AI/ML	AML compute clusters with min_node_count > 0 and no active nodes 14+ days (opt-in: `--category ai`)
`azure.ml.compute_instance.idle`	AI/ML	Azure ML Compute Instances Running with no control-plane activity 14+ days (opt-in: `--category ai`)
`azure.ml.online_endpoint.idle`	AI/ML	Azure ML managed online endpoints in Succeeded provisioning state with zero scoring requests for 7+ days (opt-in: `--category ai`)
`azure.ai_search.idle`	AI/ML	Azure AI Search services (Standard tier+) with zero search queries for 30+ days (opt-in: `--category ai`)
`azure.openai.provisioned_deployment.idle`	AI/ML	Azure OpenAI provisioned deployments (PTUs) with zero API requests for 7+ days (opt-in: `--category ai`) (default, configurable)

GCP:

Rule ID	Cost Surface	What It Detects
`gcp.compute.vm.stopped`	Compute	TERMINATED VM instances stopped 30+ days (disk charges continue)
`gcp.compute.disk.unattached`	Storage	Persistent Disks in READY state with no attached VM
`gcp.compute.snapshot.old`	Storage	Disk snapshots older than 90 days
`gcp.compute.ip.unused`	Network	Reserved static IPs (regional and global) in RESERVED state
`gcp.sql.instance.idle`	Platform	Cloud SQL instances with zero connections for 14+ days
`gcp.vertex.endpoint.idle`	AI/ML	Vertex AI Online Prediction endpoints with dedicated capacity and zero predictions for 14+ days (`--category ai`)
`gcp.vertex.workbench.idle`	AI/ML	Vertex AI Workbench instances ACTIVE with no control-plane activity for 14+ days (`--category ai`)
`gcp.vertex.training_job.long_running`	AI/ML	Vertex AI CustomJobs and TrainingPipelines in RUNNING state beyond 24h threshold; GPU/TPU/expensive-CPU early warning at 90% of threshold — hung or runaway jobs on GPU-backed machines cost $4–$80+/hr per node (opt-in: `--category ai`)
`gcp.tpu.idle`	AI/ML	Cloud TPU nodes in READY state with near-zero utilization (`duty_cycle ≤ 2%`) for 7+ days — idle TPU v4 costs ~$12.88/hr, v5p can exceed $33/hr (opt-in: `--category ai`)
`gcp.vertex.featurestore.idle`	AI/ML	Vertex AI Feature Store online stores (legacy and new-gen) with zero ReadFeatureValues requests for 30+ days — Bigtable-backed stores bill ~$197/node/month regardless of utilization (opt-in: `--category ai`)

AWS Rules

Compute Waste

Stopped EC2 Instances

Rule ID: aws.ec2.instance.stopped

What it detects: EC2 instances in 'stopped' state for 30+ days

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Stop time from CloudTrail LookupEvents ≥ 30 days ago (deterministic timestamp)
Not flagged: no CloudTrail stop event found or stopped < 30 days ago

Risk: MEDIUM

Why this matters:

Stopped EC2 instances do not charge for compute — but every attached EBS volume accrues storage costs at ~$0.10/GB-month, every hour, regardless of instance state
A 500 GB root + data volume on a forgotten stopped instance costs ~$50/month indefinitely
Any associated Elastic IPs continue to charge ~$0.005/hour while unattached
Stopped instances are the most common form of "I meant to clean that up" infrastructure debt

Detection logic:

for instance in describe_instances(state=stopped):
    stop_event = cloudtrail_lookup_events(EventName="StopInstances", instance_id=instance.id)
    # Uses latest StopInstances event after most recent StartInstances (restart-cycle aware)
    if stop_event and (now - stop_event.eventTime).days >= 30:
        confidence = "HIGH"  # Deterministic CloudTrail timestamp, not a heuristic

Cost estimates:

Based on total attached EBS storage × $0.10/GB-month
Example: 2 × 100 GB volumes = ~$20/month in ongoing storage charges
Additional Elastic IP charges are tracked separately by the aws.ec2.elastic_ip.unattached rule

Common causes:

Test or dev instances left stopped after a project ended
Migration source instances never terminated after cutover
Incident response boxes started and never cleaned up
Autoscaling warm pools drained but not terminated

Required permissions:

ec2:DescribeInstances
ec2:DescribeVolumes
cloudtrail:LookupEvents

Unused Security Groups

Rule ID: aws.ec2.security_group.unused

What it detects: Security groups not associated with any network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: No ENI associations found (service-managed groups may appear unused between deployments)

Risk: LOW

Why this matters:

Security groups with no ENI associations are pure governance debt
Each unused group widens the blast radius if a misconfiguration is later introduced
Compliance audits (SOC 2, ISO 27001, PCI DSS) flag unused security groups as a control failure
In accounts with hundreds of groups, unused ones obscure the real security posture and add friction to every access review
Cost is indirect but real: engineer time spent auditing and explaining phantom groups in compliance reviews

Detection logic:

in_use_sg_ids = {
    group["GroupId"]
    for eni in describe_network_interfaces()
    for group in eni["Groups"]
}
for sg in describe_security_groups():
    if sg.name != "default" and sg.id not in in_use_sg_ids:
        confidence = "MEDIUM"

Exclusions:

default security groups — AWS prevents deletion of the default group; flagging it is noise

Caveats:

A security group referenced only in another group's inbound rules (not attached to any ENI) will be flagged. This is intentional.
Service-managed groups (RDS, ELB, Lambda) may appear unused briefly between deployments. Review before deleting.

Common causes:

Leftover groups from deleted EC2 instances, RDS databases, or ELB deployments
Test stacks torn down without full cleanup
Groups created manually but never attached
CloudFormation stacks deleted leaving orphaned groups

Required permissions:

ec2:DescribeSecurityGroups
ec2:DescribeNetworkInterfaces

Storage Waste

Unattached EBS Volumes

Rule ID: aws.ebs.unattached

What it detects: EBS volumes not attached to any EC2 instance

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Volume in available state for ≥7 days (not attached to any instance)
Not flagged: < 7 days

Why this threshold:

Allows time for deployment cycles
Accounts for rollback windows
Reduces false positives from autoscaling

Common causes:

Volumes from terminated EC2 instances
Failed deployments or rollbacks
Autoscaling cleanup gaps

Required permission: ec2:DescribeVolumes

Old EBS Snapshots

Rule ID: aws.ebs.snapshot.old

What it detects: Snapshots ≥ 90 days old (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

LOW: Age ≥ 90 days (conservative — age alone is a weak signal)

Detection logic:

for snapshot in describe_snapshots(OwnerIds=["self"]):
    age_days = (now - snapshot.StartTime).days
    if age_days >= days_old:  # default 90
        confidence = "LOW"  # age alone is a weak signal
        risk = "LOW"

Limitations:

Snapshots linked to registered AMIs are excluded (avoids false positives)
Does NOT verify snapshot is unused (conservative approach)

Common causes:

Backup retention policies without lifecycle rules
Snapshots from deleted volumes
Over-retention without cleanup

Required permissions:

ec2:DescribeSnapshots
ec2:DescribeSnapshotAttribute

Old AMIs

Rule ID: aws.ec2.ami.old

What it detects: AMIs (Amazon Machine Images) older than 180 days (default threshold)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Age ≥ 180 days (AMI may still be actively used as template)

Why MEDIUM confidence:

Age alone is a moderate signal
AMI may be a golden image still used for launches
Cannot check if AMI is referenced by launch templates or Auto Scaling groups

Why this matters:

AMIs have associated EBS snapshots that incur storage costs
Old unused AMIs accumulate over time
Storage costs are ~$0.05/GB-month

Detection logic:

for ami in describe_images(Owners=["self"]):
    age_days = (now - ami.creation_date).days
    if age_days >= 180 (default) and ami.state == "available":
        # Flag as old AMI

What gets checked:

AMI creation date
AMI state (only "available" AMIs are flagged)
Associated snapshot sizes for cost estimation

Common causes:

AMIs from old deployments
Test/dev AMIs no longer needed
Superseded golden images
AMIs from terminated projects

Cost estimates:

Based on total EBS snapshot storage
~$0.05/GB-month for snapshot storage
Example: 100 GB AMI = ~$5/month

Required permission: ec2:DescribeImages

Network Waste

Unattached Elastic IPs

Rule ID: aws.ec2.elastic_ip.unattached

What it detects: Elastic IPs currently not associated with any instance or network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Currently not associated (all four AWS association fields absent per DescribeAddresses)

Why this matters:

Unattached Elastic IPs incur small hourly charges
State is deterministic (no AssociationId, InstanceId, NetworkInterfaceId, or PrivateIpAddress means not attached)
Clear cost optimization signal with zero ambiguity

Detection logic:

if not any([eip.get("AssociationId"), eip.get("InstanceId"),
            eip.get("NetworkInterfaceId"), eip.get("PrivateIpAddress")]):
    confidence = "HIGH"  # Deterministic state: not associated

Common causes:

Elastic IPs from terminated EC2 instances
Reserved IPs for DR that are no longer needed
Failed deployments leaving orphaned IPs
Manual allocation without attachment

Edge cases handled:

Classic EIPs without AllocationTime are annotated as is_classic: true in details
Detection is purely state-based — no age threshold is applied

Required permission: ec2:DescribeAddresses

Detached Network Interfaces (ENIs)

Rule ID: aws.ec2.eni.detached

What it detects: Elastic Network Interfaces (ENIs) currently not attached (Status=available)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Currently not attached — no temporal threshold; Status=available is the sole eligibility signal

Why this matters:

Detached ENIs incur small hourly charges
Often forgotten after failed deployments or incomplete teardowns
Clear signal with minimal ambiguity

Detection logic:

if eni['Status'] == 'available':  # Currently detached
    confidence = "HIGH"  # Deterministic state: not attached

What gets flagged:

User-created ENIs (InterfaceType='interface')
Lambda/ECS/RDS ENIs (RequesterManaged=true but YOUR resources!) - explicitly annotated in evidence and details
Detached ENIs from deleted services

Key insight: RequesterManaged=true means "AWS created this in YOUR VPC for YOUR resource" — these ARE your responsibility and often waste. RequesterManaged ENIs are included in findings with an explicit evidence signal and requester_managed: true in details for downstream filtering.

Common causes:

Failed EC2 instance launches
Incomplete infrastructure teardown
Terminated instances with retained ENIs
Forgotten manual ENI creations

Required permission: ec2:DescribeNetworkInterfaces

Idle NAT Gateways

Rule ID: aws.ec2.nat_gateway.idle

What it detects: NAT Gateways with zero traffic for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: No traffic detected for 14+ days (CloudWatch metrics checked, but seasonal patterns not verified)

Why MEDIUM confidence:

Zero traffic is a strong signal, but gateway may be for DR/standby
Cannot verify planned future usage or blue/green deployments
Seasonal traffic patterns not checked

Why this matters:

NAT Gateways cost ~~$0.045/hour + $0.045/GB data processing (~~$32/month base)
Idle gateways are a clear cost optimization signal
Common after VPC restructuring or service migrations

Detection logic:

for gw in describe_nat_gateways():
    if gw.state == "available" and age >= idle_threshold_days:
        # All 5 metrics must return datapoints and all must be zero
        # If any metric has no datapoints, the item is skipped
        for metric in required_metrics:
            value = get_metric(metric, period=idle_threshold_days)
            if value is None:
                skip  # Missing data is NOT treated as zero traffic
            if value > 0:
                skip  # Active traffic detected
        confidence = "HIGH" if no_route_table_refs else "MEDIUM"

CloudWatch metrics checked:

AWS/NATGateway → BytesOutToDestination (daily sum)
AWS/NATGateway → BytesInFromSource (daily sum)
AWS/NATGateway → BytesInFromDestination (daily sum)
AWS/NATGateway → BytesOutToSource (daily sum)
AWS/NATGateway → ActiveConnectionCount (daily sum)

Note: If any metric has no data for the period (e.g. newly created gateway), the item is skipped — missing data is NOT treated as zero traffic.

Common causes:

VPC restructuring leaving orphaned NAT Gateways
Service migrations to different subnets/VPCs
Dev/staging environments with no active workloads
DR standby gateways (intentional, but worth reviewing)

Cost estimates:

~$32/month base cost per idle NAT Gateway
Additional $0.045/GB data processing when active

Required permissions:

ec2:DescribeNatGateways
cloudwatch:GetMetricStatistics

Idle Elastic Load Balancers (ALB/CLB/NLB)

Rule IDs:

aws.elbv2.alb.idle — Application Load Balancer
aws.elbv2.nlb.idle — Network Load Balancer
aws.elb.clb.idle — Classic Load Balancer

What it detects: Load balancers with zero traffic for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Zero traffic AND no registered targets/instances
MEDIUM: Zero traffic only (targets/instances may still be registered)

Risk: MEDIUM

Why this matters:

ELBs incur base hourly charges regardless of traffic (~$16-22/month)
Idle load balancers are a clear cost optimization signal
Common after service migrations or decommissions

Detection logic:

# ALB/NLB (elbv2)
for lb in describe_load_balancers():
    if age >= idle_threshold_days:
        traffic = get_metric(RequestCount or NewFlowCount, period=idle_threshold_days)
        has_targets = check_target_groups(lb)
        if traffic == 0:
            confidence = "HIGH" if not has_targets else "MEDIUM"

# CLB (elb)
for lb in describe_load_balancers():
    if age >= idle_threshold_days:
        traffic = get_metric(RequestCount, period=idle_threshold_days)
        has_instances = len(lb.instances) > 0
        if traffic == 0:
            confidence = "HIGH" if not has_instances else "MEDIUM"

CloudWatch metrics checked:

AWS/ApplicationELB → RequestCount (ALB, daily sum)
AWS/NetworkELB → NewFlowCount (NLB, daily sum)
AWS/ELB → RequestCount (CLB, daily sum)

Exclusions:

LBs younger than the idle threshold

Common causes:

Service migrations leaving orphaned load balancers
Dev/staging environments with no active workloads
Decommissioned applications with retained infrastructure
Blue/green deployments with stale LBs

Cost estimates:

~$16-22/month base cost per idle load balancer (region dependent)

Required permissions:

elasticloadbalancing:DescribeLoadBalancers
elasticloadbalancing:DescribeTargetGroups
elasticloadbalancing:DescribeTargetHealth
cloudwatch:GetMetricStatistics

Platform Waste

Idle RDS Instances

Rule ID: aws.rds.instance.idle

What it detects: RDS instances with zero database connections for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Zero connections for 14+ days (CloudWatch metrics checked, strong but not conclusive signal)

Why MEDIUM confidence:

Zero database connections is a strong signal of non-use, but cannot rule out Aurora-style architectures or scheduled workloads that connect infrequently
Connection pools and proxies (RDS Proxy, PgBouncer) can hide real usage while keeping observed client connection counts low or zero

Risk: MEDIUM

Why MEDIUM risk:

RDS instances are among the more expensive AWS resources, but zero connections alone does not confirm the instance is safe to delete

Why this matters:

RDS instances incur hourly charges regardless of usage
Idle instances with no connections are a clear cost optimization signal
Common after application migrations or decommissions

Detection logic:

for instance in describe_db_instances():
    if instance.status == "available" and age >= idle_threshold_days:
        if not instance.read_replica_source:  # Skip read replicas
            connections_max = get_metric(DatabaseConnections, statistic="Maximum", period=idle_threshold_days)
            if connections_max == 0:
                confidence = "MEDIUM"
                risk = "MEDIUM"

CloudWatch metrics checked:

AWS/RDS -> DatabaseConnections (Maximum statistic)

Exclusions:

Aurora cluster members (DBClusterIdentifier set) — Aurora instances are managed at cluster level and may show zero connections individually even when the cluster is active
Read replicas (ReadReplicaSourceDBInstanceIdentifier set)
Instances younger than the idle threshold

Common causes:

Applications migrated to different databases
Dev/staging instances left running
Decommissioned services with retained databases
Test databases no longer needed

Required permissions:

rds:DescribeDBInstances
cloudwatch:GetMetricStatistics

Old Manual RDS Snapshots

Rule ID: aws.rds.snapshot.old

What it detects: Manual RDS snapshots older than 90 days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

LOW: Snapshot age is known and exceeds threshold (age alone is a weak signal)

Risk: LOW

Why this matters:

Manual RDS snapshots are retained indefinitely until explicitly deleted
Storage charges accrue at ~$0.095/GB-month regardless of whether the source DB still exists
Snapshots older than 90 days are rarely needed for active recovery

Detection logic:

for snapshot in describe_db_snapshots(SnapshotType="manual"):
    if snapshot.status == "available":
        age_days = (now - snapshot.create_time).days
        if age_days >= days_old:
            confidence = "LOW"
            risk = "LOW"

Exclusions:

Automated snapshots (SnapshotType=automated) — managed by RDS retention policy, auto-deleted
Snapshots in non-available states

Common causes:

Pre-migration snapshots never cleaned up
Manual backups taken before schema changes and forgotten
Snapshots of deleted databases retained for compliance but past their useful life

Cost estimate: ~$0.095/GB-month based on AllocatedStorage (the provisioned DB size). RDS snapshots are incremental so actual storage used may be lower — treat this as a ceiling estimate, not an exact figure.

Required permissions:

rds:DescribeDBSnapshots
rds:DescribeDBSnapshotAttributes

Observability Waste

CloudWatch Log Groups (Infinite Retention)

Rule ID: aws.cloudwatch.logs.infinite_retention

What it detects: Log groups with no retention policy (never expires)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: No retention policy configured (directly observable configuration fact)

Risk tiers:

HIGH: Log group has ≥1 GB stored bytes (significant ongoing cost)
MEDIUM: Log group has >0 stored bytes
LOW: Log group has 0 stored bytes (still flagged — retention should be set regardless)

Why this matters:

Logs grow indefinitely without retention
Can reach GBs/TBs over months
Often forgotten after service decommission

Common causes:

Default CloudFormation behavior (no retention)
Manual log group creation
Missing lifecycle policies

Required permission: logs:DescribeLogGroups

Governance

Untagged Resources

Rule ID: aws.resource.untagged

What it detects: Resources with zero tags

Resources checked:

EBS volumes
S3 buckets
CloudWatch log groups

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Zero tags (directly observable fact from authoritative tag source)

Why this matters:

Ownership ambiguity
Compliance violations (SOC2, ISO27001)
Cleanup decision paralysis

Required permissions:

ec2:DescribeVolumes
s3:ListAllMyBuckets
s3:GetBucketTagging
logs:DescribeLogGroups
logs:ListTagsForResource

AI/ML Waste

Idle SageMaker Endpoints

Rule ID: aws.sagemaker.endpoint.idle

Category: ai

What it detects: Real-time SageMaker endpoints in InService state with no observed InvokeEndpoint traffic across billable production variants for 14+ days (default, configurable). Async endpoints are excluded. Serverless variants without current provisioned concurrency are not treated as continuous idle-cost candidates.

Confidence:

HIGH: All evaluated billable variants returned datapoints and zero summed invocations over the observation window
MEDIUM: At least one evaluated billable variant returned no CloudWatch datapoints, but no billable variant showed positive invocation traffic

Risk:

HIGH: Any billable variant is accelerator-backed (ml.g*, ml.p*, ml.inf*, ml.trn*)
MEDIUM: All billable variants are CPU-backed

Why this matters:

SageMaker endpoints accrue charges continuously while InService, regardless of traffic
Endpoints deployed for experiments or demos are frequently abandoned after initial testing
Multi-variant endpoints multiply the cost per variant

Detection signal:

Inventory comes from ListEndpoints(StatusEquals="InService")
Runtime variants come from DescribeEndpoint.ProductionVariants
Async inference is excluded via DescribeEndpointConfig.AsyncInferenceConfig
Activity is evaluated from AWS/SageMaker Invocations using EndpointName + VariantName
estimated_monthly_cost_usd is intentionally left unset by this rule

Required permissions:

sagemaker:ListEndpoints
sagemaker:DescribeEndpoint
sagemaker:DescribeEndpointConfig
cloudwatch:GetMetricStatistics

Not run by default. AI/ML rules are opt-in to avoid surprising users who don't use these services. Run with cleancloud scan --provider aws --category ai (or --category all to combine with hygiene rules). Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule.

Idle SageMaker Notebook Instances

Rule ID: aws.sagemaker.notebook.idle

Category: ai

What it detects: SageMaker Notebook Instances in InService state whose CreationTime and LastModifiedTime are both at least 14 days old (default, configurable). This is a conservative stale control-plane heuristic, not a direct notebook-usage signal.

Detection signal — why LastModifiedTime: SageMaker Notebook Instances do not publish a native notebook-session activity metric for this rule. LastModifiedTime is the only canonical control-plane timestamp available, but it is a weak signal: it is not a direct indicator of Jupyter usage, kernel execution, or user access. The rule therefore emits only MEDIUM-confidence review candidates.

Confidence:

MEDIUM: notebook age and stale control-plane age both meet or exceed the configured threshold

Risk:

HIGH: GPU/accelerator-backed instance (ml.g4dn.*, ml.g5.*, ml.p3.*, ml.p4d.*, ml.p4de.*, ml.p5.*, Inferentia, Trainium)
MEDIUM: CPU-backed instance

Why this matters:

Notebook Instances bill continuously while InService, regardless of whether any kernels are running
Notebooks are commonly left running after a sprint ends, a project is deprioritised, or a team member leaves
Unlike endpoints, notebooks have no auto-scaling — they remain billable until explicitly stopped

Important scope note:

Stopped notebook instances are intentionally out of scope for this rule
Their retained storage cost should be handled by a separate storage / cost-waste rule
estimated_monthly_cost_usd is intentionally left unset by this rule

Required permissions:

sagemaker:ListNotebookInstances

Not run by default. AI/ML rules are opt-in to avoid surprising users who don't use these services. Run with cleancloud scan --provider aws --category ai (or --category all to combine with hygiene rules). Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule.

Idle EC2 GPU Instances

Rule ID: aws.ec2.gpu.idle

Category: ai

What it detects: EC2 GPU and accelerator instances (p2/p3/p4/p5, g4dn/g4ad/g5/g5g/g6/g6e/gr6, trn1/trn2, inf1/inf2, dl1/dl2q families) in running state with low utilisation over 7+ days (default, configurable). Unlike SageMaker rules which target managed services, this rule catches raw GPU instances spun up directly for training, inference, or experimentation and left running after the job completes.

Detection uses two tiers based on metric availability:

GPU utilisation (HIGH confidence): When the NVIDIA CloudWatch agent is installed, nvidia_smi_utilization_gpu is read from the CWAgent namespace. MAX statistic across all GPU indices is used — a single active GPU on a multi-GPU instance (e.g., p4d.24xlarge with 8 A100s) will not be masked by averaging.
CPU utilisation fallback (MEDIUM confidence): When the NVIDIA agent is not installed, CPUUtilization from AWS/EC2 is used as a proxy signal. Neuron instances (Trainium/Inferentia) always use this path by design — they use the AWS Neuron SDK, not NVIDIA CUDA.

Confidence levels:

HIGH: GPU metric available AND max GPU utilisation < 5% over 7 days
MEDIUM: GPU metric unavailable; avg CPU utilisation < 10% over 7 days

Risk levels:

CRITICAL: idle_ratio ≥ 2.0 (e.g. running for 14+ days at the 7-day threshold)
HIGH: GPU/accelerator instance with low utilisation (all other cases)

Cost estimates (us-east-1 on-demand):

Instance	Est. monthly cost
g4dn.xlarge (T4)	$379
g5.xlarge (A10G)	$604
p3.2xlarge (V100)	$2,234
p4d.24xlarge (8× A100 40GB)	$23,374
p4de.24xlarge (8× A100 80GB)	$32,074
g6e.48xlarge (8× L40S)	$18,000
p5.48xlarge (8× H100)	$98,318
trn2.48xlarge (Trainium2)	$110,000

Configurable parameters:

Parameter	Default	Description
`idle_days`	`7`	Days of low utilisation before flagging
`gpu_threshold`	`5.0`	Max GPU utilisation % (HIGH confidence path)
`cpu_threshold`	`10.0`	Max CPU utilisation % (MEDIUM confidence fallback)

Required permissions:

ec2:DescribeInstances
cloudwatch:GetMetricStatistics
cloudwatch:ListMetrics

Not run by default. Run with cleancloud scan --provider aws --category ai. Attach security/aws/ai-readonly.json to your IAM role to enable this rule. The NVIDIA CloudWatch agent is not required — instances without it fall back to CPU utilisation at MEDIUM confidence.

Idle Bedrock Provisioned Throughput

Rule ID: aws.bedrock.provisioned_throughput.idle

Category: ai

What it detects: AWS Bedrock Provisioned Throughput reservations (Model Units) in InService state with zero invocations over 7+ days (default, configurable). Provisioned Throughput reserves dedicated model capacity and bills per Model Unit per hour regardless of whether any inference requests are made — up to ~$7,300/MU/month for Claude 3 Opus on no-commitment pricing. A zero-invocation reservation is paying for capacity delivering zero value.

Confidence:

HIGH: Zero invocations confirmed for the full idle window (deployment age ≥ idle_days)

Risk:

HIGH: All provisioned throughput reservations (significant always-on spend)

Why this matters:

Provisioned Throughput bills per Model Unit per hour while InService, regardless of invocation count
Claude 3 Opus: ~$7,300/MU/month; Claude 3 Sonnet / 3.5 Sonnet: ~$2,600/MU/month; Claude 3 Haiku: ~$600/MU/month (no-commitment pricing — reserved terms are 25–60% lower but still significant)
Abandoned proof-of-concept and experiment reservations are common — teams switch to on-demand after initial testing but forget to delete the provisioned throughput

Cost estimates (per Model Unit, us-east-1, no-commitment):

Model family	Monthly cost per MU
Claude 3 Opus	~$7,300
Claude 3 Sonnet / 3.5 Sonnet	~$2,600
Claude 3 Haiku / 3.5 Haiku	~$600
Meta Llama 3	~$1,000

Multiply by desiredModelUnits for total monthly idle cost.

Configurable parameters:

Parameter	Default	Description
`idle_days`	`7`	Days of zero invocations before flagging

Required permissions:

bedrock:ListProvisionedModelThroughputs
cloudwatch:GetMetricStatistics

Not run by default. Run with cleancloud scan --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.

Idle SageMaker Studio Apps

Rule ID: aws.sagemaker.studio_app.idle

Category: ai

What it detects: SageMaker Studio apps of type KernelGateway, JupyterLab, or CodeEditor in InService state with no usable recent activity signal for 7+ days (default, configurable). Other app types, including JupyterServer, are excluded from evaluation.

Detection signal: LastUserActivityTimestamp from sagemaker:DescribeApp, but only when it is usable. AWS documents that health checks can also update LastUserActivityTimestamp; if it exactly matches LastHealthCheckTimestamp, the app is skipped and not treated as idle.

Confidence:

HIGH: usable_activity_signal = true and the last usable activity timestamp is at least the configured threshold old

Risk:

HIGH: GPU/accelerator instance (ml.g*, ml.p*, ml.inf*, ml.trn*)
MEDIUM: CPU instance

GPU families: ml.g4dn, ml.g5, ml.p2, ml.p3, ml.p4d, ml.p4de, ml.p5, ml.trn1, ml.inf1, ml.inf2

Why this matters:

Studio apps remain InService (and billing) until explicitly deleted — there is no auto-stop by default
KernelGateway, JupyterLab, and CodeEditor apps each launch a separate compute instance per user session or space
Teams frequently leave apps running after finishing a sprint, switching to a new space, or abandoning a project
estimated_monthly_cost_usd is intentionally left unset by this rule

Configurable parameters:

Parameter	Default	Description
`idle_days_threshold`	`7`	Days since the last usable activity timestamp before flagging

Required permissions:

sagemaker:ListApps
sagemaker:DescribeApp

Not run by default. Run with cleancloud scan --provider aws --category ai. Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.

Long-Running SageMaker Training Jobs

Rule ID: aws.sagemaker.training_job.long_running

Category: ai

What it detects: SageMaker training jobs still in InProgress beyond the configured threshold (default 24 hours). Runtime is measured from TrainingStartTime when present, otherwise from CreationTime.

Detection signal: Inventory is built by fully paginating ListTrainingJobs without relying on StatusEquals for completeness, then filtering TrainingJobStatus client-side. DescribeTrainingJob is used to confirm the current status, resolve the runtime anchor, and read StoppingCondition, EnableManagedSpotTraining, ResourceConfig, and optional heterogeneous InstanceGroups.

Confidence:

HIGH: elapsed runtime exceeds the applicable SageMaker stopping-condition limit (MaxWaitTimeInSeconds for managed Spot when present, otherwise MaxRuntimeInSeconds when TrainingStartTime is present)
MEDIUM: elapsed runtime meets the threshold but no applicable stopping-condition limit was exceeded (or no such limit is configured)

Risk:

HIGH: GPU/accelerator instance (ml.g*, ml.p*, ml.inf*, ml.trn*)
MEDIUM: Non-GPU/accelerator instance

GPU/accelerator families: ml.g4dn, ml.g5, ml.g6, ml.g6e, ml.g7, ml.p2, ml.p3, ml.p4d, ml.p4de, ml.p5, ml.p5en, ml.p6, ml.trn1, ml.trn2, ml.inf1, ml.inf2

Managed spot training: EnableManagedSpotTraining=true changes the effective wall-clock stopping limit. MaxRuntimeInSeconds counts only active compute time (not spot wait time) and is not a reliable wall-clock signal. For spot jobs the rule uses MaxWaitTimeInSeconds as the stopping limit; the summary and signals explicitly label which limit was exceeded.

Heterogeneous clusters: When ResourceConfig.InstanceGroups is present, accelerator detection is evaluated across the groups rather than inferred from a single primary instance type.

Why this matters:

Long-running distributed training can keep all workers running and billing while producing limited or no useful progress
Training jobs are not automatically stopped just because they are unusually long
estimated_monthly_cost_usd is intentionally omitted — this is a transient runtime review rule, not a monthly-cost rule

Configurable parameters:

Parameter	Default	Description
`long_running_hours_threshold`	`24`	Hours before a training job is considered long-running

Required permissions:

sagemaker:ListTrainingJobs
sagemaker:DescribeTrainingJob

Not run by default. Run with cleancloud scan --provider aws --category ai. Validate access first with cleancloud doctor --provider aws --category ai. Attach security/aws/ai-readonly.json alongside base-readonly.json to your IAM role to enable this rule.

Idle Azure ML Compute Clusters

Rule ID: azure.aml.compute.idle

Category: ai

What it detects: Azure Machine Learning compute clusters (AmlCompute) with min_node_count > 0 and zero active nodes over 14+ days. Clusters configured with a non-zero minimum keep instances running continuously regardless of job activity — identical billing model to SageMaker InService endpoints. GPU clusters (NC/ND/NV series) cost $600–$15K/month at minimum node count.

Confidence:

HIGH: Zero active nodes for the full 14-day window (cluster age ≥ 14 days)
MEDIUM: Zero active nodes, cluster age is 7–13 days, or cluster creation time unavailable

Risk:

HIGH: GPU-backed VM size (Standard_NC*, Standard_ND*, Standard_NV*)
MEDIUM: CPU-backed VM size

Why this matters:

min_node_count > 0 means instances are always running, always billed — even with no jobs submitted
GPU clusters cost $600–$15K/month per node at minimum capacity
Clusters are frequently created for experiments or training runs and left with non-zero minimums for "warm-start convenience"

Metric strategy: Queries Azure Monitor Active Nodes metric (with ComputeName dimension filter). Falls back to NodeCount and CurrentNodeCount if the primary metric is unavailable. Only dimension-filtered metrics are used to confirm idle — workspace-level unfiltered queries cannot safely confirm individual cluster state.

Estimated monthly cost (per node at min_node_count):

Standard_NC6 — ~$648/month
Standard_NC12 — ~$1,296/month
Standard_NC6s_v3 — ~$2,203/month
Standard_ND40rs_v2 — ~$15,862/month
Standard_D4_v2 — ~$259/month

Required permissions:

Microsoft.MachineLearningServices/workspaces/read
Microsoft.MachineLearningServices/workspaces/computes/read
Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai (or --category all). Add Microsoft.MachineLearningServices/workspaces/read and Microsoft.MachineLearningServices/workspaces/computes/read to your custom role or use the built-in AzureML Data Scientist role in read-only mode.

Idle Azure ML Compute Instances

Rule ID: azure.ml.compute_instance.idle

Category: ai

What it detects: Azure ML Compute Instances in Running state with no control-plane activity for 14+ days, detected via last_operation.operation_time. Compute Instances are single-VM interactive development environments (Jupyter, VS Code, RStudio) that bill continuously while Running — regardless of kernel activity. GPU instances (NC/ND/NV series) idle for 2× the threshold are escalated to CRITICAL.

Detection signal — why last_operation: Azure ML Compute Instances do not publish per-instance utilisation metrics to Azure Monitor by default. last_operation.operation_time is updated by the Azure ML control plane on Start, Stop, Restart, and Create operations. An instance with no recent operation has had no control-plane activity — the same approach used for SageMaker Notebook LastModifiedTime. Falls back to system_data.last_modified_at if last_operation is unavailable.

Confidence:

HIGH: last_operation.operation_time or last_modified_at signal ≥ 14 days ago AND instance age ≥ 14 days
MEDIUM: ≥ 75% of threshold on both signals, OR age-only fallback (when neither last_operation nor last_modified_at is available — age alone is not evidence of idleness)

Risk:

CRITICAL: GPU instance AND idle_ratio ≥ 2.0 (e.g. 28+ days at the default 14-day window)
HIGH: GPU instance (Standard_NC*, Standard_ND*, Standard_NV*)
MEDIUM: CPU instance

Why this matters:

Compute Instances bill at the full VM rate while Running — a stopped instance costs nothing
GPU instances cost $600–$15K+/month running continuously
Data scientists frequently leave instances Running after finishing a sprint, switching to a new instance, or during holidays

Estimated monthly cost:

Standard_DS3_v2 — ~$260/month
Standard_NC6s_v3 — ~$2,203/month
Standard_NC24s_v3 — ~$8,812/month
Standard_ND40rs_v2 — ~$15,862/month

Required permissions:

Microsoft.MachineLearningServices/workspaces/read
Microsoft.MachineLearningServices/workspaces/computes/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.

Idle Azure OpenAI Provisioned Deployment

Rule ID: azure.openai.provisioned_deployment.idle

Category: ai

What it detects: Azure OpenAI provisioned deployments (PTUs) with zero API requests for 7+ days (default, configurable). Provisioned Throughput Units reserve dedicated model capacity and bill continuously at ~$1,460/PTU/month on-demand regardless of traffic — a single idle 100-PTU GPT-4o deployment wastes ~$146,000/month.

Configurable parameters:

Parameter	Default	Description
`idle_days`	`7`	Days of zero requests before flagging

Detection signal:

Queries Azure Monitor AzureOpenAIRequests (falling back to ProcessedPromptTokens) with a ModelDeploymentName dimension filter to isolate per-deployment traffic. If the per-deployment dimension is unsupported in a region, falls back to account-level totals. Conservative: returns no finding on any API error.

Provisioned SKUs detected:

ProvisionedManaged — single-region reserved capacity
GlobalProvisionedManaged — multi-region reserved capacity
DataZoneProvisionedManaged — data-zone-scoped reserved capacity

Confidence:

HIGH: Per-deployment metric confirms zero requests AND deployment age ≥ idle_days
MEDIUM: Per-deployment zero confirmed but age < idle_days; OR account-level zero (per-deployment dimension unavailable in region)

Risk:

HIGH: ≥ 7 PTUs (~$10K+/month estimated)
MEDIUM: < 7 PTUs (still significant — PTU deployments have no cost-free tier)

Why this matters:

PTU deployments have no free tier — every hour of idle time is pure waste
Common abandonment pattern: PoC deployments left running after evaluation, dev/test deployments forgotten when team moves to production, traffic migrated to a new deployment without decommissioning the old one
Nobody else detects idle PTU deployments in CI — first-mover advantage

Estimated monthly cost:

1 PTU — ~$1,460/month (on-demand)
10 PTUs — ~$14,600/month
100 PTUs — ~$146,000/month
Note: Monthly/annual reserved pricing is 30–50% lower; estimated cost shown is on-demand ceiling

Required permissions:

Microsoft.CognitiveServices/accounts/read
Microsoft.CognitiveServices/accounts/deployments/read
Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai (or --category all). Add the permissions above to your custom read-only role.

Idle Azure ML Online Endpoints

Rule ID: azure.ml.online_endpoint.idle

Category: ai

What it detects: Azure ML managed online endpoints in Succeeded provisioning state with zero scoring requests for 7+ days (default, configurable). These endpoints bill per-instance based on minimum replica count regardless of traffic — a GPU-backed endpoint with no scoring requests is paying for capacity delivering zero value.

Detection signal: Queries Azure Monitor RequestCount (falling back to ModelEndpointRequests) with an EndpointName dimension filter to isolate per-endpoint traffic. If the dimension is unsupported, falls back to workspace-level totals. Age-only fallback applies when metric data is unavailable and endpoint age ≥ 2× idle window (MEDIUM confidence).

Configurable parameters:

Parameter	Default	Description
`idle_days`	`7`	Days of zero scoring requests before flagging

Confidence:

HIGH: Per-endpoint metric confirms zero requests AND endpoint age ≥ idle_days
MEDIUM: Zero requests confirmed but age < idle_days; OR metric data unavailable and age ≥ 2× idle_days

Risk:

CRITICAL: GPU/accelerator instance AND idle_ratio ≥ 2.0 (idle for 2× the threshold)
HIGH: GPU/accelerator instance (Standard_NC*, Standard_ND*, Standard_NV*, T4/A100 families)
MEDIUM: CPU-backed instance

Why this matters:

Managed online endpoints bill per minimum replica continuously while in Succeeded state — even with zero traffic
GPU-backed endpoints cost $200–$2,600+/month at single minimum replica
Experiment and PoC endpoints are commonly abandoned after demos without being deleted or scaled to zero
Unlike batch endpoints, managed online endpoints have no auto-scale-to-zero by default

Estimated monthly cost:

Standard_NC6 (K80 GPU) — ~$657/month per replica
Standard_NC6s_v2 — ~$900/month per replica
Standard_NC12 — ~$1,300/month per replica
CPU-backed (fallback) — ~$200/month per replica

Required permissions:

Microsoft.MachineLearningServices/workspaces/read
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/read
Microsoft.MachineLearningServices/workspaces/onlineEndpoints/deployments/read
Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.

Idle Azure AI Search Services

Rule ID: azure.ai_search.idle

Category: ai

What it detects: Azure AI Search services on Standard tier or above with zero search queries over a 30-day window (default, configurable). Cost is computed per SKU × replica count × partition count — a Standard3 service with 3 replicas and 2 partitions idles at ~$6,282/month.

Detection signal: Queries Azure Monitor SearchQueriesPerSecond (Average), falling back to TotalSearchRequestCount (Sum). Service-level metrics only — no per-index dimension filtering needed. Age-only fallback applies when metric data is unavailable and service age ≥ 2× idle window (MEDIUM confidence).

Watched SKUs: standard, standard2, standard3, storage_optimized_l1, storage_optimized_l2 — Basic tier is excluded (low cost, no signal).

Configurable parameters:

Parameter	Default	Description
`idle_days`	`30`	Days of zero queries before flagging

Confidence:

HIGH: Zero average SearchQueriesPerSecond for the full idle window AND service age ≥ idle_days
MEDIUM: Zero confirmed but age < idle_days; OR metric data unavailable and age ≥ 2× idle_days

Risk:

HIGH: Estimated monthly cost ≥ $1,000 (e.g. Standard2+ or multi-replica/partition Standard)
MEDIUM: All other cases

Why this matters:

AI Search services bill continuously by SKU × replicas × partitions regardless of query volume
A Standard service with 1 replica and 1 partition costs ~$261/month idle — scale up to 2 replicas and the bill doubles
Services are commonly left running after a project ends, a search index is replaced, or a PoC is abandoned
Standard3 High-Density (HD) with 12 partitions can idle at ~$12,564/month

Estimated monthly cost per replica per partition:

SKU	Monthly cost
Standard	$261
Standard2	$523
Standard3	$1,047
Storage Optimized L1	$2,014
Storage Optimized L2	$4,028

Multiply by replica_count × partition_count for total monthly idle cost.

Required permissions:

Microsoft.Search/searchServices/read
Microsoft.Insights/metrics/read

Not run by default. Run with cleancloud scan --provider azure --category ai. Attach security/azure/ai-readonly-role.json to your service principal to enable this rule.

Azure Rules

Compute Waste

Stopped (Not Deallocated) VMs

Rule ID: azure.vm.stopped_not_deallocated

What it detects: VMs in 'Stopped' state (OS-level shutdown) that are not deallocated, still incurring full compute charges

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Power state is 'Stopped' (deterministic state check, zero false positives)

Risk: HIGH

Why HIGH risk:

Stopped-but-not-deallocated VMs incur full compute charges ($30-500+/month depending on SKU)
Users often believe their VM is "off" but are paying full price
Classic Azure cost trap with significant financial impact

Why this matters:

Azure distinguishes between 'Stopped' (OS shutdown) and 'Deallocated' (compute released)
Only deallocated VMs stop incurring compute charges
100% deterministic state check with zero false positives

Detection logic:

for vm in virtual_machines.list_all():
    instance_view = virtual_machines.instance_view(resource_group, vm.name)
    power_state = get_power_state(instance_view.statuses)  # PowerState/* code
    if power_state == "PowerState/stopped":
        confidence = "HIGH"  # Deterministic: stopped but not deallocated
        risk = "HIGH"  # Full compute charges still applied

Power states:

PowerState/running — active, skip
PowerState/deallocated — properly stopped, skip
PowerState/stopped — FLAGGED (still incurring compute charges)
PowerState/starting, PowerState/stopping, PowerState/deallocating — transitional, skip

Common causes:

Shutting down the VM from inside the OS (instead of Azure portal/CLI)
Using Stop-AzVM without -StayProvisioned false
RDP/SSH shutdown commands
Automated scripts that stop but don't deallocate

Required permission: Microsoft.Compute/virtualMachines/read

Storage Waste

Unattached Managed Disks

Rule ID: azure.compute.disk.unattached

What it detects: Managed disks not attached to any VM

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Unattached ≥ 7 days (conservative for all ages — unattached state is deterministic but attachment intent is not)
Not flagged: < 7 days

Detection logic:

for disk in disks.list():
    if disk.managed_by is not None:
        continue  # attached to a VM
    age_days = (now - disk.time_created).days
    if age_days >= 7:
        confidence = "MEDIUM"  # conservative regardless of age
    else:
        continue  # too new to flag

Common causes:

Disks from deleted VMs
Failed deployments
Autoscaling cleanup gaps

Required permission: Microsoft.Compute/disks/read

Old Managed Disk Snapshots

Rule ID: azure.compute.snapshot.old

What it detects: Snapshots older than configured thresholds

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Age ≥ 30 days (conservative for all ages — age alone is a moderate signal)
Not flagged: < 30 days

Detection logic:

for snapshot in snapshots.list():
    age_days = (now - snapshot.time_created).days
    if age_days >= 90:
        confidence = "MEDIUM"  # conservative even at high age
    elif age_days >= 30:
        confidence = "MEDIUM"
    else:
        continue  # too new to flag

Limitations:

Does NOT check if snapshot is referenced by images
Conservative to avoid false positives

Common causes:

Snapshots from backup jobs
Over-retention without lifecycle policies
Snapshots from deleted disks

Required permission: Microsoft.Compute/snapshots/read

Network Waste

Unused Public IP Addresses

Rule ID: azure.network.public_ip.unused

What it detects: Public IPs not attached to any network interface

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Not attached (deterministic state, but may be reserved intentionally)

Why this matters:

Public IPs incur charges even when unused
State is deterministic (no heuristics needed)

Detection logic:

if public_ip.ip_configuration is None:
    confidence = "MEDIUM"

Required permission: Microsoft.Network/publicIPAddresses/read

Standard Load Balancer with No Backend Members

Rule ID: azure.load_balancer.no_backends

What it detects: Standard Load Balancers where all backend pools have zero members

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Standard SKU with zero backend members across all pools (deterministic state)

Excluded:

Basic SKU load balancers are skipped (retired, no cost signal)

Why this matters:

Standard Load Balancers incur base charges (~$18/month) regardless of backends
Empty LBs are a clear cost optimization signal
Common after VM/VMSS teardowns or migrations

Detection logic:

if lb.sku.name == "Standard":
    pools = lb.backend_address_pools or []
    # Check both NIC-based and IP-based backend representations
    has_members = any(
        pool.backend_ip_configurations or pool.load_balancer_backend_addresses
        for pool in pools
    )
    if not has_members:
        confidence = "HIGH"  # Deterministic: zero members across all pools

Backend representations checked:

backend_ip_configurations — NIC-based backends (standard VMs)
load_balancer_backend_addresses — IP-based backends (Private Link, hybrid)

Common causes:

VMs or VMSS deleted but LB retained
Migration from Basic to Standard leaving empty LBs
Failed deployments or incomplete teardowns
Hub-spoke architecture cleanup gaps

Required permission: Microsoft.Network/loadBalancers/read

Application Gateway with No Backend Targets

Rule ID: azure.application_gateway.no_backends

What it detects: Application Gateways where all backend pools have zero targets (no IP addresses or FQDNs)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: All backend pools have zero targets (deterministic state)

Excluded:

Gateways with provisioning_state != "Succeeded" are skipped (in-progress)

Why this matters:

Application Gateways incur significant charges regardless of backends
Standard_v2 and WAF_v2 SKUs cost $150-300+/month
Empty gateways are a clear cost optimization signal

Detection logic:

for gw in application_gateways:
    pools = gw.backend_address_pools or []
    has_any_targets = any(
        (pool.backend_addresses and len(pool.backend_addresses) > 0) or
        (pool.backend_ip_configurations and len(pool.backend_ip_configurations) > 0)
        for pool in pools
    )
    if not has_any_targets:
        confidence = "HIGH"  # Deterministic: zero targets across all pools
        risk = "MEDIUM"  # Significant cost impact ($150-300+/month)

Backend targets checked:

backend_addresses array (IP addresses or FQDNs)
backend_ip_configurations array (NIC-based backend references)

Common causes:

Backend VMs or services deleted but gateway retained
Migration or transition leaving empty gateways
Failed deployments or incomplete teardowns
WAF-only setup without actual backends (rare)

Cost estimates by SKU:

Standard_v2, WAF_v2: $150-300+/month
Standard, WAF (v1): $20-50/month

Required permission: Microsoft.Network/applicationGateways/read

Idle VNet Gateways (VPN/ExpressRoute)

Rule ID: azure.virtual_network_gateway.idle

What it detects: VPN Gateways and ExpressRoute Gateways with no active connections

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: No active connections (connection state checked, but P2S clients not verified)

Why MEDIUM confidence:

We can verify Site-to-Site and ExpressRoute connections
Point-to-Site VPN client count requires additional API calls
Gateway may have P2S config but no way to check active clients without deeper inspection

Risk: HIGH

Why HIGH risk:

VNet Gateways are among the most expensive idle resources ($500-3,500+/month)
Cost impact is material even for a single idle gateway
Significantly higher than Load Balancers (~~$18/month) or App Gateways (~~$150-300/month)

Why this matters:

VNet Gateways incur significant charges regardless of connections
VPN Gateway SKUs: $27-3,500+/month depending on SKU
ExpressRoute Gateway SKUs: $125-1,100+/month
Idle gateways are a major cost optimization signal

Detection logic:

for gw in virtual_network_gateways:
    connections = list_connections(gw)
    active_connections = [c for c in connections if c.connection_status == "Connected"]

    if gw.gateway_type == "Vpn":
        if len(active_connections) == 0 and not has_p2s_config:
            # Flag as idle
    elif gw.gateway_type == "ExpressRoute":
        if len(active_connections) == 0:
            # Flag as idle

Connection states checked:

Site-to-Site VPN connections (connection_status == "Connected")
ExpressRoute circuit connections
Point-to-Site VPN configuration (presence only, not active client count)

Common causes:

VPN tunnels torn down but gateway retained
ExpressRoute circuits decommissioned
Test/dev gateways left running
Migration or transition leaving orphaned gateways
DR standby gateways (intentional, but worth reviewing)

Cost estimates by SKU:

Basic: $27/month
VpnGw1/ErGw1AZ: $140-195/month
VpnGw2/ErGw2AZ: $360-505/month
VpnGw3/ErGw3AZ: $930-1,115/month
HighPerformance/UltraPerformance: $335-670/month

Required permissions:

Microsoft.Network/virtualNetworkGateways/read
Microsoft.Network/connections/read

Platform Waste

Empty App Service Plans

Rule ID: azure.app_service_plan.empty

What it detects: Paid App Service Plans with zero hosted apps (number_of_sites == 0)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Paid tier plan with 0 apps (deterministic state)

Excluded tiers:

Free and Shared tiers are skipped (no cost signal)

Why this matters:

Paid App Service Plans incur charges regardless of hosted apps
Empty plans are a clear cost optimization signal
Common after app deletions or failed deployments

Detection logic:

if plan.number_of_sites == 0:
    if plan.sku.tier not in ("Free", "Shared"):
        confidence = "HIGH"  # Deterministic: zero apps on paid plan

Common causes:

Apps deleted but plan retained
Failed deployments leaving empty plans
Scaling plans created but never used
Migration leaving old plans behind

Required permissions: Microsoft.Web/serverfarms/read, Microsoft.Web/serverfarms/sites/read

Idle Azure SQL Databases

Rule ID: azure.sql.database.idle

What it detects: Azure SQL databases with zero connections for 14+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Zero connections for 14+ days (Azure Monitor metrics checked, strong idle signal)

Risk: HIGH

Why HIGH risk:

Azure SQL databases in Standard/Premium tiers cost $15-$7,500+/month
Idle databases with no connections are a clear cost optimization signal

Why this matters:

Azure SQL databases incur charges regardless of usage
Standard and Premium tiers have significant hourly costs
Idle databases are a major cost optimization opportunity

Detection logic:

for server in sql_servers:
    for db in databases.list_by_server(rg, server.name):
        if db.name == "master":  # Skip system databases
            continue
        if db.sku.tier == "Basic":  # Skip Basic tier (< $5/month)
            continue
        connections = get_metric(connection_successful, period=14_days)
        if connections == 0:
            confidence = "HIGH"
            risk = "HIGH"

Azure Monitor metrics checked:

connection_successful (daily total over 14-day window)

Exclusions:

System databases (master)
Basic tier databases (< $5/month, not worth flagging)

Common causes:

Applications migrated to different databases
Dev/staging databases left running
Decommissioned services with retained databases
Test databases no longer needed

Cost estimates by SKU:

Standard S0: ~$15/month
Standard S3: ~$150/month
Premium P1: ~$465/month
Premium P6: ~$3,720/month
Premium P15: ~$7,446/month

Required permissions:

Microsoft.Sql/servers/read
Microsoft.Sql/servers/databases/read
Microsoft.Insights/metrics/read

Idle App Services

Rule ID: azure.app_service.idle

What it detects: Running App Service web apps with zero HTTP requests for 14+ days on paid plans

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Zero requests for 14+ days (Azure Monitor Requests metric, strong idle signal)

Risk: MEDIUM

Why this matters:

App Service Plans on paid tiers bill compute charges continuously regardless of traffic
An app with zero requests for 14+ days is a strong signal of abandonment
Common for dev/staging apps that were never decommissioned

Detection logic:

for app in web_apps.list():
    if app.state == "Running" and app.sku.tier not in ("Free", "Shared", "Dynamic"):
        requests = monitor.metrics("Requests", period=days_idle)
        if requests == 0:
            confidence = "HIGH"
            risk = "MEDIUM"

Excluded tiers:

Free, Shared, Dynamic (Consumption/serverless) — no meaningful idle cost

Common causes:

Dev or staging apps left running after project end
Feature branches deployed and never torn down
Apps migrated to containers but old App Service not removed

Cost estimates by tier (single instance):

Basic: ~$55/month
Standard: ~$73/month
Premium/PremiumV2/V3: ~$146/month
Isolated/IsolatedV2: ~$298/month

Cost assumes one instance. Scaled-out plans (multiple instances) will cost proportionally more — treat these as minimum estimates.

Not detected:

Non-HTTP workloads such as WebJobs or background services with no inbound HTTP traffic — these produce zero Requests metric data even when active. Review before deleting.

Required permissions:

Microsoft.Web/sites/read
Microsoft.Web/serverfarms/read
Microsoft.Insights/metrics/read

Unused Container Registries

Rule ID: azure.container_registry.unused

What it detects: Container registries with zero image pulls for 90+ days (default, configurable)

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

HIGH: Zero successful pulls AND zero successful pushes for 90+ days (Azure Monitor SuccessfulPullCount and SuccessfulPushCount metrics)

Risk: LOW

Why this matters:

Container registries accrue storage and per-operation charges regardless of usage
A registry with no pulls and no pushes for 90+ days signals complete abandonment
Common after workload migrations to other registries or container platforms

Detection logic:

for registry in registries.list():
    if registry.provisioning_state == "Succeeded":
        pulls = monitor.metrics("SuccessfulPullCount", period=days_unused)
        pushes = monitor.metrics("SuccessfulPushCount", period=days_unused)
        if pulls == 0 and pushes == 0:
            confidence = "HIGH"
            risk = "LOW"

Registries with active push activity (e.g. CI pipelines writing images) but zero pulls are not flagged — they are in active use.

Common causes:

Workloads migrated to another registry (e.g., Docker Hub → ACR → GHCR)
Projects retired without cleaning up the registry
Old build artifacts never consumed by any deployment

Cost estimates by SKU (base fee only):

Basic: ~$5/month + storage
Standard: ~$20/month + storage
Premium: ~$50/month + storage

These are floor estimates. ACR also charges per GB of stored images (~$0.003/GB-day). For registries with large image layers, storage can exceed the base fee — actual cost may be significantly higher.

Required permissions:

Microsoft.ContainerRegistry/registries/read
Microsoft.Insights/metrics/read

Governance

Untagged Resources

Rule ID: azure.resource.untagged

What it detects: Resources with zero tags

Resources checked:

Managed disks (7+ days old)
Snapshots

Confidence:

Confidence thresholds and signal weighting are documented in confidence.md.

MEDIUM: Untagged disk that's also unattached
LOW: Untagged snapshot or attached disk

Required permissions:

Microsoft.Compute/disks/read
Microsoft.Compute/snapshots/read

GCP Rules

Compute Waste

Stopped VM Instances

Rule ID: gcp.compute.vm.stopped

What it detects: VM instances in TERMINATED state for 30+ days

Confidence:

HIGH: lastStopTimestamp present and ≥ 30 days ago (deterministic timestamp)
MEDIUM: lastStopTimestamp absent — instance is TERMINATED but stop time is unavailable
Not flagged: stopped < 30 days, or instance in any other state (RUNNING, STAGING, etc.)

Risk: LOW

Why this matters:

A TERMINATED GCP VM does not charge for vCPU or memory — but every attached Persistent Disk accrues storage charges at ~$0.04/GB-month (standard) or ~$0.17/GB-month (SSD), regardless of instance state
A 500 GB root disk on a forgotten stopped instance costs ~$20/month indefinitely
This is the GCP equivalent of a stopped EC2 instance — the compute is free, the storage is not

Detection logic:

for instance in instances_client.aggregated_list(project=project_id):
    if instance.status == "TERMINATED":
        if _parse_gcp_timestamp(instance.last_stop_timestamp) > cutoff:
            flag(instance)

Cost estimate: Sum of attached PERSISTENT disk sizes × $0.04/GB/month (SCRATCH disks excluded — they are ephemeral)

Required permissions:

compute.instances.list (included in roles/compute.viewer)

Storage Waste

Unattached Persistent Disks

Rule ID: gcp.compute.disk.unattached

What it detects: Persistent Disks in READY state with no attached VM (users == [])

Confidence:

HIGH: Disk is READY and has no users — unambiguous detachment

Risk: LOW

Why this matters:

GCP charges for Persistent Disks regardless of whether they are attached to a VM
pd-standard: ~$0.04/GB/month, pd-ssd: ~$0.17/GB/month, pd-balanced: ~$0.10/GB/month, pd-extreme: ~$0.12/GB/month
Unattached disks accumulate when VMs are deleted without deleting their disks — the most common source of GCP storage waste
A 500 GB pd-ssd left unattached costs ~$85/month

Detection logic:

for disk in disks_client.aggregated_list(project=project_id):
    if disk.status == "READY" and not disk.users:
        flag(disk)

Cost estimate by disk type:

Type	Rate
`pd-standard`	$0.04/GB/month
`pd-balanced`	$0.10/GB/month
`pd-ssd`	$0.17/GB/month
`pd-extreme`	$0.12/GB/month

Required permissions:

compute.disks.list (included in roles/compute.viewer)

Old Disk Snapshots

Rule ID: gcp.compute.snapshot.old

What it detects: Disk snapshots older than 90 days

Confidence:

HIGH: Source disk no longer exists (snapshot is orphaned — the source was deleted)
MEDIUM: Source disk still exists (might be intentional long-term backup or DR snapshot)

Risk: LOW

Why this matters:

GCP snapshots are billed at ~$0.026/GB/month compressed storage in Cloud Storage
Automated snapshot policies are frequently removed while their snapshots are left behind
One-off manual snapshots are rarely cleaned up — they persist indefinitely until explicitly deleted
Snapshots are global resources — they accumulate across all zones and appear in no specific region

Detection logic:

for snapshot in snapshots_client.list(project=project_id):
    if snapshot.status == "READY":
        if _parse_gcp_timestamp(snapshot.creation_timestamp) < cutoff:
            confidence = HIGH if not snapshot.source_disk else MEDIUM
            flag(snapshot)

Cost estimate: Uses storage_bytes (actual compressed size) when available; falls back to disk_size_gb × $0.026/GB/month

Note: region_filter is ignored for snapshots — GCP snapshots are global resources with no region attribute.

Required permissions:

compute.snapshots.list (included in roles/compute.viewer)

Network Waste

Unused Reserved Static IPs

Rule ID: gcp.compute.ip.unused

What it detects: Reserved static IP addresses (regional and global) in RESERVED status (not IN_USE)

Confidence:

HIGH: IP status is RESERVED — unambiguous, GCP itself confirms it is not attached

Risk: LOW

Why this matters:

GCP bills ~~$0.01/hour (~~$7.20/month) for each static IP in RESERVED status under the PREMIUM network tier
Reserved IPs accumulate when VMs, load balancers, or NAT gateways are deleted without releasing their IPs
Unlike ephemeral IPs, reserved IPs persist independently — they must be explicitly released to stop billing

Detection logic:

# Regional IPs
for address in addresses_client.aggregated_list(project=project_id):
    if address.status == "RESERVED":
        flag(address, scope="regional")

# Global IPs (skipped if region_filter is set)
for address in global_addresses_client.list(project=project_id):
    if address.status == "RESERVED":
        flag(address, scope="global")

Graceful degradation: If compute.globalAddresses.list is denied but regional IPs succeed, the rule returns regional findings rather than failing entirely.

Cost estimate: $7.20/month per unused IP (PREMIUM network tier default)

Required permissions:

compute.addresses.list (included in roles/compute.viewer)
compute.globalAddresses.list (included in roles/compute.viewer)

Platform Waste

Idle Cloud SQL Instances

Rule ID: gcp.sql.instance.idle

What it detects: Cloud SQL instances in RUNNABLE state with zero database connections for 14+ days

Confidence:

HIGH: Monitoring confirms zero connections for the full 14-day window

Risk: HIGH

Why this matters:

Cloud SQL bills continuously for vCPU and memory regardless of query load
A db-n1-standard-2 costs ~$93/month with zero queries
Dev and staging databases are frequently left running after feature branches merge or projects wind down
Cloud SQL is the highest-cost idle resource type in most GCP environments

Detection logic:

for instance in sql_admin_api.list(project_id):
    if instance.state == "RUNNABLE" and not is_read_replica(instance):
        if not has_connections(monitoring_client, project_id, instance.name, days=14):
            flag(instance)

Conservative monitoring fallback: If Cloud Monitoring is unavailable or permission-denied, the instance is assumed active — it is not flagged. This avoids false positives when monitoring data is temporarily unavailable.

Read replicas excluded: Read replicas have no independent billing basis — the primary instance cost is what matters.

Cost estimates by tier:

Tier	~Monthly cost
`db-f1-micro`	$7.67
`db-g1-small`	$25.22
`db-n1-standard-1`	$46.55
`db-n1-standard-2`	$93.10
`db-n1-standard-4`	$186.19
`db-n1-highmem-2`	$113.45
`db-n1-highmem-4`	$226.90

Costs are approximate for us-central1 with HA disabled.

Required permissions:

cloudsql.instances.list (included in roles/cloudsql.viewer)
monitoring.timeSeries.list (included in roles/monitoring.viewer)

AI/ML Waste (opt-in — `--category ai`)

Idle Vertex AI Online Prediction Endpoints

Rule ID: gcp.vertex.endpoint.idle

What it detects: Vertex AI Online Prediction endpoints with dedicatedResources.minReplicaCount > 0 and zero prediction requests for 14+ days

Confidence:

HIGH: Zero predictions for the full 14-day window (endpoint age ≥ 14 days)
MEDIUM: Zero predictions, endpoint age ≥ 75% of threshold (≥ 10 days), or age unknown

Risk: HIGH (GPU-backed endpoints: T4, V100, A100, L4, H100, TPU), MEDIUM (CPU-only)

Why this matters:

Vertex AI endpoints with minReplicaCount > 0 keep dedicated compute running 24/7 regardless of traffic
GPU endpoints (T4: $311/month per GPU, A100: $2,933/month, H100: $8,000/month) are especially costly when idle
Experiment and prototype endpoints are commonly abandoned after demos without being deleted or scaled to zero
Endpoints using automaticResources (which scale to zero) are excluded — only dedicatedResources incur idle cost

Detection logic:

for endpoint in vertex_ai_api.list(project_id, location="-"):  # all locations
    total_min_replicas = sum(
        m.dedicatedResources.minReplicaCount
        for m in endpoint.deployedModels
        if m.dedicatedResources  # skip automaticResources
    )
    if total_min_replicas > 0:
        if not has_predictions(monitoring_client, endpoint_id, days=14):
            flag(endpoint)

Conservative monitoring fallback: If Cloud Monitoring is unavailable or permission-denied, the endpoint is assumed active — it is not flagged.

Cost estimates by machine type (per node, us-central1):

Machine Type	~Monthly cost/node
`n1-standard-4`	$138
`n1-standard-8`	$277
`n1-standard-4` + T4 GPU	$449
`n1-standard-4` + V100 GPU	$1,523
`a2-highgpu-1g` (A100 40GB)	$2,933
`a2-highgpu-2g` (2× A100)	$5,866
`a2-ultragpu-1g` (A100 80GB)	$5,103
`g2-standard-8` (L4 GPU)	$1,060

Costs are approximate for us-central1, on-demand. Multiply by minReplicaCount for total monthly idle cost.

Required permissions:

aiplatform.endpoints.list (included in roles/aiplatform.viewer)
monitoring.timeSeries.list (included in roles/monitoring.viewer)

Idle Vertex AI Workbench Instances

Rule ID: gcp.vertex.workbench.idle

What it detects: Vertex AI Workbench instances in ACTIVE state with no control-plane activity for 14+ days

Confidence:

HIGH: updateTime ≥ 14 days ago AND instance age ≥ 14 days
MEDIUM: updateTime ≥ 75% of threshold (≥ 10 days) and instance age ≥ 10 days, or updateTime unavailable (age-fallback, capped at MEDIUM)

Risk: CRITICAL (GPU-backed, idle ≥ 2× threshold), HIGH (GPU-backed), MEDIUM (CPU-only)

Why this matters:

Workbench instances incur continuous compute charges while ACTIVE, even with no open notebooks or active kernels
GPU instances (T4: $311/month, A100: $2,933/month, H100: $8,000/month) are extremely costly when left idle
Data scientists commonly leave instances running after a sprint ends, a project is deprioritised, or when switching to a newer instance

Detection logic:

for instance in notebooks_api.list(project_id, location="-"):  # all locations
    if instance.state == "ACTIVE":
        idle_days = (now - instance.updateTime).days
        if idle_days >= 14:
            flag(instance)

updateTime is updated by the Notebooks API when the instance is started, stopped, restarted, or reconfigured. Instances with stale updateTime have had no control-plane activity. This mirrors LastModifiedTime (SageMaker) and last_modified_at (Azure ML).

Cost estimates (per instance, us-central1, on-demand):

Machine Type	~Monthly cost
`n1-standard-4`	$138
`n1-standard-4` + T4 GPU	$449
`n1-standard-4` + V100 GPU	$1,523
`a2-highgpu-1g` (A100 40GB)	$2,933
`g2-standard-8` (L4 GPU)	$1,060

Required permissions:

notebooks.instances.list (included in roles/notebooks.viewer)

Long-Running Vertex AI Training Jobs

Rule ID: gcp.vertex.training_job.long_running

What it detects: Vertex AI CustomJobs (state=JOB_STATE_RUNNING) and TrainingPipelines (state=PIPELINE_STATE_RUNNING) that have been running longer than expected. The default threshold is 24 hours. GPU/TPU accelerator jobs and expensive CPU clusters raise an early warning at 90% of the threshold (21.6h at defaults) because high burn rates make runaway detection time-sensitive.

Most training jobs complete in minutes to a few hours. A job still running well past the threshold is likely hung, stalled, or runaway — waiting on data, deadlocked in distributed training, caught in an OOM loop, or simply forgotten after a project was cancelled.

GPU-backed training is especially costly: an A100 40GB node (a2-highgpu-1g) runs at ~$4/hour; an H100 node (a3-highgpu-8g) with 8 GPUs runs at ~$80/hour. Distributed multi-worker jobs multiply cost linearly.

Confidence:

HIGH: duration ≥ long_running_hours × 3 — clearly runaway for almost any single training run
MEDIUM: duration ≥ long_running_hours — worth reviewing; could be legitimate large-scale training
MEDIUM (early warning): GPU/TPU accelerator job, or CPU cluster with burn rate above expensive_hourly_threshold (default $20/hr), at 90–100% of threshold — not emitted for cheap CPU-only jobs below threshold

Risk:

Confidence	GPU/Accelerator	Risk
HIGH	Yes	CRITICAL
HIGH	No or unknown	HIGH
MEDIUM	Any	MEDIUM

Why this matters:

Vertex AI CustomJobs with GPU workers continue billing as long as they are in JOB_STATE_RUNNING
There is no automatic stop unless timeout is set in the job spec — jobs can run indefinitely if hung or if the stopping condition is never met
TrainingPipelines wrap CustomJobs and can also run indefinitely if the underlying job does not terminate

Detection logic:

# Queries both resource types across all locations via REST API
for job in vertex_ai.customJobs(project, locations="-", filter='state="JOB_STATE_RUNNING"'):
    duration = now - job.startTime  # fallback to createTime if absent
    is_accelerator = has_gpu_or_tpu(job.workerPoolSpecs)
    burn_rate = total_hourly_cost(job.workerPoolSpecs)
    if duration < threshold * 0.9:
        continue  # too young
    if duration < threshold and not (is_accelerator or burn_rate > 20):
        continue  # early-warning zone: skip cheap CPU-only jobs

for pipeline in vertex_ai.trainingPipelines(project, locations="-", filter='state="PIPELINE_STATE_RUNNING"'):
    ...  # same logic; hardware parsed from trainingTaskInputs when available

Hardware detection:

Accelerator classification uses workerPoolSpecs[].machineSpec.acceleratorType against a frozenset of known accelerator types (GPU families and TPU pod types), plus machine type prefixes that bundle accelerator cost (a2-*, a3-*, a4-*, a4x-*, g2-*, g4-*, ct4-*, ct5*, ct6*, tpu*)
TPU machines use tpuTopology (e.g. "2x4") to derive the physical host count — replicaCount is always 1 in the Vertex AI API regardless of pod size
TrainingPipelines embed hardware in opaque trainingTaskInputs — when specs cannot be parsed, cost uses a duration-tiered placeholder (>24h → $20/hr, 6–24h → $5/hr, <6h → $1/hr) and is_accelerator is False (unknown hardware does not imply GPU workload)
For bundled accelerator machines, co-scheduling is modeled: when acceleratorCount divides machine_gpu_count evenly, the machine cost is divided by machine_gpu_count ÷ acceleratorCount replicas per VM

Cost reported:

Accrued cost so far: duration_hours × hourly_burn_rate (sum across all worker pools); stored raw in details["accrued_cost_usd"] and capped at $1M in display text
estimated_monthly_cost_usd is intentionally None — training jobs are transient, not recurring monthly expenses; populating that field would corrupt monthly savings totals
Pricing is a static estimate (us-central1, on-demand); details["pricing_scope"] = "us-central1_reference" and details["pricing_note"] indicate the reference region and whether the job's actual region may differ significantly
details["pricing_confidence"] is "published" when all prices come from GCP pricing pages, or "partial_estimate" for newer machine families (a3-megagpu, a4-, g4-, ct5p-, ct6e-, tpu7x-*) where rates are estimated

Cost estimates (per node, us-central1, on-demand):

Machine Type	~Hourly cost	Notes
`n1-standard-8` + T4	~$0.80/hr	GPU cost additive
`n1-standard-8` + V100	~$2.27/hr	GPU cost additive
`a2-highgpu-1g` (A100 40GB)	~$4.02/hr	GPU bundled
`a2-highgpu-8g` (8× A100 40GB)	~$32.14/hr	GPU bundled
`a3-highgpu-8g` (8× H100 80GB)	~$80.00/hr	GPU bundled [est]
`g2-standard-8` (L4)	~$1.45/hr	GPU bundled
`ct5lp-hightpu-8t` (8× TPU v5e)	~$9.60/hr	TPU bundled

What it does not check:

Intentional long-running distributed training (LLM pre-training, large fine-tunes)
Checkpoint saving — job may be making progress without visible status updates
Committed use discounts — actual cost may be significantly lower than on-demand estimate
Preemptible/Spot workers — cost and interruption semantics differ
Co-scheduling for g2-standard-32 — GPU count is ambiguous in GCP docs; that machine type uses full-price-per-replica as a conservative fallback

Required permissions:

aiplatform.customJobs.list (included in roles/aiplatform.viewer)
aiplatform.trainingPipelines.list (included in roles/aiplatform.viewer)

Idle Cloud TPU Nodes

Rule ID: gcp.tpu.idle

What it detects: Cloud TPU nodes in READY state with near-zero utilization for 7+ days. A READY TPU node incurs compute charges continuously, regardless of whether any workload is running. Forgotten TPU nodes left running after a training job completes are a common source of runaway cost.

Confidence:

HIGH: Cloud Monitoring reports max tpu.googleapis.com/node/accelerator/duty_cycle ≤ 2% across all workers over the idle window (7 days by default) — the TPU was genuinely not executing any workload
LOW: Monitoring data unavailable; node exists for ≥ idle_days with no observed activity — existence duration is not a reliable idle proxy (node may still be in active use)

Risk:

Confidence	Hourly cost	Risk
HIGH	≥ $10/hr	CRITICAL
HIGH	< $10/hr	HIGH
LOW	Any	MEDIUM

Why this matters:

TPU nodes bill from the moment they reach READY state, regardless of utilization
Unlike GPU instances, Cloud TPU nodes have no automatic stop after a job completes — they must be explicitly deleted
An idle v4 node (4 chips, 2x2x1 topology) costs ~$12.88/hr; a v5p-8 costs ~$33.60/hr; a forgotten large pod runs up thousands per day

Detection logic:

# List all READY TPU nodes via Cloud TPU v2 REST API (locations/- wildcard)
for node in tpu.projects.locations.nodes.list(project, location="-"):
    if node.state != "READY":
        continue
    age = age_days(node.createTime)
    if age < idle_days:
        continue  # too young — enforce minimum observation window
    # Check Cloud Monitoring for near-zero duty_cycle
    duty_cycle = max_duty_cycle(node.id, window=idle_days)
    if duty_cycle is not None:
        idle = duty_cycle <= 0.02  # HIGH confidence
    else:
        idle = True  # LOW confidence — age-based heuristic, utilization unknown

Cost estimates (us-central1, on-demand):

TPU Type	Chips	~Hourly cost	Notes
`v2-8`	8	$12.00/hr	$1.50/chip-hr, published
`v3-8`	8	$17.60/hr	$2.20/chip-hr (device); v3 pod is $2.00/chip-hr
`v4` (2x2x1)	4	$12.88/hr	$3.22/chip-hr, published
`v4` (2x2x2)	8	$25.76/hr	$3.22/chip-hr, published
`v5e` (litepod-4)	4	$4.80/hr	$1.20/chip-hr, published
`v5e` (litepod-8)	8	$9.60/hr	$1.20/chip-hr, published
`v5p-4`	4	$16.80/hr	$4.20/chip-hr, published
`v5p-8`	8	$33.60/hr	$4.20/chip-hr, published

What it does not check:

Batch or scheduled jobs that run intermittently (the 7-day window may miss a recent burst)
Preemptible TPU nodes — may have been interrupted and not yet restarted intentionally
Committed use discounts — actual cost may be significantly lower
Nodes shared across teams where utilization is tracked externally

Required permissions:

tpu.nodes.list (included in roles/tpu.viewer)
monitoring.timeSeries.list (included in roles/monitoring.viewer) — optional; falls back to age-based detection if absent

Idle Vertex AI Feature Store Online Stores

Rule ID: gcp.vertex.featurestore.idle

What it detects: Vertex AI Feature Store online stores — both legacy featurestores (with fixedNodeCount > 0 or autoscaled via scaling.minNodeCount) and new-generation featureOnlineStores (Bigtable-backed or Optimized) — that have received zero online serving requests for 30+ days while remaining in STABLE state. Legacy featurestores and Bigtable-backed online stores incur continuous Bigtable compute charges; Optimized stores incur storage and query compute charges. Feature stores are frequently left running after a model or recommendation system is retired.

Confidence:

HIGH: Cloud Monitoring confirms zero online_serving/request_count over the 30-day window — the store had no ReadFeatureValues (or equivalent) requests at all
LOW: Monitoring data unavailable; store has been in STABLE state for ≥ 30 days — heuristic: age only, request activity unknown

Risk:

Confidence	Risk
HIGH	HIGH
LOW	MEDIUM

Why this matters:

Legacy featurestores with fixedNodeCount > 0 bill ~$0.27/node-hour (us-central1, SSD-backed Bigtable) continuously — a 1-node store costs ~$197/month, a 3-node HA store costs ~$591/month
New-generation featureOnlineStores (Bigtable-backed) have similar per-node costs via autoScaling.minNodeCount
Optimized (BigQuery-backed) featureOnlineStores have lower base cost but still incur storage and query charges
These stores are often provisioned during model development and forgotten after the serving layer is replaced

Detection logic:

# Legacy featurestores with online serving configured (fixed or autoscaled)
for store in vertex_ai.featurestores(project, locations="-"):
    config = store.onlineServingConfig
    if config.fixedNodeCount == 0 and config.scaling.minNodeCount == 0:
        continue  # no online serving cost
    requests = monitoring.sum("featurestore/online_serving/request_count", window=30d)
    if requests is not None:
        if requests == 0:
            flag()  # HIGH confidence
    elif age_days >= 30:
        flag()  # LOW confidence — age heuristic, request activity unknown

# New featureOnlineStores (Bigtable or Optimized)
for store in vertex_ai.featureOnlineStores(project, locations="-"):
    requests = monitoring.sum("featureonlinestore/online_serving/request_count", window=30d)
    if requests is not None:
        if requests == 0:
            flag()  # HIGH confidence
    elif age_days >= 30:
        flag()  # LOW confidence — age heuristic, request activity unknown

Cost estimates (us-central1, on-demand):

Store type	Config	~Monthly cost
Legacy featurestore	1 Bigtable node	~$197/mo
Legacy featurestore	3 Bigtable nodes (HA)	~$591/mo
Feature Online Store	1 Bigtable node (min)	~$197/mo
Feature Online Store	3 Bigtable nodes (min)	~$591/mo
Feature Online Store	Optimized (BigQuery)	~$100+/mo [est]

What it does not check:

Periodic or low-frequency batch workflows querying less often than the 30-day window
Feature stores used by scheduled pipelines (e.g. weekly batch inference)
Committed use discounts — actual cost may be lower
Stores intentionally kept warm for latency-sensitive cold-start mitigation

Required permissions:

aiplatform.featurestores.list (included in roles/aiplatform.viewer)
aiplatform.featureOnlineStores.list (included in roles/aiplatform.viewer)
monitoring.timeSeries.list (included in roles/monitoring.viewer) — optional; falls back to age-based detection if absent

Rule Stability Guarantee

Once a rule reaches production status:

Rule ID remains stable
Confidence semantics unchanged
Backwards compatibility preserved
Schema additions only (no breaking changes)

This guarantees trust for long-running CI/CD integrations.

Coming Soon

AI/ML (all providers):

Orphaned SageMaker training artifacts in S3 (AWS)

AWS:

S3 lifecycle gaps, Redshift idle, NAT Gateway routing waste

Azure:

Azure Firewall idle, AKS node pool idle, Azure Batch unused pools

GCP:

GKE node pool idle, BigQuery slot waste, GCS cold storage, Cloud Run idle revisions

Multi-Cloud:

Rule filtering (--rules flag)
Policy-as-code (cleancloud.yaml)

Next: AWS Setup → | Azure Setup → | GCP Setup → | CI/CD Integration →

FilesExpand file tree

rules.md

Latest commit

History

rules.md

File metadata and controls

CleanCloud Rules

Design Principles

1. Read-Only Always

2. Conservative by Default

3. Explicit Confidence Levels

4. Review-Only Recommendations

Quick Reference

AWS Rules

Compute Waste

Stopped EC2 Instances

Unused Security Groups

Storage Waste

Unattached EBS Volumes

Old EBS Snapshots

Old AMIs

Network Waste

Unattached Elastic IPs

Detached Network Interfaces (ENIs)

Idle NAT Gateways

Idle Elastic Load Balancers (ALB/CLB/NLB)

Platform Waste

Idle RDS Instances

Old Manual RDS Snapshots

Observability Waste

CloudWatch Log Groups (Infinite Retention)

Governance

Untagged Resources

AI/ML Waste

Idle SageMaker Endpoints

Idle SageMaker Notebook Instances

Idle EC2 GPU Instances

Idle Bedrock Provisioned Throughput

Idle SageMaker Studio Apps

Long-Running SageMaker Training Jobs

Idle Azure ML Compute Clusters

Idle Azure ML Compute Instances

Idle Azure OpenAI Provisioned Deployment

Idle Azure ML Online Endpoints

Idle Azure AI Search Services

Azure Rules

Compute Waste

Stopped (Not Deallocated) VMs

Storage Waste

Unattached Managed Disks

Old Managed Disk Snapshots

Network Waste

Unused Public IP Addresses

Standard Load Balancer with No Backend Members

Application Gateway with No Backend Targets

Idle VNet Gateways (VPN/ExpressRoute)

Platform Waste

Empty App Service Plans

Idle Azure SQL Databases

Idle App Services

Unused Container Registries

Governance

Untagged Resources

GCP Rules

Compute Waste

Stopped VM Instances

Storage Waste

Unattached Persistent Disks

Old Disk Snapshots

Network Waste

Unused Reserved Static IPs

Platform Waste

Idle Cloud SQL Instances

AI/ML Waste (opt-in — --category ai)

Idle Vertex AI Online Prediction Endpoints

Idle Vertex AI Workbench Instances

Long-Running Vertex AI Training Jobs

Idle Cloud TPU Nodes

Idle Vertex AI Feature Store Online Stores

Rule Stability Guarantee

AI/ML Waste (opt-in — `--category ai`)