Skip to content

Commit 03062fd

Browse files
authored
AI rule: azure.ml.compute_instance.idle (#132)
1 parent d8dc555 commit 03062fd

39 files changed

Lines changed: 1797 additions & 352 deletions

README.fr.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
141141
- **Détection du gaspillage IA/ML sur les 3 clouds :** endpoints SageMaker, clusters AML Compute et endpoints Vertex AI inactifs, facturés 500–23 000 $/mois par ressource en silence. Ressources GPU flaggées risque HIGH. Les outils natifs montrent la facture — CleanCloud indique quel endpoint supprimer. Opt-in via `--category ai`
142142
- **Gouvernance policy-as-code :** `cleancloud.yaml` pour la configuration par règle, les exceptions avec dates d'expiration, les seuils de coût et de confiance, les exclusions par tag — versionné aux côtés de votre infrastructure. Chaque exception est une approbation auditée dans git.
143143
- **Application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 500` — appliquer des seuils de gaspillage en CI/CD sur un planning, géré par les équipes platform ou FinOps
144-
- **34 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
144+
- **35 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
145145
- **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
146146
- **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
147147
- **Scan multi-projets (GCP) :** scannez tous les projets GCP accessibles en parallèle — auto-découverte via Application Default Credentials, détail des coûts par projet inclus
@@ -221,6 +221,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
221221
| Endpoint SageMaker (GPU) | 500 – 23 000 $ / mois |
222222
| Instance Notebook SageMaker (GPU) | 500 – 23 000+ $ / mois |
223223
| Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
224+
| Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
224225
| Endpoint Vertex AI Online Prediction (GPU) | 449 – 23 000+ $ / mois |
225226

226227
CleanCloud détecte les endpoints à zéro invocation / zéro prédiction et les instances de notebook inactives sur les 3 clouds et les signale risque HIGH. Les outils natifs montrent la facture — ils ne vous disent pas *quel endpoint* supprimer.
@@ -466,7 +467,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
466467

467468
## Ce que CleanCloud détecte
468469

469-
34 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
470+
35 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
470471

471472
**AWS :**
472473
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -483,7 +484,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
483484
- Réseau : adresses IP publiques inutilisées, Load Balancers vides (HIGH), App Gateways vides (HIGH), VNet Gateways inactives
484485
- Plateforme : App Service Plans vides (HIGH), bases de données SQL inactives (HIGH), App Services inactifs, Container Registries inutilisés
485486
- Gouvernance : ressources sans tags
486-
- IA/ML *(opt-in : `--category ai`)* : clusters de calcul AML avec capacité baseline non nulle et aucune activité depuis 14+ jours — clusters GPU flaggés risque HIGH ($600–$15K/mois)
487+
- IA/ML *(opt-in : `--category ai`)* : clusters de calcul AML avec capacité baseline non nulle et aucune activité depuis 14+ jours — clusters GPU flaggés risque HIGH ($600–$15K/mois) ; instances de calcul Azure ML Running sans activité depuis 14+ jours — instances GPU flaggées risque CRITICAL ($600–$15K+/mois)
487488

488489
**GCP :**
489490
- Compute : instances VM arrêtées 30+ jours (charges disque continuent) (HIGH)

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
221221
| SageMaker endpoint (GPU) | $500 – $23,000 / month |
222222
| SageMaker Notebook Instance (GPU) | $500 – $23,000+ / month |
223223
| Azure AML compute cluster (GPU) | $600 – $15,000 / month |
224+
| Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
224225
| Vertex AI Online Prediction endpoint (GPU) | $449 – $23,000+ / month |
225226

226227
CleanCloud detects zero-invocation / zero-prediction endpoints and idle notebook instances across all three clouds and flags them HIGH risk. Native cost tools show the bill — they don't tell you *which endpoint* to delete.
@@ -466,7 +467,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
466467

467468
## What CleanCloud Detects
468469

469-
34 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
470+
35 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
470471

471472
**AWS:**
472473
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -483,7 +484,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
483484
- Network: unused public IPs, empty load balancers (HIGH), empty App Gateways (HIGH), idle VNet Gateways
484485
- Platform: empty App Service Plans (HIGH), idle SQL databases (HIGH), idle App Services, unused Container Registries
485486
- Governance: untagged resources
486-
- AI/ML *(opt-in: `--category ai`)*: idle AML compute clusters with non-zero baseline capacity and no workload activity 14+ days — GPU clusters flagged HIGH risk ($600–$15K/month)
487+
- AI/ML *(opt-in: `--category ai`)*: idle AML compute clusters with non-zero baseline capacity and no workload activity 14+ days — GPU clusters flagged HIGH risk ($600–$15K/month); idle Compute Instances with no control-plane activity 14+ days — GPU instances CRITICAL risk ($600–$15K+/month)
487488

488489
**GCP:**
489490
- Compute: stopped instances 30+ days (disk charges continue) (HIGH)

cleancloud/demo/findings.py

Lines changed: 56 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -422,7 +422,7 @@
422422
"Disk status: READY",
423423
"No VM users (users list empty)",
424424
"Disk type: pd-ssd (~$0.17/GB/month storage)",
425-
"Size: 500 GB ~$85.0/month (estimated, region-dependent)",
425+
"Size: 500 GB -> ~$85.0/month (estimated, region-dependent)",
426426
"Last detached: 1656h ago",
427427
],
428428
signals_not_checked=[
@@ -737,6 +737,61 @@
737737
),
738738
estimated_monthly_cost_usd=4406.0,
739739
),
740+
Finding(
741+
provider="azure",
742+
rule_id="azure.ml.compute_instance.idle",
743+
resource_type="azure.ml.compute_instance",
744+
resource_id=(
745+
"/subscriptions/29d91ee0-922f-483a-a81f-1a5eff4ecfa2"
746+
"/resourceGroups/rg-ml-platform"
747+
"/providers/Microsoft.MachineLearningServices"
748+
"/workspaces/ml-platform-prod"
749+
"/computes/nlp-research-vm"
750+
),
751+
region="East US",
752+
title="Idle Azure ML Compute Instance (>14 Days Idle, 31 Days Since Activity)",
753+
summary=(
754+
"Azure ML Compute Instance 'nlp-research-vm' in workspace 'ml-platform-prod' "
755+
"has had no control-plane activity for 31 days but remains Running, incurring "
756+
"continuous charges (~$2,203/month)."
757+
),
758+
reason="Azure ML Compute Instance has had no control-plane activity for 31 days",
759+
risk=RiskLevel.CRITICAL,
760+
confidence=ConfidenceLevel.HIGH,
761+
detected_at=_NOW,
762+
details={
763+
"instance_name": "nlp-research-vm",
764+
"workspace_name": "ml-platform-prod",
765+
"resource_group": "rg-ml-platform",
766+
"vm_size": "Standard_NC6s_v3",
767+
"state": "Running",
768+
"is_gpu": True,
769+
"age_days": 31,
770+
"idle_since_days": 31,
771+
"idle_days_threshold": 14,
772+
"idle_ratio": 2.21,
773+
"estimated_monthly_cost": "~$2,203/month",
774+
"cost_source": "approximate_East US",
775+
},
776+
evidence=Evidence(
777+
signals_used=[
778+
"Instance state: Running",
779+
"Age: 31 days",
780+
"Last control-plane operation: 31 days ago (last_operation.operation_time) — last op: Start",
781+
"VM size: Standard_NC6s_v3",
782+
"GPU instance — high hourly cost",
783+
],
784+
signals_not_checked=[
785+
"Active Jupyter kernel or notebook sessions",
786+
"VS Code / RStudio remote connections",
787+
"Scheduled jobs running on this instance",
788+
"Assigned user's planned future use",
789+
"Resource tags (e.g. keep_alive=true)",
790+
],
791+
time_window="31 days",
792+
),
793+
estimated_monthly_cost_usd=2203.0,
794+
),
740795
]
741796

742797
AWS_AI_FINDINGS: List[Finding] = [

cleancloud/doctor/aws.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -793,7 +793,7 @@ def run_aws_multi_account_doctor(
793793
external_id=config.external_id,
794794
)
795795
assumed_identity = assumed.client("sts", config=BOTO_CONFIG).get_caller_identity()
796-
success(f"{label} {assumed_identity['Arn']}")
796+
success(f"{label} -> {assumed_identity['Arn']}")
797797
passed += 1
798798
except botocore.exceptions.ClientError as e:
799799
code = e.response["Error"]["Code"]

cleancloud/doctor/azure.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -328,7 +328,7 @@ def run_azure_doctor() -> None:
328328

329329

330330
def run_azure_ai_doctor(subscription_id: str = None) -> None:
331-
"""Validate Azure permissions for --category ai (Azure ML compute rules)."""
331+
"""Validate Azure permissions for --category ai (Azure ML compute cluster and compute instance rules)."""
332332
info("")
333333
info("=" * 70)
334334
info("AZURE AI/ML PERMISSION VALIDATION")

cleancloud/doctor/gcp.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -368,7 +368,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
368368
warn("compute.disks.list — MISSING (rule: disk_unattached will be skipped)")
369369
except NotFound:
370370
info("compute.disks.list — Compute Engine API not enabled (rule unavailable)")
371-
info(" enable via: gcloud services enable compute.googleapis.com")
371+
info(" -> enable via: gcloud services enable compute.googleapis.com")
372372
except ResourceExhausted:
373373
warn("compute.disks.list — API quota exceeded (retry later)")
374374
except Exception as e:
@@ -387,7 +387,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
387387
warn("compute.instances.list — MISSING (rule: vm_stopped will be skipped)")
388388
except NotFound:
389389
info("compute.instances.list — Compute Engine API not enabled (rule unavailable)")
390-
info(" enable via: gcloud services enable compute.googleapis.com")
390+
info(" -> enable via: gcloud services enable compute.googleapis.com")
391391
except ResourceExhausted:
392392
warn("compute.instances.list — API quota exceeded (retry later)")
393393
except Exception as e:
@@ -406,7 +406,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
406406
warn("compute.addresses.list — MISSING (rule: ip_unused regional IPs will be skipped)")
407407
except NotFound:
408408
info("compute.addresses.list — Compute Engine API not enabled (rule unavailable)")
409-
info(" enable via: gcloud services enable compute.googleapis.com")
409+
info(" -> enable via: gcloud services enable compute.googleapis.com")
410410
except ResourceExhausted:
411411
warn("compute.addresses.list — API quota exceeded (retry later)")
412412
except Exception as e:
@@ -427,7 +427,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
427427
)
428428
except NotFound:
429429
info("compute.globalAddresses.list — Compute Engine API not enabled (rule unavailable)")
430-
info(" enable via: gcloud services enable compute.googleapis.com")
430+
info(" -> enable via: gcloud services enable compute.googleapis.com")
431431
except ResourceExhausted:
432432
warn("compute.globalAddresses.list — API quota exceeded (retry later)")
433433
except Exception as e:
@@ -446,7 +446,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
446446
warn("compute.snapshots.list — MISSING (rule: snapshot_old will be skipped)")
447447
except NotFound:
448448
info("compute.snapshots.list — Compute Engine API not enabled (rule unavailable)")
449-
info(" enable via: gcloud services enable compute.googleapis.com")
449+
info(" -> enable via: gcloud services enable compute.googleapis.com")
450450
except ResourceExhausted:
451451
warn("compute.snapshots.list — API quota exceeded (retry later)")
452452
except Exception as e:
@@ -466,7 +466,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
466466
warn("cloudsql.instances.list — MISSING (rule: sql_instance_idle will be skipped)")
467467
elif resp.status_code == 404:
468468
info("cloudsql.instances.list — Cloud SQL API not enabled (rule unavailable)")
469-
info(" enable via: gcloud services enable sqladmin.googleapis.com")
469+
info(" -> enable via: gcloud services enable sqladmin.googleapis.com")
470470
else:
471471
permissions_tested.append("cloudsql.instances.list")
472472
success("cloudsql.instances.list")
@@ -509,7 +509,7 @@ def run_gcp_doctor(project_id: Optional[str] = None) -> None:
509509
)
510510
except NotFound:
511511
info("monitoring.timeSeries.list — Cloud Monitoring API not enabled (rule unavailable)")
512-
info(" enable via: gcloud services enable monitoring.googleapis.com")
512+
info(" -> enable via: gcloud services enable monitoring.googleapis.com")
513513
except ResourceExhausted:
514514
warn("monitoring.timeSeries.list — API quota exceeded (retry later)")
515515
except Exception as e:

cleancloud/filtering/rules.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,12 +98,12 @@ def _validate_params(rule_id: str, params: Dict[str, Any], func: Callable) -> No
9898
if key not in valid:
9999
close = difflib.get_close_matches(key, valid.keys(), n=1, cutoff=0.6)
100100
hint = f" (did you mean '{close[0]}'?)" if close else f". Valid params: {sorted(valid)}"
101-
raise ValueError(f"Invalid config: rules.{rule_id}.params.{key} unknown field{hint}")
101+
raise ValueError(f"Invalid config: rules.{rule_id}.params.{key} -> unknown field{hint}")
102102

103103
expected_type = valid[key]
104104
if expected_type is not None and not isinstance(value, expected_type):
105105
raise ValueError(
106-
f"Invalid config: rules.{rule_id}.params.{key} "
106+
f"Invalid config: rules.{rule_id}.params.{key} -> "
107107
f"expected {expected_type.__name__}, got {type(value).__name__} ({value!r})"
108108
)
109109

cleancloud/output/human.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def print_human(findings: List[Finding]):
2323
else f.account_id
2424
)
2525
print(f" Account : {label} ({f.account_id})")
26-
print(f" Resource : {f.resource_type} {f.resource_id}")
26+
print(f" Resource : {f.resource_type} -> {f.resource_id}")
2727
if f.region:
2828
print(f" Region : {f.region}")
2929

cleancloud/output/markdown.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,20 @@ def write_markdown(
4545
else:
4646
regions_str = str(regions)
4747

48-
subscriptions = summary.get("subscriptions_scanned", [])
4948
accounts_scanned = summary.get("accounts_scanned")
50-
5149
projects_scanned = summary.get("projects_scanned", [])
5250

51+
# Use names from per_subscription when available (falls back to IDs)
52+
per_sub_meta = summary.get("per_subscription", [])
53+
sub_names = (
54+
[r["name"] for r in per_sub_meta]
55+
if per_sub_meta
56+
else summary.get("subscriptions_scanned", [])
57+
)
58+
5359
lines.append(f"**Provider:** {provider} ")
54-
if subscriptions:
55-
lines.append(f"**Subscriptions:** {', '.join(subscriptions)} ")
60+
if sub_names:
61+
lines.append(f"**Subscriptions:** {', '.join(sub_names)} ")
5662
elif accounts_scanned is not None:
5763
lines.append(f"**Accounts:** {accounts_scanned} ")
5864
lines.append(f"**Regions:** {regions_str} ")
@@ -85,6 +91,15 @@ def write_markdown(
8591

8692
lines.append("")
8793

94+
# Rules evaluated
95+
rules_evaluated = summary.get("rules_evaluated", {})
96+
if rules_evaluated:
97+
lines.append(
98+
f"**Rules evaluated ({len(rules_evaluated)}):** "
99+
+ ", ".join(f"`{r}`" for r in sorted(rules_evaluated.keys()))
100+
)
101+
lines.append("")
102+
88103
# Confidence breakdown
89104
by_conf = summary.get("by_confidence", {})
90105
if by_conf:
@@ -134,9 +149,9 @@ def write_markdown(
134149
lines.append(f"| {label} ({rid}) | {r.get('findings', 0)}{status_str} | {cost_str} |")
135150
lines.append("")
136151

137-
# Azure multi-subscription breakdown
152+
# Azure multi-subscription breakdown (only shown when more than one subscription)
138153
per_sub = summary.get("per_subscription")
139-
if per_sub:
154+
if per_sub and len(per_sub) > 1:
140155
lines.append("**Per-subscription breakdown:**")
141156
lines.append("")
142157
lines.append("| Subscription | Findings | Est. Monthly Cost |")

0 commit comments

Comments
 (0)