Skip to content

Commit 1297d48

Browse files
authored
AI rule 1 for Azure: azure.aml.compute.idle (#123)
1 parent a904faf commit 1297d48

22 files changed

Lines changed: 1656 additions & 129 deletions

README.fr.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ C'est CleanCloud. Scannez vos environnements AWS, Azure et GCP, obtenez des find
3030
| Hygiène multi-comptes / multi-abonnements / multi-projets ||||
3131
| Application planifiée et CI/CD (codes de sortie) ||||
3232

33-
- **31 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe. Les règles IA/ML (SageMaker) sont opt-in via `--category ai`
33+
- **32 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe. Les règles IA/ML (SageMaker, Azure ML) sont opt-in via `--category ai`
3434
- **Gouvernance et application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 100` — appliquer des seuils de gaspillage sur un planning, géré par les équipes platform ou FinOps
3535
- **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
3636
- **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
@@ -171,7 +171,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
171171
| Flag | Fonction |
172172
|---|---|
173173
| `--provider aws\|azure\|gcp` | Fournisseur cloud à scanner *(obligatoire)* |
174-
| `--category hygiene\|ai\|all` | Catégorie de règles : `hygiene` (défaut), `ai` (SageMaker, AWS uniquement) ou `all` (hygiene + IA) |
174+
| `--category hygiene\|ai\|all` | Catégorie de règles : `hygiene` (défaut), `ai` (SageMaker sur AWS, AML Compute sur Azure) ou `all` (hygiene + IA) |
175175
| `--region REGION` | Scanner une seule région |
176176
| `--all-regions` | Toutes les régions actives — AWS/Azure uniquement |
177177
| **AWS multi-comptes** | |
@@ -328,7 +328,7 @@ Pour des exemples de sortie complets incluant `doctor`, JSON, CSV et markdown :
328328

329329
## Ce que CleanCloud détecte
330330

331-
30 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
331+
32 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
332332

333333
**AWS :**
334334
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -345,6 +345,7 @@ Pour des exemples de sortie complets incluant `doctor`, JSON, CSV et markdown :
345345
- Réseau : adresses IP publiques inutilisées, Load Balancers vides (HIGH), App Gateways vides (HIGH), VNet Gateways inactives
346346
- Plateforme : App Service Plans vides (HIGH), bases de données SQL inactives (HIGH), App Services inactifs, Container Registries inutilisés
347347
- Gouvernance : ressources sans tags
348+
- IA/ML *(opt-in : `--category ai`)* : clusters de calcul AML avec capacité baseline non nulle et aucune activité depuis 14+ jours — clusters GPU flaggés risque HIGH ($600–$15K/mois)
348349

349350
**GCP :**
350351
- Compute : instances VM arrêtées 30+ jours (charges disque continuent) (HIGH)
@@ -583,7 +584,7 @@ Guide complet : [Configuration GCP →](docs/gcp.md)
583584

584585
**Policy-as-code**`cleancloud.yaml` avec packs de règles, exceptions par équipe, et seuils de coût en config — la principale demande de gouvernance FinOps pour 2025/2026
585586

586-
**Plus de règles IA/ML**clusters de calcul Azure ML inactifs, endpoints Vertex AI inactifs, instances de notebook SageMaker inutilisées, artefacts d'entraînement orphelins
587+
**Plus de règles IA/ML** — endpoints Vertex AI inactifs, instances de notebook SageMaker inutilisées, artefacts d'entraînement orphelins
587588

588589
**Plus de règles AWS** — lacunes de cycle de vie S3, Redshift inactif, fuite de coût NAT Gateway (services internes routant via NAT au lieu de VPC endpoints — S3, DynamoDB, ECR, SSM), VPC endpoints inutilisés
589590

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ That's CleanCloud. Scan your AWS, Azure, and GCP environments, get specific acti
3030
| Multi-account / multi-subscription / multi-project ||||
3131
| CI/CD and scheduled enforcement (exit codes) ||||
3232

33-
- **31 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate. AI/ML rules (SageMaker) are opt-in via `--category ai`
33+
- **32 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate. AI/ML rules (SageMaker, Azure ML) are opt-in via `--category ai`
3434
- **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 100` — enforce waste thresholds on a schedule, owned by platform or FinOps teams
3535
- **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org`
3636
- **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included
@@ -217,7 +217,7 @@ Run:
217217
| Flag | What it does |
218218
|---|---|
219219
| `--provider aws\|azure\|gcp` | Cloud provider to scan *(required)* |
220-
| `--category hygiene\|ai\|all` | Rule category: `hygiene` (default), `ai` (SageMaker, AWS-only), or `all` (hygiene + AI) |
220+
| `--category hygiene\|ai\|all` | Rule category: `hygiene` (default), `ai` (SageMaker on AWS, AML Compute on Azure), or `all` (hygiene + AI) |
221221
| `--region REGION` | Scan a single region |
222222
| `--all-regions` | Scan all active regions — AWS/Azure only |
223223
| **AWS multi-account** | |
@@ -338,7 +338,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa
338338

339339
## What CleanCloud Detects
340340

341-
31 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
341+
32 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
342342

343343
**AWS:**
344344
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -355,6 +355,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa
355355
- Network: unused public IPs, empty load balancers (HIGH), empty App Gateways (HIGH), idle VNet Gateways
356356
- Platform: empty App Service Plans (HIGH), idle SQL databases (HIGH), idle App Services, unused Container Registries
357357
- Governance: untagged resources
358+
- AI/ML *(opt-in: `--category ai`)*: idle AML compute clusters with non-zero baseline capacity and no workload activity 14+ days — GPU clusters flagged HIGH risk ($600–$15K/month)
358359

359360
**GCP:**
360361
- Compute: stopped instances 30+ days (disk charges continue) (HIGH)
@@ -593,7 +594,7 @@ Full setup guide: [GCP setup →](docs/gcp.md)
593594

594595
**Policy-as-code**`cleancloud.yaml` with rule packs, per-team exceptions, and cost thresholds in config — the top FinOps governance ask for 2025/2026
595596

596-
**More AI/ML waste rules**Azure ML compute clusters idle, Vertex AI endpoints idle, SageMaker notebook instances running unused, orphaned training artifacts
597+
**More AI/ML waste rules** — Vertex AI endpoints idle, SageMaker notebook instances running unused, orphaned training artifacts
597598

598599
**More AWS rules** — S3 lifecycle gaps, Redshift idle, NAT Gateway cost leakage (internal services routing through NAT instead of VPC endpoints — S3, DynamoDB, ECR, SSM), unused VPC endpoints
599600

cleancloud/demo/command.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
ALL_FINDINGS,
88
AWS_AI_FINDINGS,
99
AWS_FINDINGS,
10+
AZURE_AI_FINDINGS,
1011
AZURE_FINDINGS,
1112
GCP_FINDINGS,
1213
)
@@ -25,7 +26,7 @@
2526
"--category",
2627
type=click.Choice(["hygiene", "ai"]),
2728
default="hygiene",
28-
help="Rule category to demo: hygiene (default) or ai (SageMaker)",
29+
help="Rule category to demo: hygiene (default) or ai (SageMaker on AWS, AML Compute on Azure)",
2930
)
3031
def demo(provider: Optional[str], category: str):
3132
"""Show realistic sample findings without cloud credentials."""
@@ -36,9 +37,18 @@ def demo(provider: Optional[str], category: str):
3637
click.echo("=" * 60)
3738

3839
if category == "ai":
39-
findings = AWS_AI_FINDINGS
40-
regions = ["us-east-1"]
41-
region_mode = "explicit"
40+
if provider == "aws":
41+
findings = AWS_AI_FINDINGS
42+
regions = ["us-east-1"]
43+
region_mode = "explicit"
44+
elif provider == "azure":
45+
findings = AZURE_AI_FINDINGS
46+
regions = ["East US"]
47+
region_mode = "all"
48+
else:
49+
findings = AWS_AI_FINDINGS + AZURE_AI_FINDINGS
50+
regions = ["us-east-1", "East US"]
51+
region_mode = "all"
4252
elif provider == "aws":
4353
findings = AWS_FINDINGS
4454
regions = ["us-east-1", "us-west-2", "eu-west-1"]
@@ -60,7 +70,7 @@ def demo(provider: Optional[str], category: str):
6070

6171
summary = build_summary(findings)
6272
summary["scanned_at"] = datetime.now(timezone.utc).isoformat()
63-
summary["provider"] = provider or ("aws" if category == "ai" else "mixed")
73+
summary["provider"] = provider or "mixed"
6474
summary["regions_scanned"] = regions
6575

6676
_print_summary(summary, region_selection_mode=region_mode)

cleancloud/demo/findings.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -615,6 +615,64 @@
615615

616616
ALL_FINDINGS: List[Finding] = AWS_FINDINGS + AZURE_FINDINGS + GCP_FINDINGS
617617

618+
AZURE_AI_FINDINGS: List[Finding] = [
619+
Finding(
620+
provider="azure",
621+
rule_id="azure.aml.compute.idle",
622+
resource_type="azure.aml.compute",
623+
resource_id=(
624+
"/subscriptions/29d91ee0-922f-483a-a81f-1a5eff4ecfa2"
625+
"/resourceGroups/rg-ml-platform"
626+
"/providers/Microsoft.MachineLearningServices"
627+
"/workspaces/ml-platform-prod"
628+
"/computes/gpu-train-cluster"
629+
),
630+
region="East US",
631+
title="Idle Azure ML Compute Cluster (Baseline Capacity Waste for 21 Days)",
632+
summary=(
633+
"AML compute cluster 'gpu-train-cluster' in workspace 'ml-platform-prod' "
634+
"is configured to keep 2 node(s) always running (min_node_count=2) but no "
635+
"workload activity was observed for 21 days — baseline capacity waste."
636+
),
637+
reason="AML compute cluster has min_node_count=2 with no workload activity for 21 days",
638+
risk=RiskLevel.HIGH,
639+
confidence=ConfidenceLevel.HIGH,
640+
detected_at=_NOW,
641+
details={
642+
"cluster_name": "gpu-train-cluster",
643+
"workspace_name": "ml-platform-prod",
644+
"resource_group": "rg-ml-platform",
645+
"vm_size": "Standard_NC6s_v3",
646+
"min_node_count": 2,
647+
"is_gpu": True,
648+
"age_days": 21,
649+
"idle_window_days": 21,
650+
"idle_days_threshold": 14,
651+
"estimated_monthly_cost": "~$4,406/month",
652+
"cost_estimate_type": "mapped",
653+
},
654+
evidence=Evidence(
655+
signals_used=[
656+
"Cluster configured with non-zero baseline capacity but no workload observed for 21 days (Azure Monitor: Active Nodes)",
657+
"Baseline cost driver: min_node_count=2 (always-on compute — billed continuously)",
658+
"Compute type: AmlCompute",
659+
"Cluster age: 21 days",
660+
"VM size: Standard_NC6s_v3",
661+
"GPU cluster with no workload — high-cost idle state",
662+
],
663+
signals_not_checked=[
664+
"Scheduled or periodic training jobs",
665+
"Jobs submitted outside the observation window",
666+
"Planned future usage",
667+
"Cluster configured with min_node_count for warm-start latency",
668+
"Cluster reserved for interactive development",
669+
],
670+
time_window="21 days",
671+
),
672+
estimated_monthly_cost_usd=4406.0,
673+
),
674+
]
675+
618676
AWS_AI_FINDINGS: List[Finding] = [
619677
Finding(
620678
provider="aws",

cleancloud/doctor/azure.py

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,9 @@ def run_azure_doctor() -> None:
301301
info(" - Microsoft.Insights/metrics/read")
302302
info(" - Microsoft.Resources/subscriptions/read")
303303
info(" - Microsoft.Resources/resources/read")
304+
info(" AI/ML rules (opt-in via --category ai):")
305+
info(" - Microsoft.MachineLearningServices/workspaces/read")
306+
info(" - Microsoft.MachineLearningServices/workspaces/computes/read")
304307

305308
# Summary
306309
info("")
@@ -317,5 +320,122 @@ def run_azure_doctor() -> None:
317320

318321
info("")
319322
success("AZURE ENVIRONMENT READY FOR CLEANCLOUD")
323+
info("")
324+
info("Tip: To also validate AI/ML permissions (Azure ML rules), run:")
325+
info(" cleancloud doctor --provider azure --category ai")
326+
info("=" * 70)
327+
info("")
328+
329+
330+
def run_azure_ai_doctor(subscription_id: str = None) -> None:
331+
"""Validate Azure permissions for --category ai (Azure ML compute rules)."""
332+
info("")
333+
info("=" * 70)
334+
info("AZURE AI/ML PERMISSION VALIDATION")
335+
info("=" * 70)
336+
info("")
337+
info("Validating permissions for: cleancloud scan --provider azure --category ai")
338+
info("")
339+
340+
try:
341+
from azure.mgmt.machinelearningservices import AzureMachineLearningWorkspaces
342+
343+
credential = DefaultAzureCredential()
344+
except Exception as e:
345+
fail(f"Azure authentication failed — configure credentials and re-run doctor: {e}")
346+
return
347+
348+
# Resolve a subscription to test against
349+
try:
350+
sub_client = SubscriptionClient(credential)
351+
subscriptions = list(sub_client.subscriptions.list())
352+
if not subscriptions:
353+
fail("No accessible Azure subscriptions found")
354+
return
355+
test_sub = subscription_id or subscriptions[0].subscription_id
356+
success(f"Using subscription: {test_sub}")
357+
except Exception as e:
358+
fail(f"Failed to list subscriptions: {e}")
359+
return
360+
361+
info("")
362+
info("Permission Checks")
363+
info("-" * 70)
364+
365+
permissions_tested = []
366+
permissions_failed = []
367+
368+
# Check: Microsoft.MachineLearningServices/workspaces/read
369+
try:
370+
ml_client = AzureMachineLearningWorkspaces(
371+
credential=credential,
372+
subscription_id=test_sub,
373+
)
374+
workspaces = list(ml_client.workspaces.list_by_subscription())
375+
permissions_tested.append("Microsoft.MachineLearningServices/workspaces/read")
376+
success(
377+
f"Microsoft.MachineLearningServices/workspaces/read "
378+
f"({len(workspaces)} workspace(s) found)"
379+
)
380+
except Exception as e:
381+
permissions_failed.append(("Microsoft.MachineLearningServices/workspaces/read", str(e)))
382+
warn(f"Microsoft.MachineLearningServices/workspaces/read — {e}")
383+
workspaces = []
384+
385+
# Check: Microsoft.MachineLearningServices/workspaces/computes/read
386+
if workspaces:
387+
try:
388+
ws = workspaces[0]
389+
rg = ws.id.split("/")[ws.id.lower().split("/").index("resourcegroups") + 1]
390+
list(ml_client.compute.list(rg, ws.name))
391+
permissions_tested.append("Microsoft.MachineLearningServices/workspaces/computes/read")
392+
success("Microsoft.MachineLearningServices/workspaces/computes/read")
393+
except Exception as e:
394+
permissions_failed.append(
395+
("Microsoft.MachineLearningServices/workspaces/computes/read", str(e))
396+
)
397+
warn(f"Microsoft.MachineLearningServices/workspaces/computes/read — {e}")
398+
else:
399+
info(
400+
" Skipping computes/read check — no workspaces found to test against "
401+
"(permission may still be present)"
402+
)
403+
404+
# Check: Microsoft.Insights/metrics/read (already required by hygiene rules)
405+
try:
406+
from azure.mgmt.monitor import MonitorManagementClient
407+
408+
monitor = MonitorManagementClient(credential=credential, subscription_id=test_sub)
409+
# A lightweight call — list metric definitions for a subscription-level scope
410+
monitor.metric_definitions.list(
411+
f"/subscriptions/{test_sub}",
412+
)
413+
permissions_tested.append("Microsoft.Insights/metrics/read")
414+
success("Microsoft.Insights/metrics/read")
415+
except Exception as e:
416+
permissions_failed.append(("Microsoft.Insights/metrics/read", str(e)))
417+
warn(f"Microsoft.Insights/metrics/read — {e}")
418+
419+
info("")
420+
info("=" * 70)
421+
total = len(permissions_tested) + len(permissions_failed)
422+
info(f"Permissions: {len(permissions_tested)}/{total} passed")
423+
424+
if permissions_failed:
425+
info("")
426+
for perm, _ in permissions_failed:
427+
warn(f" missing: {perm}")
428+
info("")
429+
info("Assign the AI role to your service principal:")
430+
info(" az role definition create --role-definition security/azure/ai-readonly-role.json")
431+
info(' az role assignment create --assignee <APP_ID> --role "CleanCloudAIReadOnly" \\')
432+
info(" --scope /subscriptions/<SUBSCRIPTION_ID>")
433+
info("Then re-run: cleancloud doctor --provider azure --category ai")
434+
info("")
435+
warn("AZURE AI/ML PERMISSIONS INCOMPLETE")
436+
else:
437+
info("")
438+
success("AZURE AI/ML PERMISSIONS READY")
439+
info("Run: cleancloud scan --provider azure --category ai")
320440
info("=" * 70)
321441
info("")

cleancloud/doctor/runner.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
from typing import Optional
33

44
from cleancloud.doctor.aws import run_aws_ai_doctor, run_aws_doctor
5-
from cleancloud.doctor.azure import run_azure_doctor
5+
from cleancloud.doctor.azure import run_azure_ai_doctor, run_azure_doctor
66
from cleancloud.doctor.common import DoctorError, info, success
77
from cleancloud.doctor.gcp import run_gcp_doctor
88

@@ -66,7 +66,13 @@ def run_doctor(
6666
info(" The --region parameter is only used for AWS provider")
6767
info("")
6868

69-
run_azure_doctor()
69+
if category == "ai":
70+
run_azure_ai_doctor()
71+
elif category == "all":
72+
run_azure_doctor()
73+
run_azure_ai_doctor()
74+
else:
75+
run_azure_doctor()
7076
results[p] = {"status": "passed", "error": None}
7177

7278
elif p == "gcp":

cleancloud/output/summary.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ def _print_summary(summary: dict, region_selection_mode: str = None, multi_accou
143143
missing = skipped.get("missing_permissions", "")
144144
# Strip verbose prefix if present
145145
missing = missing.replace("Missing required IAM permissions: ", "")
146+
missing = missing.replace("Missing required permissions: ", "")
146147
click.echo(f" - {rule_name}")
147148
if missing:
148149
click.echo(f" needs: {missing}")
@@ -162,7 +163,7 @@ def _print_summary(summary: dict, region_selection_mode: str = None, multi_accou
162163
)
163164
if has_azure:
164165
click.echo(
165-
" Azure: https://github.com/cleancloud-io/cleancloud/blob/main/security/azure-readonly-role.json"
166+
" Azure: https://github.com/cleancloud-io/cleancloud/tree/main/security/azure/"
166167
)
167168
click.echo(
168169
" Run 'cleancloud doctor --provider azure' to validate permissions after updating."

0 commit comments

Comments
 (0)