Skip to content

Commit dbcdb5f

Browse files
authored
AI rule: Detect GCP Idle vertex Endpoint (#126)
1 parent 66e02cb commit dbcdb5f

27 files changed

Lines changed: 2780 additions & 224 deletions

README.fr.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515

1616
```bash
1717
pipx install cleancloud
18-
cleancloud demo # visualisez des findings — aucun credential requis
18+
cleancloud demo # visualisez des findings — aucun credential requis
19+
cleancloud demo --category ai # findings IA/ML (SageMaker, AML, Vertex AI — endpoints/clusters GPU intensifs)
1920
```
2021

2122
Scannez votre cloud :
@@ -24,21 +25,23 @@ Scannez votre cloud :
2425
cleancloud scan --provider aws --all-regions
2526
cleancloud scan --provider azure
2627
cleancloud scan --provider gcp --all-projects
28+
cleancloud scan --provider aws --category ai # détectez les endpoints SageMaker inactifs
2729
```
2830

2931
---
3032

31-
**CleanCloud est le moteur d'hygiène cloud — la couche manquante entre la visibilité des coûts et le nettoyage.**
33+
**CleanCloud est le moteur d'hygiène cloud — détecte le gaspillage d'infrastructure inactive et de ressources IA/ML coûteuses sur AWS, Azure et GCP.**
3234

3335
**Supporte :** AWS · Azure · GCP
3436

35-
CleanCloud scanne vos environnements AWS, Azure et GCP et vous indique exactement ce qu'il faut nettoyer — avec des estimations de coût par ressource. Aucun agent. Pas de SaaS. Lecture seule. S'exécute entièrement dans votre environnement.
37+
CleanCloud scanne vos environnements AWS, Azure et GCP et vous indique exactement ce qu'il faut nettoyer — infrastructure inactive et ressources IA/ML coûteuses (endpoints SageMaker, clusters AML Compute, endpoints Vertex AI) — avec des estimations de coût par ressource. Aucun agent. Pas de SaaS. Lecture seule. S'exécute entièrement dans votre environnement.
3638

3739
| | Outils natifs AWS/Azure/GCP | Plateformes FinOps SaaS | **CleanCloud** |
3840
|---|:---:|:---:|:---:|
3941
| Affiche les tendances de coûts ||||
4042
| Nomme exactement les ressources à nettoyer || partiel ||
4143
| Estimation de coût déterministe par ressource ||||
44+
| Détecte le gaspillage IA/ML (SageMaker, AML, Vertex AI — dont les endpoints GPU) ||||
4245
| Lecture seule, aucun agent ||||
4346
| Fonctionne en environnements air-gapped / réglementés ||||
4447
| Aucun compte SaaS ni accès vendor requis ||||
@@ -141,7 +144,10 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
141144

142145
## Fonctionnalités clés
143146

144-
- **32 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe. Les règles IA/ML (SageMaker, Azure ML) sont opt-in via `--category ai`
147+
- **Détection du gaspillage IA/ML sur les 3 clouds :** endpoints SageMaker inactifs (AWS), clusters AML Compute inactifs (Azure), et endpoints Vertex AI Online Prediction inactifs (GCP) — ressources GPU toujours provisionnées flaggées risque HIGH, avec un gaspillage typique de $449 à $23K+/mois. Opt-in via `--category ai` ou `--category all`
148+
149+
De nombreuses ressources IA/ML restent provisionnées en permanence (min replicas / baseline capacity) et continuent de facturer même sans trafic — CleanCloud détecte ces déploiements abandonnés ou sous-utilisés dès le début.
150+
- **33 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
145151
- **Gouvernance et application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 100` — appliquer des seuils de gaspillage sur un planning, géré par les équipes platform ou FinOps
146152
- **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
147153
- **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
@@ -227,7 +233,7 @@ Pas sûr que vos credentials aient les bonnes permissions ? Lancez d'abord `clea
227233
| Flag | Fonction |
228234
|---|---|
229235
| `--provider aws\|azure\|gcp` | Fournisseur cloud à scanner *(obligatoire)* |
230-
| `--category hygiene\|ai\|all` | Catégorie de règles : `hygiene` (défaut), `ai` (SageMaker sur AWS, AML Compute sur Azure) ou `all` (hygiene + IA) |
236+
| `--category hygiene\|ai\|all` | Catégorie de règles : `hygiene` (défaut), `ai` (SageMaker sur AWS, AML Compute sur Azure, Vertex AI sur GCP) ou `all` (hygiene + IA) |
231237
| `--region REGION` | Scanner une seule région |
232238
| `--all-regions` | Toutes les régions actives — AWS/Azure uniquement |
233239
| **AWS multi-comptes** | |
@@ -344,7 +350,7 @@ Pour des exemples de sortie complets incluant `doctor`, JSON, CSV et markdown :
344350

345351
## Ce que CleanCloud détecte
346352

347-
32 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
353+
33 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
348354

349355
**AWS :**
350356
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -368,6 +374,7 @@ Pour des exemples de sortie complets incluant `doctor`, JSON, CSV et markdown :
368374
- Stockage : Persistent Disks non attachés (HIGH), anciens snapshots 90+ jours
369375
- Réseau : IPs statiques réservées — régionales et globales — en état RESERVED (HIGH)
370376
- Plateforme : instances Cloud SQL inactives avec zéro connexion 14+ jours (HIGH)
377+
- IA/ML *(opt-in : `--category ai`)* : endpoints Vertex AI Online Prediction inactifs avec zéro ou quasi-zéro prédiction depuis 14+ jours (les nœuds dédiés continuent de facturer quel que soit le trafic) — endpoints GPU flaggés risque HIGH ($449–$23K+/mois)
371378

372379
Les règles sans marqueur de confiance sont MEDIUM — elles utilisent des heuristiques temporelles ou des signaux multiples. Commencez par `--fail-on-confidence HIGH` pour les gaspillages évidents, puis resserrez au fil de la validation par votre équipe.
373380

@@ -603,7 +610,7 @@ Guide complet : [Configuration GCP →](docs/gcp.md)
603610

604611
**Policy-as-code**`cleancloud.yaml` avec packs de règles, exceptions par équipe, et seuils de coût en config — la principale demande de gouvernance FinOps pour 2025/2026
605612

606-
**Plus de règles IA/ML**endpoints Vertex AI inactifs, instances de notebook SageMaker inutilisées, artefacts d'entraînement orphelins
613+
**Plus de règles IA/ML** — instances de notebook SageMaker inutilisées, artefacts d'entraînement orphelins, instances de notebook Vertex AI inactives
607614

608615
**Plus de règles AWS** — lacunes de cycle de vie S3, Redshift inactif, fuite de coût NAT Gateway, VPC endpoints inutilisés
609616

README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,8 @@
1515

1616
```bash
1717
pipx install cleancloud
18-
cleancloud demo # see sample findings — no credentials needed
18+
cleancloud demo # see sample findings — no credentials needed
19+
cleancloud demo --category ai # see AI/ML waste findings (SageMaker, AML, Vertex AI — GPU-heavy endpoints/clusters)
1920
```
2021

2122
Scan your cloud:
@@ -24,21 +25,23 @@ Scan your cloud:
2425
cleancloud scan --provider aws --all-regions
2526
cleancloud scan --provider azure
2627
cleancloud scan --provider gcp --all-projects
28+
cleancloud scan --provider aws --category ai # detect idle SageMaker endpoints
2729
```
2830

2931
---
3032

31-
**CleanCloud is the Cloud Hygiene Engine — the missing layer between cost visibility and cleanup.**
33+
**CleanCloud is the Cloud Hygiene Engine — detects idle infrastructure and high-cost AI/ML waste across AWS, Azure, and GCP.**
3234

3335
**Supports:** AWS · Azure · GCP
3436

35-
CleanCloud scans your AWS, Azure, and GCP environments and tells you exactly what to clean up — with per-resource cost estimates. No agents. No SaaS. Read-only. Runs entirely in your environment.
37+
CleanCloud scans your AWS, Azure, and GCP environments and tells you exactly what to clean up — idle infrastructure and high-cost AI/ML resources (SageMaker endpoints, AML compute clusters, Vertex AI endpoints) — with per-resource cost estimates. No agents. No SaaS. Read-only. Runs entirely in your environment.
3638

3739
| | AWS/Azure/GCP native cost tools | FinOps SaaS platforms | **CleanCloud** |
3840
|---|:---:|:---:|:---:|
3941
| Shows cost trends ||||
4042
| Names exactly which resources to clean up || partial ||
4143
| Deterministic cost estimate per resource ||||
44+
| Detects idle AI/ML waste (SageMaker, AML, Vertex AI — including GPU-backed endpoints) ||||
4245
| Read-only, no agents ||||
4346
| Runs in air-gapped / regulated environments ||||
4447
| No SaaS account or vendor access required ||||
@@ -141,7 +144,10 @@ No cloud account yet? `cleancloud demo` shows sample output without any credenti
141144

142145
## Key Features
143146

144-
- **32 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate. AI/ML rules (SageMaker, Azure ML) are opt-in via `--category ai`
147+
- **AI/ML waste detection across all 3 clouds:** idle SageMaker endpoints (AWS), idle AML compute clusters (Azure), and idle Vertex AI Online Prediction endpoints (GCP) — always-on GPU-backed resources flagged HIGH risk, with typical waste ranging from $449–$23K+/month. Opt-in via `--category ai` or `--category all`
148+
149+
Many AI/ML serving resources remain permanently provisioned (min replicas / baseline capacity) and continue billing even with zero traffic — CleanCloud detects these abandoned or underutilized deployments early.
150+
- **33 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate
145151
- **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 100` — enforce waste thresholds on a schedule, owned by platform or FinOps teams
146152
- **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org`
147153
- **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included
@@ -229,7 +235,7 @@ Run `cleancloud doctor --provider aws`, `cleancloud doctor --provider azure`, or
229235
| Flag | What it does |
230236
|---|---|
231237
| `--provider aws\|azure\|gcp` | Cloud provider to scan *(required)* |
232-
| `--category hygiene\|ai\|all` | Rule category: `hygiene` (default), `ai` (SageMaker on AWS, AML Compute on Azure), or `all` (hygiene + AI) |
238+
| `--category hygiene\|ai\|all` | Rule category: `hygiene` (default), `ai` (SageMaker on AWS, AML Compute on Azure, Vertex AI on GCP), or `all` (hygiene + AI) |
233239
| `--region REGION` | Scan a single region |
234240
| `--all-regions` | Scan all active regions — AWS/Azure only |
235241
| **AWS multi-account** | |
@@ -346,7 +352,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa
346352

347353
## What CleanCloud Detects
348354

349-
32 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
355+
33 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
350356

351357
**AWS:**
352358
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -370,6 +376,7 @@ For full output examples including `doctor`, JSON, CSV, and markdown: [`docs/exa
370376
- Storage: unattached Persistent Disks (HIGH), old snapshots 90+ days
371377
- Network: unused reserved static IPs — regional and global (HIGH)
372378
- Platform: idle Cloud SQL instances with zero connections 14+ days (HIGH)
379+
- AI/ML *(opt-in: `--category ai`)*: idle Vertex AI Online Prediction endpoints with zero or near-zero predictions 14+ days (dedicated nodes continue billing regardless of traffic) — GPU-backed endpoints flagged HIGH risk ($449–$23K+/month)
373380

374381
Rules without a confidence marker are MEDIUM — they use time-based heuristics or multiple signals. Start with `--fail-on-confidence HIGH` to catch obvious waste, then tighten as your team validates.
375382

@@ -605,7 +612,7 @@ Full setup guide: [GCP setup →](docs/gcp.md)
605612

606613
**Policy-as-code**`cleancloud.yaml` with rule packs, per-team exceptions, and cost thresholds in config — the top FinOps governance ask for 2025/2026
607614

608-
**More AI/ML waste rules**Vertex AI endpoints idle, SageMaker notebook instances running unused, orphaned training artifacts
615+
**More AI/ML waste rules** — SageMaker notebook instances running unused, orphaned training artifacts, Vertex AI notebook instances idle
609616

610617
**More AWS rules** — S3 lifecycle gaps, Redshift idle, NAT Gateway cost leakage (internal services routing through NAT instead of VPC endpoints — S3, DynamoDB, ECR, SSM), unused VPC endpoints
611618

cleancloud/demo/command.py

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
AWS_FINDINGS,
1010
AZURE_AI_FINDINGS,
1111
AZURE_FINDINGS,
12+
GCP_AI_FINDINGS,
1213
GCP_FINDINGS,
1314
)
1415
from cleancloud.output.human import print_human
@@ -26,7 +27,7 @@
2627
"--category",
2728
type=click.Choice(["hygiene", "ai"]),
2829
default="hygiene",
29-
help="Rule category to demo: hygiene (default) or ai (SageMaker on AWS, AML Compute on Azure)",
30+
help="Rule category to demo: hygiene (default) or ai (SageMaker/AWS, AML Compute/Azure, Vertex AI/GCP)",
3031
)
3132
def demo(provider: Optional[str], category: str):
3233
"""Show realistic sample findings without cloud credentials."""
@@ -45,9 +46,13 @@ def demo(provider: Optional[str], category: str):
4546
findings = AZURE_AI_FINDINGS
4647
regions = ["East US"]
4748
region_mode = "all"
49+
elif provider == "gcp":
50+
findings = GCP_AI_FINDINGS
51+
regions = ["us-central1"]
52+
region_mode = "all"
4853
else:
49-
findings = AWS_AI_FINDINGS + AZURE_AI_FINDINGS
50-
regions = ["us-east-1", "East US"]
54+
findings = AWS_AI_FINDINGS + AZURE_AI_FINDINGS + GCP_AI_FINDINGS
55+
regions = ["us-east-1", "East US", "us-central1"]
5156
region_mode = "all"
5257
elif provider == "aws":
5358
findings = AWS_FINDINGS

cleancloud/demo/findings.py

Lines changed: 70 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -551,8 +551,10 @@
551551
"Status: READY",
552552
"Disk size: 400 GB",
553553
"Estimated cost: ~$10.4/month (disk size used as proxy)",
554-
"Source disk reference missing — likely orphaned snapshot "
555-
"(GCP clears sourceDisk when the backing disk is deleted)",
554+
(
555+
"Source disk reference missing — likely orphaned snapshot "
556+
"(GCP clears sourceDisk when the backing disk is deleted)"
557+
),
556558
],
557559
signals_not_checked=[
558560
"Compliance or regulatory data retention requirements",
@@ -595,8 +597,10 @@
595597
evidence=Evidence(
596598
signals_used=[
597599
"Instance state: RUNNABLE",
598-
"Zero TCP connections observed via Cloud Monitoring over 14 days "
599-
"(metric: cloudsql.googleapis.com/database/network/connections)",
600+
(
601+
"Zero TCP connections observed via Cloud Monitoring over 14 days "
602+
"(metric: cloudsql.googleapis.com/database/network/connections)"
603+
),
600604
"Database version: POSTGRES_14",
601605
"Tier 'db-n1-standard-2' costs ~$93.10/month (compute only, no HA)",
602606
"Storage: 100 GB (PD_SSD) — billed separately from compute",
@@ -613,6 +617,68 @@
613617
),
614618
]
615619

620+
GCP_AI_FINDINGS: List[Finding] = [
621+
Finding(
622+
provider="gcp",
623+
rule_id="gcp.vertex.endpoint.idle",
624+
resource_type="gcp.vertex.endpoint",
625+
resource_id="projects/my-project/locations/us-central1/endpoints/8842019374650589184",
626+
region="us-central1",
627+
title="Idle Vertex AI Endpoint (No Predictions for 21 Days)",
628+
summary=(
629+
"Vertex AI endpoint 'llm-serving-v2' in 'us-central1' has received zero predictions "
630+
"for 21 days but keeps 1 dedicated node running continuously, incurring compute charges."
631+
),
632+
reason=(
633+
"Vertex AI endpoint has zero predictions for 21 days "
634+
"with dedicated capacity (minReplicaCount=1)"
635+
),
636+
risk=RiskLevel.HIGH,
637+
confidence=ConfidenceLevel.HIGH,
638+
detected_at=_NOW,
639+
details={
640+
"endpoint_id": "8842019374650589184",
641+
"display_name": "llm-serving-v2",
642+
"location": "us-central1",
643+
"machine_type": "n1-standard-4",
644+
"accelerator_type": "NVIDIA_TESLA_T4",
645+
"accelerator_count": 1,
646+
"is_gpu": True,
647+
"min_replica_count": 1,
648+
"age_days": 21,
649+
"idle_window_days": 21,
650+
"idle_days_threshold": 14,
651+
"estimated_monthly_cost": "~$449/month",
652+
},
653+
evidence=Evidence(
654+
signals_used=[
655+
(
656+
"Zero prediction requests for 21 days "
657+
"(Cloud Monitoring: aiplatform.googleapis.com/prediction/online/request_count)"
658+
),
659+
(
660+
"Dedicated capacity configured: minReplicaCount=1 "
661+
"(always-on compute — billed continuously regardless of traffic)"
662+
),
663+
"Endpoint age: 21 days",
664+
"Machine type: n1-standard-4",
665+
"Accelerator: NVIDIA_TESLA_T4 × 1",
666+
"GPU-backed endpoint — high continuous cost",
667+
"Display name: llm-serving-v2",
668+
],
669+
signals_not_checked=[
670+
"Scheduled or batch prediction requests outside the observation window",
671+
"Internal health-check or canary traffic not tracked by Cloud Monitoring",
672+
"Planned future usage or upcoming model promotion",
673+
"Shadow mode or A/B test routing with low traffic share",
674+
"Endpoints kept warm for latency-sensitive production traffic",
675+
],
676+
time_window="21 days",
677+
),
678+
estimated_monthly_cost_usd=449.0,
679+
),
680+
]
681+
616682
ALL_FINDINGS: List[Finding] = AWS_FINDINGS + AZURE_FINDINGS + GCP_FINDINGS
617683

618684
AZURE_AI_FINDINGS: List[Finding] = [

cleancloud/doctor/command.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,8 @@ def doctor(
5858
click.echo("Running CleanCloud doctor")
5959
click.echo()
6060

61-
if category == "ai" and provider not in (None, "aws"):
62-
raise click.UsageError(
63-
"--category ai is only supported with --provider aws (SageMaker rules). "
64-
"AI/ML rules for Azure and GCP are on the roadmap."
65-
)
61+
if category == "ai" and provider not in (None, "aws", "azure", "gcp"):
62+
raise click.UsageError("--category ai is only supported with --provider aws, azure, or gcp")
6663

6764
if multi_account_file:
6865
if provider != "aws" and provider is not None:

0 commit comments

Comments
 (0)