Skip to content

Commit 6605c98

Browse files
authored
AI Rules: gcp.tpu.idle && gcp.vertex.featurestore.idle (#154)
1 parent 60478c0 commit 6605c98

15 files changed

Lines changed: 2603 additions & 36 deletions

File tree

README.fr.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
194194
- **Détection du gaspillage IA/ML sur les 3 clouds :** points de terminaison SageMaker et notebooks, clusters AML Compute et instances ML, points de terminaison en ligne Azure ML et services Azure AI Search, points de terminaison Vertex AI et instances Workbench — facturés 500–23 000 $/mois par ressource en silence. Les ressources GPU sont signalées comme RISQUE ÉLEVÉ. Les outils natifs n'indiquent pas toujours quoi supprimer — CleanCloud le fait. Opt-in via `--category ai`
195195
- **Gouvernance policy-as-code :** `cleancloud.yaml` pour la configuration par règle, les exceptions avec dates d'expiration, les seuils de coût et de confiance, les exclusions par tag — versionné aux côtés de votre infrastructure. Chaque exception est une approbation auditée dans git.
196196
- **Application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 500` — appliquer des seuils de gaspillage en CI/CD sur un planning, géré par les équipes platform ou FinOps
197-
- **43 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
197+
- **45 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
198198
- **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
199199
- **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
200200
- **Scan multi-projets (GCP) :** scannez tous les projets GCP accessibles en parallèle — auto-découverte via Application Default Credentials, détail des coûts par projet inclus
@@ -283,13 +283,15 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
283283
| Déploiement Azure OpenAI Provisionné (PTU) | 1 460+ $ / PTU / mois |
284284
| Endpoint Vertex AI Online Prediction (GPU) | 449 – 23 000+ $ / mois |
285285
| Instance Vertex AI Workbench (GPU) | 449 – 8 000+ $ / mois |
286+
| Nœud Cloud TPU (v4/v5p) | 188 – 750+ $ / jour |
287+
| Vertex AI Feature Store (Bigtable) | 197 – 591+ $ / mois |
286288

287289
CleanCloud détecte les endpoints à zéro invocation / zéro prédiction et les instances de notebook inactives sur les 3 clouds et les signale risque HIGH. Les outils natifs montrent la facture — ils ne vous disent pas *quel endpoint* supprimer.
288290

289291
```bash
290292
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + Studio apps SageMaker + training jobs SageMaker + EC2 GPU
291293
cleancloud scan --provider azure --category ai # clusters AML + instances ML + endpoints en ligne + AI Search + instances ML + PTUs OpenAI
292-
cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench + training jobs
294+
cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench + training jobs + Cloud TPU + Feature Stores
293295
cleancloud scan --provider aws --category all # hygiène + IA/ML ensemble
294296
```
295297

@@ -527,7 +529,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
527529

528530
## Ce que CleanCloud détecte
529531

530-
43 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
532+
45 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
531533

532534
**AWS :**
533535
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -551,7 +553,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
551553
- Stockage : Persistent Disks non attachés (HIGH), anciens snapshots 90+ jours
552554
- Réseau : IPs statiques réservées — régionales et globales — en état RESERVED (HIGH)
553555
- Plateforme : instances Cloud SQL inactives avec zéro connexion 14+ jours (HIGH)
554-
- IA/ML *(opt-in : `--category ai`)* : endpoints Vertex AI Online Prediction inactifs avec zéro ou quasi-zéro prédiction depuis 14+ jours (les nœuds dédiés continuent de facturer quel que soit le trafic) — endpoints GPU flaggés risque HIGH ($449–$23K+/mois) ; instances Workbench sans activité depuis 14+ jours — instances GPU flaggées HIGH/CRITICAL ($449–$8K+/mois) ; training jobs Vertex AI (CustomJobs + TrainingPipelines) dépassant 24h — alerte précoce GPU à 75% du seuil, risque CRITICAL pour les jobs GPU à 3× le seuil ($4–$80+/h par nœud GPU)
556+
- IA/ML *(opt-in : `--category ai`)* : endpoints Vertex AI Online Prediction inactifs avec zéro ou quasi-zéro prédiction depuis 14+ jours (les nœuds dédiés continuent de facturer quel que soit le trafic) — endpoints GPU flaggés risque HIGH ($449–$23K+/mois) ; instances Workbench sans activité depuis 14+ jours — instances GPU flaggées HIGH/CRITICAL ($449–$8K+/mois) ; training jobs Vertex AI (CustomJobs + TrainingPipelines) dépassant 24h — alerte précoce GPU/TPU à 90% du seuil, risque CRITICAL pour les jobs GPU à 3× le seuil ($4–$80+/h par nœud GPU) ; nœuds Cloud TPU (v2–v6e) en état READY avec duty_cycle quasi-nul depuis 7+ jours — un v4 inactif coûte $12,88/h, un v5p-8 coûte $33,60/h ; Feature Stores Vertex AI avec zéro requête ReadFeatureValues depuis 30+ jours — les stores Bigtable facturent ~$197/nœud/mois quelle que soit l'activité
555557

556558
Les règles sans marqueur de confiance sont MEDIUM — elles utilisent des heuristiques temporelles ou des signaux multiples. Commencez par `--fail-on-confidence HIGH` pour les gaspillages évidents, puis resserrez au fil de la validation par votre équipe.
557559

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ No cloud account yet? `cleancloud demo` shows sample output without any credenti
194194
- **AI/ML waste detection across all 3 clouds:** idle SageMaker endpoints and notebooks, AML compute clusters and instances, Azure ML online endpoints and AI Search services, Vertex AI endpoints and Workbench instances — silently billing $500–$23K/month per resource. GPU-backed resources flagged HIGH risk. Native cost tools don't surface these — CleanCloud does. Opt-in via `--category ai`
195195
- **Policy-as-code governance:** `cleancloud.yaml` for per-rule config, exceptions with expiry dates, cost and confidence thresholds, tag-based exclusions — version-controlled alongside your infrastructure. Every exception is a git-reviewable approval.
196196
- **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 500` — enforce waste thresholds in CI/CD on a schedule, owned by platform or FinOps teams
197-
- **43 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate
197+
- **45 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate
198198
- **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org`
199199
- **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included
200200
- **Multi-project scanning (GCP):** scan all accessible GCP projects in parallel — auto-discovery via Application Default Credentials, per-project cost breakdown included
@@ -283,13 +283,15 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
283283
| Azure OpenAI Provisioned Deployment (PTU) | $1,460+ / PTU / month |
284284
| Vertex AI Online Prediction endpoint (GPU) | $449 – $23,000+ / month |
285285
| Vertex AI Workbench instance (GPU) | $449 – $8,000+ / month |
286+
| Cloud TPU node (v4/v5p) | $188 – $750+/ day |
287+
| Vertex AI Feature Store (Bigtable-backed) | $197 – $591+ / month |
286288

287289
CleanCloud detects zero-invocation / zero-prediction endpoints and idle notebook instances across all three clouds and flags them HIGH risk. Native cost tools show the bill — they don't tell you *which endpoint* to delete.
288290

289291
```bash
290292
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + Studio apps + training jobs + idle GPU EC2
291293
cleancloud scan --provider azure --category ai # AML compute + ML instances + online endpoints + AI Search + OpenAI PTUs
292-
cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench + training jobs
294+
cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench + training jobs + Cloud TPU + Feature Stores
293295
cleancloud scan --provider aws --category all # hygiene + AI/ML together
294296
```
295297

@@ -527,7 +529,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
527529

528530
## What CleanCloud Detects
529531

530-
43 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
532+
45 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
531533

532534
**AWS:**
533535
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -551,7 +553,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
551553
- Storage: unattached Persistent Disks (HIGH), old snapshots 90+ days
552554
- Network: unused reserved static IPs — regional and global (HIGH)
553555
- Platform: idle Cloud SQL instances with zero connections 14+ days (HIGH)
554-
- AI/ML *(opt-in: `--category ai`)*: idle Vertex AI Online Prediction endpoints with zero or near-zero predictions 14+ days (dedicated nodes continue billing regardless of traffic) — GPU-backed endpoints flagged HIGH risk ($449–$23K+/month); idle Workbench instances (v1 + v2) with no control-plane activity 14+ days — GPU instances flagged HIGH/CRITICAL ($449–$8K+/month); long-running Vertex AI training jobs (CustomJobs + TrainingPipelines) beyond 24h threshold — GPU early warning at 75% of threshold, CRITICAL risk for GPU jobs at 3× threshold ($4–$80+/hr per GPU node)
556+
- AI/ML *(opt-in: `--category ai`)*: idle Vertex AI Online Prediction endpoints with zero or near-zero predictions 14+ days (dedicated nodes continue billing regardless of traffic) — GPU-backed endpoints flagged HIGH risk ($449–$23K+/month); idle Workbench instances (v1 + v2) with no control-plane activity 14+ days — GPU instances flagged HIGH/CRITICAL ($449–$8K+/month); long-running Vertex AI training jobs (CustomJobs + TrainingPipelines) beyond 24h threshold — GPU/TPU early warning at 90% of threshold, CRITICAL risk for GPU jobs at 3× threshold ($4–$80+/hr per GPU node); idle Cloud TPU nodes (v2–v6e) in READY state with near-zero duty_cycle for 7+ days — idle v4 costs $12.88/hr, v5p-8 costs $33.60/hr; idle Vertex AI Feature Store online stores with zero ReadFeatureValues requests for 30+ days — Bigtable-backed stores bill ~$197/node/month regardless of activity
555557

556558
Rules without a confidence marker are MEDIUM — they use time-based heuristics or multiple signals. Start with `--fail-on-confidence HIGH` to catch obvious waste, then tighten as your team validates.
557559

cleancloud.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,14 @@ rules:
108108
# params:
109109
# idle_days: 14 # default: 7 (days of zero requests before flagging)
110110

111+
# gcp.tpu.idle:
112+
# params:
113+
# idle_days: 14 # default: 7 (days of low duty-cycle before flagging)
114+
115+
# gcp.vertex.featurestore.idle:
116+
# params:
117+
# idle_days: 60 # default: 30 (days of zero online serving requests before flagging)
118+
111119
# ── Categories ───────────────────────────────────────────────────────────────
112120
# Override the default category without using the --category CLI flag.
113121
# Equivalent to: cleancloud scan --provider aws --category all

cleancloud/doctor/gcp.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -798,6 +798,60 @@ def run_gcp_ai_doctor(project_id: Optional[str] = None) -> None:
798798
else:
799799
info("aiplatform.trainingPipelines.list — skipped (no project specified)")
800800

801+
# --- tpu.nodes.list ---
802+
if probe_project_id:
803+
try:
804+
session = AuthorizedSession(credentials)
805+
resp = session.get(
806+
f"https://tpu.googleapis.com/v2" f"/projects/{probe_project_id}/locations/-/nodes",
807+
params={"pageSize": 1},
808+
)
809+
if resp.status_code == 403:
810+
permissions_failed.append("tpu.nodes.list")
811+
warn("tpu.nodes.list — MISSING " "(rule: gcp.tpu.idle will be skipped)")
812+
elif resp.status_code == 404:
813+
info(
814+
"tpu.nodes.list — Cloud TPU API not enabled in this project "
815+
"(enable via: gcloud services enable tpu.googleapis.com)"
816+
)
817+
else:
818+
success("tpu.nodes.list")
819+
except Exception as e:
820+
info(f"tpu.nodes.list — check skipped ({type(e).__name__})")
821+
else:
822+
info("tpu.nodes.list — skipped (no project specified)")
823+
824+
# --- aiplatform.featurestores.list / aiplatform.featureOnlineStores.list ---
825+
for resource, perm in (
826+
("featurestores", "aiplatform.featurestores.list"),
827+
("featureOnlineStores", "aiplatform.featureOnlineStores.list"),
828+
):
829+
if probe_project_id:
830+
try:
831+
session = AuthorizedSession(credentials)
832+
resp = session.get(
833+
f"https://aiplatform.googleapis.com/v1"
834+
f"/projects/{probe_project_id}/locations/us-central1/{resource}",
835+
params={"pageSize": 1},
836+
)
837+
if resp.status_code == 403:
838+
permissions_failed.append(perm)
839+
warn(
840+
f"{perm} — MISSING "
841+
f"(rule: gcp.vertex.featurestore.idle may be partially skipped)"
842+
)
843+
elif resp.status_code == 404:
844+
info(
845+
f"{perm} — Vertex AI API not enabled in this project "
846+
f"(enable via: gcloud services enable aiplatform.googleapis.com)"
847+
)
848+
else:
849+
success(f"{perm}")
850+
except Exception as e:
851+
info(f"{perm} — check skipped ({type(e).__name__})")
852+
else:
853+
info(f"{perm} — skipped (no project specified)")
854+
801855
# -------------------------------------------------------------------------
802856
# Rule coverage summary
803857
# -------------------------------------------------------------------------
@@ -836,6 +890,40 @@ def run_gcp_ai_doctor(project_id: Optional[str] = None) -> None:
836890
f"{', '.join(training_perms_missing)})"
837891
)
838892

893+
if "tpu.nodes.list" not in permissions_failed:
894+
tpu_metric_note = (
895+
"(partial: monitoring missing — age-based fallback active)"
896+
if "monitoring.timeSeries.list" in permissions_failed
897+
else "(enabled)"
898+
)
899+
if "monitoring.timeSeries.list" in permissions_failed:
900+
warn(f" ~ gcp.tpu.idle {tpu_metric_note}")
901+
else:
902+
success(f" ✓ gcp.tpu.idle {tpu_metric_note}")
903+
else:
904+
warn(" ✗ gcp.tpu.idle (disabled: missing tpu.nodes.list)")
905+
906+
featurestore_perms_missing = [
907+
p
908+
for p in ("aiplatform.featurestores.list", "aiplatform.featureOnlineStores.list")
909+
if p in permissions_failed
910+
]
911+
if not featurestore_perms_missing:
912+
fs_note = (
913+
"(partial: monitoring missing — age-based fallback active)"
914+
if "monitoring.timeSeries.list" in permissions_failed
915+
else "(enabled)"
916+
)
917+
if "monitoring.timeSeries.list" in permissions_failed:
918+
warn(f" ~ gcp.vertex.featurestore.idle {fs_note}")
919+
else:
920+
success(f" ✓ gcp.vertex.featurestore.idle {fs_note}")
921+
else:
922+
warn(
923+
f" ✗ gcp.vertex.featurestore.idle (partial: missing "
924+
f"{', '.join(featurestore_perms_missing)})"
925+
)
926+
839927
# -------------------------------------------------------------------------
840928
# Remediation guidance
841929
# -------------------------------------------------------------------------
@@ -852,11 +940,15 @@ def run_gcp_ai_doctor(project_id: Optional[str] = None) -> None:
852940
"aiplatform.endpoints.list",
853941
"aiplatform.customJobs.list",
854942
"aiplatform.trainingPipelines.list",
943+
"aiplatform.featurestores.list",
944+
"aiplatform.featureOnlineStores.list",
855945
)
856946
):
857947
roles_needed.append("roles/aiplatform.viewer")
858948
if "notebooks.instances.list" in permissions_failed:
859949
roles_needed.append("roles/notebooks.viewer")
950+
if "tpu.nodes.list" in permissions_failed:
951+
roles_needed.append("roles/tpu.viewer")
860952
for role in roles_needed:
861953
info(f" gcloud projects add-iam-policy-binding {proj_hint} \\")
862954
info(f' --member="serviceAccount:{sa_hint}" \\')

0 commit comments

Comments
 (0)