Skip to content

Commit c47032a

Browse files
authored
AI Rule: aws.sagemaker.studio_app.idle (#149)
1 parent 08a608b commit c47032a

15 files changed

Lines changed: 1273 additions & 38 deletions

File tree

README.fr.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
194194
- **Détection du gaspillage IA/ML sur les 3 clouds :** endpoints SageMaker et notebooks, clusters AML Compute et instances ML, endpoints Vertex AI et instances Workbench — facturés 500–23 000 $/mois par ressource en silence. Ressources GPU flaggées risque HIGH. Les outils natifs montrent la facture — CleanCloud indique quoi supprimer. Opt-in via `--category ai`
195195
- **Gouvernance policy-as-code :** `cleancloud.yaml` pour la configuration par règle, les exceptions avec dates d'expiration, les seuils de coût et de confiance, les exclusions par tag — versionné aux côtés de votre infrastructure. Chaque exception est une approbation auditée dans git.
196196
- **Application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 500` — appliquer des seuils de gaspillage en CI/CD sur un planning, géré par les équipes platform ou FinOps
197-
- **39 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
197+
- **40 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
198198
- **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
199199
- **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
200200
- **Scan multi-projets (GCP) :** scannez tous les projets GCP accessibles en parallèle — auto-découverte via Application Default Credentials, détail des coûts par projet inclus
@@ -274,6 +274,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
274274
| Bedrock Provisioned Throughput | 600 – 7 300+ $ / MU / mois |
275275
| Endpoint SageMaker (GPU) | 500 – 23 000 $ / mois |
276276
| Instance Notebook SageMaker (GPU) | 500 – 23 000+ $ / mois |
277+
| Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois |
277278
| Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
278279
| Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
279280
| Déploiement Azure OpenAI Provisionné (PTU) | 1 460+ $ / PTU / mois |
@@ -283,7 +284,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
283284
CleanCloud détecte les endpoints à zéro invocation / zéro prédiction et les instances de notebook inactives sur les 3 clouds et les signale risque HIGH. Les outils natifs montrent la facture — ils ne vous disent pas *quel endpoint* supprimer.
284285

285286
```bash
286-
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks SageMaker + EC2 GPU
287+
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + Studio apps SageMaker + EC2 GPU
287288
cleancloud scan --provider azure --category ai # clusters AML + instances ML + PTUs OpenAI
288289
cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench
289290
cleancloud scan --provider aws --category all # hygiène + IA/ML ensemble
@@ -523,7 +524,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
523524

524525
## Ce que CleanCloud détecte
525526

526-
35 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
527+
40 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
527528

528529
**AWS :**
529530
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -532,7 +533,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
532533
- Plateforme : instances RDS inactives (HIGH)
533534
- Observabilité : logs CloudWatch à rétention infinie
534535
- Gouvernance : ressources sans tags, security groups inutilisés
535-
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocations depuis 7+ jours — facturés 600–7 300+$/MU/mois quel que soit le trafic ; endpoints SageMaker inactifs avec zéro invocations depuis 14+ jours — endpoints GPU flaggés risque HIGH ($500–$23K/mois) ; instances Notebook SageMaker sans activité depuis 14+ jours — notebooks GPU flaggés risque HIGH ($500–$23K+/mois)
536+
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocations depuis 7+ jours — facturés 600–7 300+$/MU/mois quel que soit le trafic ; endpoints SageMaker inactifs avec zéro invocations depuis 14+ jours — endpoints GPU flaggés risque HIGH ($500–$23K/mois) ; instances Notebook SageMaker sans activité depuis 14+ jours — notebooks GPU flaggés risque HIGH ($500–$23K+/mois) ; Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) sans activité utilisateur depuis 7+ jours — apps GPU flaggées risque HIGH ($42–$1 600+/mois)
536537

537538
**Azure :**
538539
- Compute : VMs arrêtées (non désallouées) (HIGH)

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ No cloud account yet? `cleancloud demo` shows sample output without any credenti
194194
- **AI/ML waste detection across all 3 clouds:** idle SageMaker endpoints and notebooks, AML compute clusters and instances, Vertex AI endpoints and Workbench instances — silently billing $500–$23K/month per resource. GPU-backed resources flagged HIGH risk. Native cost tools don't surface these — CleanCloud does. Opt-in via `--category ai`
195195
- **Policy-as-code governance:** `cleancloud.yaml` for per-rule config, exceptions with expiry dates, cost and confidence thresholds, tag-based exclusions — version-controlled alongside your infrastructure. Every exception is a git-reviewable approval.
196196
- **Governance enforcement (opt-in):** `--fail-on-confidence HIGH` or `--fail-on-cost 500` — enforce waste thresholds in CI/CD on a schedule, owned by platform or FinOps teams
197-
- **39 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate
197+
- **40 curated, high-signal detection rules:** orphaned volumes, idle databases, stopped instances, unused registries, and more — designed to avoid false positives in IaC environments, each with a deterministic cost estimate
198198
- **Multi-account scanning (AWS):** scan entire AWS Organizations in one run — config file, inline IDs, or auto-discovery via `--org`
199199
- **Multi-subscription scanning (Azure):** scan all Azure subscriptions in parallel — auto-discovery via Management Group, per-subscription cost breakdown included
200200
- **Multi-project scanning (GCP):** scan all accessible GCP projects in parallel — auto-discovery via Application Default Credentials, per-project cost breakdown included
@@ -274,6 +274,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
274274
| Bedrock Provisioned Throughput | $600 – $7,300+ / MU / month |
275275
| SageMaker endpoint (GPU) | $500 – $23,000 / month |
276276
| SageMaker Notebook Instance (GPU) | $500 – $23,000+ / month |
277+
| SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month |
277278
| Azure AML compute cluster (GPU) | $600 – $15,000 / month |
278279
| Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
279280
| Azure OpenAI Provisioned Deployment (PTU) | $1,460+ / PTU / month |
@@ -283,7 +284,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
283284
CleanCloud detects zero-invocation / zero-prediction endpoints and idle notebook instances across all three clouds and flags them HIGH risk. Native cost tools show the bill — they don't tell you *which endpoint* to delete.
284285

285286
```bash
286-
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + idle GPU EC2
287+
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + Studio apps + idle GPU EC2
287288
cleancloud scan --provider azure --category ai # AML compute clusters + ML instances + OpenAI PTUs
288289
cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench
289290
cleancloud scan --provider aws --category all # hygiene + AI/ML together
@@ -523,7 +524,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
523524

524525
## What CleanCloud Detects
525526

526-
35 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
527+
40 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
527528

528529
**AWS:**
529530
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -532,7 +533,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
532533
- Platform: idle RDS instances (HIGH)
533534
- Observability: infinite retention CloudWatch Logs
534535
- Governance: untagged resources, unused security groups
535-
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills $600–$7,300+/MU/month regardless of traffic; idle SageMaker endpoints with zero invocations 14+ days — GPU-backed endpoints flagged HIGH risk ($500–$23K/month); idle Notebook Instances with no activity 14+ days — GPU-backed notebooks flagged HIGH risk ($500–$23K+/month)
536+
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills $600–$7,300+/MU/month regardless of traffic; idle SageMaker endpoints with zero invocations 14+ days — GPU-backed endpoints flagged HIGH risk ($500–$23K/month); idle Notebook Instances with no activity 14+ days — GPU-backed notebooks flagged HIGH risk ($500–$23K+/month); idle Studio Apps (KernelGateway/JupyterLab/CodeEditor) with no user activity 7+ days — GPU-backed apps flagged HIGH risk ($42–$1,600+/month)
536537

537538
**Azure:**
538539
- Compute: stopped (not deallocated) VMs (HIGH)

cleancloud.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,10 @@ rules:
9292
# params:
9393
# idle_days: 14 # default: 7 (days of zero invocations before flagging)
9494

95+
# aws.sagemaker.studio_app.idle:
96+
# params:
97+
# idle_days: 14 # default: 7 (days of no user activity before flagging)
98+
9599
# azure.openai.provisioned_deployment.idle:
96100
# params:
97101
# idle_days: 14 # default: 7 (days of zero requests before flagging)

cleancloud/doctor/aws.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -688,6 +688,55 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N
688688
permissions_failed.append(("sagemaker:DescribeNotebookInstance", str(e)))
689689
warn(f"sagemaker:DescribeNotebookInstance - {e}")
690690

691+
# --- sagemaker:ListApps (aws.sagemaker.studio_app.idle) ---
692+
try:
693+
sagemaker.list_apps(MaxResults=1)
694+
permissions_tested.append("sagemaker:ListApps")
695+
success("sagemaker:ListApps")
696+
except Exception as e:
697+
permissions_failed.append(("sagemaker:ListApps", str(e)))
698+
warn(f"sagemaker:ListApps - {e}")
699+
700+
try:
701+
# Paginate through all apps to find the first qualifying one — checking only
702+
# the first page can leave describe_app untested in accounts with many apps.
703+
_target_app = None
704+
_paginator = sagemaker.get_paginator("list_apps")
705+
for _page in _paginator.paginate(PaginationConfig={"PageSize": 20}):
706+
for _a in _page.get("Apps", []):
707+
if _a.get("Status") == "InService" and _a.get("AppType") in (
708+
"KernelGateway",
709+
"JupyterLab",
710+
"CodeEditor",
711+
):
712+
_target_app = _a
713+
break
714+
if _target_app:
715+
break
716+
717+
if _target_app:
718+
describe_kwargs = {
719+
"DomainId": _target_app["DomainId"],
720+
"AppType": _target_app["AppType"],
721+
"AppName": _target_app["AppName"],
722+
}
723+
if _target_app.get("SpaceName"):
724+
describe_kwargs["SpaceName"] = _target_app["SpaceName"]
725+
elif _target_app.get("UserProfileName"):
726+
describe_kwargs["UserProfileName"] = _target_app["UserProfileName"]
727+
sagemaker.describe_app(**describe_kwargs)
728+
permissions_tested.append("sagemaker:DescribeApp")
729+
success("sagemaker:DescribeApp")
730+
else:
731+
# No qualifying InService app exists in this account/region — describe_app
732+
# cannot be exercised, so the permission remains untested.
733+
info(
734+
"sagemaker:DescribeApp - not tested (no InService KernelGateway/JupyterLab/CodeEditor app found to probe)"
735+
)
736+
except Exception as e:
737+
permissions_failed.append(("sagemaker:DescribeApp", str(e)))
738+
warn(f"sagemaker:DescribeApp - {e}")
739+
691740
try:
692741
cloudwatch = session.client("cloudwatch", region_name=region)
693742
now = datetime.now(timezone.utc)

0 commit comments

Comments
 (0)