Skip to content

Commit a53922d

Browse files
AWS AI rule: aws.sagemaker.processing_job.long_running (#177)
1 parent 5dc6f28 commit a53922d

15 files changed

Lines changed: 1567 additions & 29 deletions

File tree

README.fr.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Gaspillage minimum estimé : ~$25 944/mois
100100
- Détecte le gaspillage IA/ML coûteux : SageMaker, AML, Vertex AI — ressources GPU signalées comme candidats à risque plus élevé (500–23 000 $/mois)
101101
- Fonctionne sur AWS, Azure et GCP en un seul outil
102102
- S'exécute entièrement dans votre environnement — aucun agent, pas de SaaS, aucun credential stocké
103-
- 47 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
103+
- 48 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
104104
- Prêt pour CI/CD — codes de sortie d'application + sorties JSON/CSV/markdown
105105

106106
### Ce que CleanCloud ne fait PAS
@@ -153,6 +153,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
153153
| Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois |
154154
| Domaine SageMaker (stockage EFS inactif) | Charges EFS continues |
155155
| Training Job SageMaker (job GPU runaway/bloqué) | 670 – 2 360+ $ / jour |
156+
| Processing Job SageMaker (bloqué/stuck) | 670 – 2 360+ $ / jour |
156157
| Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
157158
| Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
158159
| Endpoint en ligne Azure ML (GPU) | 200 – 2 600+ $ / mois |
@@ -166,7 +167,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
166167
CleanCloud détecte les endpoints à zéro invocation / zéro prédiction, l'activité de contrôle inactive sur les notebooks et apps managés, ainsi que les training jobs managés anormalement longs sur les 3 clouds. Les outils natifs montrent la facture — ils ne nomment pas la ressource concrète à examiner.
167168

168169
```bash
169-
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs SageMaker + EC2 GPU
170+
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs + processing jobs SageMaker + EC2 GPU
170171
cleancloud scan --provider azure --category ai # clusters AML + instances ML + endpoints en ligne + AI Search + PTUs OpenAI
171172
cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench + training jobs + Cloud TPU + Feature Stores
172173
cleancloud scan --provider aws --category all # hygiène + IA/ML ensemble
@@ -433,7 +434,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
433434

434435
## Ce que CleanCloud détecte
435436

436-
47 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
437+
48 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
437438

438439
**AWS :**
439440
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -442,7 +443,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
442443
- Plateforme : instances RDS inactives (HIGH)
443444
- Observabilité : logs CloudWatch à rétention infinie
444445
- Gouvernance : ressources sans tags, security groups inutilisés
445-
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h
446+
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h ; processing jobs SageMaker toujours `InProgress` au-delà du seuil de 24h
446447

447448
**Azure :**
448449
- Compute : VMs arrêtées (non désallouées) (HIGH)

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Minimum estimated waste: ~$25,944/month
100100
- Catches expensive idle AI/ML waste: SageMaker, AML, Vertex AI — GPU-backed resources flagged as higher-risk review candidates ($500–$23K/month)
101101
- Works across AWS, Azure, and GCP in one tool
102102
- Runs entirely in your environment — no agents, no SaaS, no credentials stored
103-
- 46 curated, high-signal detection rules designed to avoid false positives in IaC environments
103+
- 48 curated, high-signal detection rules designed to avoid false positives in IaC environments
104104
- CI/CD-ready — enforcement exit codes + JSON/CSV/markdown output
105105

106106
### What CleanCloud does NOT do
@@ -153,6 +153,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
153153
| SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month |
154154
| SageMaker Domain (idle EFS storage) | Continuous EFS charges |
155155
| SageMaker Training Job (runaway/hung GPU job) | $670 – $2,360+ / day |
156+
| SageMaker Processing Job (hung/stuck) | $670 – $2,360+ / day |
156157
| Azure AML compute cluster (GPU) | $600 – $15,000 / month |
157158
| Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
158159
| Azure ML Online Endpoint (GPU-backed) | $200 – $2,600+ / month |
@@ -166,7 +167,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
166167
CleanCloud detects zero-invocation / zero-prediction endpoints, stale managed notebook and app activity, and long-running managed training jobs across all three clouds. Native cost tools show the bill — they do not name the specific resource to review.
167168

168169
```bash
169-
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + idle GPU EC2
170+
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + processing jobs + idle GPU EC2
170171
cleancloud scan --provider azure --category ai # AML compute + ML instances + online endpoints + AI Search + OpenAI PTUs
171172
cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench + training jobs + Cloud TPU + Feature Stores
172173
cleancloud scan --provider aws --category all # hygiene + AI/ML together
@@ -433,7 +434,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
433434

434435
## What CleanCloud Detects
435436

436-
47 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
437+
48 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
437438

438439
**AWS:**
439440
- Compute: stopped instances 30+ days (EBS charges continue)
@@ -442,7 +443,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
442443
- Platform: idle RDS instances (HIGH)
443444
- Observability: infinite retention CloudWatch Logs
444445
- Governance: untagged resources, unused security groups
445-
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold
446+
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold; SageMaker processing jobs still `InProgress` beyond the 24h threshold
446447

447448
**Azure:**
448449
- Compute: stopped (not deallocated) VMs (HIGH)

cleancloud/doctor/aws.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -856,6 +856,35 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N
856856
permissions_failed.append(("sagemaker:DescribeTrainingJob", str(e)))
857857
warn(f"sagemaker:DescribeTrainingJob - {e}")
858858

859+
# --- sagemaker:ListProcessingJobs + sagemaker:DescribeProcessingJob (aws.sagemaker.processing_job.long_running) ---
860+
try:
861+
sagemaker.list_processing_jobs(MaxResults=1)
862+
permissions_tested.append("sagemaker:ListProcessingJobs")
863+
success("sagemaker:ListProcessingJobs")
864+
except Exception as e:
865+
permissions_failed.append(("sagemaker:ListProcessingJobs", str(e)))
866+
warn(f"sagemaker:ListProcessingJobs - {e}")
867+
868+
try:
869+
_pj_paginator = sagemaker.get_paginator("list_processing_jobs")
870+
_target_job = None
871+
for _pj_page in _pj_paginator.paginate(PaginationConfig={"PageSize": 20}):
872+
for _pj in _pj_page.get("ProcessingJobSummaries", []):
873+
_target_job = _pj
874+
break
875+
if _target_job:
876+
break
877+
878+
if _target_job:
879+
sagemaker.describe_processing_job(ProcessingJobName=_target_job["ProcessingJobName"])
880+
permissions_tested.append("sagemaker:DescribeProcessingJob")
881+
success("sagemaker:DescribeProcessingJob")
882+
else:
883+
info("sagemaker:DescribeProcessingJob - not tested (no processing job found to probe)")
884+
except Exception as e:
885+
permissions_failed.append(("sagemaker:DescribeProcessingJob", str(e)))
886+
warn(f"sagemaker:DescribeProcessingJob - {e}")
887+
859888
try:
860889
cloudwatch = session.client("cloudwatch", region_name=region)
861890
now = datetime.now(timezone.utc)

cleancloud/providers/aws/errors.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
"""Shared AWS error classification helpers."""
2+
3+
from botocore.exceptions import ClientError
4+
5+
# Error codes that indicate a permission/authorization failure.
6+
# Covers the common SageMaker, EC2, IAM, and service-specific variants.
7+
_PERMISSION_ERROR_CODES = frozenset(
8+
{
9+
"AccessDenied",
10+
"AccessDeniedException",
11+
"UnauthorizedOperation",
12+
"UnauthorizedException",
13+
"Client.UnauthorizedOperation",
14+
}
15+
)
16+
17+
18+
def is_permission_error(exc: ClientError) -> bool:
19+
"""Return True when a ClientError represents an authorization failure."""
20+
code = exc.response.get("Error", {}).get("Code", "")
21+
return code in _PERMISSION_ERROR_CODES

0 commit comments

Comments
 (0)