Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions README.fr.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Gaspillage minimum estimé : ~$25 944/mois
- Détecte le gaspillage IA/ML coûteux : SageMaker, AML, Vertex AI — ressources GPU signalées comme candidats à risque plus élevé (500–23 000 $/mois)
- Fonctionne sur AWS, Azure et GCP en un seul outil
- S'exécute entièrement dans votre environnement — aucun agent, pas de SaaS, aucun credential stocké
- 47 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
- 48 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
- Prêt pour CI/CD — codes de sortie d'application + sorties JSON/CSV/markdown

### Ce que CleanCloud ne fait PAS
Expand Down Expand Up @@ -153,6 +153,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
| Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois |
| Domaine SageMaker (stockage EFS inactif) | Charges EFS continues |
| Training Job SageMaker (job GPU runaway/bloqué) | 670 – 2 360+ $ / jour |
| Processing Job SageMaker (bloqué/stuck) | 670 – 2 360+ $ / jour |
| Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
| Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
| Endpoint en ligne Azure ML (GPU) | 200 – 2 600+ $ / mois |
Expand All @@ -166,7 +167,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
CleanCloud détecte les endpoints à zéro invocation / zéro prédiction, l'activité de contrôle inactive sur les notebooks et apps managés, ainsi que les training jobs managés anormalement longs sur les 3 clouds. Les outils natifs montrent la facture — ils ne nomment pas la ressource concrète à examiner.

```bash
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs SageMaker + EC2 GPU
cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs + processing jobs SageMaker + EC2 GPU
cleancloud scan --provider azure --category ai # clusters AML + instances ML + endpoints en ligne + AI Search + PTUs OpenAI
cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench + training jobs + Cloud TPU + Feature Stores
cleancloud scan --provider aws --category all # hygiène + IA/ML ensemble
Expand Down Expand Up @@ -433,7 +434,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud

## Ce que CleanCloud détecte

47 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
48 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.

**AWS :**
- Compute : instances arrêtées 30+ jours (charges EBS continuent)
Expand All @@ -442,7 +443,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
- Plateforme : instances RDS inactives (HIGH)
- Observabilité : logs CloudWatch à rétention infinie
- Gouvernance : ressources sans tags, security groups inutilisés
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h
- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h ; processing jobs SageMaker toujours `InProgress` au-delà du seuil de 24h

**Azure :**
- Compute : VMs arrêtées (non désallouées) (HIGH)
Expand Down
9 changes: 5 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ Minimum estimated waste: ~$25,944/month
- Catches expensive idle AI/ML waste: SageMaker, AML, Vertex AI — GPU-backed resources flagged as higher-risk review candidates ($500–$23K/month)
- Works across AWS, Azure, and GCP in one tool
- Runs entirely in your environment — no agents, no SaaS, no credentials stored
- 46 curated, high-signal detection rules designed to avoid false positives in IaC environments
- 48 curated, high-signal detection rules designed to avoid false positives in IaC environments
- CI/CD-ready — enforcement exit codes + JSON/CSV/markdown output

### What CleanCloud does NOT do
Expand Down Expand Up @@ -153,6 +153,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
| SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month |
| SageMaker Domain (idle EFS storage) | Continuous EFS charges |
| SageMaker Training Job (runaway/hung GPU job) | $670 – $2,360+ / day |
| SageMaker Processing Job (hung/stuck) | $670 – $2,360+ / day |
| Azure AML compute cluster (GPU) | $600 – $15,000 / month |
| Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
| Azure ML Online Endpoint (GPU-backed) | $200 – $2,600+ / month |
Expand All @@ -166,7 +167,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
CleanCloud detects zero-invocation / zero-prediction endpoints, stale managed notebook and app activity, and long-running managed training jobs across all three clouds. Native cost tools show the bill — they do not name the specific resource to review.

```bash
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + idle GPU EC2
cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + processing jobs + idle GPU EC2
cleancloud scan --provider azure --category ai # AML compute + ML instances + online endpoints + AI Search + OpenAI PTUs
cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench + training jobs + Cloud TPU + Feature Stores
cleancloud scan --provider aws --category all # hygiene + AI/ML together
Expand Down Expand Up @@ -433,7 +434,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints

## What CleanCloud Detects

47 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
48 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.

**AWS:**
- Compute: stopped instances 30+ days (EBS charges continue)
Expand All @@ -442,7 +443,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
- Platform: idle RDS instances (HIGH)
- Observability: infinite retention CloudWatch Logs
- Governance: untagged resources, unused security groups
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold
- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold; SageMaker processing jobs still `InProgress` beyond the 24h threshold

**Azure:**
- Compute: stopped (not deallocated) VMs (HIGH)
Expand Down
29 changes: 29 additions & 0 deletions cleancloud/doctor/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -856,6 +856,35 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N
permissions_failed.append(("sagemaker:DescribeTrainingJob", str(e)))
warn(f"sagemaker:DescribeTrainingJob - {e}")

# --- sagemaker:ListProcessingJobs + sagemaker:DescribeProcessingJob (aws.sagemaker.processing_job.long_running) ---
try:
sagemaker.list_processing_jobs(MaxResults=1)
permissions_tested.append("sagemaker:ListProcessingJobs")
success("sagemaker:ListProcessingJobs")
except Exception as e:
permissions_failed.append(("sagemaker:ListProcessingJobs", str(e)))
warn(f"sagemaker:ListProcessingJobs - {e}")

try:
_pj_paginator = sagemaker.get_paginator("list_processing_jobs")
_target_job = None
for _pj_page in _pj_paginator.paginate(PaginationConfig={"PageSize": 20}):
for _pj in _pj_page.get("ProcessingJobSummaries", []):
_target_job = _pj
break
if _target_job:
break

if _target_job:
sagemaker.describe_processing_job(ProcessingJobName=_target_job["ProcessingJobName"])
permissions_tested.append("sagemaker:DescribeProcessingJob")
success("sagemaker:DescribeProcessingJob")
else:
info("sagemaker:DescribeProcessingJob - not tested (no processing job found to probe)")
except Exception as e:
permissions_failed.append(("sagemaker:DescribeProcessingJob", str(e)))
warn(f"sagemaker:DescribeProcessingJob - {e}")

try:
cloudwatch = session.client("cloudwatch", region_name=region)
now = datetime.now(timezone.utc)
Expand Down
21 changes: 21 additions & 0 deletions cleancloud/providers/aws/errors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""Shared AWS error classification helpers."""

from botocore.exceptions import ClientError

# Error codes that indicate a permission/authorization failure.
# Covers the common SageMaker, EC2, IAM, and service-specific variants.
_PERMISSION_ERROR_CODES = frozenset(
{
"AccessDenied",
"AccessDeniedException",
"UnauthorizedOperation",
"UnauthorizedException",
"Client.UnauthorizedOperation",
}
)


def is_permission_error(exc: ClientError) -> bool:
"""Return True when a ClientError represents an authorization failure."""
code = exc.response.get("Error", {}).get("Code", "")
return code in _PERMISSION_ERROR_CODES
Loading
Loading