cleancloud-io
diff --git a/‎README.fr.md‎
Lines changed: 6 additions & 5 deletions b/‎README.fr.md‎
Lines changed: 6 additions & 5 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 4 deletions b/‎README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎cleancloud/doctor/aws.py‎
Lines changed: 33 additions & 0 deletions b/‎cleancloud/doctor/aws.py‎
Lines changed: 33 additions & 0 deletions
@@ -194,7 +194,7 @@ Pas encore de compte cloud ? `cleancloud demo` affiche un exemple de sortie sans
 - **Détection du gaspillage IA/ML sur les 3 clouds :** endpoints SageMaker et notebooks, clusters AML Compute et instances ML, endpoints Vertex AI et instances Workbench — facturés 500–23 000 $/mois par ressource en silence. Ressources GPU flaggées risque HIGH. Les outils natifs montrent la facture — CleanCloud indique quoi supprimer. Opt-in via `--category ai`
 - **Gouvernance policy-as-code :** `cleancloud.yaml` pour la configuration par règle, les exceptions avec dates d'expiration, les seuils de coût et de confiance, les exclusions par tag — versionné aux côtés de votre infrastructure. Chaque exception est une approbation auditée dans git.
 - **Application de politique (opt-in) :** `--fail-on-confidence HIGH` ou `--fail-on-cost 500` — appliquer des seuils de gaspillage en CI/CD sur un planning, géré par les équipes platform ou FinOps
-- **40 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
+- **41 règles de détection sélectives et haut signal :** volumes orphelins, bases de données inactives, instances arrêtées, registres inutilisés, et plus — conçues pour éviter les faux positifs en environnements IaC, chacune avec une estimation de coût déterministe
 - **Scan multi-comptes (AWS) :** scannez des AWS Organizations entières en une exécution — fichier de config, IDs inline, ou auto-découverte via `--org`
 - **Scan multi-abonnements (Azure) :** scannez tous les abonnements Azure en parallèle — auto-découverte via Management Group, détail des coûts par abonnement inclus
 - **Scan multi-projets (GCP) :** scannez tous les projets GCP accessibles en parallèle — auto-découverte via Application Default Credentials, détail des coûts par projet inclus
@@ -275,6 +275,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
 | Endpoint SageMaker (GPU) | 500 – 23 000 $ / mois |
 | Instance Notebook SageMaker (GPU) | 500 – 23 000+ $ / mois |
 | Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois |
+| Training Job SageMaker (job GPU runaway/bloqué) | 670 – 2 360+ $ / jour |
 | Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
 | Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
 | Déploiement Azure OpenAI Provisionné (PTU) | 1 460+ $ / PTU / mois |
@@ -284,7 +285,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
 CleanCloud détecte les endpoints à zéro invocation / zéro prédiction et les instances de notebook inactives sur les 3 clouds et les signale risque HIGH. Les outils natifs montrent la facture — ils ne vous disent pas *quel endpoint* supprimer.
 
 ```bash
-cleancloud scan --provider aws --category ai          # PTUs Bedrock + endpoints + notebooks + Studio apps SageMaker + EC2 GPU
+cleancloud scan --provider aws --category ai          # PTUs Bedrock + endpoints + notebooks + Studio apps + training jobs SageMaker + EC2 GPU
 cleancloud scan --provider azure --category ai        # clusters AML + instances ML + PTUs OpenAI
 cleancloud scan --provider gcp --category ai          # endpoints Vertex AI + Workbench
 cleancloud scan --provider aws --category all         # hygiène + IA/ML ensemble
@@ -524,7 +525,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
 
 ## Ce que CleanCloud détecte
 
-40 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
+41 règles pour AWS, Azure et GCP — conservatives, haut signal, conçues pour éviter les faux positifs en environnements IaC.
 
 **AWS :**
 - Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -533,7 +534,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
 - Plateforme : instances RDS inactives (HIGH)
 - Observabilité : logs CloudWatch à rétention infinie
 - Gouvernance : ressources sans tags, security groups inutilisés
-- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocations depuis 7+ jours — facturés 600–7 300+$/MU/mois quel que soit le trafic ; endpoints SageMaker inactifs avec zéro invocations depuis 14+ jours — endpoints GPU flaggés risque HIGH ($500–$23K/mois) ; instances Notebook SageMaker sans activité depuis 14+ jours — notebooks GPU flaggés risque HIGH ($500–$23K+/mois) ; Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) sans activité utilisateur depuis 7+ jours — apps GPU flaggées risque HIGH ($42–$1 600+/mois)
+- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocations depuis 7+ jours — facturés 600–7 300+$/MU/mois quel que soit le trafic ; endpoints SageMaker inactifs avec zéro invocations depuis 14+ jours — endpoints GPU flaggés risque HIGH ($500–$23K/mois) ; instances Notebook SageMaker sans activité depuis 14+ jours — notebooks GPU flaggés risque HIGH ($500–$23K+/mois) ; Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) sans activité utilisateur depuis 7+ jours — apps GPU flaggées risque HIGH ($42–$1 600+/mois) ; training jobs SageMaker dépassant 24h — alerte précoce GPU à 75% du seuil, risque CRITICAL pour les jobs GPU ayant dépassé leur condition d'arrêt (670–2 360+$/jour pour instances p3/p4d/p5)
 
 **Azure :**
 - Compute : VMs arrêtées (non désallouées) (HIGH)
@@ -558,7 +559,7 @@ Les règles sans marqueur de confiance sont MEDIUM — elles utilisent des heuri
 
 ## Feuille de route
 
-**Plus de règles IA/ML** — SageMaker Training Jobs (runaway/bloqués), artefacts d'entraînement orphelins dans S3
+**Plus de règles IA/ML** — artefacts d'entraînement orphelins dans S3
 
 **Plus de règles AWS** — lacunes de cycle de vie S3, Redshift inactif, fuite de coût NAT Gateway, VPC endpoints inutilisés
 
 
@@ -275,6 +275,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
 | SageMaker endpoint (GPU) | $500 – $23,000 / month |
 | SageMaker Notebook Instance (GPU) | $500 – $23,000+ / month |
 | SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month |
+| SageMaker Training Job (runaway/hung GPU job) | $670 – $2,360+ / day |
 | Azure AML compute cluster (GPU) | $600 – $15,000 / month |
 | Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
 | Azure OpenAI Provisioned Deployment (PTU) | $1,460+ / PTU / month |
@@ -284,7 +285,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
 CleanCloud detects zero-invocation / zero-prediction endpoints and idle notebook instances across all three clouds and flags them HIGH risk. Native cost tools show the bill — they don't tell you *which endpoint* to delete.
 
 ```bash
-cleancloud scan --provider aws --category ai          # Bedrock PTUs + SageMaker endpoints + notebooks + Studio apps + idle GPU EC2
+cleancloud scan --provider aws --category ai          # Bedrock PTUs + SageMaker endpoints + notebooks + Studio apps + training jobs + idle GPU EC2
 cleancloud scan --provider azure --category ai        # AML compute clusters + ML instances + OpenAI PTUs
 cleancloud scan --provider gcp --category ai          # Vertex AI endpoints + Workbench
 cleancloud scan --provider aws --category all         # hygiene + AI/ML together
@@ -524,7 +525,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
 
 ## What CleanCloud Detects
 
-40 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
+41 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
 
 **AWS:**
 - Compute: stopped instances 30+ days (EBS charges continue)
@@ -533,7 +534,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
 - Platform: idle RDS instances (HIGH)
 - Observability: infinite retention CloudWatch Logs
 - Governance: untagged resources, unused security groups
-- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills $600–$7,300+/MU/month regardless of traffic; idle SageMaker endpoints with zero invocations 14+ days — GPU-backed endpoints flagged HIGH risk ($500–$23K/month); idle Notebook Instances with no activity 14+ days — GPU-backed notebooks flagged HIGH risk ($500–$23K+/month); idle Studio Apps (KernelGateway/JupyterLab/CodeEditor) with no user activity 7+ days — GPU-backed apps flagged HIGH risk ($42–$1,600+/month)
+- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days — bills $600–$7,300+/MU/month regardless of traffic; idle SageMaker endpoints with zero invocations 14+ days — GPU-backed endpoints flagged HIGH risk ($500–$23K/month); idle Notebook Instances with no activity 14+ days — GPU-backed notebooks flagged HIGH risk ($500–$23K+/month); idle Studio Apps (KernelGateway/JupyterLab/CodeEditor) with no user activity 7+ days — GPU-backed apps flagged HIGH risk ($42–$1,600+/month); long-running SageMaker training jobs beyond 24h threshold — GPU early warning at 75% of threshold, CRITICAL risk for GPU jobs that have outlived their stopping condition ($670–$2,360+/day for p3/p4d/p5 instances)
 
 **Azure:**
 - Compute: stopped (not deallocated) VMs (HIGH)
@@ -558,7 +559,7 @@ Rules without a confidence marker are MEDIUM — they use time-based heuristics
 
 ## Roadmap
 
-**More AI/ML waste rules** — SageMaker Training Jobs (runaway/hung), orphaned training artifacts in S3
+**More AI/ML waste rules** — orphaned training artifacts in S3
 
 **More AWS rules** — S3 lifecycle gaps, Redshift idle, NAT Gateway cost leakage (internal services routing through NAT instead of VPC endpoints — S3, DynamoDB, ECR, SSM), unused VPC endpoints
 
 
@@ -737,6 +737,39 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N
         permissions_failed.append(("sagemaker:DescribeApp", str(e)))
         warn(f"sagemaker:DescribeApp - {e}")
 
+    # --- sagemaker:ListTrainingJobs + sagemaker:DescribeTrainingJob (aws.sagemaker.training_job.long_running) ---
+    try:
+        sagemaker.list_training_jobs(MaxResults=1, StatusEquals="InProgress")
+        permissions_tested.append("sagemaker:ListTrainingJobs")
+        success("sagemaker:ListTrainingJobs")
+    except Exception as e:
+        permissions_failed.append(("sagemaker:ListTrainingJobs", str(e)))
+        warn(f"sagemaker:ListTrainingJobs - {e}")
+
+    try:
+        _tj_paginator = sagemaker.get_paginator("list_training_jobs")
+        _target_job = None
+        for _tj_page in _tj_paginator.paginate(
+            StatusEquals="InProgress", PaginationConfig={"PageSize": 20}
+        ):
+            for _tj in _tj_page.get("TrainingJobSummaries", []):
+                _target_job = _tj
+                break
+            if _target_job:
+                break
+
+        if _target_job:
+            sagemaker.describe_training_job(TrainingJobName=_target_job["TrainingJobName"])
+            permissions_tested.append("sagemaker:DescribeTrainingJob")
+            success("sagemaker:DescribeTrainingJob")
+        else:
+            info(
+                "sagemaker:DescribeTrainingJob - not tested (no InProgress training job found to probe)"
+            )
+    except Exception as e:
+        permissions_failed.append(("sagemaker:DescribeTrainingJob", str(e)))
+        warn(f"sagemaker:DescribeTrainingJob - {e}")
+
     try:
         cloudwatch = session.client("cloudwatch", region_name=region)
         now = datetime.now(timezone.utc)