cleancloud-io · javvaji-devops · Jun 12, 2026 · Jun 12, 2026
@@ -100,7 +100,7 @@ Gaspillage minimum estimé : ~$25 944/mois
 - Détecte le gaspillage IA/ML coûteux : SageMaker, AML, Vertex AI — ressources GPU signalées comme candidats à risque plus élevé (500–23 000 $/mois)
 - Fonctionne sur AWS, Azure et GCP en un seul outil
 - S'exécute entièrement dans votre environnement — aucun agent, pas de SaaS, aucun credential stocké
-- 47 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
+- 48 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC
 - Prêt pour CI/CD — codes de sortie d'application + sorties JSON/CSV/markdown
 
 ### Ce que CleanCloud ne fait PAS
@@ -153,6 +153,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
 | Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois |
 | Domaine SageMaker (stockage EFS inactif) | Charges EFS continues |
 | Training Job SageMaker (job GPU runaway/bloqué) | 670 – 2 360+ $ / jour |
+| Processing Job SageMaker (bloqué/stuck) | 670 – 2 360+ $ / jour |
 | Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois |
 | Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois |
 | Endpoint en ligne Azure ML (GPU) | 200 – 2 600+ $ / mois |
@@ -166,7 +167,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l
 CleanCloud détecte les endpoints à zéro invocation / zéro prédiction, l'activité de contrôle inactive sur les notebooks et apps managés, ainsi que les training jobs managés anormalement longs sur les 3 clouds. Les outils natifs montrent la facture — ils ne nomment pas la ressource concrète à examiner.
 
 ```bash
-cleancloud scan --provider aws --category ai          # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs SageMaker + EC2 GPU
+cleancloud scan --provider aws --category ai          # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs + processing jobs SageMaker + EC2 GPU
 cleancloud scan --provider azure --category ai        # clusters AML + instances ML + endpoints en ligne + AI Search + PTUs OpenAI
 cleancloud scan --provider gcp --category ai          # endpoints Vertex AI + Workbench + training jobs + Cloud TPU + Feature Stores
 cleancloud scan --provider aws --category all         # hygiène + IA/ML ensemble
@@ -433,7 +434,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
 
 ## Ce que CleanCloud détecte
 
-47 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
+48 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC.
 
 **AWS :**
 - Compute : instances arrêtées 30+ jours (charges EBS continuent)
@@ -442,7 +443,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud
 - Plateforme : instances RDS inactives (HIGH)
 - Observabilité : logs CloudWatch à rétention infinie
 - Gouvernance : ressources sans tags, security groups inutilisés
-- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h
+- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h ; processing jobs SageMaker toujours `InProgress` au-delà du seuil de 24h
 
 **Azure :**
 - Compute : VMs arrêtées (non désallouées) (HIGH)

@@ -100,7 +100,7 @@ Minimum estimated waste: ~$25,944/month
 - Catches expensive idle AI/ML waste: SageMaker, AML, Vertex AI — GPU-backed resources flagged as higher-risk review candidates ($500–$23K/month)
 - Works across AWS, Azure, and GCP in one tool
 - Runs entirely in your environment — no agents, no SaaS, no credentials stored
-- 46 curated, high-signal detection rules designed to avoid false positives in IaC environments
+- 48 curated, high-signal detection rules designed to avoid false positives in IaC environments
 - CI/CD-ready — enforcement exit codes + JSON/CSV/markdown output
 
 ### What CleanCloud does NOT do
@@ -153,6 +153,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
 | SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month |
 | SageMaker Domain (idle EFS storage) | Continuous EFS charges |
 | SageMaker Training Job (runaway/hung GPU job) | $670 – $2,360+ / day |
+| SageMaker Processing Job (hung/stuck) | $670 – $2,360+ / day |
 | Azure AML compute cluster (GPU) | $600 – $15,000 / month |
 | Azure ML Compute Instance (GPU) | $600 – $15,000+ / month |
 | Azure ML Online Endpoint (GPU-backed) | $200 – $2,600+ / month |
@@ -166,7 +167,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend
 CleanCloud detects zero-invocation / zero-prediction endpoints, stale managed notebook and app activity, and long-running managed training jobs across all three clouds. Native cost tools show the bill — they do not name the specific resource to review.
 
 ```bash
-cleancloud scan --provider aws --category ai          # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + idle GPU EC2
+cleancloud scan --provider aws --category ai          # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + processing jobs + idle GPU EC2
 cleancloud scan --provider azure --category ai        # AML compute + ML instances + online endpoints + AI Search + OpenAI PTUs
 cleancloud scan --provider gcp --category ai          # Vertex AI endpoints + Workbench + training jobs + Cloud TPU + Feature Stores
 cleancloud scan --provider aws --category all         # hygiene + AI/ML together
@@ -433,7 +434,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
 
 ## What CleanCloud Detects
 
-47 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
+48 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments.
 
 **AWS:**
 - Compute: stopped instances 30+ days (EBS charges continue)
@@ -442,7 +443,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints
 - Platform: idle RDS instances (HIGH)
 - Observability: infinite retention CloudWatch Logs
 - Governance: untagged resources, unused security groups
-- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold
+- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold; SageMaker processing jobs still `InProgress` beyond the 24h threshold
 
 **Azure:**
 - Compute: stopped (not deallocated) VMs (HIGH)

@@ -856,6 +856,35 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N
         permissions_failed.append(("sagemaker:DescribeTrainingJob", str(e)))
         warn(f"sagemaker:DescribeTrainingJob - {e}")
 
+    # --- sagemaker:ListProcessingJobs + sagemaker:DescribeProcessingJob (aws.sagemaker.processing_job.long_running) ---
+    try:
+        sagemaker.list_processing_jobs(MaxResults=1)
+        permissions_tested.append("sagemaker:ListProcessingJobs")
+        success("sagemaker:ListProcessingJobs")
+    except Exception as e:
+        permissions_failed.append(("sagemaker:ListProcessingJobs", str(e)))
+        warn(f"sagemaker:ListProcessingJobs - {e}")
+
+    try:
+        _pj_paginator = sagemaker.get_paginator("list_processing_jobs")
+        _target_job = None
+        for _pj_page in _pj_paginator.paginate(PaginationConfig={"PageSize": 20}):
+            for _pj in _pj_page.get("ProcessingJobSummaries", []):
+                _target_job = _pj
+                break
+            if _target_job:
+                break
+
+        if _target_job:
+            sagemaker.describe_processing_job(ProcessingJobName=_target_job["ProcessingJobName"])
+            permissions_tested.append("sagemaker:DescribeProcessingJob")
+            success("sagemaker:DescribeProcessingJob")
+        else:
+            info("sagemaker:DescribeProcessingJob - not tested (no processing job found to probe)")
+    except Exception as e:
+        permissions_failed.append(("sagemaker:DescribeProcessingJob", str(e)))
+        warn(f"sagemaker:DescribeProcessingJob - {e}")
+
     try:
         cloudwatch = session.client("cloudwatch", region_name=region)
         now = datetime.now(timezone.utc)

@@ -0,0 +1,21 @@
+"""Shared AWS error classification helpers."""
+
+from botocore.exceptions import ClientError
+
+# Error codes that indicate a permission/authorization failure.
+# Covers the common SageMaker, EC2, IAM, and service-specific variants.
+_PERMISSION_ERROR_CODES = frozenset(
+    {
+        "AccessDenied",
+        "AccessDeniedException",
+        "UnauthorizedOperation",
+        "UnauthorizedException",
+        "Client.UnauthorizedOperation",
+    }
+)
+
+
+def is_permission_error(exc: ClientError) -> bool:
+    """Return True when a ClientError represents an authorization failure."""
+    code = exc.response.get("Error", {}).get("Code", "")
+    return code in _PERMISSION_ERROR_CODES