Skip to content

Commit 55ccba4

Browse files
GCP AI rule hardening - part 1 (#169)
1 parent 9e6f60f commit 55ccba4

12 files changed

Lines changed: 4814 additions & 2963 deletions

File tree

cleancloud/providers/gcp/rules/ai/featurestore_idle.py

Lines changed: 527 additions & 311 deletions
Large diffs are not rendered by default.

cleancloud/providers/gcp/rules/ai/tpu_idle.py

Lines changed: 346 additions & 409 deletions
Large diffs are not rendered by default.

cleancloud/providers/gcp/rules/ai/vertex_endpoint_idle.py

Lines changed: 592 additions & 583 deletions
Large diffs are not rendered by default.

docs/rules/gcp.md

Lines changed: 64 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@
1111
| `gcp.compute.snapshot.old` | Storage | Disk snapshots older than 90 days |
1212
| `gcp.compute.ip.unused` | Network | Reserved static IPs in RESERVED state |
1313
| `gcp.sql.instance.idle` | Platform | Cloud SQL instances with zero connections 14+ days |
14-
| `gcp.vertex.endpoint.idle` | AI/ML | Vertex AI endpoints with dedicated capacity and zero predictions 14+ days |
14+
| `gcp.vertex.endpoint.idle` | AI/ML | Vertex AI endpoints with an always-deployed serving floor and zero observed request activity 14+ days |
1515
| `gcp.vertex.workbench.idle` | AI/ML | Vertex AI Workbench instances with no activity 14+ days |
1616
| `gcp.vertex.training_job.long_running` | AI/ML | Vertex AI jobs running beyond threshold |
17-
| `gcp.tpu.idle` | AI/ML | Cloud TPU nodes with near-zero utilization 7+ days |
18-
| `gcp.vertex.featurestore.idle` | AI/ML | Vertex AI Feature Stores with zero serving requests 30+ days |
17+
| `gcp.tpu.idle` | AI/ML | Standalone Cloud TPU nodes in READY state with monitoring-based idle detection; currently no findings emit until worker-to-node join is documented |
18+
| `gcp.vertex.featurestore.idle` | AI/ML | Vertex AI Feature Stores (legacy) and Bigtable-backed Feature Online Stores with zero serving requests 30+ days (Monitoring-confirmed only) |
1919

2020
---
2121

@@ -151,17 +151,34 @@
151151
## AI/ML *(opt-in: `--category ai`)*
152152

153153
#### `gcp.vertex.endpoint.idle`
154-
**Detects:** Vertex AI Online Prediction endpoints with `dedicatedResources` and zero predictions for `idle_days`
154+
**Detects:** Vertex AI Online Prediction endpoints with an always-deployed serving floor (`dedicatedResources.minReplicaCount >= 1` or `automaticResources.minReplicaCount >= 1`) and no usable endpoint-scoped request-count datapoint above `0` across the full `idle_days` observation window, confirmed by Cloud Monitoring telemetry with proven gap-free coverage
155155

156-
**Confidence / Risk:** HIGH (zero predictions confirmed + age ≥ `idle_days`); MEDIUM (zero predictions, age ≥ 75% of threshold or age unknown) / HIGH (GPU-backed: T4, V100, A100, L4, H100, TPU); MEDIUM (CPU-only)
156+
**Confidence / Risk:** HIGH (sole emit path: full-window zero request-count telemetry with no heuristic fallback; no MEDIUM tier) / HIGH (any in-scope dedicated model with nonzero accelerator count and recognized GPU/TPU type); MEDIUM (CPU-only or automatic-resources-only endpoints)
157+
158+
**Cost:** `estimated_monthly_cost_usd = None` -- pricing varies by machine type, accelerator, region, and usage option; no flat estimate is appropriate
157159

158160
**Permissions:** `aiplatform.endpoints.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer)
159161

160162
**Params:** `idle_days` (default: 14)
161163

162-
**Exclusions:** endpoints using `automaticResources` (scale-to-zero); only `dedicatedResources` with `minReplicaCount > 0`
163-
164-
**Spec:**
164+
**Exclusions:**
165+
- endpoint name or location malformed or absent
166+
- location filter set and location does not exactly match
167+
- endpoint `createTime` absent, unparsable, or future
168+
- no in-scope deployed models; `provisioned_serving_floor < 1`
169+
- shared-resource-only endpoint (`sharedResources` only; spec 11.4)
170+
- any in-scope deployed model `createTime` absent, unparsable, or future
171+
- `capacity_floor_start > evaluation_window_start` (full window not coverable)
172+
- malformed `minReplicaCount` or unrecognized prediction-resource union on any deployed model
173+
- monitoring client creation failure -- all endpoints skip; no fallback
174+
- monitoring query failure for a location -- all endpoints in that location skip
175+
- telemetry coverage unresolved: no series, leading gap > `idle_days * 86400s / 2`, any interior gap > `idle_days * 86400s / 2`, or trailing gap > `idle_days * 86400s / 2`
176+
- any usable request-count datapoint > `0` in the observation window
177+
- `dedicatedResources.minReplicaCount == 0` (scale-to-zero preview; no always-deployed floor)
178+
- `automaticResources.minReplicaCount == 0` (scale-to-zero; no always-deployed floor)
179+
- near-idle, low-traffic, age-only, trafficSplit, or missing-telemetry-as-zero fallbacks are explicitly forbidden
180+
181+
**Spec:** [docs/specs/gcp/ai/vertex_endpoint_idle.md](../specs/gcp/ai/vertex_endpoint_idle.md)
165182

166183
#### `gcp.vertex.workbench.idle`
167184
**Detects:** Vertex AI Workbench instances `ACTIVE` with no control-plane activity (`updateTime`) for `idle_days`
@@ -190,27 +207,55 @@
190207
**Spec:**
191208

192209
#### `gcp.tpu.idle`
193-
**Detects:** Cloud TPU nodes in `READY` state with max `duty_cycle ≤ 2%` across all workers for `idle_days`
210+
**Detects:** Standalone Cloud TPU nodes in exact `READY` state where complete worker-joined duty-cycle telemetry (`tpu.googleapis.com/accelerator/duty_cycle` on `tpu.googleapis.com/GceTpuWorker`) confirms max observed duty cycle <= 2% across all joined workers and accelerators over the full buffered `idle_days` window; monitoring is required — no age-only, partial-join, or cadence-assumed fallback
211+
212+
**Confidence / Risk:** HIGH / HIGH (when emitting — requires monitoring-confirmed complete join; no tiered fallback)
213+
214+
**Current emission status:** No findings are emitted. The `GceTpuWorker` monitored resource labels (`resource_container`, `location`, `worker_id`) do not include a TPU Node name. No documented first-party Google Cloud surface maps `worker_id` to the owning TPU Node, so `telemetry_join_state` cannot be proven `complete` (spec 8.3). Emission requires `telemetry_join_state == complete` (spec 9, condition 7). The monitoring query is issued per zone to surface permission errors. When Google publishes a documented worker-to-node identity surface, implement the join in `_run_zone_diagnostic`.
194215

195-
**Confidence / Risk:** HIGH (Cloud Monitoring confirms near-zero duty cycle); LOW (Monitoring unavailable — age-only heuristic) / CRITICAL (HIGH confidence + hourly cost ≥ $10/hr); HIGH (HIGH confidence + < $10/hr); MEDIUM (LOW confidence)
216+
**Cost:** `estimated_monthly_cost_usd = None` — pricing varies by TPU type, region, and usage option; no flat estimate is appropriate
196217

197-
**Permissions:** `tpu.nodes.list` (roles/tpu.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer, optional — falls back to age-based)
218+
**Permissions:** `tpu.nodes.list` (roles/tpu.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer)
198219

199220
**Params:** `idle_days` (default: 7)
200221

201-
**Exclusions:** nodes not in `READY` state; nodes younger than `idle_days`
222+
**Exclusions (pre-checks applied before monitoring):**
223+
- node name malformed, node ID or zone absent/unresolvable
224+
- region filter set and derived region does not exactly match
225+
- state not exactly `READY`
226+
- `createTime` absent, unparsable, future, or node younger than full buffered window (`now - 180s - idle_days * 86400s`)
227+
- `queuedResource` non-empty string (queued-resource-managed node)
228+
- `multisliceNode == true` (multislice node)
229+
- malformed `queuedResource` (non-string/non-null) or `multisliceNode` (non-bool/non-null)
230+
- monitoring client creation failure (all nodes skip — no age-only fallback)
231+
- monitoring query failure for a node (that node skips, warning issued)
232+
- `telemetry_join_state` not `complete` — currently always the case (see above)
202233

203-
**Spec:**
234+
**Spec:** [docs/specs/gcp/ai/tpu_idle.md](../specs/gcp/ai/tpu_idle.md)
204235

205236
#### `gcp.vertex.featurestore.idle`
206-
**Detects:** Vertex AI Feature Stores (legacy and new-gen) with zero `online_serving/request_count` for `idle_days`; Bigtable-backed stores bill ~$197/node/month regardless of utilization
237+
**Detects:** Vertex AI Feature Stores (legacy) and Bigtable-backed Feature Online Stores with provisioned online-serving capacity and zero `online_serving/request_count` confirmed by Cloud Monitoring for `idle_days`; no age-only or monitoring-absent fallback
207238

208-
**Confidence / Risk:** HIGH (Cloud Monitoring confirms zero requests); LOW (Monitoring unavailable — age-only) / HIGH (HIGH confidence); MEDIUM (LOW confidence)
239+
**Confidence / Risk:** HIGH (Cloud Monitoring confirms zero requests for full aligned window) / HIGH
209240

210-
**Permissions:** `aiplatform.featurestores.list`, `aiplatform.featureOnlineStores.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer, optional)
241+
**Cost:** `estimated_monthly_cost_usd = None` — pricing varies by backing, region, node count, and commitment model; no flat estimate is appropriate
211242

212-
**Params:** `idle_days` (default: 30)
243+
**Permissions:** `aiplatform.featurestores.list`, `aiplatform.featureOnlineStores.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer)
213244

214-
**Exclusions:** legacy featurestores with `fixedNodeCount == 0` and `scaling.minNodeCount == 0`; stores not in `STABLE` state
245+
**Params:** `idle_days` (default: 30)
215246

216-
**Spec:**
247+
**Exclusions:**
248+
- resource name malformed or store ID / region absent
249+
- region filter set and region does not exactly match
250+
- state not exactly `STABLE`
251+
- `reference_time` (`max(createTime, updateTime)`) absent, unparsable, or in the future
252+
- store younger than full `idle_days` observation window
253+
- legacy: `fixedNodeCount == 0` and `scaling.minNodeCount == 0` (no provisioned online-serving capacity)
254+
- legacy: both `fixedNodeCount > 0` and `scaling.minNodeCount > 0` simultaneously — invalid serving mode
255+
- FeatureOnlineStore: storage type not exactly Bigtable (`optimized` stores are out of scope)
256+
- FeatureOnlineStore: `bigtable.autoScaling` absent, or `maxNodeCount < minNodeCount`
257+
- monitoring client unavailable (no age-only fallback)
258+
- metric coverage unresolved — not exactly `idle_days` aligned daily buckets, query failure, future timestamp, or gap > 86 400 s between adjacent points
259+
- aggregate request count > 0 over the full window
260+
261+
**Spec:** [docs/specs/gcp/ai/featurestore_idle.md](../specs/gcp/ai/featurestore_idle.md)

0 commit comments

Comments
 (0)