|
11 | 11 | | `gcp.compute.snapshot.old` | Storage | Disk snapshots older than 90 days | |
12 | 12 | | `gcp.compute.ip.unused` | Network | Reserved static IPs in RESERVED state | |
13 | 13 | | `gcp.sql.instance.idle` | Platform | Cloud SQL instances with zero connections 14+ days | |
14 | | -| `gcp.vertex.endpoint.idle` | AI/ML | Vertex AI endpoints with dedicated capacity and zero predictions 14+ days | |
| 14 | +| `gcp.vertex.endpoint.idle` | AI/ML | Vertex AI endpoints with an always-deployed serving floor and zero observed request activity 14+ days | |
15 | 15 | | `gcp.vertex.workbench.idle` | AI/ML | Vertex AI Workbench instances with no activity 14+ days | |
16 | 16 | | `gcp.vertex.training_job.long_running` | AI/ML | Vertex AI jobs running beyond threshold | |
17 | | -| `gcp.tpu.idle` | AI/ML | Cloud TPU nodes with near-zero utilization 7+ days | |
18 | | -| `gcp.vertex.featurestore.idle` | AI/ML | Vertex AI Feature Stores with zero serving requests 30+ days | |
| 17 | +| `gcp.tpu.idle` | AI/ML | Standalone Cloud TPU nodes in READY state with monitoring-based idle detection; currently no findings emit until worker-to-node join is documented | |
| 18 | +| `gcp.vertex.featurestore.idle` | AI/ML | Vertex AI Feature Stores (legacy) and Bigtable-backed Feature Online Stores with zero serving requests 30+ days (Monitoring-confirmed only) | |
19 | 19 |
|
20 | 20 | --- |
21 | 21 |
|
|
151 | 151 | ## AI/ML *(opt-in: `--category ai`)* |
152 | 152 |
|
153 | 153 | #### `gcp.vertex.endpoint.idle` |
154 | | -**Detects:** Vertex AI Online Prediction endpoints with `dedicatedResources` and zero predictions for `idle_days` |
| 154 | +**Detects:** Vertex AI Online Prediction endpoints with an always-deployed serving floor (`dedicatedResources.minReplicaCount >= 1` or `automaticResources.minReplicaCount >= 1`) and no usable endpoint-scoped request-count datapoint above `0` across the full `idle_days` observation window, confirmed by Cloud Monitoring telemetry with proven gap-free coverage |
155 | 155 |
|
156 | | -**Confidence / Risk:** HIGH (zero predictions confirmed + age ≥ `idle_days`); MEDIUM (zero predictions, age ≥ 75% of threshold or age unknown) / HIGH (GPU-backed: T4, V100, A100, L4, H100, TPU); MEDIUM (CPU-only) |
| 156 | +**Confidence / Risk:** HIGH (sole emit path: full-window zero request-count telemetry with no heuristic fallback; no MEDIUM tier) / HIGH (any in-scope dedicated model with nonzero accelerator count and recognized GPU/TPU type); MEDIUM (CPU-only or automatic-resources-only endpoints) |
| 157 | + |
| 158 | +**Cost:** `estimated_monthly_cost_usd = None` -- pricing varies by machine type, accelerator, region, and usage option; no flat estimate is appropriate |
157 | 159 |
|
158 | 160 | **Permissions:** `aiplatform.endpoints.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer) |
159 | 161 |
|
160 | 162 | **Params:** `idle_days` (default: 14) |
161 | 163 |
|
162 | | -**Exclusions:** endpoints using `automaticResources` (scale-to-zero); only `dedicatedResources` with `minReplicaCount > 0` |
163 | | - |
164 | | -**Spec:** — |
| 164 | +**Exclusions:** |
| 165 | +- endpoint name or location malformed or absent |
| 166 | +- location filter set and location does not exactly match |
| 167 | +- endpoint `createTime` absent, unparsable, or future |
| 168 | +- no in-scope deployed models; `provisioned_serving_floor < 1` |
| 169 | +- shared-resource-only endpoint (`sharedResources` only; spec 11.4) |
| 170 | +- any in-scope deployed model `createTime` absent, unparsable, or future |
| 171 | +- `capacity_floor_start > evaluation_window_start` (full window not coverable) |
| 172 | +- malformed `minReplicaCount` or unrecognized prediction-resource union on any deployed model |
| 173 | +- monitoring client creation failure -- all endpoints skip; no fallback |
| 174 | +- monitoring query failure for a location -- all endpoints in that location skip |
| 175 | +- telemetry coverage unresolved: no series, leading gap > `idle_days * 86400s / 2`, any interior gap > `idle_days * 86400s / 2`, or trailing gap > `idle_days * 86400s / 2` |
| 176 | +- any usable request-count datapoint > `0` in the observation window |
| 177 | +- `dedicatedResources.minReplicaCount == 0` (scale-to-zero preview; no always-deployed floor) |
| 178 | +- `automaticResources.minReplicaCount == 0` (scale-to-zero; no always-deployed floor) |
| 179 | +- near-idle, low-traffic, age-only, trafficSplit, or missing-telemetry-as-zero fallbacks are explicitly forbidden |
| 180 | + |
| 181 | +**Spec:** [docs/specs/gcp/ai/vertex_endpoint_idle.md](../specs/gcp/ai/vertex_endpoint_idle.md) |
165 | 182 |
|
166 | 183 | #### `gcp.vertex.workbench.idle` |
167 | 184 | **Detects:** Vertex AI Workbench instances `ACTIVE` with no control-plane activity (`updateTime`) for `idle_days` |
|
190 | 207 | **Spec:** — |
191 | 208 |
|
192 | 209 | #### `gcp.tpu.idle` |
193 | | -**Detects:** Cloud TPU nodes in `READY` state with max `duty_cycle ≤ 2%` across all workers for `idle_days` |
| 210 | +**Detects:** Standalone Cloud TPU nodes in exact `READY` state where complete worker-joined duty-cycle telemetry (`tpu.googleapis.com/accelerator/duty_cycle` on `tpu.googleapis.com/GceTpuWorker`) confirms max observed duty cycle <= 2% across all joined workers and accelerators over the full buffered `idle_days` window; monitoring is required — no age-only, partial-join, or cadence-assumed fallback |
| 211 | + |
| 212 | +**Confidence / Risk:** HIGH / HIGH (when emitting — requires monitoring-confirmed complete join; no tiered fallback) |
| 213 | + |
| 214 | +**Current emission status:** No findings are emitted. The `GceTpuWorker` monitored resource labels (`resource_container`, `location`, `worker_id`) do not include a TPU Node name. No documented first-party Google Cloud surface maps `worker_id` to the owning TPU Node, so `telemetry_join_state` cannot be proven `complete` (spec 8.3). Emission requires `telemetry_join_state == complete` (spec 9, condition 7). The monitoring query is issued per zone to surface permission errors. When Google publishes a documented worker-to-node identity surface, implement the join in `_run_zone_diagnostic`. |
194 | 215 |
|
195 | | -**Confidence / Risk:** HIGH (Cloud Monitoring confirms near-zero duty cycle); LOW (Monitoring unavailable — age-only heuristic) / CRITICAL (HIGH confidence + hourly cost ≥ $10/hr); HIGH (HIGH confidence + < $10/hr); MEDIUM (LOW confidence) |
| 216 | +**Cost:** `estimated_monthly_cost_usd = None` — pricing varies by TPU type, region, and usage option; no flat estimate is appropriate |
196 | 217 |
|
197 | | -**Permissions:** `tpu.nodes.list` (roles/tpu.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer, optional — falls back to age-based) |
| 218 | +**Permissions:** `tpu.nodes.list` (roles/tpu.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer) |
198 | 219 |
|
199 | 220 | **Params:** `idle_days` (default: 7) |
200 | 221 |
|
201 | | -**Exclusions:** nodes not in `READY` state; nodes younger than `idle_days` |
| 222 | +**Exclusions (pre-checks applied before monitoring):** |
| 223 | +- node name malformed, node ID or zone absent/unresolvable |
| 224 | +- region filter set and derived region does not exactly match |
| 225 | +- state not exactly `READY` |
| 226 | +- `createTime` absent, unparsable, future, or node younger than full buffered window (`now - 180s - idle_days * 86400s`) |
| 227 | +- `queuedResource` non-empty string (queued-resource-managed node) |
| 228 | +- `multisliceNode == true` (multislice node) |
| 229 | +- malformed `queuedResource` (non-string/non-null) or `multisliceNode` (non-bool/non-null) |
| 230 | +- monitoring client creation failure (all nodes skip — no age-only fallback) |
| 231 | +- monitoring query failure for a node (that node skips, warning issued) |
| 232 | +- `telemetry_join_state` not `complete` — currently always the case (see above) |
202 | 233 |
|
203 | | -**Spec:** — |
| 234 | +**Spec:** [docs/specs/gcp/ai/tpu_idle.md](../specs/gcp/ai/tpu_idle.md) |
204 | 235 |
|
205 | 236 | #### `gcp.vertex.featurestore.idle` |
206 | | -**Detects:** Vertex AI Feature Stores (legacy and new-gen) with zero `online_serving/request_count` for `idle_days`; Bigtable-backed stores bill ~$197/node/month regardless of utilization |
| 237 | +**Detects:** Vertex AI Feature Stores (legacy) and Bigtable-backed Feature Online Stores with provisioned online-serving capacity and zero `online_serving/request_count` confirmed by Cloud Monitoring for `idle_days`; no age-only or monitoring-absent fallback |
207 | 238 |
|
208 | | -**Confidence / Risk:** HIGH (Cloud Monitoring confirms zero requests); LOW (Monitoring unavailable — age-only) / HIGH (HIGH confidence); MEDIUM (LOW confidence) |
| 239 | +**Confidence / Risk:** HIGH (Cloud Monitoring confirms zero requests for full aligned window) / HIGH |
209 | 240 |
|
210 | | -**Permissions:** `aiplatform.featurestores.list`, `aiplatform.featureOnlineStores.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer, optional) |
| 241 | +**Cost:** `estimated_monthly_cost_usd = None` — pricing varies by backing, region, node count, and commitment model; no flat estimate is appropriate |
211 | 242 |
|
212 | | -**Params:** `idle_days` (default: 30) |
| 243 | +**Permissions:** `aiplatform.featurestores.list`, `aiplatform.featureOnlineStores.list` (roles/aiplatform.viewer), `monitoring.timeSeries.list` (roles/monitoring.viewer) |
213 | 244 |
|
214 | | -**Exclusions:** legacy featurestores with `fixedNodeCount == 0` and `scaling.minNodeCount == 0`; stores not in `STABLE` state |
| 245 | +**Params:** `idle_days` (default: 30) |
215 | 246 |
|
216 | | -**Spec:** — |
| 247 | +**Exclusions:** |
| 248 | +- resource name malformed or store ID / region absent |
| 249 | +- region filter set and region does not exactly match |
| 250 | +- state not exactly `STABLE` |
| 251 | +- `reference_time` (`max(createTime, updateTime)`) absent, unparsable, or in the future |
| 252 | +- store younger than full `idle_days` observation window |
| 253 | +- legacy: `fixedNodeCount == 0` and `scaling.minNodeCount == 0` (no provisioned online-serving capacity) |
| 254 | +- legacy: both `fixedNodeCount > 0` and `scaling.minNodeCount > 0` simultaneously — invalid serving mode |
| 255 | +- FeatureOnlineStore: storage type not exactly Bigtable (`optimized` stores are out of scope) |
| 256 | +- FeatureOnlineStore: `bigtable.autoScaling` absent, or `maxNodeCount < minNodeCount` |
| 257 | +- monitoring client unavailable (no age-only fallback) |
| 258 | +- metric coverage unresolved — not exactly `idle_days` aligned daily buckets, query failure, future timestamp, or gap > 86 400 s between adjacent points |
| 259 | +- aggregate request count > 0 over the full window |
| 260 | + |
| 261 | +**Spec:** [docs/specs/gcp/ai/featurestore_idle.md](../specs/gcp/ai/featurestore_idle.md) |
0 commit comments