fix(materialize_window): empty-extraction sessions are not ok (#167)

caohy1988 · web-flow · commit d78a860f29dd · 2026-05-16T00:47:19.000-07:00
Closes the silent-failure mode #166's live deploy surfaced. ## The bug When ``AI.GENERATE`` failed per-event (e.g., runtime SA missing ``roles/aiplatform.user``), the SDK swallowed the error and returned an empty graph for every session. The orchestrator's session loop then: * Called ``materialize_with_status`` on the empty graph (didn't raise — the materializer just had nothing to insert). * Marked the session as ``ok=True``. * Reported ``sessions_materialized == sessions_discovered``, ``ok=true``, and an EMPTY ``rows_materialized`` dict. Operators monitoring on ``jsonPayload.ok`` saw "all good" while the entity tables stayed empty. Hard-to-spot drift. ## The fix After ``materialize_with_status`` succeeds, sum the per-table row counts. If the total is zero, treat as a per-session failure (matching exception semantics — break the loop, advance checkpoint only to the prior successful session): * ``ok=False`` on the SessionResult. * ``error_code="empty_extraction"``. * ``error_detail`` documents the common causes: missing ``aiplatform.user`` IAM, transient AI.GENERATE rate limit, or no extractable ontology content in the session's events. A legitimately empty session (no MAKO decision points) gets flagged too — better a false-positive that prompts the operator to look than a silent zero. The empty-window heartbeat path (``sessions_discovered == 0``) is untouched: the empty-extraction guard is per-session and doesn't fire when no sessions were processed. ``ok=true`` remains the right shape for "the cron ran, nothing to do". ## Tests Three new cases in ``TestEmptyExtractionNotOk``: * ``test_all_sessions_zero_rows_reports_not_ok`` — every session extracts to empty; first marked failed, loop breaks, ``ok=false``, ``sessions_materialized=0``, ``sessions_failed=1``, error_code matches. * ``test_partial_extraction_partial_failure_conservative_checkpoint`` — session 1 succeeds with real rows, session 2 empty, session 3 never reached; checkpoint advances ONLY to session 1's timestamp; materializer called exactly twice (BQ-quota safe). * ``test_empty_window_remains_ok`` — zero sessions discovered still produces ``ok=true``; the per-session guard doesn't flip the heartbeat case. 71/71 focused tests pass (68 prior + 3 new). Full suite 2930 pass.
diff --git a/examples/migration_v5/periodic_materialization/README.md b/examples/migration_v5/periodic_materialization/README.md
@@ -397,23 +397,38 @@ ff1e956df8b8    2026-05-16 04:38:59    3 / 3 / 0                       true
 
 (Row 1 is the deploy script's `--smoke` execution, which ran
 BEFORE the verification added `roles/aiplatform.user` to the
-deploy. AI.GENERATE failed for every session there, but the
-orchestrator still reported `ok=true` with empty
-`rows_materialized` — see the known issue below.)
-
-### Known issue surfaced by this verification
-
-The orchestrator currently reports `sessions_materialized ==
-sessions_discovered` and `ok=true` even when every per-event
-`AI.GENERATE` call failed. The `rows_materialized` dict is
-empty in that case, but `sessions_materialized` doesn't
-reflect the underlying failure. Operators monitoring on
-`jsonPayload.ok` would miss the silent extraction failure.
-
-Tracked for SDK follow-up — out of scope for this example PR.
-Workaround: alert on
-`jsonPayload.rows_materialized == {}` in Cloud Logging /
-Monitoring as a second-line check.
+deploy. AI.GENERATE failed for every session there. At the time
+of #166, the orchestrator reported `ok=true` with empty
+`rows_materialized` — the silent-failure mode that PR #167
+fixed. Today, the same situation would produce `ok=false` with
+`failures[0].error_code = "empty_extraction"`.)
+
+### Failure-mode surface (post-#167)
+
+The orchestrator distinguishes two zero-row session outcomes
+that look identical from `rows_materialized` alone:
+
+* **`empty_extraction`** — extraction (AI.GENERATE or compiled
+  bundle) returned an empty graph; no inserts attempted.
+  Diagnose by checking the runtime SA's `roles/aiplatform.user`
+  grant, AI.GENERATE quotas, or whether the session's events
+  legitimately had any MAKO content.
+
+* **`materialization_failed`** — extraction produced rows but
+  every insert returned an error. The `failures[].error_detail`
+  names the specific tables (e.g.,
+  `DecisionExecution: rows_attempted=3, insert_status='insert_failed'`),
+  and the aggregate `table_statuses` carries the per-table
+  diagnostic at the top level of the report. Diagnose by
+  checking the SA's dataset write perm on the graph dataset,
+  schema drift the binding-validate pre-flight missed, or
+  streaming-buffer pinning.
+
+In both cases: `ok=false`, CLI exit 1, the cron run shows up
+as a failed execution in Cloud Monitoring. Alert directly on
+`jsonPayload.ok=false` plus `jsonPayload.failures[].error_code`
+for the failure-mode breakdown — no second-line
+`rows_materialized == {}` check needed.
 
 ## Not in scope here
 
diff --git a/src/bigquery_agent_analytics/materialize_window.py b/src/bigquery_agent_analytics/materialize_window.py
@@ -921,6 +921,81 @@ def run_materialize_window(
             "insert_status": ts.insert_status,
             "idempotent": ts.idempotent,
         }
+      # Zero-rows guard. ``row_counts`` only includes tables
+      # whose insert SUCCEEDED (per ``ontology_materializer.py``
+      # — see the ``inserted_tables`` filter at the build site).
+      # So ``total_rows == 0`` collapses two distinct failure
+      # modes that operators must triage differently:
+      #
+      #   * **empty_extraction** — the graph itself was empty.
+      #     Extraction (AI.GENERATE or compiled bundle) returned
+      #     no rows; nothing to insert. ``table_statuses`` will
+      #     be empty or have ``rows_attempted == 0`` everywhere.
+      #     Common cause: missing ``roles/aiplatform.user`` on
+      #     the runtime SA (surfaced by #166 live deploy), or
+      #     the session's events legitimately had no MAKO
+      #     content.
+      #
+      #   * **materialization_failed** — the graph HAD rows but
+      #     every insert returned an error. ``table_statuses``
+      #     will have entries with ``rows_attempted > 0`` and
+      #     ``insert_status == "insert_failed"``. Common cause:
+      #     dataset write-perm regression, streaming-buffer
+      #     pinning that hits a delete-then-insert idempotency
+      #     boundary, or a schema mismatch the binding-validate
+      #     pre-flight didn't catch.
+      #
+      # Operators chasing the wrong failure mode is the failure
+      # mode this PR was meant to prevent. Classify here.
+      total_rows = sum(int(v) for v in (mat.row_counts or {}).values())
+      if total_rows == 0:
+        any_attempted = any(
+            int(ts.get("rows_attempted", 0)) > 0
+            for ts in table_statuses_dict.values()
+        )
+        any_insert_failed = any(
+            ts.get("insert_status") == "insert_failed"
+            for ts in table_statuses_dict.values()
+        )
+        if any_attempted and any_insert_failed:
+          error_code = "materialization_failed"
+          error_detail = (
+              "session extracted rows but every insert failed: "
+              + "; ".join(
+                  f"{name}: rows_attempted={ts.get('rows_attempted')}, "
+                  f"insert_status={ts.get('insert_status')!r}, "
+                  f"cleanup_status={ts.get('cleanup_status')!r}"
+                  for name, ts in sorted(table_statuses_dict.items())
+              )
+              + ". Common causes: dataset write-permission "
+              "regression on the runtime SA, schema mismatch the "
+              "binding-validate pre-flight didn't catch, or "
+              "streaming-buffer pinning blocking a delete-then-"
+              "insert cycle."
+          )
+        else:
+          error_code = "empty_extraction"
+          error_detail = (
+              "session materialized zero rows across every entity "
+              "table and no inserts were attempted; usually means "
+              "extraction (AI.GENERATE or compiled bundle) returned "
+              "an empty graph. Common causes: missing "
+              "roles/aiplatform.user on the runtime SA, transient "
+              "AI.GENERATE rate limit, or the session's events did "
+              "not contain any extractable ontology content."
+          )
+        session_results.append(
+            SessionResult(
+                session_id=session.session_id,
+                ok=False,
+                completion_timestamp=session.completion_timestamp,
+                rows_materialized=dict(mat.row_counts),
+                table_statuses=table_statuses_dict,
+                error_code=error_code,
+                error_detail=error_detail,
+            )
+        )
+        break
       session_results.append(
           SessionResult(
               session_id=session.session_id,
@@ -1266,18 +1341,30 @@ def _build_result(
   # clean ``deleted``. Operators rely on the report as the
   # "did anything go wrong" signal — any delete failure must
   # bubble up.
+  # Aggregate ``table_statuses`` from EVERY session — including
+  # failed ones. The status surface ("did the delete succeed?
+  # did the insert succeed?") is operator-visible regardless of
+  # session-level success. Dropping failed sessions' statuses
+  # was the gap #167's reviewer flagged: when a session fails
+  # with ``materialization_failed``, the per-table diagnostic
+  # (rows_attempted=N, insert_status=insert_failed) belongs in
+  # the report so operators don't have to dig into log payloads
+  # to see which table broke.
+  #
+  # ``rows_materialized`` still aggregates only successful
+  # sessions — that's the "rows that actually landed" view.
   table_statuses_agg: dict[str, dict[str, Any]] = {}
   for r in session_results:
     if r.ok:
       for table, n in r.rows_materialized.items():
         rows_materialized[table] = rows_materialized.get(table, 0) + n
-      for table, ts in r.table_statuses.items():
-        if table in table_statuses_agg:
-          table_statuses_agg[table] = _merge_table_status(
-              table_statuses_agg[table], ts
-          )
-        else:
-          table_statuses_agg[table] = dict(ts)
+    for table, ts in r.table_statuses.items():
+      if table in table_statuses_agg:
+        table_statuses_agg[table] = _merge_table_status(
+            table_statuses_agg[table], ts
+        )
+      else:
+        table_statuses_agg[table] = dict(ts)
 
   failures = [
       {
diff --git a/tests/test_materialize_window.py b/tests/test_materialize_window.py