Skip to content

Commit d830398

Browse files
authored
examples(migration_v5): live Cloud Run deploy verification + IAM fixes (#166)
Follow-up to PR #165. Actually deployed the periodic- materialization Cloud Run Job + Cloud Scheduler trigger against ``test-project-0728-467323``, captured the evidence, and fixed four real-world issues the live run surfaced. ## Issues fixed by this PR (all discovered by live verification) 1. **IAM propagation race.** ``gcloud iam service-accounts create`` returns success when one IAM replica has the SA, but ``gcloud projects add-iam-policy-binding`` reads from a different replica that can lag several seconds — producing ``INVALID_ARGUMENT: Service account ... does not exist`` on the immediate next grant. Added a ``_retry_iam`` helper that retries IAM commands on the "does not exist" error class with backoff. 2. **``bq add-iam-policy-binding`` requires allowlisting on some projects.** Switched dataset-level IAM grants to a small Python heredoc using the ``google-cloud-bigquery`` client's ``AccessEntry`` API (already a dependency of the local dry-run install). Portable, idempotent, no project-allowlist requirement. 3. **Buildpacks needs ``web:`` in Procfile + no ``--command/--args`` override.** Without a Procfile, Buildpacks fails with "provide a main.py or app.py file or set an entrypoint". With a ``job:`` Procfile, it fails with "web process not found in Procfile". Even with a ``web:`` Procfile, ``--command python --args run_job.py`` skips Buildpacks' venv-activation wrapper, leading to "Application failed to start: container exited abnormally" with no Python output. Fix: ``web:`` Procfile + no ``--command/--args``. The ``web:`` label is a Buildpacks convention; it doesn't imply HTTP service. 4. **Runtime SA needs ``roles/aiplatform.user`` for ``AI.GENERATE``.** The MAKO demo's extraction path calls ``AI.GENERATE`` (Gemini-backed entity extraction). Without this grant, the AI call returns "user does not have the permission to access resources used by AI.GENERATE" and the orchestrator silently extracts an empty graph for every session — looks ``ok=true`` in the report. Added the grant. ## Evidence captured Documented in a new "Verified Cloud Run deployment evidence" section in the periodic_materialization README: * Cloud Build image digest (vendored SDK). * Cloud Scheduler trigger YAML (HTTP target, OAuth identity, schedule). * Dataset IAM policies (events READER, graph WRITER, zero WRITE/OWNER bindings on events — read-only contract holds). * ``materialization complete`` JSON payload from Cloud Logging — all 11 entity/relationship tables populated, ``cleanup_status=deleted, insert_status=inserted, idempotent=true`` across the board. * State-table audit log with 3 successful runs: the ``--smoke`` execution (pre-aiplatform fix, silent failure), a manual re-execution post-fix (full materialization), and a real Cloud Scheduler cron firing at 06:02 UTC (proves the trigger end-to-end without manual intervention). ## Known issue surfaced (out of scope, tracked for SDK follow-up) The orchestrator reports ``sessions_materialized == sessions_discovered`` and ``ok=true`` even when every per- event ``AI.GENERATE`` call failed. ``rows_materialized`` is empty in that case, but ``ok`` doesn't reflect the silent failure. Documented as a known issue in the README; workaround is to alert on ``jsonPayload.rows_materialized == {}``. ## Not in scope * Terraform. * Compiled-bundle materialization path. * Backfill mode. * SDK-side fix for the silent AI.GENERATE failure — flagged, not fixed here.
1 parent 48bd74f commit d830398

2 files changed

Lines changed: 313 additions & 26 deletions

File tree

examples/migration_v5/periodic_materialization/README.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,15 @@ full prose.
5656
tables and the state/checkpoint table all live here.
5757
* `gcloud` authenticated with permissions to deploy Cloud Run
5858
Jobs, create scheduler triggers, and grant IAM bindings.
59+
* `python3` on PATH. The deploy script uses Python to apply
60+
dataset-level IAM via the BigQuery client's `AccessEntry`
61+
API (since `bq add-iam-policy-binding` requires project
62+
allowlisting in some environments). If your `python3` doesn't
63+
have `google-cloud-bigquery` installed, the script
64+
transparently creates a one-shot temp venv with it — no
65+
manual install required. If it does (e.g., you ran
66+
`pip install -e .` from the repo root), the script reuses
67+
that directly.
5968

6069
## Local dry-run
6170

@@ -127,6 +136,14 @@ The script:
127136
3. **Grants narrow IAM** to the SA:
128137
* Project-level `roles/bigquery.jobUser`
129138
`bigquery.jobs.create` only.
139+
* Project-level `roles/aiplatform.user` — required because
140+
the MAKO demo's extraction path calls BigQuery's
141+
`AI.GENERATE` function (Gemini-backed entity extraction).
142+
Without this grant, the AI call returns "user does not
143+
have the permission to access resources used by
144+
AI.GENERATE" and the orchestrator silently extracts an
145+
empty graph for every session. Surfaced by the live
146+
verification in PR #166.
130147
* Dataset-level `roles/bigquery.dataViewer` on
131148
**events** — read-only access. The events dataset stays
132149
effectively read-only per the contract above.
@@ -270,6 +287,134 @@ also fail and the scheduled fire will be reported as failed in
270287
Cloud Monitoring. Set up an alert on
271288
`logging.googleapis.com/log_entry_count` with severity `ERROR`.
272289

290+
## Verified Cloud Run deployment evidence
291+
292+
This section documents an end-to-end live verification of the
293+
deploy path against `test-project-0728-467323` (the
294+
canonical SDK test project). The verification was the work of
295+
PR #166 (follow-up to #165) and surfaced four real issues — all
296+
fixed in `deploy_cloud_run_job.sh` before the evidence below
297+
was captured. See the PR description for the full discovery
298+
log.
299+
300+
**Inputs:**
301+
302+
* Events dataset: `migration_v5_idem_43c51d05` (3 demo
303+
sessions, 115 events, pre-populated by `run_agent.py` in PR
304+
#164).
305+
* Graph dataset: `migration_v5_graph_verify_500c9f` (auto-
306+
created by deploy script).
307+
* Job name: `bqaa-periodic-verify-500c9f`.
308+
* Schedule: `0 */6 * * *`.
309+
* Region: `us-central1`.
310+
311+
**Build + deploy:**
312+
313+
* Cloud Build image:
314+
`us-central1-docker.pkg.dev/test-project-0728-467323/cloud-run-source-deploy/bqaa-periodic-verify-500c9f@sha256:d1cd008…`.
315+
* Built from the vendored `./sdk_src` (PR #165 contract):
316+
`Building bigquery-agent-analytics @ file:///workspace/sdk_src`
317+
`Built bigquery-agent-analytics @ file:///workspace/sdk_src`.
318+
* Build time: ~4 min (Cloud Build + Buildpacks).
319+
320+
**Cloud Scheduler trigger** (`gcloud scheduler jobs describe`):
321+
322+
```yaml
323+
httpTarget:
324+
httpMethod: POST
325+
oauthToken:
326+
scope: https://www.googleapis.com/auth/cloud-platform
327+
serviceAccountEmail: bqaa-periodic-sa@test-project-0728-467323.iam.gserviceaccount.com
328+
uri: https://us-central1-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/test-project-0728-467323/jobs/bqaa-periodic-verify-500c9f:run
329+
schedule: 0 */6 * * *
330+
state: ENABLED
331+
```
332+
333+
The OAuth identity matches the runtime SA — same SA serves
334+
both runtime and scheduler-caller as designed.
335+
336+
**IAM contract** verified post-deploy via the BigQuery client:
337+
338+
```
339+
Events dataset IAM for SA:
340+
role=READER, entity_type=userByEmail, entity_id=bqaa-periodic-sa@…
341+
Number of WRITE/OWNER bindings for SA: 0
342+
343+
Graph dataset IAM for SA:
344+
role=WRITER, entity_type=userByEmail, entity_id=bqaa-periodic-sa@…
345+
```
346+
347+
The SA can read events but cannot write — the "events dataset
348+
read-only" contract holds at the IAM layer.
349+
350+
**Successful execution** (`materialization complete` payload
351+
from Cloud Logging, after the post-deploy `roles/aiplatform.user`
352+
grant the verification added to the deploy script):
353+
354+
```json
355+
{
356+
"run_id": "2d52338e16db",
357+
"sessions_discovered": 3,
358+
"sessions_materialized": 3,
359+
"sessions_failed": 0,
360+
"rows_materialized": {
361+
"DecisionExecution": 3,
362+
"DecisionPoint": 3,
363+
"Candidate": 11,
364+
"SelectionOutcome": 3,
365+
"ContextSnapshot": 3,
366+
"evaluatesCandidate": 11,
367+
"selectedCandidate": 3,
368+
"rejectedCandidate": 5,
369+
"atContextSnapshot": 3,
370+
"executedAtDecisionPoint": 3,
371+
"hasSelectionOutcome": 3
372+
},
373+
"ok": true,
374+
"failures": []
375+
}
376+
```
377+
378+
All 11 entity/relationship tables populated.
379+
`cleanup_status=deleted, insert_status=inserted,
380+
idempotent=true` across the board.
381+
382+
**Scheduler trigger actually fires.** The cron-scheduled fire
383+
at `2026-05-16T06:00:00Z` produced a third state-table row
384+
(`run_id 4725ebd79060`) with the same shape — proving the
385+
end-to-end path from Cloud Scheduler → Cloud Run Job →
386+
materialization works without manual intervention.
387+
388+
**State table audit log** (`_bqaa_materialization_state` in
389+
the graph dataset):
390+
391+
```
392+
run_id run_started_at sessions_disc / mat / failed ok
393+
ff1e956df8b8 2026-05-16 04:38:59 3 / 3 / 0 true
394+
2d52338e16db 2026-05-16 04:48:45 3 / 3 / 0 true
395+
4725ebd79060 2026-05-16 06:02:51 3 / 3 / 0 true
396+
```
397+
398+
(Row 1 is the deploy script's `--smoke` execution, which ran
399+
BEFORE the verification added `roles/aiplatform.user` to the
400+
deploy. AI.GENERATE failed for every session there, but the
401+
orchestrator still reported `ok=true` with empty
402+
`rows_materialized` — see the known issue below.)
403+
404+
### Known issue surfaced by this verification
405+
406+
The orchestrator currently reports `sessions_materialized ==
407+
sessions_discovered` and `ok=true` even when every per-event
408+
`AI.GENERATE` call failed. The `rows_materialized` dict is
409+
empty in that case, but `sessions_materialized` doesn't
410+
reflect the underlying failure. Operators monitoring on
411+
`jsonPayload.ok` would miss the silent extraction failure.
412+
413+
Tracked for SDK follow-up — out of scope for this example PR.
414+
Workaround: alert on
415+
`jsonPayload.rows_materialized == {}` in Cloud Logging /
416+
Monitoring as a second-line check.
417+
273418
## Not in scope here
274419

275420
* **Terraform / Pulumi.** A scripted deploy is easier to read

0 commit comments

Comments
 (0)