Skip to content

Commit c0501f1

Browse files
authored
docs(migration_v5): customer playbook for periodic materialization (#168)
Next-PR-in-sequence after #167. The deploy script + verification in #165/#166 are now real; this PR makes the customer-facing docs match. No code changes. ## What changes ### ``examples/migration_v5/README.md`` Replaced the brief "Run periodic materialization" section with a more direct "Run this every N hours in production" section. Walks the customer through the actual sequence: get events → dry-run → deploy → verify → alert. Links into the playbook for each step. Same one-command deploy snippet stays prominent. ### ``examples/migration_v5/periodic_materialization/README.md`` Restructured the operational README around a top-of-doc **customer playbook** index table (steps 0-9, each linked to a section). New / promoted sections, in playbook order: * **Required APIs** — exact ``gcloud services enable`` for the five APIs the deploy touches (BigQuery, Run, Scheduler, Build, AI Platform). * **Required IAM** — matrix of every role the runtime SA needs: project-level (``jobUser``, ``aiplatform.user``, ``run.invoker`` on the job) + dataset-level (``dataViewer`` on events, ``dataEditor`` on graph). Naming each role, scope, and reason. * **Dataset roles (events vs graph)** — explicit table of the two datasets' purpose, lifecycle, and IAM. Codifies the read-only-events / read-write-graph contract that's enforced by the IAM grants. Notes the multi-app pattern (one job per app, isolated state_keys). * **Recommended schedules** — concrete cron + lookback + overlap recommendations for 1-hour / 6-hour / daily latency targets, with the rationale for each. The 6-hour row is the deploy script's default. * **Expected JSON log shape** — full example ``materialization complete`` payload as it appears in ``jsonPayload``. Documents every field a customer pivots alerts / queries on. * **Cloud Monitoring alerts** — ``gcloud logging metrics create`` + ``gcloud alpha monitoring policies create`` for the primary ``ok=false`` alert. Drill-down queries for the two error codes (``empty_extraction``, ``materialization_failed``). Reflects #167's classifier. * **Cleanup and redeploy** — redeploy idempotency notes + the three-resource teardown sequence in the right order (Scheduler → Cloud Run Job → optional dataset drop). ### What's NOT in this PR Per the reviewer's plan: * No ``doctor`` / ``--check-deploy-inputs`` command. Wait for customers to hit friction first; let that friction shape what the doctor checks. * No Terraform / Pulumi templates. * No SDK behavior changes (#167 was the last code change in this thread). ## Verification * ``markdown`` renders cleanly in the playbook table-of- contents links (manual spot-check). * No stale ``rows_materialized == {}`` workaround language — the only mention is the explicit "no second-line check needed" callout that documents the obsolete pattern. * ``gcloud`` / ``bq`` command snippets are copy-pasteable.
1 parent d78a860 commit c0501f1

2 files changed

Lines changed: 289 additions & 28 deletions

File tree

examples/migration_v5/README.md

Lines changed: 18 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -168,39 +168,32 @@ A live end-to-end notebook run (`run_agent.py --sessions 3` + Beat 1–4 cells a
168168
- **Beat 3.6**: synthetic `ExtractedGraph` triggers all three `FallbackScope` failures (`NODE + FIELD + EDGE`).
169169
- **Beat 4**: concept index emitted + applied; `LabelSynonymResolver.resolve("DecisionExecution")` returns 1 candidate with a 12-hex `compile_id`; `GRAPH_TABLE` count over the user-authored property graph is non-zero. Hub-shape `(DecisionExecution)-[partOfSession]->(AgentSession)` returns at least one row per current session — the compiled extractor wired in Beat 3.5 synthesizes the envelope-side `AgentSession` + `partOfSession`.
170170

171-
## Run periodic materialization (Cloud Run Job + Cloud Scheduler)
171+
## Run this every N hours in production
172172

173-
The notebook materializes a graph once, ad hoc. Real deployments want the graph kept fresh on a cron. `examples/migration_v5/periodic_materialization/` packages `bqaa-materialize-window` as a Cloud Run Job + Cloud Scheduler trigger, using the v5 demo binding (re-targeted to the customer's project at runtime).
173+
The notebook walks through the four guarantees once, ad hoc. Real deployments want the graph kept fresh on a cron — events arrive continuously, the materialized entity/relationship tables should follow within a chosen latency budget.
174174

175-
**Local dry-run** — exercise the path against your own BigQuery without deploying:
175+
[`periodic_materialization/`](./periodic_materialization/) is the production path: a packaged Cloud Run Job + Cloud Scheduler trigger that runs `bqaa-materialize-window` every N hours against your project, using the v5 demo binding (retargeted to your `(project, graph_dataset)` at deploy time).
176176

177-
```bash
178-
pip install -r examples/migration_v5/periodic_materialization/requirements.txt
179-
180-
BQAA_PROJECT_ID=your-project \
181-
BQAA_EVENTS_DATASET_ID=your_events_dataset \
182-
BQAA_GRAPH_DATASET_ID=your_graph_dataset \
183-
BQAA_LOOKBACK_HOURS=6 \
184-
python examples/migration_v5/periodic_materialization/run_job.py
185-
```
177+
The flow customers actually follow:
186178

187-
**Deploy** — one command, with `--smoke` to verify the deploy by running the job once and tailing logs:
179+
1. **Get events** — point at your existing `agent_events` table (the BQ AA plugin already writes here; if you don't have one yet, seed via `python examples/migration_v5/run_agent.py --sessions 3` against a scratch dataset).
180+
2. **Local dry-run**`python periodic_materialization/run_job.py` with env vars. Same code path as the deployed job, no Cloud Run required. Verifies your IAM / dataset setup before paying for a deploy.
181+
3. **Deploy** — one command:
188182

189-
```bash
190-
./examples/migration_v5/periodic_materialization/deploy_cloud_run_job.sh \
191-
--project your-project \
192-
--region us-central1 \
193-
--events-dataset your_events_dataset \
194-
--graph-dataset your_graph_dataset \
195-
--schedule "0 */6 * * *" \
196-
--smoke
197-
```
183+
```bash
184+
./examples/migration_v5/periodic_materialization/deploy_cloud_run_job.sh \
185+
--project your-project --region us-central1 \
186+
--events-dataset your_events_dataset \
187+
--graph-dataset your_graph_dataset \
188+
--schedule "0 */6 * * *" --smoke
189+
```
198190

199-
The deploy script bundles `run_job.py` + the demo artifacts (`ontology.yaml`, `binding.yaml`, `table_ddl.sql`) + `requirements.txt` into a staging dir, deploys via `gcloud run jobs deploy --source`, creates a service account, grants `roles/run.invoker`, and wires the Cloud Scheduler HTTP trigger.
191+
Deploys the Cloud Run Job, creates the runtime service account with narrow IAM (events-READ, graph-WRITE), wires the Cloud Scheduler trigger, and runs `--smoke` to verify in one shot.
200192

201-
The job's JSON report (per run) lands in Cloud Logging as a structured entry. Filter on `resource.labels.job_name` for the materialization audit log. The state table at `<graph_dataset>._bqaa_materialization_state` is a queryable history.
193+
4. **Verify** — Cloud Logging shows the JSON report on every run (`jsonPayload.ok`, `sessions_materialized`, `rows_materialized`, per-table `table_statuses`). The state table at `<graph_dataset>._bqaa_materialization_state` is a queryable audit log.
194+
5. **Alert** — Cloud Monitoring on `severity=ERROR` OR `jsonPayload.ok=false`. The `jsonPayload.failures[].error_code` distinguishes `empty_extraction` (AI/IAM) from `materialization_failed` (schema/write-perm).
202195

203-
See [`periodic_materialization/README.md`](./periodic_materialization/README.md) for the full operational contract, env-var reference, and troubleshooting notes.
196+
See **[`periodic_materialization/README.md`](./periodic_materialization/README.md)** for the full customer playbook: required APIs, IAM matrix, recommended schedules per latency target, Cloud Monitoring alert queries, state-table SQL, troubleshooting, and live-deployment evidence captured against the canonical test project.
204197

205198
## What's NOT in this commit
206199

examples/migration_v5/periodic_materialization/README.md

Lines changed: 271 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,28 @@ N hours via Cloud Scheduler, materializes the last N hours of
1010
events into your graph dataset, and emits a structured JSON
1111
report to Cloud Logging.
1212

13+
## Customer playbook (skim this first)
14+
15+
The customer-facing sequence — every section below corresponds
16+
to a numbered step here:
17+
18+
| # | Step | Where |
19+
|---|------|-------|
20+
| 0 | Enable required APIs + grant runtime IAM | [Prerequisites](#prerequisites) |
21+
| 1 | Decide your **events** vs **graph** dataset names | [Datasets](#dataset-roles-events-vs-graph) |
22+
| 2 | Local dry-run against your project (no Cloud Run) | [Local dry-run](#local-dry-run) |
23+
| 3 | Pick a schedule | [Recommended schedules](#recommended-schedules) |
24+
| 4 | Deploy with `--smoke` (verifies in one shot) | [Deploy to Cloud Run + Cloud Scheduler](#deploy-to-cloud-run-cloud-scheduler) |
25+
| 5 | Read the JSON log shape per run | [Expected JSON log shape](#expected-json-log-shape) |
26+
| 6 | Wire the Cloud Monitoring alert on `ok=false` | [Cloud Monitoring alerts](#cloud-monitoring-alerts) |
27+
| 7 | Query the state-table audit log | [Inspecting results](#inspecting-results) |
28+
| 8 | If something looks wrong | [Failure-mode surface](#failure-mode-surface-post-167) + [Troubleshooting](#troubleshooting) |
29+
| 9 | Tear down or redeploy | [Cleanup and redeploy](#cleanup-and-redeploy) |
30+
31+
The rest of the doc has design rationale, the exact IAM matrix,
32+
operational notes, and the captured evidence from the live
33+
deployment verification in PR #166.
34+
1335
## Production shape
1436

1537
```
@@ -41,8 +63,51 @@ full prose.
4163

4264
## Prerequisites
4365

66+
### Required APIs
67+
68+
Enable these in the target project before deploying. The
69+
deploy script enables Cloud Scheduler itself if missing; the
70+
others should already be on or operators need them anyway:
71+
72+
```bash
73+
gcloud services enable \
74+
bigquery.googleapis.com \
75+
run.googleapis.com \
76+
cloudscheduler.googleapis.com \
77+
cloudbuild.googleapis.com \
78+
aiplatform.googleapis.com \
79+
--project=your-project
80+
```
81+
82+
`aiplatform.googleapis.com` is required because the MAKO demo's
83+
extraction path calls `AI.GENERATE`. Without it, every session
84+
will fail with `error_code = "empty_extraction"` (surfaced as a
85+
hard `ok=false` since PR #167).
86+
87+
### Required IAM
88+
89+
The deploy script creates a single runtime service account
90+
(`bqaa-periodic-sa@PROJECT.iam.gserviceaccount.com`) and grants
91+
the minimum set of roles. Operators with appropriate write
92+
permissions can override or split the SAs later — the structure
93+
makes it a small edit.
94+
95+
| Scope | Role | Why |
96+
|---|---|---|
97+
| Project | `roles/bigquery.jobUser` | Run BQ jobs (DDL, discovery, state writes) |
98+
| Project | `roles/aiplatform.user` | Call `AI.GENERATE` for entity extraction |
99+
| Cloud Run Job | `roles/run.invoker` | Cloud Scheduler invokes the job. Granted on the specific job resource, not project-wide |
100+
| Events DS | `roles/bigquery.dataViewer` | Read `agent_events`; events stay read-only |
101+
| Graph DS | `roles/bigquery.dataEditor` | Write entity/relationship tables + `_bqaa_materialization_state` |
102+
103+
The deploy script handles every grant. For production
104+
hardening, split the runtime SA from the scheduler-caller SA;
105+
the script's structure makes that a small edit.
106+
107+
### General prerequisites
108+
44109
* GCP project with the BigQuery, Cloud Run, Cloud Scheduler, and
45-
Cloud Build APIs enabled.
110+
Cloud Build APIs enabled (per above).
46111
* **Events dataset** (`BQAA_EVENTS_DATASET_ID`) already exists
47112
with a populated `agent_events` table. The BQ AA plugin writes
48113
to this; if you've never run an agent against BQAA, seed one
@@ -66,6 +131,31 @@ full prose.
66131
`pip install -e .` from the repo root), the script reuses
67132
that directly.
68133

134+
## Dataset roles (events vs graph)
135+
136+
Two distinct datasets, two distinct lifecycles. The deploy
137+
script enforces this at the IAM layer (events READER, graph
138+
EDITOR) so a misconfigured run can't write to the events
139+
dataset.
140+
141+
| Dataset | Holds | Lifecycle | IAM for runtime SA |
142+
|---|---|---|---|
143+
| **Events** (`BQAA_EVENTS_DATASET_ID`) | `agent_events` table — raw event stream written by the BQ AA plugin. | Owned + written by the agent runtime, never the materialization job. Pre-existing. | `roles/bigquery.dataViewer` only (read) |
144+
| **Graph** (`BQAA_GRAPH_DATASET_ID`) | Entity tables (`decision_execution`, `candidate`, …), relationship tables (`evaluates_candidate`, …), and the `_bqaa_materialization_state` audit/checkpoint table. | Created by the deploy script's `bq mk` if missing. Owned by the materialization job. | `roles/bigquery.dataEditor` (read + write) |
145+
146+
The events dataset is **read-only** for the periodic job. The
147+
graph dataset is the only write target. The state table is
148+
co-located with the graph dataset — that decision means a
149+
predicate switch (e.g., swapping `--completion-event-type`)
150+
auto-invalidates the checkpoint because the `state_key` SHA
151+
changes; see the orchestrator design contract for the full
152+
treatment.
153+
154+
If you have multiple agent applications writing to one events
155+
dataset, run one materialization job per application with
156+
different graph datasets. State-keys won't collide; checkpoints
157+
stay isolated.
158+
69159
## Local dry-run
70160

71161
Run the job once on your laptop against a real BigQuery project —
@@ -104,6 +194,34 @@ Exit codes mirror the SDK CLI:
104194
binding-validate detected schema drift against live BigQuery.
105195
* `2` — unexpected internal error (config missing, code bug).
106196

197+
## Recommended schedules
198+
199+
Pick a schedule + window based on how stale you can tolerate
200+
the graph being. The orchestrator's overlap-windowed re-scan
201+
catches late-arriving events; pair `overlap-minutes` with the
202+
ingestion lag you actually see on your `agent_events` stream
203+
(if your plugin writes are synchronous, 15min is plenty; if
204+
events trickle in over hours, scale up).
205+
206+
| Latency target | Cron | `--lookback-hours` | `--overlap-minutes` | Notes |
207+
|---|---|---|---|---|
208+
| **~1 hour** | `0 * * * *` | `2` | `15` | Tight window + small overlap. Best for low-latency monitoring use cases. Higher BQ cost per day (24 runs). |
209+
| **~6 hours** | `0 */6 * * *` | `8` | `30` | Default in the deploy script. Good balance for dashboarding / reporting use cases. 4 runs per day. |
210+
| **Daily** | `0 2 * * *` | `30` | `60` | Catch-up window covers any late-arriving events from the prior day. 1 run per day. Pair with off-peak `02:00` to avoid contending with daytime BQ slots. |
211+
| **Backfill** | (manual) | depends | depends | For one-shot catch-up: `bqaa-materialize-window --lookback-hours $N` with N covering the gap. Defer to a future `--backfill --from/--to` mode once it ships. |
212+
213+
Rules of thumb:
214+
215+
* `lookback-hours` is an upper bound on history scanned, not
216+
the typical scan size. The orchestrator scans
217+
`[max(checkpoint - overlap, run_started - lookback), run_started)`,
218+
so steady-state runs scan only the new + overlap window.
219+
* `overlap-minutes` should cover your ingestion's worst-case
220+
lag. Conservative is fine — the materializer is idempotent
221+
on the session-id boundary.
222+
* `max-sessions` (optional) caps per-run cost. Useful for the
223+
first few runs against a large backlog.
224+
107225
## Deploy to Cloud Run + Cloud Scheduler
108226

109227
One command:
@@ -199,6 +317,114 @@ Each entry includes:
199317
* `failures` — list of failed sessions with error codes.
200318
* `ok` — overall success boolean.
201319

320+
### Expected JSON log shape
321+
322+
A successful run looks like this in `jsonPayload`:
323+
324+
```json
325+
{
326+
"severity": "INFO",
327+
"message": "materialization complete",
328+
"run_id": "2d52338e16db",
329+
"state_key": "3bafe7195e806340bce25b565493d24de073518d2a1c299fb668dc4f86499e5c",
330+
"window_start": "2026-05-15T17:40:19.542872Z",
331+
"window_end": "2026-05-16T04:48:45.518518Z",
332+
"checkpoint_read": "2026-05-15T17:55:19.542872Z",
333+
"checkpoint_written": "2026-05-15T17:55:19.542872Z",
334+
"sessions_discovered": 3,
335+
"sessions_materialized": 3,
336+
"sessions_failed": 0,
337+
"rows_materialized": {
338+
"DecisionExecution": 3,
339+
"Candidate": 11,
340+
"...": "..."
341+
},
342+
"table_statuses": {
343+
"project.graph_ds.decision_execution": {
344+
"rows_attempted": 3,
345+
"rows_inserted": 3,
346+
"cleanup_status": "deleted",
347+
"insert_status": "inserted",
348+
"idempotent": true
349+
}
350+
},
351+
"compiled_outcomes": {
352+
"compiled_unchanged": 0,
353+
"compiled_filtered": 0,
354+
"fallback_for_event": 0
355+
},
356+
"failures": [],
357+
"ok": true
358+
}
359+
```
360+
361+
A failed run swaps `ok: true` for `ok: false` and populates
362+
`failures[].error_code` with either `empty_extraction` or
363+
`materialization_failed` — see
364+
[Failure-mode surface](#failure-mode-surface-post-167) for the
365+
distinction.
366+
367+
### Cloud Monitoring alerts
368+
369+
Wire a single log-based alert on `jsonPayload.ok=false`. With
370+
PR #167's classifier, this is the only signal needed —
371+
extraction failures and insert failures both surface here.
372+
373+
```bash
374+
# Create a log-based metric that counts failed runs.
375+
gcloud logging metrics create bqaa_periodic_failed_runs \
376+
--project=your-project \
377+
--description="Periodic materialization runs that reported ok=false." \
378+
--log-filter='resource.type="cloud_run_job"
379+
AND resource.labels.job_name="bqaa-periodic-materialization"
380+
AND jsonPayload.message="materialization complete"
381+
AND jsonPayload.ok=false'
382+
383+
# Then alert on the metric > 0 over a 1h window (any failed run
384+
# in the last hour fires the alert). The Cloud Monitoring UI is
385+
# the easier place to set the threshold; gcloud equivalent uses
386+
# the alpha command's ``--condition-filter`` + ``--if`` flags
387+
# (the older ``--threshold-value`` / ``--threshold-comparison``
388+
# pair was removed):
389+
gcloud alpha monitoring policies create \
390+
--project=your-project \
391+
--notification-channels=projects/your-project/notificationChannels/CHANNEL_ID \
392+
--display-name="BQAA periodic materialization failed" \
393+
--condition-display-name="ok=false runs in the last hour" \
394+
--condition-filter='metric.type="logging.googleapis.com/user/bqaa_periodic_failed_runs" AND resource.type="cloud_run_job"' \
395+
--aggregation='{"alignmentPeriod": "3600s", "perSeriesAligner": "ALIGN_SUM"}' \
396+
--if='> 0' \
397+
--duration=60s
398+
```
399+
400+
For drill-down on the failure mode, filter on the error code.
401+
``--freshness`` is the portable way to limit the time window
402+
(``date -u -v-1d`` is macOS-only; ``gcloud logging read`` accepts
403+
``--freshness=1d`` directly on every supported platform):
404+
405+
```bash
406+
# All AI / extraction failures in the last 24h.
407+
gcloud logging read \
408+
'resource.type="cloud_run_job"
409+
AND jsonPayload.failures.error_code="empty_extraction"' \
410+
--project=your-project \
411+
--freshness=1d \
412+
--limit=50
413+
414+
# All schema / write-perm failures in the last 24h.
415+
gcloud logging read \
416+
'resource.type="cloud_run_job"
417+
AND jsonPayload.failures.error_code="materialization_failed"' \
418+
--project=your-project \
419+
--freshness=1d \
420+
--limit=50
421+
```
422+
423+
Two distinct error codes → two distinct on-call runbooks. The
424+
`error_detail` field names the specific failing tables for
425+
`materialization_failed`, so the on-call doesn't have to
426+
correlate with separate logs to find what broke.
427+
202428
**The state table.** Co-located with the graph dataset (NOT
203429
the events dataset — the events dataset stays read-only per
204430
the contract above). A real BQ table at
@@ -427,8 +653,50 @@ that look identical from `rows_materialized` alone:
427653
In both cases: `ok=false`, CLI exit 1, the cron run shows up
428654
as a failed execution in Cloud Monitoring. Alert directly on
429655
`jsonPayload.ok=false` plus `jsonPayload.failures[].error_code`
430-
for the failure-mode breakdown — no second-line
431-
`rows_materialized == {}` check needed.
656+
for the failure-mode breakdown — no second-line check needed.
657+
658+
## Cleanup and redeploy
659+
660+
### Redeploy (no resource churn)
661+
662+
Re-running the deploy script with the same flags is fully
663+
idempotent — same service account (`bqaa-periodic-sa`), same
664+
job name, same Scheduler trigger name, same graph dataset. The
665+
existing IAM bindings are detected and skipped (`already
666+
granted (READER)` etc.); only the container image gets rebuilt
667+
to reflect any source changes. Run after any code change to
668+
`run_job.py` or the demo artifacts.
669+
670+
### Tear down a deployment
671+
672+
Three resources to remove. Run in this order so the Scheduler
673+
doesn't try to invoke a deleted job between the two deletes:
674+
675+
```bash
676+
# 1. Stop the cron from firing.
677+
gcloud scheduler jobs delete bqaa-periodic-materialization-cron \
678+
--project=your-project --location=us-central1 --quiet
679+
680+
# 2. Delete the Cloud Run Job.
681+
gcloud run jobs delete bqaa-periodic-materialization \
682+
--project=your-project --region=us-central1 --quiet
683+
684+
# 3. (Optional) Drop the graph dataset — destroys ALL
685+
# materialized entity/relationship tables AND the state-table
686+
# audit log. Skip if you want to preserve history.
687+
bq --project_id=your-project rm -r -f your_graph_dataset
688+
```
689+
690+
The events dataset is never modified by the deploy and stays
691+
untouched. The runtime service account (`bqaa-periodic-sa`)
692+
persists across teardowns — drop it manually if you're
693+
permanently retiring the deployment:
694+
695+
```bash
696+
gcloud iam service-accounts delete \
697+
bqaa-periodic-sa@your-project.iam.gserviceaccount.com \
698+
--project=your-project --quiet
699+
```
432700

433701
## Not in scope here
434702

0 commit comments

Comments
 (0)