You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs(migration_v5): customer playbook for periodic materialization (#168)
Next-PR-in-sequence after #167. The deploy script + verification
in #165/#166 are now real; this PR makes the customer-facing
docs match. No code changes.
## What changes
### ``examples/migration_v5/README.md``
Replaced the brief "Run periodic materialization" section with
a more direct "Run this every N hours in production" section.
Walks the customer through the actual sequence: get events →
dry-run → deploy → verify → alert. Links into the playbook for
each step. Same one-command deploy snippet stays prominent.
### ``examples/migration_v5/periodic_materialization/README.md``
Restructured the operational README around a top-of-doc
**customer playbook** index table (steps 0-9, each linked to a
section). New / promoted sections, in playbook order:
* **Required APIs** — exact ``gcloud services enable`` for the
five APIs the deploy touches (BigQuery, Run, Scheduler,
Build, AI Platform).
* **Required IAM** — matrix of every role the runtime SA needs:
project-level (``jobUser``, ``aiplatform.user``,
``run.invoker`` on the job) + dataset-level (``dataViewer``
on events, ``dataEditor`` on graph). Naming each role,
scope, and reason.
* **Dataset roles (events vs graph)** — explicit table of the
two datasets' purpose, lifecycle, and IAM. Codifies the
read-only-events / read-write-graph contract that's enforced
by the IAM grants. Notes the multi-app pattern (one job per
app, isolated state_keys).
* **Recommended schedules** — concrete cron + lookback +
overlap recommendations for 1-hour / 6-hour / daily latency
targets, with the rationale for each. The 6-hour row is the
deploy script's default.
* **Expected JSON log shape** — full example
``materialization complete`` payload as it appears in
``jsonPayload``. Documents every field a customer pivots
alerts / queries on.
* **Cloud Monitoring alerts** — ``gcloud logging metrics
create`` + ``gcloud alpha monitoring policies create`` for
the primary ``ok=false`` alert. Drill-down queries for the
two error codes (``empty_extraction``,
``materialization_failed``). Reflects #167's classifier.
* **Cleanup and redeploy** — redeploy idempotency notes + the
three-resource teardown sequence in the right order
(Scheduler → Cloud Run Job → optional dataset drop).
### What's NOT in this PR
Per the reviewer's plan:
* No ``doctor`` / ``--check-deploy-inputs`` command. Wait for
customers to hit friction first; let that friction shape what
the doctor checks.
* No Terraform / Pulumi templates.
* No SDK behavior changes (#167 was the last code change in
this thread).
## Verification
* ``markdown`` renders cleanly in the playbook table-of-
contents links (manual spot-check).
* No stale ``rows_materialized == {}`` workaround language —
the only mention is the explicit "no second-line check
needed" callout that documents the obsolete pattern.
* ``gcloud`` / ``bq`` command snippets are copy-pasteable.
Copy file name to clipboardExpand all lines: examples/migration_v5/README.md
+18-25Lines changed: 18 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -168,39 +168,32 @@ A live end-to-end notebook run (`run_agent.py --sessions 3` + Beat 1–4 cells a
168
168
-**Beat 3.6**: synthetic `ExtractedGraph` triggers all three `FallbackScope` failures (`NODE + FIELD + EDGE`).
169
169
-**Beat 4**: concept index emitted + applied; `LabelSynonymResolver.resolve("DecisionExecution")` returns 1 candidate with a 12-hex `compile_id`; `GRAPH_TABLE` count over the user-authored property graph is non-zero. Hub-shape `(DecisionExecution)-[partOfSession]->(AgentSession)` returns at least one row per current session — the compiled extractor wired in Beat 3.5 synthesizes the envelope-side `AgentSession` + `partOfSession`.
170
170
171
-
## Run periodic materialization (Cloud Run Job + Cloud Scheduler)
171
+
## Run this every N hours in production
172
172
173
-
The notebook materializes a graph once, ad hoc. Real deployments want the graph kept fresh on a cron. `examples/migration_v5/periodic_materialization/` packages `bqaa-materialize-window` as a Cloud Run Job + Cloud Scheduler trigger, using the v5 demo binding (re-targeted to the customer's project at runtime).
173
+
The notebook walks through the four guarantees once, ad hoc. Real deployments want the graph kept fresh on a cron — events arrive continuously, the materialized entity/relationship tables should follow within a chosen latency budget.
174
174
175
-
**Local dry-run** — exercise the path against your own BigQuery without deploying:
175
+
[`periodic_materialization/`](./periodic_materialization/) is the production path: a packaged Cloud Run Job + Cloud Scheduler trigger that runs `bqaa-materialize-window` every N hours against your project, using the v5 demo binding (retargeted to your `(project, graph_dataset)` at deploy time).
**Deploy** — one command, with `--smoke` to verify the deploy by running the job once and tailing logs:
179
+
1.**Get events** — point at your existing `agent_events` table (the BQ AA plugin already writes here; if you don't have one yet, seed via `python examples/migration_v5/run_agent.py --sessions 3` against a scratch dataset).
180
+
2.**Local dry-run** — `python periodic_materialization/run_job.py` with env vars. Same code path as the deployed job, no Cloud Run required. Verifies your IAM / dataset setup before paying for a deploy.
The deploy script bundles `run_job.py` + the demo artifacts (`ontology.yaml`, `binding.yaml`, `table_ddl.sql`) + `requirements.txt` into a staging dir, deploys via `gcloud run jobs deploy --source`, creates a service account, grants `roles/run.invoker`, and wires the Cloud Scheduler HTTP trigger.
191
+
Deploys the Cloud Run Job, creates the runtime service account with narrow IAM (events-READ, graph-WRITE), wires the Cloud Scheduler trigger, and runs `--smoke` to verify in one shot.
200
192
201
-
The job's JSON report (per run) lands in Cloud Logging as a structured entry. Filter on `resource.labels.job_name` for the materialization audit log. The state table at `<graph_dataset>._bqaa_materialization_state` is a queryable history.
193
+
4.**Verify** — Cloud Logging shows the JSON report on every run (`jsonPayload.ok`, `sessions_materialized`, `rows_materialized`, per-table `table_statuses`). The state table at `<graph_dataset>._bqaa_materialization_state` is a queryable audit log.
194
+
5.**Alert** — Cloud Monitoring on `severity=ERROR` OR `jsonPayload.ok=false`. The `jsonPayload.failures[].error_code` distinguishes `empty_extraction` (AI/IAM) from `materialization_failed` (schema/write-perm).
202
195
203
-
See [`periodic_materialization/README.md`](./periodic_materialization/README.md) for the full operational contract, env-var reference, and troubleshooting notes.
196
+
See **[`periodic_materialization/README.md`](./periodic_materialization/README.md)** for the full customer playbook: required APIs, IAM matrix, recommended schedules per latency target, Cloud Monitoring alert queries, state-table SQL, troubleshooting, and live-deployment evidence captured against the canonical test project.
with a populated `agent_events` table. The BQ AA plugin writes
48
113
to this; if you've never run an agent against BQAA, seed one
@@ -66,6 +131,31 @@ full prose.
66
131
`pip install -e .` from the repo root), the script reuses
67
132
that directly.
68
133
134
+
## Dataset roles (events vs graph)
135
+
136
+
Two distinct datasets, two distinct lifecycles. The deploy
137
+
script enforces this at the IAM layer (events READER, graph
138
+
EDITOR) so a misconfigured run can't write to the events
139
+
dataset.
140
+
141
+
| Dataset | Holds | Lifecycle | IAM for runtime SA |
142
+
|---|---|---|---|
143
+
|**Events** (`BQAA_EVENTS_DATASET_ID`) |`agent_events` table — raw event stream written by the BQ AA plugin. | Owned + written by the agent runtime, never the materialization job. Pre-existing. |`roles/bigquery.dataViewer` only (read) |
144
+
|**Graph** (`BQAA_GRAPH_DATASET_ID`) | Entity tables (`decision_execution`, `candidate`, …), relationship tables (`evaluates_candidate`, …), and the `_bqaa_materialization_state` audit/checkpoint table. | Created by the deploy script's `bq mk` if missing. Owned by the materialization job. |`roles/bigquery.dataEditor` (read + write) |
145
+
146
+
The events dataset is **read-only** for the periodic job. The
147
+
graph dataset is the only write target. The state table is
148
+
co-located with the graph dataset — that decision means a
|**~1 hour**|`0 * * * *`|`2`|`15`| Tight window + small overlap. Best for low-latency monitoring use cases. Higher BQ cost per day (24 runs). |
209
+
|**~6 hours**|`0 */6 * * *`|`8`|`30`| Default in the deploy script. Good balance for dashboarding / reporting use cases. 4 runs per day. |
210
+
|**Daily**|`0 2 * * *`|`30`|`60`| Catch-up window covers any late-arriving events from the prior day. 1 run per day. Pair with off-peak `02:00` to avoid contending with daytime BQ slots. |
211
+
|**Backfill**| (manual) | depends | depends | For one-shot catch-up: `bqaa-materialize-window --lookback-hours $N` with N covering the gap. Defer to a future `--backfill --from/--to` mode once it ships. |
212
+
213
+
Rules of thumb:
214
+
215
+
*`lookback-hours` is an upper bound on history scanned, not
0 commit comments