Skip to content

Commit b79a6db

Browse files
Data Cloud Agents Teamcopybara-github
authored andcommitted
feat: add instructions for Google-provided templates to gcp-dataflow skill
This CL updates the `gcp_dataflow` skill definition to add guidelines for configuring and running Google-provided templates (Classic and Flex): - **Google-Provided Templates**: Added instructions for parameter validation (e.g., `badRecordsOutputTable` and determining error schemas) and UDF/SSL configuration. - **Job Execution & Monitoring**: Added a universal execution workflow (with mandatory pre-launch confirmation) and monitoring guidelines. - **Formatting**: Refactored inline links to reference-style links and wrapped long lines. PiperOrigin-RevId: 906494995
1 parent cb3a6e8 commit b79a6db

1 file changed

Lines changed: 250 additions & 35 deletions

File tree

skills/gcp-dataflow/SKILL.md

Lines changed: 250 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,76 @@
11
---
22
name: gcp-dataflow
3-
description: |
4-
Guides writing, packaging, executing, and troubleshooting Apache Beam pipelines on Dataflow. Use when creating new pipelines, configuring Flex Templates, or analyzing performance of Dataflow jobs. Capabilities include Java/Python/Go setup, Cloud Build integration, and deep diagnostic analysis of job health and autoscaling.
5-
Use when: - Creating an Apache Beam Dataflow pipeline. - Creating a Google Flex Template. - Debugging Dataflow pipeline - Troubleshooting Dataflow pipeline - Analyzing Performance of Dataflow pipeline.
6-
Key capabilities include: Project setup for Java/Python/Go, Flex Template configuration (with Cloud Build support), and in-depth diagnostics for streaming job health, bottlenecks, and autoscaling.
7-
Do NOT use for: - General GCP resource management unrelated to Dataflow. - Issues with other GCP services (e.g., GCE, GCS, BigQuery) unless directly
3+
description: >
4+
Guides writing, packaging, executing, and troubleshooting Apache Beam
5+
pipelines on Dataflow. Use when creating new pipelines, configuring Flex
6+
Templates, or analyzing performance of Dataflow jobs. Capabilities include
7+
Java/Python/Go setup, Cloud Build integration, and deep diagnostic analysis
8+
of job health and autoscaling.
9+
10+
Use when:
11+
- Creating an Apache Beam Dataflow pipeline.
12+
- Creating a Google Dataflow Flex Template.
13+
- Using an existing Google Dataflow Template.
14+
- Debugging Dataflow pipeline
15+
- Troubleshooting Dataflow pipeline
16+
- Analyzing Performance of Dataflow pipeline.
17+
18+
Key capabilities: Java/Python/Go project setup, Flex Templates (with
19+
Cloud Build), and diagnostics for streaming job health, bottlenecks,
20+
and autoscaling.
21+
22+
Do NOT use for:
23+
- General GCP resource management unrelated to Dataflow.
24+
- Issues with other GCP services (e.g., GCE, GCS, BigQuery) unless directly
825
impacting Dataflow pipeline execution.
926
- Pipeline technologies other than Apache Beam on Dataflow.
27+
1028
license: Apache-2.0
1129
metadata:
12-
version: v3
30+
version: v4
1331
publisher: google
1432
---
1533

1634
# Apache Beam Pipelines on Cloud Dataflow
1735

18-
Expert guidance for writing and packaging Apache Beam pipelines to run on Google
19-
Cloud Dataflow.
36+
## Pipeline authoring
37+
38+
Use this section when implementing Dataflow pipeline logic using Apache Beam.
39+
40+
### Check if existing Google Dataflow Template exists
41+
42+
Google provides a variety of pre-built, open source Dataflow templates that can
43+
be used for common scenarios. Before implementing a pipeline from scratch, you
44+
MUST follow the steps below to check whether a Dataflow template for the
45+
pipeline logic you need to implement already exists.
46+
47+
- **Step 1: Check for a matching Google Dataflow Template**
2048

21-
## Creating a new project
49+
- Identify the **source** and **sink** (e.g., GCS to BigQuery) from the
50+
user's request. *Note*: You *MUST NOT* proceed until the source and sink
51+
are clearly identified.
52+
- **Action**: List templates in the public `dataflow-templates` bucket:
53+
* For Classic templates, check `gs://dataflow-templates/latest`.
54+
* For Flex templates, check `gs://dataflow-templates/latest/flex`. Use
55+
`gcloud storage ls` to list the contents.
56+
- Match templates by name or description to the source and sink.
57+
- If no matching template is found, go to **Create a new pipeline from
58+
scratch**.
2259

23-
Use this section when creating a new project for a Dataflow pipeline.
60+
- **Step 2: Confirm template selection**
61+
62+
- Present the matched template(s) to the user with a brief explanation of
63+
why they match, and make a note of whether it is a Classic or Flex
64+
template.
65+
- **Action**: Ask the user for explicit confirmation to proceed with this
66+
template.
67+
- If the user rejects or prefers a custom solution, proceed to **Create a
68+
new pipeline from scratch**.
69+
70+
### Create a new pipeline from scratch
71+
72+
Use this section when creating a new project for a Dataflow pipeline from
73+
scratch.
2474

2575
- If the user doesn't say explicitly which language (Java, Python, Go) shall
2676
be used to write the pipeline, you MUST confirm the language.
@@ -47,39 +97,162 @@ Use this section when configuring a Dataflow Java pipeline project using gradle.
4797
`logback-classic`, etc.) to exactly match the major/minor version of the
4898
resolved `slf4j-api`.
4999

50-
### Structure the pipeline as a Dataflow Flex Template
100+
### Packaging a pipeline as a Flex Template
51101

52-
When creating new Dataflow pipeline projects, configure them as a Flex template.
53-
Flex Templates offer a hermetic and reproducible launch environment, and are
54-
easy to launch with `gcloud` or with orchestrators like Cloud Composer.
102+
Use this section to package pipeline code as a Flex template.
55103

56-
Follow the Flex Templates section below.
104+
Flex Templates offer a hermetic and reproducible launch environment for a
105+
pipeline. They are easy to launch with `gcloud` or with orchestrators like Cloud
106+
Composer. You **MUST** package the pipeline as a Flex Template when creating new
107+
Dataflow pipeline projects.
57108

58-
## Flex Templates
109+
Follow the steps below:
59110

60111
- **Provide Instructions**: Provide instructions on rebuilding and running
61112
Flex Templates to the user in walkthrough.
62113
- **Use Single Docker Image for Python pipelines**: For Python Flex Templates,
63114
it is better to use a single image for the template launcher image and for
64-
the worker runtime environment (`--sdk_container_image`). Whenever
65-
configuring or suggesting a Dataflow Flex Template for a Python pipeline
66-
that requires extra dependencies (e.g., using `--requirements_file`,
67-
`--setup_file`, or `--extra_package`), **YOU MUST recommend the Single
68-
Docker Image Configuration** as detailed in
69-
[python_flex_template_reference.md](references/python_flex_template_reference.md).
115+
the worker runtime environment (`--sdk_container_image`). Does the Python
116+
pipeline require extra dependencies (e.g., using `--requirements_file`,
117+
`--setup_file`, or `--extra_package`)? If so, **YOU MUST recommend the**
118+
**Single Docker Image Configuration** for the Flex Template. See
119+
[python_flex_template_reference.md][py-flex-ref] for details.
70120
- **Prefer Cloud Build over Local Docker**:
71121
- Do NOT assume local Docker availability on the workspace machine.
72122
- **Action**: Suggest and provide `cloudbuild.yaml` out-of-the-box for
73123
building and pushing images unless local setup is explicitly requested.
74124
- When building images with Cloud Build in the background you MUST provide
75125
the link where the user can monitor the long-running operation.
76-
77-
## Launching Apache Beam Pipelines with Dataflow Runner
126+
- **Providing SSL certificates and Secrets to Workers**:
127+
- If certificates or keys are stored in Secret Manager, **NEVER** bake
128+
them into the Docker image layers. Instead, retrieve them dynamically at
129+
runtime inside the Apache Beam `DoFn.setup()` lifecycle using the Secret
130+
Manager client library (writing them to ephemeral worker disk like
131+
`/tmp` only if physical file paths are strictly required). Ensure the
132+
Dataflow Worker Service Account has the
133+
`roles/secretmanager.secretAccessor` role.
134+
135+
## Configuring Google-provided templates
136+
137+
Use this section when the user has selected a Google-provided template (Classic
138+
or Flex) and you need to configure it.
139+
140+
- **Step 1: Get template metadata**
141+
142+
- Identify template type:
143+
* **Classic**: Metadata files are in `gs://dataflow-templates` and end
144+
with `_metadata` (e.g.,
145+
`gs://dataflow-templates/latest/Word_Count_metadata`).
146+
* **Flex**: Metadata are embedded in the template spec file under
147+
`gs://dataflow-templates/latest/flex` (e.g.
148+
`gs://dataflow-templates/latest/flex/Cloud_Datastream_to_BigQuery`).
149+
- Read the corresponding template metadata file to identify required
150+
parameters.
151+
- **Note**:
152+
* Make sure to run a recursive search over the bucket if needed to
153+
locate the metadata.
154+
* If the template parameters include UDF-related fields (e.g.,
155+
`javascriptTextTransformGcsPath`,
156+
`javascriptTextTransformFunctionName`), refer to the
157+
[UDF guide][udf-guide] to write and configure the UDF.
158+
* **Parameter-Based SSL / Secret Staging**: If the Google-provided
159+
template requires local SSL certificates or Secret Manager secrets,
160+
pass comma-separated GCS paths via the `extraFilesToStage`
161+
parameter. The runner will drop them into `/extra_files` on worker
162+
VMs. Refer to the [SSL certificates guide][ssl-cert-guide] for local
163+
referencing syntax (`/extra_files/...`).
164+
165+
- **Step 2: Get network configuration**
166+
167+
- **Action**: Run `gcloud` commands to list networks and subnetworks.
168+
- Confirm the network and subnetwork to use with the user.
169+
170+
- **Step 3: Identify required parameters and prepare resources**
171+
172+
- Extract required parameters from the template metadata.
173+
- > [!IMPORTANT]
174+
- > **Strict parameter validation**: Any parameter in the metadata JSON
175+
- > that does **NOT** explicitly have `"isOptional": true` is **strictly
176+
- > required** by the Dataflow API.
177+
- > This applies even if the description suggests it has a default value
178+
- > (e.g., `csvFormat` or `badRecordsOutputTable` in some templates).
179+
- > You must identify and supply all of them.
180+
- Identify which parameters are provided by the user and which need to be
181+
resolved or created by you.
182+
* **Action**: Present these parameters to the user using Markdown
183+
Key-Value (bullet points) for clarity and confirmation.
184+
- **Schema Handling**: If a schema JSON parameter (like `schemaJSONPath`
185+
or `JSONPath`) is required:
186+
* **Action**: Ask the user to provide the GCS path to an existing
187+
schema file or the JSON content.
188+
* If the user does not have a schema file, ask them to provide the
189+
field names and their types. Construct the schema JSON locally and
190+
present it to the user for validation.
191+
* Once confirmed by the user, write the schema JSON file locally and
192+
upload it to a GCS staging location, then supply this path to the
193+
parameter.
194+
- **Pre-create Target Sink (Best Practice)**: To ensure stability and
195+
avoid runtime creation schema mismatches:
196+
* **Action**: Clarify with the user whether the target sink (e.g.,
197+
BigQuery table, Spanner database/table) already exists. If the user
198+
confirms it exists, proceed to the remaining steps as-is.
199+
* If it does not exist, ask for permission to create it. If permitted,
200+
create it yourself. Otherwise, provide the exact creation commands
201+
to the user.
202+
* If the sink is BigQuery, refer to
203+
[Destination-specific prerequisites][dest-prereqs] for crucial table
204+
and error table setup.
205+
- Include additional parameters such as service accounts, network details,
206+
and other pipeline options.
207+
* **Specifying Options**: For Google-provided Flex Templates, refer to
208+
the [Specifying options for Flex Templates][flex-template-options]
209+
guide for how to pass parameters and additional experiments.
210+
211+
### Destination-specific prerequisites
212+
213+
Different templates might require specific resources to be prepared in the
214+
target sink before execution. Follow the instructions for your target sink
215+
below.
216+
217+
#### BigQuery
218+
219+
When running templates that write to BigQuery, you MUST ensure the following
220+
resources are prepared to prevent job failures:
221+
222+
- **Pre-create Target Table**: Create the target BigQuery table (e.g., using
223+
`bq mk`) before launching the job. Ensure the schema matches the template's
224+
expectations.
225+
- **Pre-create Error/Bad Records Table**: Many templates that write to
226+
BigQuery have a parameter for redirecting failed records (e.g.,
227+
`badRecordsOutputTable` or `outputDeadletterTable`). Some templates attempt
228+
to auto-create this table. However, pre-creating it is a best practice. This
229+
ensures correct schema and permissions.
230+
* **How to Determine the Error Schema**: Trace the schema definition in
231+
the public [DataflowTemplates GitHub repository][df-templates-repo]:
232+
1. Locate the source code or README for the template you are using
233+
(e.g., in `v1/` or `v2/` directories).
234+
2. Identify the parameter name used for the error table (e.g.,
235+
`badRecordsOutputTable` or `outputDeadletterTable`).
236+
3. Search the source code to see how the schema is defined or loaded
237+
for that parameter:
238+
* **Example (Code Reference)**: In `PubSubToBigQuery.java`, the
239+
schema is set using
240+
`ResourceUtils.getDeadletterTableSchemaJson()`. Tracing
241+
`ResourceUtils.java` shows it loads the schema from
242+
[streaming_source_deadletter_table_schema.json on GitHub][deadletter-schema].
243+
* **Example (Documentation Reference)**: For simpler templates,
244+
the schema might be listed in the official documentation, such
245+
as the `RawContent`/`ErrorMsg` schema shown in the
246+
[CSV to BigQuery DevSite Doc][csv-bq-doc].
247+
248+
## Configuring Custom Pipelines (Dataflow Runner)
249+
250+
Use this section when preparing to run a custom Apache Beam pipeline on
251+
Dataflow.
78252

79253
- When launching Python Pipelines without a Flex Template with
80254
`DataflowRunner`, you MUST scan the pipeline project directory for the
81255
following files:
82-
83256
- **`requirements.txt`**:
84257
- If found, you MUST include `--requirements_file` pipeline option.
85258
- **`setup.py`**:
@@ -90,8 +263,6 @@ Follow the Flex Templates section below.
90263
image is also the SDK Container image (Single Docker Image Configuration),
91264
then you MUST supply the image in the `sdk_container_image` parameter.
92265

93-
- Confirm the launch command with the user.
94-
95266
### Lookup environment resources instead of using placeholder values
96267

97268
- Avoid using generic placeholders (e.g., `your-gcp-project-id`) for GCP
@@ -100,6 +271,40 @@ Follow the Flex Templates section below.
100271
find active resources to pre-fill scripts for the user. Confirm the values
101272
with the user before proceeding.
102273

274+
## Job Execution
275+
276+
Use this section when configuration is complete and you are ready to launch any
277+
Dataflow job (Google-provided template, Custom Flex template, or standalone
278+
pipeline).
279+
280+
### Universal Execution Workflow
281+
282+
1. **Construct Launch Command**: Draft the full launch command based on the
283+
pipeline type (e.g., `gcloud dataflow flex-template run` or `python main.py
284+
--runner=DataflowRunner`). Ensure workers default to private IP
285+
configuration unless specified otherwise, and verify target project
286+
permissions.
287+
2. **Mandatory Pre-Launch Confirmation**: Present the *entire* drafted command
288+
to the user at once. Explain the purpose of all parameters (including
289+
experimental flags) and allow the user to review and correct the command as
290+
a batch instead of confirming piecemeal. **Do NOT proceed** with execution
291+
until explicitly approved.
292+
3. **Trigger Job**: Once approved, execute the command and note the resulting
293+
Job ID (displaying it to the user).
294+
4. **Display Console URL**: Construct and present the direct Cloud Console
295+
monitoring URL:
296+
https://console.cloud.google.com/dataflow/jobs/<region>/<job_id>?project=<project_id>
297+
298+
## Job Monitoring
299+
300+
Use this section to monitor the progress of a running Dataflow job.
301+
302+
- Check the status of the triggered Dataflow job using the job ID.
303+
- Run the check every 30 seconds for the first 2 minutes, then check every 3
304+
minutes, unless specified otherwise by the user.
305+
- **Note**: Do NOT perform data check queries on the sink until the job has
306+
reached a stable `RUNNING` or `DONE` state.
307+
103308
## Diagnostics & Troubleshooting
104309

105310
> [!IMPORTANT] YOU MUST use this section when the user asks about performance of
@@ -136,8 +341,7 @@ Follow the Flex Templates section below.
136341

137342
* Use Dataflow REST API to get High level Job Messages/Events that
138343
happened in the job.
139-
* Refer to
140-
[dataflow_diagnostics_reference.md](references/dataflow_diagnostics_reference.md) for
344+
* Refer to [dataflow_diagnostics_reference.md][diag-ref] for
141345
key metrics and logging query patterns based on Job Type.
142346
* Use Monitoring REST API to fetch metrics.
143347
* Use GCloud Logging command to fetch logs.
@@ -151,11 +355,10 @@ Follow the Flex Templates section below.
151355
[streaming_job_health](references/streaming_job_health.md) to analyze
152356
overall streaming job health.
153357
* Analyze Bottlenecks and Parallelism. YOU MUST refer to
154-
[bottlenecks_and_parallelism_context](references/bottlenecks_and_parallelism_context.md)
155-
and interpret the bottlenecks and parallelism metrics in that
156-
context.
358+
[bottlenecks_and_parallelism_context][bottlenecks-context] and
359+
interpret the bottlenecks and parallelism metrics in that context.
157360
* Analyze Autoscaling Behavior. YOU MUST refer to
158-
[streaming_horizontal_autoscaling_analysis.md](references/streaming_horizontal_autoscaling_analysis.md)
361+
[streaming_horizontal_autoscaling_analysis.md][autoscaling-analysis-link]
159362
* For Batch Jobs
160363
* Correlate metrics spikes/drops with log errors.
161364
* Identify Issues.
@@ -182,12 +385,12 @@ Follow the Flex Templates section below.
182385
`job/is_bottleneck` (interpreting `likely_cause` / `bottleneck_kind`)
183386
and key metrics `job/backlogged_keys` /
184387
`job/processing_parallelism_keys` interpreted in the context of
185-
[bottlenecks_and_parallelism_context](references/bottlenecks_and_parallelism_context.md).
388+
[bottlenecks_and_parallelism_context][bottlenecks-context].
186389
7. **Autoscaling Analysis**: Scaling trends using
187390
`job/horizontal_worker_scaling` (and label `rationale`), clamp limits
188391
(`job/max_worker_instances_limit` / `job/min_worker_instances_limit`),
189392
and utilization hints in the context of
190-
[streaming_horizontal_autoscaling_analysis](references/streaming_horizontal_autoscaling_analysis.md).
393+
[streaming_horizontal_autoscaling_analysis][autoscaling-analysis-link].
191394
8. **Recommendations**: Direct remediation plans (in-flight updates,
192395
client-side configurations, or code corrections linked via absolute
193396
`file:///` URIs).
@@ -201,3 +404,15 @@ Follow the Flex Templates section below.
201404
3. **Recommendations**: Direct remediation plans to future runs
202405
(client-side configurations, or code corrections linked via absolute
203406
`file:///` URIs).
407+
408+
[py-flex-ref]: references/python_flex_template_reference.md
409+
[udf-guide]: https://docs.cloud.google.com/dataflow/docs/guides/templates/create-template-udf
410+
[ssl-cert-guide]: https://docs.cloud.google.com/dataflow/docs/guides/templates/ssl-certificates
411+
[dest-prereqs]: #destination-specific-prerequisites
412+
[flex-template-options]: https://docs.cloud.google.com/dataflow/docs/guides/templates/run-flex-templates#specify-options
413+
[df-templates-repo]: https://github.com/GoogleCloudPlatform/DataflowTemplates
414+
[deadletter-schema]: https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/common/src/main/resources/schema/streaming_source_deadletter_table_schema.json
415+
[csv-bq-doc]: https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-csv-to-bigquery#GcsCSVToBigQueryBadRecordsSchema
416+
[diag-ref]: references/dataflow_diagnostics_reference.md
417+
[bottlenecks-context]: references/bottlenecks_and_parallelism_context.md
418+
[autoscaling-analysis-link]: references/streaming_horizontal_autoscaling_analysis.md

0 commit comments

Comments
 (0)