11---
22name : gcp-dataflow
3- description : |
4- Guides writing, packaging, executing, and troubleshooting Apache Beam pipelines on Dataflow. Use when creating new pipelines, configuring Flex Templates, or analyzing performance of Dataflow jobs. Capabilities include Java/Python/Go setup, Cloud Build integration, and deep diagnostic analysis of job health and autoscaling.
5- Use when: - Creating an Apache Beam Dataflow pipeline. - Creating a Google Flex Template. - Debugging Dataflow pipeline - Troubleshooting Dataflow pipeline - Analyzing Performance of Dataflow pipeline.
6- Key capabilities include: Project setup for Java/Python/Go, Flex Template configuration (with Cloud Build support), and in-depth diagnostics for streaming job health, bottlenecks, and autoscaling.
7- Do NOT use for: - General GCP resource management unrelated to Dataflow. - Issues with other GCP services (e.g., GCE, GCS, BigQuery) unless directly
3+ description : >
4+ Guides writing, packaging, executing, and troubleshooting Apache Beam
5+ pipelines on Dataflow. Use when creating new pipelines, configuring Flex
6+ Templates, or analyzing performance of Dataflow jobs. Capabilities include
7+ Java/Python/Go setup, Cloud Build integration, and deep diagnostic analysis
8+ of job health and autoscaling.
9+
10+ Use when:
11+ - Creating an Apache Beam Dataflow pipeline.
12+ - Creating a Google Dataflow Flex Template.
13+ - Using an existing Google Dataflow Template.
14+ - Debugging Dataflow pipeline
15+ - Troubleshooting Dataflow pipeline
16+ - Analyzing Performance of Dataflow pipeline.
17+
18+ Key capabilities: Java/Python/Go project setup, Flex Templates (with
19+ Cloud Build), and diagnostics for streaming job health, bottlenecks,
20+ and autoscaling.
21+
22+ Do NOT use for:
23+ - General GCP resource management unrelated to Dataflow.
24+ - Issues with other GCP services (e.g., GCE, GCS, BigQuery) unless directly
825 impacting Dataflow pipeline execution.
926 - Pipeline technologies other than Apache Beam on Dataflow.
27+
1028license : Apache-2.0
1129metadata :
12- version : v3
30+ version : v4
1331 publisher : google
1432---
1533
1634# Apache Beam Pipelines on Cloud Dataflow
1735
18- Expert guidance for writing and packaging Apache Beam pipelines to run on Google
19- Cloud Dataflow.
36+ ## Pipeline authoring
37+
38+ Use this section when implementing Dataflow pipeline logic using Apache Beam.
39+
40+ ### Check if existing Google Dataflow Template exists
41+
42+ Google provides a variety of pre-built, open source Dataflow templates that can
43+ be used for common scenarios. Before implementing a pipeline from scratch, you
44+ MUST follow the steps below to check whether a Dataflow template for the
45+ pipeline logic you need to implement already exists.
46+
47+ - ** Step 1: Check for a matching Google Dataflow Template**
2048
21- ## Creating a new project
49+ - Identify the ** source** and ** sink** (e.g., GCS to BigQuery) from the
50+ user's request. * Note* : You * MUST NOT* proceed until the source and sink
51+ are clearly identified.
52+ - ** Action** : List templates in the public ` dataflow-templates ` bucket:
53+ * For Classic templates, check ` gs://dataflow-templates/latest ` .
54+ * For Flex templates, check ` gs://dataflow-templates/latest/flex ` . Use
55+ ` gcloud storage ls ` to list the contents.
56+ - Match templates by name or description to the source and sink.
57+ - If no matching template is found, go to ** Create a new pipeline from
58+ scratch** .
2259
23- Use this section when creating a new project for a Dataflow pipeline.
60+ - ** Step 2: Confirm template selection**
61+
62+ - Present the matched template(s) to the user with a brief explanation of
63+ why they match, and make a note of whether it is a Classic or Flex
64+ template.
65+ - ** Action** : Ask the user for explicit confirmation to proceed with this
66+ template.
67+ - If the user rejects or prefers a custom solution, proceed to ** Create a
68+ new pipeline from scratch** .
69+
70+ ### Create a new pipeline from scratch
71+
72+ Use this section when creating a new project for a Dataflow pipeline from
73+ scratch.
2474
2575- If the user doesn't say explicitly which language (Java, Python, Go) shall
2676 be used to write the pipeline, you MUST confirm the language.
@@ -47,39 +97,162 @@ Use this section when configuring a Dataflow Java pipeline project using gradle.
4797 ` logback-classic ` , etc.) to exactly match the major/minor version of the
4898 resolved ` slf4j-api ` .
4999
50- ### Structure the pipeline as a Dataflow Flex Template
100+ ### Packaging a pipeline as a Flex Template
51101
52- When creating new Dataflow pipeline projects, configure them as a Flex template.
53- Flex Templates offer a hermetic and reproducible launch environment, and are
54- easy to launch with ` gcloud ` or with orchestrators like Cloud Composer.
102+ Use this section to package pipeline code as a Flex template.
55103
56- Follow the Flex Templates section below.
104+ Flex Templates offer a hermetic and reproducible launch environment for a
105+ pipeline. They are easy to launch with ` gcloud ` or with orchestrators like Cloud
106+ Composer. You ** MUST** package the pipeline as a Flex Template when creating new
107+ Dataflow pipeline projects.
57108
58- ## Flex Templates
109+ Follow the steps below:
59110
60111- ** Provide Instructions** : Provide instructions on rebuilding and running
61112 Flex Templates to the user in walkthrough.
62113- ** Use Single Docker Image for Python pipelines** : For Python Flex Templates,
63114 it is better to use a single image for the template launcher image and for
64- the worker runtime environment (` --sdk_container_image ` ). Whenever
65- configuring or suggesting a Dataflow Flex Template for a Python pipeline
66- that requires extra dependencies (e.g., using ` --requirements_file ` ,
67- ` --setup_file ` , or ` --extra_package ` ), ** YOU MUST recommend the Single
68- Docker Image Configuration** as detailed in
69- [ python_flex_template_reference.md] ( references/python_flex_template_reference.md ) .
115+ the worker runtime environment (` --sdk_container_image ` ). Does the Python
116+ pipeline require extra dependencies (e.g., using ` --requirements_file ` ,
117+ ` --setup_file ` , or ` --extra_package ` )? If so, ** YOU MUST recommend the**
118+ ** Single Docker Image Configuration** for the Flex Template. See
119+ [ python_flex_template_reference.md] [ py-flex-ref ] for details.
70120- ** Prefer Cloud Build over Local Docker** :
71121 - Do NOT assume local Docker availability on the workspace machine.
72122 - ** Action** : Suggest and provide ` cloudbuild.yaml ` out-of-the-box for
73123 building and pushing images unless local setup is explicitly requested.
74124 - When building images with Cloud Build in the background you MUST provide
75125 the link where the user can monitor the long-running operation.
76-
77- ## Launching Apache Beam Pipelines with Dataflow Runner
126+ - ** Providing SSL certificates and Secrets to Workers** :
127+ - If certificates or keys are stored in Secret Manager, ** NEVER** bake
128+ them into the Docker image layers. Instead, retrieve them dynamically at
129+ runtime inside the Apache Beam ` DoFn.setup() ` lifecycle using the Secret
130+ Manager client library (writing them to ephemeral worker disk like
131+ ` /tmp ` only if physical file paths are strictly required). Ensure the
132+ Dataflow Worker Service Account has the
133+ ` roles/secretmanager.secretAccessor ` role.
134+
135+ ## Configuring Google-provided templates
136+
137+ Use this section when the user has selected a Google-provided template (Classic
138+ or Flex) and you need to configure it.
139+
140+ - ** Step 1: Get template metadata**
141+
142+ - Identify template type:
143+ * ** Classic** : Metadata files are in ` gs://dataflow-templates ` and end
144+ with ` _metadata ` (e.g.,
145+ ` gs://dataflow-templates/latest/Word_Count_metadata ` ).
146+ * ** Flex** : Metadata are embedded in the template spec file under
147+ ` gs://dataflow-templates/latest/flex ` (e.g.
148+ ` gs://dataflow-templates/latest/flex/Cloud_Datastream_to_BigQuery ` ).
149+ - Read the corresponding template metadata file to identify required
150+ parameters.
151+ - ** Note** :
152+ * Make sure to run a recursive search over the bucket if needed to
153+ locate the metadata.
154+ * If the template parameters include UDF-related fields (e.g.,
155+ ` javascriptTextTransformGcsPath ` ,
156+ ` javascriptTextTransformFunctionName ` ), refer to the
157+ [ UDF guide] [ udf-guide ] to write and configure the UDF.
158+ * ** Parameter-Based SSL / Secret Staging** : If the Google-provided
159+ template requires local SSL certificates or Secret Manager secrets,
160+ pass comma-separated GCS paths via the ` extraFilesToStage `
161+ parameter. The runner will drop them into ` /extra_files ` on worker
162+ VMs. Refer to the [ SSL certificates guide] [ ssl-cert-guide ] for local
163+ referencing syntax (` /extra_files/... ` ).
164+
165+ - ** Step 2: Get network configuration**
166+
167+ - ** Action** : Run ` gcloud ` commands to list networks and subnetworks.
168+ - Confirm the network and subnetwork to use with the user.
169+
170+ - ** Step 3: Identify required parameters and prepare resources**
171+
172+ - Extract required parameters from the template metadata.
173+ - > [ !IMPORTANT]
174+ - > ** Strict parameter validation** : Any parameter in the metadata JSON
175+ - > that does ** NOT** explicitly have ` "isOptional": true ` is ** strictly
176+ - > required** by the Dataflow API.
177+ - > This applies even if the description suggests it has a default value
178+ - > (e.g., ` csvFormat ` or ` badRecordsOutputTable ` in some templates).
179+ - > You must identify and supply all of them.
180+ - Identify which parameters are provided by the user and which need to be
181+ resolved or created by you.
182+ * ** Action** : Present these parameters to the user using Markdown
183+ Key-Value (bullet points) for clarity and confirmation.
184+ - ** Schema Handling** : If a schema JSON parameter (like ` schemaJSONPath `
185+ or ` JSONPath ` ) is required:
186+ * ** Action** : Ask the user to provide the GCS path to an existing
187+ schema file or the JSON content.
188+ * If the user does not have a schema file, ask them to provide the
189+ field names and their types. Construct the schema JSON locally and
190+ present it to the user for validation.
191+ * Once confirmed by the user, write the schema JSON file locally and
192+ upload it to a GCS staging location, then supply this path to the
193+ parameter.
194+ - ** Pre-create Target Sink (Best Practice)** : To ensure stability and
195+ avoid runtime creation schema mismatches:
196+ * ** Action** : Clarify with the user whether the target sink (e.g.,
197+ BigQuery table, Spanner database/table) already exists. If the user
198+ confirms it exists, proceed to the remaining steps as-is.
199+ * If it does not exist, ask for permission to create it. If permitted,
200+ create it yourself. Otherwise, provide the exact creation commands
201+ to the user.
202+ * If the sink is BigQuery, refer to
203+ [ Destination-specific prerequisites] [ dest-prereqs ] for crucial table
204+ and error table setup.
205+ - Include additional parameters such as service accounts, network details,
206+ and other pipeline options.
207+ * ** Specifying Options** : For Google-provided Flex Templates, refer to
208+ the [ Specifying options for Flex Templates] [ flex-template-options ]
209+ guide for how to pass parameters and additional experiments.
210+
211+ ### Destination-specific prerequisites
212+
213+ Different templates might require specific resources to be prepared in the
214+ target sink before execution. Follow the instructions for your target sink
215+ below.
216+
217+ #### BigQuery
218+
219+ When running templates that write to BigQuery, you MUST ensure the following
220+ resources are prepared to prevent job failures:
221+
222+ - ** Pre-create Target Table** : Create the target BigQuery table (e.g., using
223+ ` bq mk ` ) before launching the job. Ensure the schema matches the template's
224+ expectations.
225+ - ** Pre-create Error/Bad Records Table** : Many templates that write to
226+ BigQuery have a parameter for redirecting failed records (e.g.,
227+ ` badRecordsOutputTable ` or ` outputDeadletterTable ` ). Some templates attempt
228+ to auto-create this table. However, pre-creating it is a best practice. This
229+ ensures correct schema and permissions.
230+ * ** How to Determine the Error Schema** : Trace the schema definition in
231+ the public [ DataflowTemplates GitHub repository] [ df-templates-repo ] :
232+ 1 . Locate the source code or README for the template you are using
233+ (e.g., in ` v1/ ` or ` v2/ ` directories).
234+ 2 . Identify the parameter name used for the error table (e.g.,
235+ ` badRecordsOutputTable ` or ` outputDeadletterTable ` ).
236+ 3 . Search the source code to see how the schema is defined or loaded
237+ for that parameter:
238+ * ** Example (Code Reference)** : In ` PubSubToBigQuery.java ` , the
239+ schema is set using
240+ ` ResourceUtils.getDeadletterTableSchemaJson() ` . Tracing
241+ ` ResourceUtils.java ` shows it loads the schema from
242+ [ streaming_source_deadletter_table_schema.json on GitHub] [ deadletter-schema ] .
243+ * ** Example (Documentation Reference)** : For simpler templates,
244+ the schema might be listed in the official documentation, such
245+ as the ` RawContent ` /` ErrorMsg ` schema shown in the
246+ [ CSV to BigQuery DevSite Doc] [ csv-bq-doc ] .
247+
248+ ## Configuring Custom Pipelines (Dataflow Runner)
249+
250+ Use this section when preparing to run a custom Apache Beam pipeline on
251+ Dataflow.
78252
79253- When launching Python Pipelines without a Flex Template with
80254 ` DataflowRunner ` , you MUST scan the pipeline project directory for the
81255 following files:
82-
83256 - ** ` requirements.txt ` ** :
84257 - If found, you MUST include ` --requirements_file ` pipeline option.
85258 - ** ` setup.py ` ** :
@@ -90,8 +263,6 @@ Follow the Flex Templates section below.
90263 image is also the SDK Container image (Single Docker Image Configuration),
91264 then you MUST supply the image in the ` sdk_container_image ` parameter.
92265
93- - Confirm the launch command with the user.
94-
95266### Lookup environment resources instead of using placeholder values
96267
97268- Avoid using generic placeholders (e.g., ` your-gcp-project-id ` ) for GCP
@@ -100,6 +271,40 @@ Follow the Flex Templates section below.
100271 find active resources to pre-fill scripts for the user. Confirm the values
101272 with the user before proceeding.
102273
274+ ## Job Execution
275+
276+ Use this section when configuration is complete and you are ready to launch any
277+ Dataflow job (Google-provided template, Custom Flex template, or standalone
278+ pipeline).
279+
280+ ### Universal Execution Workflow
281+
282+ 1 . ** Construct Launch Command** : Draft the full launch command based on the
283+ pipeline type (e.g., ` gcloud dataflow flex-template run ` or `python main.py
284+ --runner=DataflowRunner`). Ensure workers default to private IP
285+ configuration unless specified otherwise, and verify target project
286+ permissions.
287+ 2 . ** Mandatory Pre-Launch Confirmation** : Present the * entire* drafted command
288+ to the user at once. Explain the purpose of all parameters (including
289+ experimental flags) and allow the user to review and correct the command as
290+ a batch instead of confirming piecemeal. ** Do NOT proceed** with execution
291+ until explicitly approved.
292+ 3 . ** Trigger Job** : Once approved, execute the command and note the resulting
293+ Job ID (displaying it to the user).
294+ 4 . ** Display Console URL** : Construct and present the direct Cloud Console
295+ monitoring URL:
296+ https://console.cloud.google.com/dataflow/jobs/ <region >/<job_id>?project=<project_id>
297+
298+ ## Job Monitoring
299+
300+ Use this section to monitor the progress of a running Dataflow job.
301+
302+ - Check the status of the triggered Dataflow job using the job ID.
303+ - Run the check every 30 seconds for the first 2 minutes, then check every 3
304+ minutes, unless specified otherwise by the user.
305+ - ** Note** : Do NOT perform data check queries on the sink until the job has
306+ reached a stable ` RUNNING ` or ` DONE ` state.
307+
103308## Diagnostics & Troubleshooting
104309
105310> [ !IMPORTANT] YOU MUST use this section when the user asks about performance of
@@ -136,8 +341,7 @@ Follow the Flex Templates section below.
136341
137342 * Use Dataflow REST API to get High level Job Messages/Events that
138343 happened in the job.
139- * Refer to
140- [ dataflow_diagnostics_reference.md] ( references/dataflow_diagnostics_reference.md ) for
344+ * Refer to [ dataflow_diagnostics_reference.md] [ diag-ref ] for
141345 key metrics and logging query patterns based on Job Type.
142346 * Use Monitoring REST API to fetch metrics.
143347 * Use GCloud Logging command to fetch logs.
@@ -151,11 +355,10 @@ Follow the Flex Templates section below.
151355 [ streaming_job_health] ( references/streaming_job_health.md ) to analyze
152356 overall streaming job health.
153357 * Analyze Bottlenecks and Parallelism. YOU MUST refer to
154- [ bottlenecks_and_parallelism_context] ( references/bottlenecks_and_parallelism_context.md )
155- and interpret the bottlenecks and parallelism metrics in that
156- context.
358+ [ bottlenecks_and_parallelism_context] [ bottlenecks-context ] and
359+ interpret the bottlenecks and parallelism metrics in that context.
157360 * Analyze Autoscaling Behavior. YOU MUST refer to
158- [ streaming_horizontal_autoscaling_analysis.md] ( references/streaming_horizontal_autoscaling_analysis.md )
361+ [ streaming_horizontal_autoscaling_analysis.md] [ autoscaling-analysis-link ]
159362 * For Batch Jobs
160363 * Correlate metrics spikes/drops with log errors.
161364 * Identify Issues.
@@ -182,12 +385,12 @@ Follow the Flex Templates section below.
182385 ` job/is_bottleneck ` (interpreting ` likely_cause ` / ` bottleneck_kind ` )
183386 and key metrics ` job/backlogged_keys ` /
184387 ` job/processing_parallelism_keys ` interpreted in the context of
185- [ bottlenecks_and_parallelism_context] ( references/bottlenecks_and_parallelism_context.md ) .
388+ [ bottlenecks_and_parallelism_context] [ bottlenecks-context ] .
186389 7 . ** Autoscaling Analysis** : Scaling trends using
187390 ` job/horizontal_worker_scaling ` (and label ` rationale ` ), clamp limits
188391 (` job/max_worker_instances_limit ` / ` job/min_worker_instances_limit ` ),
189392 and utilization hints in the context of
190- [ streaming_horizontal_autoscaling_analysis] ( references/streaming_horizontal_autoscaling_analysis.md ) .
393+ [ streaming_horizontal_autoscaling_analysis] [ autoscaling-analysis-link ] .
191394 8 . ** Recommendations** : Direct remediation plans (in-flight updates,
192395 client-side configurations, or code corrections linked via absolute
193396 ` file:/// ` URIs).
@@ -201,3 +404,15 @@ Follow the Flex Templates section below.
201404 3 . ** Recommendations** : Direct remediation plans to future runs
202405 (client-side configurations, or code corrections linked via absolute
203406 ` file:/// ` URIs).
407+
408+ [ py-flex-ref ] : references/python_flex_template_reference.md
409+ [ udf-guide ] : https://docs.cloud.google.com/dataflow/docs/guides/templates/create-template-udf
410+ [ ssl-cert-guide ] : https://docs.cloud.google.com/dataflow/docs/guides/templates/ssl-certificates
411+ [ dest-prereqs ] : #destination-specific-prerequisites
412+ [ flex-template-options ] : https://docs.cloud.google.com/dataflow/docs/guides/templates/run-flex-templates#specify-options
413+ [ df-templates-repo ] : https://github.com/GoogleCloudPlatform/DataflowTemplates
414+ [ deadletter-schema ] : https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/v2/common/src/main/resources/schema/streaming_source_deadletter_table_schema.json
415+ [ csv-bq-doc ] : https://cloud.google.com/dataflow/docs/guides/templates/provided/cloud-storage-csv-to-bigquery#GcsCSVToBigQueryBadRecordsSchema
416+ [ diag-ref ] : references/dataflow_diagnostics_reference.md
417+ [ bottlenecks-context ] : references/bottlenecks_and_parallelism_context.md
418+ [ autoscaling-analysis-link ] : references/streaming_horizontal_autoscaling_analysis.md
0 commit comments