Skip to content

Commit f3084fb

Browse files
committed
feat(yaml): Add Iceberg to AlloyDB YAML template
1 parent af9d406 commit f3084fb

5 files changed

Lines changed: 733 additions & 0 deletions

File tree

Lines changed: 269 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,269 @@
1+
2+
Iceberg to AlloyDB (YAML) template
3+
---
4+
The Iceberg to AlloyDB template is a batch pipeline that reads data from an
5+
Iceberg table and outputs the records to an AlloyDB database table.
6+
7+
8+
9+
:bulb: This is a generated documentation based
10+
on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations)
11+
. Do not change this file directly.
12+
13+
## Parameters
14+
15+
### Required parameters
16+
17+
* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`.
18+
* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`.
19+
* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`.
20+
* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:postgresql://your-host:5432/your-db`.
21+
22+
### Optional parameters
23+
24+
* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`.
25+
* **drop**: A list of field names to drop. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`.
26+
* **filter**: A filter expression to apply to records from the Iceberg table. For example, `age > 18`.
27+
* **keep**: A list of field names to keep. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`.
28+
* **username**: The database username. For example, `my_user`.
29+
* **password**: The database password. For example, `my_secret_password`.
30+
* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`.
31+
* **alloydbTable**: The name of the database table. For example, `public.my_table`.
32+
* **query**: The SQL query/statement to execute on the source/sink. For example, `SELECT * FROM my_table WHERE status = 'active'`.
33+
* **batchSize**: The number of records to group together for each write. For example, `1000`. Defaults to: 1000.
34+
* **autosharding**: If true, a dynamic number of shards will be used for writing. For example, `False`.
35+
36+
37+
38+
## Getting Started
39+
40+
### Requirements
41+
42+
* Java 17
43+
* Maven
44+
* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the
45+
following commands:
46+
* `gcloud auth login`
47+
* `gcloud auth application-default login`
48+
49+
:star2: Those dependencies are pre-installed if you use Google Cloud Shell!
50+
51+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/IcebergToAlloyDBYaml.java)
52+
53+
### Templates Plugin
54+
55+
This README provides instructions using
56+
the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin).
57+
58+
#### Validating the Template
59+
60+
This template has a validation command that is used to check code quality.
61+
62+
```shell
63+
mvn clean install -PtemplatesValidate \
64+
-DskipTests -am \
65+
-pl yaml
66+
```
67+
68+
### Building Template
69+
70+
This template is a Flex Template, meaning that the pipeline code will be
71+
containerized and the container will be executed on Dataflow. Please
72+
check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates)
73+
and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates)
74+
for more information.
75+
76+
#### Staging the Template
77+
78+
If the plan is to just stage the template (i.e., make it available to use) by
79+
the `gcloud` command or Dataflow "Create job from template" UI,
80+
the `-PtemplatesStage` profile should be used:
81+
82+
```shell
83+
export PROJECT=<my-project>
84+
export BUCKET_NAME=<bucket-name>
85+
export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo>
86+
87+
mvn clean package -PtemplatesStage \
88+
-DskipTests \
89+
-DprojectId="$PROJECT" \
90+
-DbucketName="$BUCKET_NAME" \
91+
-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \
92+
-DstagePrefix="templates" \
93+
-DtemplateName="Iceberg_To_AlloyDB_Yaml" \
94+
-f yaml
95+
```
96+
97+
The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image.
98+
If not provided, it defaults to `gcr.io/<project>`.
99+
100+
The command should build and save the template to Google Cloud, and then print
101+
the complete location on Cloud Storage:
102+
103+
```
104+
Flex Template was staged! gs://<bucket-name>/templates/flex/Iceberg_To_AlloyDB_Yaml
105+
```
106+
107+
The specific path should be copied as it will be used in the following steps.
108+
109+
#### Running the Template
110+
111+
**Using the staged template**:
112+
113+
You can use the path above run the template (or share with others for execution).
114+
115+
To start a job with the template at any time using `gcloud`, you are going to
116+
need valid resources for the required parameters.
117+
118+
Provided that, the following command line can be used:
119+
120+
```shell
121+
export PROJECT=<my-project>
122+
export BUCKET_NAME=<bucket-name>
123+
export REGION=us-central1
124+
export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/Iceberg_To_AlloyDB_Yaml"
125+
126+
### Required
127+
export TABLE=<table>
128+
export CATALOG_NAME=<catalogName>
129+
export CATALOG_PROPERTIES=<catalogProperties>
130+
export JDBC_URL=<jdbcUrl>
131+
132+
### Optional
133+
export CONFIG_PROPERTIES=<configProperties>
134+
export DROP=<drop>
135+
export FILTER=<filter>
136+
export KEEP=<keep>
137+
export USERNAME=<username>
138+
export PASSWORD=<password>
139+
export CONNECTION_PROPERTIES=<connectionProperties>
140+
export ALLOYDB_TABLE=<alloydbTable>
141+
export QUERY=<query>
142+
export BATCH_SIZE=1000
143+
export AUTOSHARDING=<autosharding>
144+
145+
gcloud dataflow flex-template run "iceberg-to-alloydb-yaml-job" \
146+
--project "$PROJECT" \
147+
--region "$REGION" \
148+
--template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \
149+
--parameters "table=$TABLE" \
150+
--parameters "catalogName=$CATALOG_NAME" \
151+
--parameters "catalogProperties=$CATALOG_PROPERTIES" \
152+
--parameters "configProperties=$CONFIG_PROPERTIES" \
153+
--parameters "drop=$DROP" \
154+
--parameters "filter=$FILTER" \
155+
--parameters "keep=$KEEP" \
156+
--parameters "jdbcUrl=$JDBC_URL" \
157+
--parameters "username=$USERNAME" \
158+
--parameters "password=$PASSWORD" \
159+
--parameters "connectionProperties=$CONNECTION_PROPERTIES" \
160+
--parameters "alloydbTable=$ALLOYDB_TABLE" \
161+
--parameters "query=$QUERY" \
162+
--parameters "batchSize=$BATCH_SIZE" \
163+
--parameters "autosharding=$AUTOSHARDING"
164+
```
165+
166+
For more information about the command, please check:
167+
https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run
168+
169+
170+
**Using the plugin**:
171+
172+
Instead of just generating the template in the folder, it is possible to stage
173+
and run the template in a single command. This may be useful for testing when
174+
changing the templates.
175+
176+
```shell
177+
export PROJECT=<my-project>
178+
export BUCKET_NAME=<bucket-name>
179+
export REGION=us-central1
180+
181+
### Required
182+
export TABLE=<table>
183+
export CATALOG_NAME=<catalogName>
184+
export CATALOG_PROPERTIES=<catalogProperties>
185+
export JDBC_URL=<jdbcUrl>
186+
187+
### Optional
188+
export CONFIG_PROPERTIES=<configProperties>
189+
export DROP=<drop>
190+
export FILTER=<filter>
191+
export KEEP=<keep>
192+
export USERNAME=<username>
193+
export PASSWORD=<password>
194+
export CONNECTION_PROPERTIES=<connectionProperties>
195+
export ALLOYDB_TABLE=<alloydbTable>
196+
export QUERY=<query>
197+
export BATCH_SIZE=1000
198+
export AUTOSHARDING=<autosharding>
199+
200+
mvn clean package -PtemplatesRun \
201+
-DskipTests \
202+
-DprojectId="$PROJECT" \
203+
-DbucketName="$BUCKET_NAME" \
204+
-Dregion="$REGION" \
205+
-DjobName="iceberg-to-alloydb-yaml-job" \
206+
-DtemplateName="Iceberg_To_AlloyDB_Yaml" \
207+
-Dparameters="table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,filter=$FILTER,keep=$KEEP,jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,connectionProperties=$CONNECTION_PROPERTIES,alloydbTable=$ALLOYDB_TABLE,query=$QUERY,batchSize=$BATCH_SIZE,autosharding=$AUTOSHARDING" \
208+
-f yaml
209+
```
210+
211+
## Terraform
212+
213+
Dataflow supports the utilization of Terraform to manage template jobs,
214+
see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job).
215+
216+
Terraform modules have been generated for most templates in this repository. This includes the relevant parameters
217+
specific to the template. If available, they may be used instead of
218+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
219+
directly.
220+
221+
To use the autogenerated module, execute the standard
222+
[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow):
223+
224+
```shell
225+
cd yaml/terraform/Iceberg_To_AlloyDB_Yaml
226+
terraform init
227+
terraform apply
228+
```
229+
230+
To use
231+
[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job)
232+
directly:
233+
234+
```terraform
235+
provider "google-beta" {
236+
project = var.project
237+
}
238+
variable "project" {
239+
default = "<my-project>"
240+
}
241+
variable "region" {
242+
default = "us-central1"
243+
}
244+
245+
resource "google_dataflow_flex_template_job" "iceberg_to_alloydb_yaml" {
246+
247+
provider = google-beta
248+
container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/Iceberg_To_AlloyDB_Yaml"
249+
name = "iceberg-to-alloydb-yaml"
250+
region = var.region
251+
parameters = {
252+
table = "<table>"
253+
catalogName = "<catalogName>"
254+
catalogProperties = "<catalogProperties>"
255+
jdbcUrl = "<jdbcUrl>"
256+
# configProperties = "<configProperties>"
257+
# drop = "<drop>"
258+
# filter = "<filter>"
259+
# keep = "<keep>"
260+
# username = "<username>"
261+
# password = "<password>"
262+
# connectionProperties = "<connectionProperties>"
263+
# alloydbTable = "<alloydbTable>"
264+
# query = "<query>"
265+
# batchSize = "1000"
266+
# autosharding = "<autosharding>"
267+
}
268+
}
269+
```

0 commit comments

Comments
 (0)