|
| 1 | + |
| 2 | +Iceberg to AlloyDB (YAML) template |
| 3 | +--- |
| 4 | +The Iceberg to AlloyDB template is a batch pipeline that reads data from an |
| 5 | +Iceberg table and outputs the records to an AlloyDB database table. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +:bulb: This is a generated documentation based |
| 10 | +on [Metadata Annotations](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#metadata-annotations) |
| 11 | +. Do not change this file directly. |
| 12 | + |
| 13 | +## Parameters |
| 14 | + |
| 15 | +### Required parameters |
| 16 | + |
| 17 | +* **table**: A fully-qualified table identifier, e.g., my_dataset.my_table. For example, `my_dataset.my_table`. |
| 18 | +* **catalogName**: The name of the Iceberg catalog that contains the table. For example, `my_hadoop_catalog`. |
| 19 | +* **catalogProperties**: A map of properties for setting up the Iceberg catalog. For example, `{"type": "hadoop", "warehouse": "gs://your-bucket/warehouse"}`. |
| 20 | +* **jdbcUrl**: The JDBC connection URL. For example, `jdbc:postgresql://your-host:5432/your-db`. |
| 21 | + |
| 22 | +### Optional parameters |
| 23 | + |
| 24 | +* **configProperties**: A map of properties to pass to the Hadoop Configuration. For example, `{"fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"}`. |
| 25 | +* **drop**: A list of field names to drop. Mutually exclusive with 'keep' and 'only'. For example, `["field_to_drop_1", "field_to_drop_2"]`. |
| 26 | +* **filter**: A filter expression to apply to records from the Iceberg table. For example, `age > 18`. |
| 27 | +* **keep**: A list of field names to keep. Mutually exclusive with 'drop' and 'only'. For example, `["field_to_keep_1", "field_to_keep_2"]`. |
| 28 | +* **username**: The database username. For example, `my_user`. |
| 29 | +* **password**: The database password. For example, `my_secret_password`. |
| 30 | +* **connectionProperties**: A semicolon-separated list of key-value pairs for the JDBC connection. For example, `key1=value1;key2=value2`. |
| 31 | +* **alloydbTable**: The name of the database table. For example, `public.my_table`. |
| 32 | +* **query**: The SQL query/statement to execute on the source/sink. For example, `SELECT * FROM my_table WHERE status = 'active'`. |
| 33 | +* **batchSize**: The number of records to group together for each write. For example, `1000`. Defaults to: 1000. |
| 34 | +* **autosharding**: If true, a dynamic number of shards will be used for writing. For example, `False`. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +## Getting Started |
| 39 | + |
| 40 | +### Requirements |
| 41 | + |
| 42 | +* Java 17 |
| 43 | +* Maven |
| 44 | +* [gcloud CLI](https://cloud.google.com/sdk/gcloud), and execution of the |
| 45 | + following commands: |
| 46 | + * `gcloud auth login` |
| 47 | + * `gcloud auth application-default login` |
| 48 | + |
| 49 | +:star2: Those dependencies are pre-installed if you use Google Cloud Shell! |
| 50 | + |
| 51 | +[](https://console.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2FDataflowTemplates.git&cloudshell_open_in_editor=yaml/src/main/java/com/google/cloud/teleport/templates/yaml/IcebergToAlloyDBYaml.java) |
| 52 | + |
| 53 | +### Templates Plugin |
| 54 | + |
| 55 | +This README provides instructions using |
| 56 | +the [Templates Plugin](https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/main/contributor-docs/code-contributions.md#templates-plugin). |
| 57 | + |
| 58 | +#### Validating the Template |
| 59 | + |
| 60 | +This template has a validation command that is used to check code quality. |
| 61 | + |
| 62 | +```shell |
| 63 | +mvn clean install -PtemplatesValidate \ |
| 64 | +-DskipTests -am \ |
| 65 | +-pl yaml |
| 66 | +``` |
| 67 | + |
| 68 | +### Building Template |
| 69 | + |
| 70 | +This template is a Flex Template, meaning that the pipeline code will be |
| 71 | +containerized and the container will be executed on Dataflow. Please |
| 72 | +check [Use Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/using-flex-templates) |
| 73 | +and [Configure Flex Templates](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates) |
| 74 | +for more information. |
| 75 | + |
| 76 | +#### Staging the Template |
| 77 | + |
| 78 | +If the plan is to just stage the template (i.e., make it available to use) by |
| 79 | +the `gcloud` command or Dataflow "Create job from template" UI, |
| 80 | +the `-PtemplatesStage` profile should be used: |
| 81 | + |
| 82 | +```shell |
| 83 | +export PROJECT=<my-project> |
| 84 | +export BUCKET_NAME=<bucket-name> |
| 85 | +export ARTIFACT_REGISTRY_REPO=<region>-docker.pkg.dev/$PROJECT/<repo> |
| 86 | + |
| 87 | +mvn clean package -PtemplatesStage \ |
| 88 | +-DskipTests \ |
| 89 | +-DprojectId="$PROJECT" \ |
| 90 | +-DbucketName="$BUCKET_NAME" \ |
| 91 | +-DartifactRegistry="$ARTIFACT_REGISTRY_REPO" \ |
| 92 | +-DstagePrefix="templates" \ |
| 93 | +-DtemplateName="Iceberg_To_AlloyDB_Yaml" \ |
| 94 | +-f yaml |
| 95 | +``` |
| 96 | + |
| 97 | +The `-DartifactRegistry` parameter can be specified to set the artifact registry repository of the Flex Templates image. |
| 98 | +If not provided, it defaults to `gcr.io/<project>`. |
| 99 | + |
| 100 | +The command should build and save the template to Google Cloud, and then print |
| 101 | +the complete location on Cloud Storage: |
| 102 | + |
| 103 | +``` |
| 104 | +Flex Template was staged! gs://<bucket-name>/templates/flex/Iceberg_To_AlloyDB_Yaml |
| 105 | +``` |
| 106 | + |
| 107 | +The specific path should be copied as it will be used in the following steps. |
| 108 | + |
| 109 | +#### Running the Template |
| 110 | + |
| 111 | +**Using the staged template**: |
| 112 | + |
| 113 | +You can use the path above run the template (or share with others for execution). |
| 114 | + |
| 115 | +To start a job with the template at any time using `gcloud`, you are going to |
| 116 | +need valid resources for the required parameters. |
| 117 | + |
| 118 | +Provided that, the following command line can be used: |
| 119 | + |
| 120 | +```shell |
| 121 | +export PROJECT=<my-project> |
| 122 | +export BUCKET_NAME=<bucket-name> |
| 123 | +export REGION=us-central1 |
| 124 | +export TEMPLATE_SPEC_GCSPATH="gs://$BUCKET_NAME/templates/flex/Iceberg_To_AlloyDB_Yaml" |
| 125 | + |
| 126 | +### Required |
| 127 | +export TABLE=<table> |
| 128 | +export CATALOG_NAME=<catalogName> |
| 129 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 130 | +export JDBC_URL=<jdbcUrl> |
| 131 | + |
| 132 | +### Optional |
| 133 | +export CONFIG_PROPERTIES=<configProperties> |
| 134 | +export DROP=<drop> |
| 135 | +export FILTER=<filter> |
| 136 | +export KEEP=<keep> |
| 137 | +export USERNAME=<username> |
| 138 | +export PASSWORD=<password> |
| 139 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 140 | +export ALLOYDB_TABLE=<alloydbTable> |
| 141 | +export QUERY=<query> |
| 142 | +export BATCH_SIZE=1000 |
| 143 | +export AUTOSHARDING=<autosharding> |
| 144 | + |
| 145 | +gcloud dataflow flex-template run "iceberg-to-alloydb-yaml-job" \ |
| 146 | + --project "$PROJECT" \ |
| 147 | + --region "$REGION" \ |
| 148 | + --template-file-gcs-location "$TEMPLATE_SPEC_GCSPATH" \ |
| 149 | + --parameters "table=$TABLE" \ |
| 150 | + --parameters "catalogName=$CATALOG_NAME" \ |
| 151 | + --parameters "catalogProperties=$CATALOG_PROPERTIES" \ |
| 152 | + --parameters "configProperties=$CONFIG_PROPERTIES" \ |
| 153 | + --parameters "drop=$DROP" \ |
| 154 | + --parameters "filter=$FILTER" \ |
| 155 | + --parameters "keep=$KEEP" \ |
| 156 | + --parameters "jdbcUrl=$JDBC_URL" \ |
| 157 | + --parameters "username=$USERNAME" \ |
| 158 | + --parameters "password=$PASSWORD" \ |
| 159 | + --parameters "connectionProperties=$CONNECTION_PROPERTIES" \ |
| 160 | + --parameters "alloydbTable=$ALLOYDB_TABLE" \ |
| 161 | + --parameters "query=$QUERY" \ |
| 162 | + --parameters "batchSize=$BATCH_SIZE" \ |
| 163 | + --parameters "autosharding=$AUTOSHARDING" |
| 164 | +``` |
| 165 | + |
| 166 | +For more information about the command, please check: |
| 167 | +https://cloud.google.com/sdk/gcloud/reference/dataflow/flex-template/run |
| 168 | + |
| 169 | + |
| 170 | +**Using the plugin**: |
| 171 | + |
| 172 | +Instead of just generating the template in the folder, it is possible to stage |
| 173 | +and run the template in a single command. This may be useful for testing when |
| 174 | +changing the templates. |
| 175 | + |
| 176 | +```shell |
| 177 | +export PROJECT=<my-project> |
| 178 | +export BUCKET_NAME=<bucket-name> |
| 179 | +export REGION=us-central1 |
| 180 | + |
| 181 | +### Required |
| 182 | +export TABLE=<table> |
| 183 | +export CATALOG_NAME=<catalogName> |
| 184 | +export CATALOG_PROPERTIES=<catalogProperties> |
| 185 | +export JDBC_URL=<jdbcUrl> |
| 186 | + |
| 187 | +### Optional |
| 188 | +export CONFIG_PROPERTIES=<configProperties> |
| 189 | +export DROP=<drop> |
| 190 | +export FILTER=<filter> |
| 191 | +export KEEP=<keep> |
| 192 | +export USERNAME=<username> |
| 193 | +export PASSWORD=<password> |
| 194 | +export CONNECTION_PROPERTIES=<connectionProperties> |
| 195 | +export ALLOYDB_TABLE=<alloydbTable> |
| 196 | +export QUERY=<query> |
| 197 | +export BATCH_SIZE=1000 |
| 198 | +export AUTOSHARDING=<autosharding> |
| 199 | + |
| 200 | +mvn clean package -PtemplatesRun \ |
| 201 | +-DskipTests \ |
| 202 | +-DprojectId="$PROJECT" \ |
| 203 | +-DbucketName="$BUCKET_NAME" \ |
| 204 | +-Dregion="$REGION" \ |
| 205 | +-DjobName="iceberg-to-alloydb-yaml-job" \ |
| 206 | +-DtemplateName="Iceberg_To_AlloyDB_Yaml" \ |
| 207 | +-Dparameters="table=$TABLE,catalogName=$CATALOG_NAME,catalogProperties=$CATALOG_PROPERTIES,configProperties=$CONFIG_PROPERTIES,drop=$DROP,filter=$FILTER,keep=$KEEP,jdbcUrl=$JDBC_URL,username=$USERNAME,password=$PASSWORD,connectionProperties=$CONNECTION_PROPERTIES,alloydbTable=$ALLOYDB_TABLE,query=$QUERY,batchSize=$BATCH_SIZE,autosharding=$AUTOSHARDING" \ |
| 208 | +-f yaml |
| 209 | +``` |
| 210 | + |
| 211 | +## Terraform |
| 212 | + |
| 213 | +Dataflow supports the utilization of Terraform to manage template jobs, |
| 214 | +see [dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job). |
| 215 | + |
| 216 | +Terraform modules have been generated for most templates in this repository. This includes the relevant parameters |
| 217 | +specific to the template. If available, they may be used instead of |
| 218 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 219 | +directly. |
| 220 | + |
| 221 | +To use the autogenerated module, execute the standard |
| 222 | +[terraform workflow](https://developer.hashicorp.com/terraform/intro/core-workflow): |
| 223 | + |
| 224 | +```shell |
| 225 | +cd v2/yaml/terraform/Iceberg_To_AlloyDB_Yaml |
| 226 | +terraform init |
| 227 | +terraform apply |
| 228 | +``` |
| 229 | + |
| 230 | +To use |
| 231 | +[dataflow_flex_template_job](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/dataflow_flex_template_job) |
| 232 | +directly: |
| 233 | + |
| 234 | +```terraform |
| 235 | +provider "google-beta" { |
| 236 | + project = var.project |
| 237 | +} |
| 238 | +variable "project" { |
| 239 | + default = "<my-project>" |
| 240 | +} |
| 241 | +variable "region" { |
| 242 | + default = "us-central1" |
| 243 | +} |
| 244 | +
|
| 245 | +resource "google_dataflow_flex_template_job" "iceberg_to_alloydb_yaml" { |
| 246 | +
|
| 247 | + provider = google-beta |
| 248 | + container_spec_gcs_path = "gs://dataflow-templates-${var.region}/latest/flex/Iceberg_To_AlloyDB_Yaml" |
| 249 | + name = "iceberg-to-alloydb-yaml" |
| 250 | + region = var.region |
| 251 | + parameters = { |
| 252 | + table = "<table>" |
| 253 | + catalogName = "<catalogName>" |
| 254 | + catalogProperties = "<catalogProperties>" |
| 255 | + jdbcUrl = "<jdbcUrl>" |
| 256 | + # configProperties = "<configProperties>" |
| 257 | + # drop = "<drop>" |
| 258 | + # filter = "<filter>" |
| 259 | + # keep = "<keep>" |
| 260 | + # username = "<username>" |
| 261 | + # password = "<password>" |
| 262 | + # connectionProperties = "<connectionProperties>" |
| 263 | + # alloydbTable = "<alloydbTable>" |
| 264 | + # query = "<query>" |
| 265 | + # batchSize = "1000" |
| 266 | + # autosharding = "<autosharding>" |
| 267 | + } |
| 268 | +} |
| 269 | +``` |
0 commit comments