diff --git a/website/www/site/content/en/documentation/io/managed-io.md b/website/www/site/content/en/documentation/io/managed-io.md index 0e365eeb23e3..a42794a3af05 100644 --- a/website/www/site/content/en/documentation/io/managed-io.md +++ b/website/www/site/content/en/documentation/io/managed-io.md @@ -58,6 +58,34 @@ and Beam SQL is invoked via the Managed API under the hood.
str)str)boolean)str)str)map[str, str])str)str)str)boolean)boolean)int32)boolean)str)str)str)str)str)str)map[str, str])str)str)str)boolean)str)str)map[str, str])str)str)str)boolean)boolean)int32)boolean)str)str)str)str)str)str)map[str, str])str)str)str)boolean)int32)str)int32)str)str)str)list[str])str)str)list[str])list[str])str)str)int64)str)str)boolean)int32)str)int32)str)str)str)str)list[str])str)str)list[str])list[str])str)str)int64)| - table + bootstrap_servers |
str
|
- Identifier of the Iceberg table. + A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form `host1:port1,host2:port2,...` |
| - catalog_name + topic |
str
|
- Name of the catalog containing the table. - | -
| - catalog_properties - | -
- map[str, str]
- |
- - Properties used to set up the Iceberg catalog. + n/a |
| - config_properties + allow_duplicates |
- map[str, str]
+ boolean
|
- Properties passed to the Hadoop Configuration. + If the Kafka read allows duplicates. |
| - drop + confluent_schema_registry_subject |
- list[str]
+ str
|
- A subset of column names to exclude from reading. If null or empty, all columns will be read. + n/a |
| - filter + confluent_schema_registry_url |
str
|
- SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html + n/a |
| - keep + consumer_config_updates |
- list[str]
+ map[str, str]
|
- A subset of column names to read exclusively. If null or empty, all columns will be read. + A list of key-value pairs that act as configuration parameters for Kafka consumers. Most of these configurations will not be needed, but if you need to customize your Kafka consumer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html |
| Configuration | -Type | -Description | -
|---|---|---|
| - table + file_descriptor_path |
str
|
- A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`. + The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. |
| - catalog_name + format |
str
|
- Name of the catalog containing the table. + The encoding format for the data stored in Kafka. Valid options are: RAW,STRING,AVRO,JSON,PROTO |
| - catalog_properties + message_name |
- map[str, str]
+ str
|
- Properties used to set up the Iceberg catalog. + The name of the Protocol Buffer message to be used for schema extraction and data conversion. |
| - config_properties + offset_deduplication |
- map[str, str]
+ boolean
|
- Properties passed to the Hadoop Configuration. + If the redistribute is using offset deduplication mode. |
| - direct_write_byte_limit + redistribute_by_record_key |
- int32
+ boolean
|
- For a streaming pipeline, sets the limit for lifting bundles into the direct write path. + If the redistribute keys by the Kafka record key. |
| - drop + redistribute_num_keys |
- list[str]
+ int32
|
- A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. + The number of keys for redistributing Kafka inputs. |
| - keep + redistributed |
- list[str]
+ boolean
|
- A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. + If the Kafka read should be redistributed. |
| - only + schema |
str
|
- The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. - | -
| - partition_fields - | -
- list[str]
- |
- - Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are: - -- `foo` -- `truncate(foo, N)` -- `bucket(foo, N)` -- `hour(foo)` -- `day(foo)` -- `month(foo)` -- `year(foo)` -- `void(foo)` - -For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms. - | -
| - table_properties - | -
- map[str, str]
- |
- - Iceberg table properties to be set on the table when it is created. -For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties. - | -
| - triggering_frequency_seconds - | -
- int32
- |
- - For a streaming pipeline, sets the frequency at which snapshots are produced. + The schema in which the data is encoded in the Kafka topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL to Confluent Schema Registry is provided, then this field is ignored, and the schema is fetched from Confluent Schema Registry. |
| - table - | -
- str
- |
- - Identifier of the Iceberg table. - | -
| - catalog_name + bootstrap_servers |
str
|
- Name of the catalog containing the table. - | -
| - catalog_properties - | -
- map[str, str]
- |
- - Properties used to set up the Iceberg catalog. - | -
| - config_properties - | -
- map[str, str]
- |
- - Properties passed to the Hadoop Configuration. - | -
| - drop - | -
- list[str]
- |
- - A subset of column names to exclude from reading. If null or empty, all columns will be read. + A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,... |
| - filter + format |
str
|
- SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html - | -
| - from_snapshot - | -
- int64
- |
- - Starts reading from this snapshot ID (inclusive). - | -
| - from_timestamp - | -
- int64
- |
- - Starts reading from the first snapshot (inclusive) that was created after this timestamp (in milliseconds). - | -
| - keep - | -
- list[str]
- |
- - A subset of column names to read exclusively. If null or empty, all columns will be read. + The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO |
| - poll_interval_seconds + topic |
- int32
+ str
|
- The interval at which to poll for new snapshots. Defaults to 60 seconds. + n/a |
| - starting_strategy + file_descriptor_path |
str
|
- The source's starting strategy. Valid options are: "earliest" or "latest". Can be overriden by setting a starting snapshot or timestamp. Defaults to earliest for batch, and latest for streaming. + The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. |
| - streaming + message_name |
- boolean
+ str
|
- Enables streaming reads, where source continuously polls for snapshots forever. + The name of the Protocol Buffer message to be used for schema extraction and data conversion. |
| - to_snapshot + producer_config_updates |
- int64
+ map[str, str]
|
- Reads up to this snapshot ID (inclusive). + A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html |
| - to_timestamp + schema |
- int64
+ str
|
- Reads up to the latest snapshot (inclusive) created before this timestamp (in milliseconds). + n/a |
| - bootstrap_servers + table |
str
|
- A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,... + Identifier of the Iceberg table. |
| - format + catalog_name |
str
|
- The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO + Name of the catalog containing the table. |
| - topic + catalog_properties |
- str
+ map[str, str]
|
- n/a + Properties used to set up the Iceberg catalog. |
| - file_descriptor_path + config_properties |
- str
+ map[str, str]
|
- The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. + Properties passed to the Hadoop Configuration. |
| - message_name + drop |
- str
+ list[str]
|
- The name of the Protocol Buffer message to be used for schema extraction and data conversion. + A subset of column names to exclude from reading. If null or empty, all columns will be read. |
| - producer_config_updates + filter |
- map[str, str]
+ str
|
- A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html + SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html |
| - schema + keep |
- str
+ list[str]
|
- n/a + A subset of column names to read exclusively. If null or empty, all columns will be read. |
| - bootstrap_servers + table |
str
|
- A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form `host1:port1,host2:port2,...` + A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`. |
| - topic + catalog_name |
str
|
- n/a + Name of the catalog containing the table. |
| - allow_duplicates + catalog_properties |
- boolean
+ map[str, str]
|
- If the Kafka read allows duplicates. + Properties used to set up the Iceberg catalog. |
| - confluent_schema_registry_subject + config_properties |
- str
+ map[str, str]
|
- n/a + Properties passed to the Hadoop Configuration. |
| - confluent_schema_registry_url + direct_write_byte_limit |
- str
+ int32
|
- n/a + For a streaming pipeline, sets the limit for lifting bundles into the direct write path. |
| - consumer_config_updates + drop |
- map[str, str]
+ list[str]
|
- A list of key-value pairs that act as configuration parameters for Kafka consumers. Most of these configurations will not be needed, but if you need to customize your Kafka consumer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html + A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. |
| - file_descriptor_path + keep |
- str
+ list[str]
|
- The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. + A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. |
| - format + only |
str
|
- The encoding format for the data stored in Kafka. Valid options are: RAW,STRING,AVRO,JSON,PROTO + The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. |
| - message_name + partition_fields |
- str
+ list[str]
|
- The name of the Protocol Buffer message to be used for schema extraction and data conversion. + Fields used to create a partition spec that is applied when tables are created. For a field 'foo', the available partition transforms are: + +- `foo` +- `truncate(foo, N)` +- `bucket(foo, N)` +- `hour(foo)` +- `day(foo)` +- `month(foo)` +- `year(foo)` +- `void(foo)` + +For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms. |
| - offset_deduplication + table_properties |
- boolean
+ map[str, str]
|
- If the redistribute is using offset deduplication mode. + Iceberg table properties to be set on the table when it is created. +For more information on table properties, please visit https://iceberg.apache.org/docs/latest/configuration/#table-properties. |
| - redistribute_by_record_key + triggering_frequency_seconds |
- boolean
+ int32
|
- If the redistribute keys by the Kafka record key. + For a streaming pipeline, sets the frequency at which snapshots are produced. |
| Configuration | +Type | +Description | +
|---|---|---|
| - redistribute_num_keys + table |
- int32
+ str
|
- The number of keys for redistributing Kafka inputs. + Identifier of the Iceberg table. |
| - redistributed + catalog_name |
- boolean
+ str
|
- If the Kafka read should be redistributed. + Name of the catalog containing the table. |
| - schema + catalog_properties |
- str
+ map[str, str]
|
- The schema in which the data is encoded in the Kafka topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL to Confluent Schema Registry is provided, then this field is ignored, and the schema is fetched from Confluent Schema Registry. + Properties used to set up the Iceberg catalog. |
| Configuration | -Type | -Description | ++ config_properties + | +
+ map[str, str]
+ |
+ + Properties passed to the Hadoop Configuration. + |
|---|---|---|---|---|---|
| - jdbc_url + drop + | +
+ list[str]
+ |
+ + A subset of column names to exclude from reading. If null or empty, all columns will be read. + | +|||
| + filter |
str
|
- Connection URL for the JDBC sink. + SQL-like predicate to filter data at scan time. Example: "id > 5 AND status = 'ACTIVE'". Uses Apache Calcite syntax: https://calcite.apache.org/docs/reference.html | |||
| - autosharding + from_snapshot |
- boolean
+ int64
|
- If true, enables using a dynamically determined number of shards to write. + Starts reading from this snapshot ID (inclusive). | |||
| - batch_size + from_timestamp |
int64
|
- n/a + Starts reading from the first snapshot (inclusive) that was created after this timestamp (in milliseconds). | |||
| - connection_properties + keep |
- str
+ list[str]
|
- Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + A subset of column names to read exclusively. If null or empty, all columns will be read. | |||
| - location + poll_interval_seconds |
- str
+ int32
|
- Name of the table to write to. + The interval at which to poll for new snapshots. Defaults to 60 seconds. | |||
| - password + starting_strategy |
str
|
- Password for the JDBC source. + The source's starting strategy. Valid options are: "earliest" or "latest". Can be overriden by setting a starting snapshot or timestamp. Defaults to earliest for batch, and latest for streaming. | |||
| - username + streaming |
- str
+ boolean
|
- Username for the JDBC source. + Enables streaming reads, where source continuously polls for snapshots forever. | |||
| - write_statement + to_snapshot |
- str
+ int64
|
- SQL query used to insert records into the JDBC sink. + Reads up to this snapshot ID (inclusive). + | +|||
| + to_timestamp + | +
+ int64
+ |
+ + Reads up to the latest snapshot (inclusive) created before this timestamp (in milliseconds). |
| + disable_auto_commit + | +
+ boolean
+ |
+ + Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true. + | +
| fetch_size @@ -1112,7 +1023,7 @@ For more information on table properties, please visit https://iceberg.apache.or |
| - kms_key - | -
- str
- |
- - Use this Cloud KMS key to encrypt your data - | -
| - query + jdbc_url |
str
|
- The SQL query to be executed to read from the BigQuery table. + Connection URL for the JDBC sink. |
| - row_restriction + autosharding |
- str
+ boolean
|
- Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query. + If true, enables using a dynamically determined number of shards to write. |
| - fields + batch_size |
- list[str]
+ int64
|
- Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3" + n/a |
| - table + connection_properties |
str
|
- The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE} + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". |
| Configuration | -Type | -Description | -
|---|---|---|
| - table + location |
str
|
- The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE} - | -
| - drop - | -
- list[str]
- |
- - A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. - | -
| - keep - | -
- list[str]
- |
- - A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. + Name of the table to write to. |
| - kms_key + password |
str
|
- Use this Cloud KMS key to encrypt your data + Password for the JDBC source. |
| - only + username |
str
|
- The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. + Username for the JDBC source. |
| - triggering_frequency_seconds + write_statement |
- int64
+ str
|
- Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds. + SQL query used to insert records into the JDBC sink. |
| - disable_auto_commit - | -
- boolean
- |
- - Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true. - | -
| fetch_size @@ -1390,7 +1245,7 @@ For more information on table properties, please visit https://iceberg.apache.or |
| Configuration | +Type | +Description | +
|---|---|---|
| + kms_key + | +
+ str
+ |
+ + Use this Cloud KMS key to encrypt your data + | +
| + query + | +
+ str
+ |
+ + The SQL query to be executed to read from the BigQuery table. + | +
| + row_restriction + | +
+ str
+ |
+ + Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query. + | +
| + fields + | +
+ list[str]
+ |
+ + Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3" + | +
| + table + | +
+ str
+ |
+ + The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE} + | +
| Configuration | +Type | +Description | +
|---|---|---|
| + table + | +
+ str
+ |
+ + The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE} + | +
| + drop + | +
+ list[str]
+ |
+ + A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. + | +
| + keep + | +
+ list[str]
+ |
+ + A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. + | +
| + kms_key + | +
+ str
+ |
+ + Use this Cloud KMS key to encrypt your data + | +
| + only + | +
+ str
+ |
+ + The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. + | +
| + triggering_frequency_seconds + | +
+ int64
+ |
+ + Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds. + | +