Skip to content

Commit ca80cdc

Browse files
kbatuigasFeediver1
authored andcommitted
[25.2] Iceberg - AWS Glue (#1208)
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
1 parent b2654cc commit ca80cdc

5 files changed

Lines changed: 238 additions & 5 deletions

File tree

modules/ROOT/nav.adoc

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -172,9 +172,11 @@
172172
*** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
173173
*** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]
174174
*** xref:manage:iceberg/use-iceberg-catalogs.adoc[Use Iceberg Catalogs]
175+
*** xref:manage:iceberg/rest-catalog/index.adoc[Integrate with REST Catalogs]
176+
**** xref:manage:iceberg/iceberg-topics-aws-glue.adoc[AWS Glue]
177+
**** xref:manage:iceberg/iceberg-topics-databricks-unity.adoc[Databricks Unity Catalog]
178+
**** xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Snowflake and Open Catalog]
175179
*** xref:manage:iceberg/query-iceberg-topics.adoc[Query Iceberg Topics]
176-
*** xref:manage:iceberg/iceberg-topics-databricks-unity.adoc[Query Iceberg Topics with Databricks Unity Catalog]
177-
*** xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Query Iceberg Topics with Snowflake and Open Catalog]
178180
** xref:manage:schema-reg/index.adoc[Schema Registry]
179181
*** xref:manage:schema-reg/schema-reg-overview.adoc[Overview]
180182
*** xref:manage:schema-reg/manage-schema-reg.adoc[]

modules/get-started/pages/release-notes/redpanda.adoc

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@ This topic includes new content added in version {page-component-version}. For a
77
* xref:redpanda-cloud:get-started:whats-new-cloud.adoc[]
88
* xref:redpanda-cloud:get-started:cloud-overview.adoc#redpanda-cloud-vs-self-managed-feature-compatibility[Redpanda Cloud vs Self-Managed feature compatibility]
99
10+
== Iceberg topics with AWS Glue
11+
12+
A new xref:manage:iceberg/iceberg-topics-aws-glue.adoc[integration with AWS Glue Data Catalog] allows you to add Redpanda topics as Iceberg tables in your data lakehouse. The AWS Glue catalog integration is available in Redpanda version 25.1.7 and later.
13+
14+
See xref:manage:iceberg/rest-catalog/index.adoc[] for supported Iceberg REST catalog integrations.
15+
1016
== JSON Schema support for Iceberg topics
1117

1218
Redpanda now supports JSON Schema for Iceberg topics. This allows you to use all supported schema types (Protobuf, Avro, and JSON Schema) for Iceberg topics. For more information, see xref:manage:iceberg/specify-iceberg-schema.adoc[].
@@ -50,4 +56,10 @@ If you need to maintain the current HTTP Proxy functionality while transitioning
5056
- xref:reference:properties/broker-properties.adoc#scram_password[`scram_password`]: Password for SASL/SCRAM authentication
5157
- xref:reference:properties/broker-properties.adoc#sasl_mechanism[`sasl_mechanism`]: SASL mechanism (typically `SCRAM-SHA-256` or `SCRAM-SHA-512`)
5258

59+
== Cluster properties
60+
61+
The following cluster properties are new in this version:
62+
63+
=== Iceberg integration
5364

65+
* config_ref:iceberg_rest_catalog_base_location,true,properties/cluster-properties[`iceberg_rest_catalog_base_location`]: Specifies the base location for the Iceberg REST catalog. Required for AWS Glue Data Catalog.
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
= Query Iceberg Topics using AWS Glue
2+
:description: Add Redpanda topics as Iceberg tables that you can query from AWS Glue Data Catalog.
3+
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration
4+
:page-beta: true
5+
ifdef::env-cloud[]
6+
:rpk-install-doc: manage:rpk/rpk-install.adoc
7+
endif::[]
8+
ifndef::env-cloud[]
9+
:rpk-install-doc: get-started:rpk-install.adoc
10+
endif::[]
11+
12+
13+
[NOTE]
14+
====
15+
include::shared:partial$enterprise-license.adoc[]
16+
====
17+
18+
// tag::single-source[]
19+
20+
This guide walks you through querying Redpanda topics as Iceberg tables stored in AWS S3, using a catalog integration with https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro[AWS Glue^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[].
21+
22+
== Prerequisites
23+
24+
* Redpanda version 25.1.7 or later.
25+
* xref:{rpk-install-doc}[`rpk`] installed or updated to the latest version.
26+
ifdef::env-cloud[]
27+
** You can also use the Redpanda Cloud API to xref:manage:cluster-maintenance/config-cluster.adoc#set-cluster-configuration-properties[reference secrets in your cluster configuration].
28+
endif::[]
29+
ifndef::env-cloud[]
30+
* xref:manage:tiered-storage.adoc#configure-object-storage[Object storage configured] for your cluster and xref:manage:tiered-storage.adoc#enable-tiered-storage[Tiered Storage enabled] for the topics for which you want to generate Iceberg tables.
31+
+
32+
You also use the S3 bucket URI to set the base location for AWS Glue Data Catalog.
33+
endif::[]
34+
* Admin permissions to create IAM policies and roles in AWS.
35+
36+
== Limitations
37+
38+
=== Nested partition spec support
39+
40+
AWS Glue does not support partitioning on nested fields. If Redpanda detects that
41+
the default partitioning `(hour(redpanda.timestamp))` is in use, it will instead apply an empty partition spec `()`, which means the table will not be partitioned.
42+
43+
If you want to use partitioning, you must specify a custom partition specification using your own partition columns (columns that are not nested).
44+
45+
== Authorize access to AWS Glue
46+
47+
You must allow Redpanda access to AWS Glue services in your AWS account. You can use the same access credentials that you configured for S3 (IAM role, access keys, and KMS key), as long as you have also added read and write access to AWS Glue Data Catalog.
48+
49+
For example, you could create a separate IAM policy that manages access to AWS Glue, and attach it to the IAM role that Redpanda also uses to access S3. It is recommended to add all AWS Glue API actions in the policy (`"glue:*"`) on the following resources:
50+
51+
- Root catalog (`catalog`)
52+
- All databases (`database/*`)
53+
- All tables (`table/\*/*`)
54+
55+
Your policy should include a statement similar to the following:
56+
57+
[,json]
58+
----
59+
{
60+
"Version": "2012-10-17",
61+
"Statement": [
62+
{
63+
"Effect": "Allow",
64+
"Action": [
65+
"glue:*"
66+
],
67+
"Resource": [
68+
"arn:aws:glue:<aws-region>:<aws-account-id>:catalog",
69+
"arn:aws:glue:<aws-region>:<aws-account-id>:database/*",
70+
"arn:aws:glue:<aws-region>:<aws-account-id>:table/*/*"
71+
]
72+
}
73+
]
74+
}
75+
----
76+
77+
For more information on configuring IAM permissions, see the https://docs.aws.amazon.com/glue/latest/dg/configure-iam-for-glue.html[AWS Glue documentation^].
78+
79+
== Configure authentication and credentials
80+
81+
You can configure credentials for the AWS Glue Data Catalog integration in either of the following ways:
82+
83+
* Allow Redpanda to use the same `cloud_storage_*` credential properties configured for S3. If you do not configure the overrides listed below, Redpanda uses the same credentials for both S3 and AWS Glue. This is the recommended approach.
84+
* If you want to configure authentication to AWS Glue separately from authentication to S3, there are equivalent credential configuration properties named `iceberg_rest_catalog_aws_*` that override the object storage credentials. These properties only apply to REST catalog authentication, and never to S3 authentication:
85+
** config_ref:iceberg_rest_catalog_aws_access_key,true,properties/cluster-properties[`iceberg_rest_catalog_aws_access_key`] overrides config_ref:cloud_storage_access_key,true,properties/cluster-properties[`cloud_storage_access_key`]
86+
** config_ref:iceberg_rest_catalog_aws_secret_key,true,properties/cluster-properties[`iceberg_rest_catalog_aws_secret_key`] overrides config_ref:cloud_storage_secret_key,true,properties/cluster-properties[`cloud_storage_secret_key`]
87+
** config_ref:iceberg_rest_catalog_aws_region,true,properties/cluster-properties[`iceberg_rest_catalog_aws_region`] overrides config_ref:cloud_storage_region,true,properties/cluster-properties[`cloud_storage_region`]
88+
** config_ref:iceberg_rest_catalog_aws_credentials_source,true,properties/cluster-properties[`iceberg_rest_catalog_aws_credentials_source`] overrides config_ref:cloud_storage_credentials_source,true,properties/cluster-properties[`cloud_storage_credentials_source`]
89+
90+
== Update cluster configuration
91+
92+
To configure your Redpanda cluster to enable Iceberg on a topic and integrate with the AWS Glue Data Catalog:
93+
94+
. Edit your cluster configuration to set the `iceberg_enabled` property to `true`, and set the catalog integration properties listed in the example below.
95+
ifndef::env-cloud[]
96+
+
97+
Run `rpk cluster config edit` to update these properties:
98+
+
99+
[,bash]
100+
----
101+
iceberg_enabled: true
102+
iceberg_catalog_type: rest
103+
iceberg_rest_catalog_endpoint: https://glue.<aws-region>.amazonaws.com/iceberg
104+
iceberg_rest_catalog_authentication_mode: aws_sigv4
105+
iceberg_rest_catalog_base_location: s3://<bucket-name>/<warehouse-path>
106+
----
107+
endif::[]
108+
ifdef::env-cloud[]
109+
Use `rpk` like in the following example, or use the Cloud API to xref:manage:cluster-maintenance/config-cluster.adoc#set-cluster-configuration-properties[update these cluster properties]. The update might take several minutes to complete.
110+
+
111+
To reference a secret in a cluster property, you must first xref:manage:iceberg/use-iceberg-catalogs.adoc#store-a-secret-for-rest-catalog-authentication[store the secret value].
112+
+
113+
[,bash]
114+
----
115+
rpk cloud login
116+
117+
rpk profile create --from-cloud <cluster-id>
118+
119+
rpk cluster config set \
120+
iceberg_enabled=true \
121+
iceberg_catalog_type=rest \
122+
iceberg_rest_catalog_endpoint=https://glue.<aws-region>.amazonaws.com/iceberg \
123+
iceberg_rest_catalog_authentication_mode=aws_sigv4 \
124+
iceberg_rest_catalog_base_location=s3://<bucket-name>/<warehouse-path>
125+
126+
----
127+
endif::[]
128+
+
129+
Use your own values for the following placeholders:
130+
+
131+
--
132+
- `<aws-region>`: The AWS region where your Data Catalog is located. The region in the AWS Glue endpoint must match the region specified in either your config_ref:cloud_storage_region,true,properties/cluster-properties[`cloud_storage_region`] or config_ref:iceberg_rest_catalog_aws_region,true,properties/cluster-properties[`iceberg_rest_catalog_aws_region`] property.
133+
- `<bucket-name>` and `<warehouse-path>`: AWS Glue requires you to specify the base location where Redpanda stores Iceberg data and metadata files. You must use an S3 URI; for example, `s3://<bucket-name>/iceberg`. As a security best practice, Redpanda Data recommends specifying a subfolder (using prefixes) rather than the root of the bucket.
134+
--
135+
+
136+
[,bash,role=no-copy]
137+
----
138+
Successfully updated configuration. New configuration version is 2.
139+
----
140+
141+
ifndef::env-cloud[]
142+
. If you change the configuration for a running cluster, you must restart that cluster now.
143+
endif::[]
144+
145+
. Enable the integration for a topic by configuring the topic property `redpanda.iceberg.mode`. The following examples show how to use xref:get-started:rpk-install.adoc[`rpk`] to either create a new topic or alter the configuration for an existing topic and set the Iceberg mode to `key_value`. The `key_value` mode creates a two-column Iceberg table for the topic, with one column for the record metadata including the key, and another binary column for the record's value. See xref:manage:iceberg/choose-iceberg-mode.adoc[] for more details on Iceberg modes.
146+
+
147+
.Create a new topic and set `redpanda.iceberg.mode`:
148+
[,bash]
149+
----
150+
rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_value
151+
----
152+
+
153+
.Set `redpanda.iceberg.mode` for an existing topic:
154+
[,bash]
155+
----
156+
rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value
157+
----
158+
159+
. Produce to the topic. For example,
160+
+
161+
[,bash]
162+
----
163+
echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n'
164+
----
165+
166+
You should see the topic as a table with data in AWS Glue Data Catalog. The data may take some time to become visible, depending on your config_ref:iceberg_target_lag_ms,true,properties/cluster-properties[`iceberg_target_lag_ms`] setting.
167+
168+
. In AWS Glue Studio, go to Databases.
169+
. Select the `redpanda` database. The `redpanda` database and the table within are automatically added for you. The table name is the same as the topic name.
170+
171+
== Query Iceberg table
172+
173+
You can query the Iceberg table using different engines, such as Amazon Athena, PyIceberg, or Apache Spark. To query the table or view the table data in AWS Glue, ensure that your account has the necessary permissions to access the catalog, database, and table.
174+
175+
To query the table in Amazon Athena:
176+
177+
. On the list of tables in AWS Glue Studio, click "Table data" under the *View data* column.
178+
. Click "Proceed" to be redirected to the Athena query editor.
179+
. In the query editor, select AwsDataCatalog as the data source, and select the `redpanda` database.
180+
. The SQL query editor should be pre-populated with a query that selects 10 rows from the Iceberg table. Run the query to see a preview of the table data.
181+
+
182+
[,sql]
183+
----
184+
SELECT * FROM "AwsDataCatalog"."redpanda"."<table-name>" limit 1;
185+
----
186+
+
187+
Your query results should look like the following:
188+
+
189+
[,sql,role="no-copy no-wrap"]
190+
----
191+
+-----------------------------------------------------+----------------+
192+
| redpanda | value |
193+
+-----------------------------------------------------+----------------+
194+
| {partition=0, offset=0, timestamp=2025-07-21 | 77 6f 72 6c 64 |
195+
| 18:11:25.070000, headers=null, key=[B@1900af31} | |
196+
+-----------------------------------------------------+----------------+
197+
----
198+
199+
== See also
200+
201+
- xref:manage:iceberg/query-iceberg-topics.adoc[]
202+
203+
// end::single-source[]
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
= Integrate with REST Catalogs
2+
:description: Integrate Redpanda topics with managed Iceberg REST Catalogs.
3+
:page-layout: index

modules/manage/partials/iceberg/use-iceberg-catalogs.adoc

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,26 @@ ifndef::env-cloud[:about-iceberg-doc: manage:iceberg/topic-iceberg-integration.a
33

44
To read from the Redpanda-generated xref:{about-iceberg-doc}[Iceberg table], your Iceberg-compatible client or tool needs access to the catalog to retrieve the table metadata and know the current state of the table. The catalog provides the current table metadata, which includes locations for all the table's data files. You can configure Redpanda to either connect to a REST-based catalog, or use a filesystem-based catalog.
55

6-
For production deployments, Redpanda recommends using an external REST catalog to manage Iceberg metadata. This enables built-in table maintenance, safely handles multiple engines and tools accessing tables at the same time, facilitates data governance, and maximizes data discovery. However, if it is not possible to use a REST catalog, you may use the filesystem-based catalog (`object_storage` catalog type), which does not require you to maintain a separate service to access the Iceberg data. In either case, you use the catalog to load, query, or refresh the Iceberg table as you produce to the Redpanda topic. See the documentation for your query engine or Iceberg-compatible tool for specific guidance on adding the Iceberg tables to your data warehouse or lakehouse using the catalog.
6+
For production deployments, Redpanda recommends <<rest,using an external REST catalog>> to manage Iceberg metadata. This enables built-in table maintenance, safely handles multiple engines and tools accessing tables at the same time, facilitates data governance, and maximizes data discovery. However, if it is not possible to use a REST catalog, you can <<object-storage,use the filesystem-based catalog>> (`object_storage` catalog type), which does not require you to maintain a separate service to access the Iceberg data.
7+
8+
In either case, you use the catalog to load, query, or refresh the Iceberg table as you produce to the Redpanda topic. See the documentation for your query engine or Iceberg-compatible tool for specific guidance on adding the Iceberg tables to your data warehouse or lakehouse using the catalog.
79

810
After you have selected a catalog type at the cluster level and xref:{about-iceberg-doc}#enable-iceberg-integration[enabled the Iceberg integration] for a topic, you cannot switch to another catalog type.
911

12+
[[rest]]
1013
== Connect to a REST catalog
1114

1215
Connect to an Iceberg REST catalog using the standard https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml[REST API^] supported by many catalog providers. Use this catalog integration type with REST-enabled Iceberg catalog services, such as https://docs.databricks.com/en/data-governance/unity-catalog/index.html[Databricks Unity^] and https://other-docs.snowflake.com/en/opencatalog/overview[Snowflake Open Catalog^].
1316

17+
[TIP]
18+
====
19+
This section provides general guidance on using REST catalogs with Redpanda. For instructions on integrating with specific REST catalog services, see the following:
20+
21+
* xref:manage:iceberg/iceberg-topics-aws-glue.adoc[AWS Glue Data Catalog]
22+
* xref:manage:iceberg/iceberg-topics-databricks-unity.adoc[Databricks Unity Catalog]
23+
* xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Snowflake with Open Catalog]
24+
====
25+
1426
ifdef::env-cloud[]
1527
=== Prerequisites
1628

@@ -69,10 +81,10 @@ NOTE: You must set `iceberg_rest_catalog_endpoint` at the same time that you set
6981

7082
To authenticate with the REST catalog, set the following cluster properties:
7183

72-
* config_ref:iceberg_rest_catalog_authentication_mode,true,properties/cluster-properties[`iceberg_rest_catalog_authentication_mode`]: The authentication mode to use for the REST catalog. Choose from `oauth2`, `bearer`, or `none` (default).
84+
* config_ref:iceberg_rest_catalog_authentication_mode,true,properties/cluster-properties[`iceberg_rest_catalog_authentication_mode`]: The authentication mode to use for the REST catalog. Choose from `oauth2`, `aws_sigv4`, `bearer`, or `none` (default). You must use `aws_sigv4` for xref:manage:iceberg/iceberg-topics-aws-glue.adoc[AWS Glue Data Catalog].
7385
ifdef::env-cloud[]
7486
+
75-
Redpanda recommends using `oauth2`.
87+
Redpanda generally recommends using `oauth2` for REST catalogs.
7688

7789
endif::[]
7890
** For `oauth2`, also configure the following properties:
@@ -253,6 +265,7 @@ ifndef::env-cloud[]
253265
TIP: You may need to explicitly create a table for the Iceberg data in your query engine. For an example, see xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[].
254266
endif::[]
255267

268+
[[object-storage]]
256269
== Integrate filesystem-based catalog (`object_storage`)
257270

258271
By default, Iceberg topics use the filesystem-based catalog (config_ref:iceberg_catalog_type,true,properties/cluster-properties[`iceberg_catalog_type`] cluster property set to `object_storage`). Redpanda stores the table metadata in https://iceberg.apache.org/docs/latest/java-api-quickstart/#using-a-hadoop-catalog[HadoopCatalog^] format in the same object storage bucket or container as the data files.

0 commit comments

Comments
 (0)