Skip to content

Commit e7587f4

Browse files
kbatuigasmicheleRP
andauthored
Query Iceberg topics with Databricks Unity (#1154)
Co-authored-by: Michele Cyran <michele@redpanda.com>
1 parent d0decad commit e7587f4

5 files changed

Lines changed: 261 additions & 13 deletions

File tree

modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -174,6 +174,7 @@
174174
*** xref:manage:iceberg/choose-iceberg-mode.adoc[Choose Iceberg Mode]
175175
*** xref:manage:iceberg/use-iceberg-catalogs.adoc[Use Iceberg Catalogs]
176176
*** xref:manage:iceberg/query-iceberg-topics.adoc[Query Iceberg Topics]
177+
*** xref:manage:iceberg/iceberg-topics-databricks-unity.adoc[Query Iceberg Topics with Databricks Unity Catalog]
177178
*** xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Query Iceberg Topics with Snowflake]
178179
** xref:manage:schema-reg/index.adoc[Schema Registry]
179180
*** xref:manage:schema-reg/schema-reg-overview.adoc[Overview]

modules/manage/pages/iceberg/choose-iceberg-mode.adoc

Lines changed: 20 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -175,22 +175,29 @@ Avro::
175175
| string | string
176176
| record | struct
177177
| array | list
178-
| maps | list
179-
| fixed | fixed
178+
| maps | map
179+
| fixed | fixed*
180180
| decimal | decimal
181-
| uuid | uuid
181+
| uuid | uuid*
182182
| date | date
183-
| time | time
183+
| time | time*
184184
| timestamp | timestamp
185185
|===
186186
187+
*These types are not currently supported in Unity Catalog managed Iceberg tables.
188+
189+
There are some cases where the Avro type does not map directly to an Iceberg type and Redpanda applies the following transformations:
190+
187191
* Different flavors of time (such as `time-millis`) and timestamp (such as `timestamp-millis`) types are translated to the same Iceberg `time` and `timestamp` types, respectively.
188192
* Avro unions are flattened to Iceberg structs with optional fields. For example:
189193
** The union `["int", "long", "float"]` is represented as an Iceberg struct `struct<0 INT NULLABLE, 1 LONG NULLABLE, 2 FLOAT NULLABLE>`.
190194
** The union `["int", null, "float"]` is represented as an Iceberg struct `struct<0 INT NULLABLE, 1 FLOAT NULLABLE>`.
191195
* All fields are required by default. (Avro always sets a default in binary representation.)
192-
* The Avro duration logical type is ignored.
193-
* The Avro null type is ignored and not represented in the Iceberg schema.
196+
197+
Some Avro types are not supported:
198+
199+
* The Avro `duration` logical type is ignored.
200+
* The Avro `null` type is ignored and not represented in the Iceberg schema.
194201
* Recursive types are not supported.
195202
--
196203
@@ -208,19 +215,22 @@ Protobuf::
208215
| int64 | long
209216
| sint64 | long
210217
| sfixed32 | int
211-
| sfixed64 | int
218+
| sfixed64 | long
212219
| string | string
213220
| bytes | binary
214221
| map | map
222+
| message | struct
215223
|===
216224
225+
There are some cases where the Protobuf type does not map directly to an Iceberg type and Redpanda applies the following transformations:
226+
217227
* Repeated values are translated into Iceberg `array` types.
218228
* Enums are translated into Iceberg `int` types based on the integer value of the enumerated type.
219229
* `uint32` and `fixed32` are translated into Iceberg `long` types as that is the existing semantic for unsigned 32-bit values in Iceberg.
220230
* `uint64` and `fixed64` values are translated into their Base-10 string representation.
221-
* The `timestamp` type in Protobuf is translated into `timestamp` in Iceberg.
222-
* Messages are converted into Iceberg structs.
223-
* Recursive types are not supported.
231+
* `google.protobuf.Timestamp` is translated into `timestamp` in Iceberg.
232+
233+
Recursive types are not supported.
224234
--
225235
======
226236

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
= Query Iceberg Topics using Databricks and Unity Catalog
2+
:description: Add Redpanda topics as Iceberg tables that you can query in Databricks managed by Unity Catalog.
3+
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration
4+
5+
// tag::single-source[]
6+
7+
ifndef::env-cloud[]
8+
[NOTE]
9+
====
10+
include::shared:partial$enterprise-license.adoc[]
11+
====
12+
endif::[]
13+
14+
This guide walks you through querying Redpanda topics as managed Iceberg tables in Databricks, with AWS S3 as object storage and a catalog integration using https://docs.databricks.com/aws/en/data-governance/unity-catalog[Unity Catalog^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[].
15+
16+
== Prerequisites
17+
18+
ifndef::env-cloud[]
19+
* xref:manage:tiered-storage.adoc#configure-object-storage[Object storage configured] for your cluster and xref:manage:tiered-storage.adoc#enable-tiered-storage[Tiered Storage enabled] for the topics for which you want to generate Iceberg tables.
20+
+
21+
You need the AWS S3 bucket URI, so you can configure it as an external location in Unity Catalog.
22+
endif::[]
23+
* A Databricks workspace in the same region as your S3 bucket. See the https://docs.databricks.com/aws/en/resources/supported-regions#supported-regions-list[list of supported AWS regions^].
24+
* Unity Catalog enabled in your Databricks workspace. See the https://docs.databricks.com/aws/en/data-governance/unity-catalog/get-started[Databricks documentation^] to set up Unity Catalog for your workspace.
25+
* https://docs.databricks.com/aws/en/optimizations/predictive-optimization#enable-predictive-optimization[Predictive optimization^] enabled for Unity Catalog.
26+
* https://docs.databricks.com/aws/en/external-access/admin[External data access^] enabled in your metastore.
27+
* Workspace admin privileges to complete the steps to create a Unity Catalog storage credential and external location that connects your cluster's Tiered Storage bucket to Databricks.
28+
29+
== Limitations
30+
31+
* Databricks Managed Iceberg tables do not currently support partition evolution. If you define a xref:manage:iceberg/about-iceberg-topics.adoc#use-custom-partitioning[custom partitioning] scheme for an Iceberg topic, you must not change it later.
32+
+
33+
The cluster property config_ref:iceberg_default_partition_spec,true,properties/cluster-properties[`iceberg_default_partition_spec`] defines the cluster-wide partitioning scheme, by default configured to `(hour(redpanda.timestamp))`. To use a different cluster-wide default, configure this property before creating any Iceberg-enabled topics. You must only override the default partitioning scheme on the topic level for new Iceberg topics that do not yet contain data.
34+
* The following data types are not currently supported for managed Iceberg tables:
35+
+
36+
--
37+
|===
38+
| Iceberg type | Equivalent Avro type
39+
40+
| uuid | uuid
41+
| fixed(L) | fixed
42+
| time | time-millis, time-micros
43+
44+
|===
45+
--
46+
+
47+
There are no limitations for Protobuf types.
48+
49+
== Create a Unity Catalog storage credential
50+
51+
A storage credential is a Databricks object that controls access to external object storage, in this case S3. You associate a storage credential with an AWS IAM role that defines what actions Unity Catalog can perform in the S3 bucket.
52+
53+
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks:
54+
55+
* A self-assuming IAM role, meaning you've defined the role trust policy so the role trusts itself.
56+
* Two IAM policies attached to the IAM role. The first policy grants Unity Catalog read and write access to the bucket. The second policy allows Unity Catalog to configure file events.
57+
* A storage credential in Databricks associated with the IAM role, using the role's ARN. You also use the storage credential's external ID in the role's trust relationship policy to make the role self-assuming.
58+
59+
== Create a Unity Catalog external location
60+
61+
The external location stores the Unity Catalog-managed Iceberg metadata, and the Iceberg data written by Redpanda. You must use the same bucket configured for glossterm:Tiered Storage[] for your Redpanda cluster.
62+
63+
ifdef::env-cloud[]
64+
For BYOC clusters, the bucket name is `redpanda-cloud-storage-<cluster-id>`, where `<cluster-id>` is the ID of your Redpanda cluster.
65+
endif::[]
66+
67+
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/external-locations[Databricks documentation] to *manually* create an external location. You can create the external location in the Catalog Explorer or with SQL. You must create the external location manually because the location needs to be associated with the existing Tiered Storage bucket URL, `s3://<bucket-name>`.
68+
69+
== Create a new catalog
70+
71+
Follow the steps in the Databricks documentation to https://docs.databricks.com/aws/en/catalogs/create-catalog[create a standard catalog^]. When you create the catalog, specify the external location you created in the previous step as the storage location.
72+
73+
You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step.
74+
75+
== Authorize access to Unity Catalog
76+
77+
Redpanda recommends using OAuth for service principals to grant Redpanda access to Unity Catalog.
78+
79+
. Follow the steps in the https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m[Databricks documentation] to create a service principal, and then generate an OAuth secret. You use the client ID and secret to set Iceberg cluster configuration properties in Redpanda in the next step.
80+
. Open your catalog in the Catalog Explorer, then click *Permissions*.
81+
. Click *Grant* to grant the service principal the following permissions on the catalog:
82+
+
83+
* `ALL PRIVILEGES`
84+
* `EXTERNAL USE SCHEMA`
85+
86+
The Iceberg integration for Redpanda also supports using bearer tokens.
87+
88+
== Update cluster configuration
89+
90+
To configure your Redpanda cluster to enable Iceberg on a topic and integrate with Unity Catalog:
91+
92+
. Edit your cluster configuration to set the `iceberg_enabled` property to `true`, and set the catalog integration properties listed in the example below.
93+
ifndef::env-cloud[]
94+
+
95+
Run `rpk cluster config edit` to update these properties:
96+
+
97+
[,bash]
98+
----
99+
iceberg_enabled: true
100+
iceberg_catalog_type: rest
101+
iceberg_rest_catalog_endpoint: https://<workspace-instance>/api/2.1/unity-catalog/iceberg-rest
102+
iceberg_rest_catalog_authentication_mode: oauth2
103+
iceberg_rest_catalog_oauth2_server_uri: https://<workspace-instance>/oidc/v1/token
104+
iceberg_rest_catalog_oauth2_scope: all-apis
105+
iceberg_rest_catalog_client_id: <service-principal-client-id>
106+
iceberg_rest_catalog_client_secret: <service-principal-client-secret>
107+
iceberg_rest_catalog_warehouse: <unity-catalog-name>
108+
iceberg_disable_snapshot_tagging: true
109+
----
110+
endif::[]
111+
ifdef::env-cloud[]
112+
Use `rpk` like in the following example, or use the Cloud API to xref:manage:cluster-maintenance/config-cluster.adoc#set-cluster-configuration-properties[update these cluster properties]. The update might take several minutes to complete.
113+
+
114+
To reference a secret in a cluster property, you must first xref:manage:iceberg/use-iceberg-catalogs.adoc#store-a-secret-for-rest-catalog-authentication[store the secret value].
115+
+
116+
[,bash]
117+
----
118+
rpk cloud login
119+
120+
rpk profile create --from-cloud <cluster-id>
121+
122+
rpk cluster config set \
123+
iceberg_enabled=true \
124+
iceberg_catalog_type=rest \
125+
iceberg_rest_catalog_endpoint=https://<workspace-instance>/api/2.1/unity-catalog/iceberg-rest \
126+
iceberg_rest_catalog_authentication_mode=oauth2 \
127+
iceberg_rest_catalog_oauth2_server_uri=https://<workspace-instance>/oidc/v1/token \
128+
iceberg_rest_catalog_oauth2_scope=all-apis \
129+
iceberg_rest_catalog_client_id=<service-principal-client-id> \
130+
iceberg_rest_catalog_client_secret=${secrets.<service-principal-client-secret-name>} \
131+
iceberg_rest_catalog_warehouse=<unity-catalog-name> \
132+
iceberg_disable_snapshot_tagging=true
133+
----
134+
endif::[]
135+
+
136+
Use your own values for the following placeholders:
137+
+
138+
--
139+
- `<workspace-instance>`: The URL of your https://docs.databricks.com/aws/en/workspace/workspace-details#workspace-instance-names-urls-and-ids[Databricks workspace instance^]; for example, `cust-success.cloud.databricks.com`.
140+
- `<service-principal-client-id>`: The client ID of the service principal you created in an earlier step.
141+
ifndef::env-cloud[]
142+
- `<service-principal-client-secret>`: The client secret of the service principal you created in an earlier step.
143+
endif::[]
144+
ifdef::env-cloud[]
145+
- `<service-principal-client-secret-name>`: The name of the client secret of the service principal you created in an earlier step.
146+
endif::[]
147+
- `<unity-catalog-name>`: The name of your catalog in Unity Catalog.
148+
--
149+
+
150+
[,bash,role=no-copy]
151+
----
152+
Successfully updated configuration. New configuration version is 2.
153+
----
154+
155+
ifndef::env-cloud[]
156+
. You must restart your cluster if you change the configuration for a running cluster.
157+
endif::[]
158+
159+
. Enable the integration for a topic by configuring the topic property `redpanda.iceberg.mode`. The following examples show how to use xref:get-started:rpk-install.adoc[`rpk`] to either create a new topic or alter the configuration for an existing topic and set the Iceberg mode to `key_value`. The `key_value` mode creates an Iceberg table for the topic consisting of two columns, one for the record metadata including the key, and another binary column for the record's value. See xref:manage:iceberg/choose-iceberg-mode.adoc[] for more details on Iceberg modes.
160+
+
161+
.Create a new topic and set `redpanda.iceberg.mode`:
162+
[,bash]
163+
----
164+
rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_value
165+
----
166+
+
167+
.Set `redpanda.iceberg.mode` for an existing topic:
168+
[,bash]
169+
----
170+
rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value
171+
----
172+
173+
. Produce to the topic. For example,
174+
+
175+
[,bash]
176+
----
177+
echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n'
178+
----
179+
180+
You should see the topic as a table with data in Unity Catalog. The data may take some time to become visible, depending on your config_ref:iceberg_target_lag_ms,true,properties/cluster-properties[`iceberg_target_lag_ms`] setting.
181+
182+
. In Catalog Explorer, open your catalog. You should see a `redpanda` schema, in addition to `default` and `information_schema`.
183+
. The `redpanda` schema and the table residing within this schema are automatically added for you. The table name is the same as the topic name.
184+
185+
== Query Iceberg table using Databricks SQL
186+
187+
You can query the Iceberg table using different engines, such as Databricks SQL, PyIceberg, or Apache Spark. To query the table or view the table data in Catalog Explorer, ensure that your account has the necessary permissions to read the table. Review the Databricks documentation on https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/?language=SQL#grant-permissions-on-objects-in-a-unity-catalog-metastore[granting permissions to objects^] and https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges[Unity Catalog privileges^] for details.
188+
189+
The following example shows how to query the Iceberg table using SQL in Databricks SQL.
190+
191+
. In the Databricks console, open *SQL Editor*.
192+
. In the query editor, run:
193+
+
194+
[,sql]
195+
----
196+
-- Ensure that the catalog and table name are correctly parsed in case they contain special characters
197+
SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`;
198+
----
199+
+
200+
Your query results should look like the following:
201+
+
202+
[,sql,role="no-copy no-wrap"]
203+
----
204+
-- Example for redpanda.iceberg.mode=key_value with 1 record produced to topic
205+
+----------------------------------------------------------------------+------------+
206+
| redpanda | value |
207+
+----------------------------------------------------------------------+------------+
208+
| {"partition":0,"offset":"0","timestamp":"2025-04-02T18:57:11.127Z", | 776f726c64 |
209+
| "headers":null,"key":"68656c6c6f"} | |
210+
+----------------------------------------------------------------------+------------+
211+
----
212+
213+
== See also
214+
215+
- xref:manage:iceberg/query-iceberg-topics.adoc[]
216+
217+
// end::single-source[]

modules/manage/partials/iceberg/about-iceberg-topics.adoc

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,9 +63,25 @@ ifdef::env-cloud[]
6363
To create an Iceberg table for a Redpanda topic, you must set the cluster configuration property config_ref:iceberg_enabled,true,properties/cluster-properties[`iceberg_enabled`] to `true`, and also configure the topic property `redpanda.iceberg.mode`. You can choose to provide a schema if you need the Iceberg table to be structured with defined columns.
6464
endif::[]
6565

66-
. Set the `iceberg_enabled` configuration option on your cluster to `true`. You must restart your cluster if you change this configuration for a running cluster.
66+
. Set the `iceberg_enabled` configuration option on your cluster to `true`.
6767
ifdef::env-cloud[]
6868
+
69+
[tabs]
70+
=====
71+
rpk::
72+
+
73+
--
74+
[,bash]
75+
----
76+
rpk cloud login
77+
rpk profile create --from-cloud <CLUSTER ID>
78+
rpk cluster config set iceberg_enabled true
79+
----
80+
--
81+
82+
Cloud API::
83+
+
84+
--
6985
[,bash]
7086
----
7187
# Store your cluster ID in a variable
@@ -85,8 +101,10 @@ curl -H "Authorization: Bearer ${RP_CLOUD_TOKEN}" -X PATCH \
85101
-H 'content-type: application/json' \
86102
-d '{"cluster_configuration":{"custom_properties": {"iceberg_enabled":true}}}'
87103
----
88-
+
104+
89105
The xref:api:ROOT:cloud-controlplane-api.adoc#patch-/v1/clusters/-cluster.id-[`PATCH /clusters/{cluster.id}`] request returns the ID of a long-running operation. The operation may take up to ten minutes to complete. You can check the status of the operation by polling the xref:api:ROOT:cloud-controlplane-api.adoc#get-/v1/operations/-id-[`GET /operations/\{id}`] endpoint.
106+
--
107+
=====
90108
endif::[]
91109
ifndef::env-cloud[]
92110
+
@@ -99,6 +117,8 @@ rpk cluster config set iceberg_enabled true
99117
----
100118
Successfully updated configuration. New configuration version is 2.
101119
----
120+
+
121+
You must restart your cluster if you change this configuration for a running cluster.
102122
endif::[]
103123

104124
. (Optional) Create a new topic.

modules/manage/partials/iceberg/query-iceberg-topics.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ NOTE: Redpanda automatically removes expired snapshots on a periodic basis. Snap
6363
== Query examples
6464

6565
ifndef::env-cloud[]
66-
To follow along with the examples on this page, suppose you produce the same stream of events to a topic `ClickEvent`, which uses a schema, and another topic `ClickEvent_key_value`, which uses the key-value mode. The topics have glossterm:tiered-storage[] configured to an AWS S3 bucket. A sample record contains the following data:
66+
To follow along with the examples on this page, suppose you produce the same stream of events to a topic `ClickEvent`, which uses a schema, and another topic `ClickEvent_key_value`, which uses the key-value mode. The topics have glossterm:Tiered Storage[] configured to an AWS S3 bucket. A sample record contains the following data:
6767
endif::[]
6868

6969
ifdef::env-cloud[]

0 commit comments

Comments
 (0)