-
Notifications
You must be signed in to change notification settings - Fork 50
Query Iceberg topics using Databricks and Unity Catalog #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a31d6a8
8b09d9a
76eb542
cf6cdbc
443c1b3
fbf0815
2a6f5a1
1f21cad
9e45cce
5f96dda
9849e9b
c923717
b5caa8f
892b0d6
26c1435
b5dcac3
c313aca
212262d
9bb3515
b2e1023
f2e05ee
22d80e5
ce0f9f1
86fa3e1
5af796c
c129523
5e51d3c
d4ca48b
4c06437
dea4b41
f2c28b2
f257d8a
1f5f9f6
cae145e
4dd52a5
816545c
87a649f
3e050de
63dd4a0
ba9e36d
bfd6add
c06a250
978f814
29fd716
6519e2d
44b720c
458d4ba
5e6e469
707b72d
afed685
d92c313
555b718
17f5d49
be2027a
f3df080
47375fc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,160 @@ | ||||||
| = Query Iceberg Topics using Databricks and Unity Catalog | ||||||
| :description: Add Redpanda topics as Iceberg tables that you can query in Databricks managed by Unity Catalog. | ||||||
| :page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration | ||||||
| :page-beta: true | ||||||
|
|
||||||
| [NOTE] | ||||||
| ==== | ||||||
| include::shared:partial$enterprise-license.adoc[] | ||||||
| ==== | ||||||
|
|
||||||
| This guide walks you through querying Redpanda topics as Iceberg tables in Databricks, with AWS S3 as object storage and a catalog integration using https://docs.databricks.com/aws/en/data-governance/unity-catalog[Unity Catalog^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[]. | ||||||
|
|
||||||
| == Prerequisites | ||||||
|
|
||||||
| * xref:manage:tiered-storage.adoc#configure-object-storage[Object storage configured] for your cluster and xref:manage:tiered-storage.adoc#enable-tiered-storage[Tiered Storage enabled] for the topics for which you want to generate Iceberg tables. | ||||||
| + | ||||||
| You need the AWS S3 bucket URI, so you can configure it as an external location in Unity Catalog. | ||||||
| * A Databricks workspace in the same region as your S3 bucket. See the https://docs.databricks.com/aws/en/resources/supported-regions#supported-regions-list[list of supported AWS regions^]. | ||||||
| * Unity Catalog enabled in your Databricks workspace. See the https://docs.databricks.com/aws/en/data-governance/unity-catalog/get-started[official Databricks documentation^] to set up Unity Catalog for your workspace. | ||||||
| * Predictive optimization enabled. | ||||||
| * External data access enabled in your metastore. | ||||||
| * Workspace admin privileges to complete the steps to create a Unity Catalog storage credential and external location that connects your Tiered Storage bucket to Databricks. | ||||||
|
|
||||||
| == Create a Unity Catalog storage credential | ||||||
|
|
||||||
| A storage credential is a Databricks object that controls access to external object storage, in this case S3. You associate a storage credential with an AWS IAM role that defines what actions Unity Catalog can perform in the S3 bucket. | ||||||
|
|
||||||
| Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Identified issues
Proposed fix
Suggested change
The spelling of 'Databricks' is correct, and I believe the use of 'should' can be replaced with 'will' to make the instruction more definitive. |
||||||
|
|
||||||
| * A self-assuming IAM role, meaning you've defined the role trust policy so the role trusts itself. | ||||||
| * Two IAM policies attached to the IAM role. The first policy grants Unity Catalog read and write access to the bucket. The second policy allows Unity Catalog to configure file events. | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do these policies need to be restricted down to the catalog root |
||||||
| + | ||||||
|
|
||||||
| * A storage credential in Databricks associated with the IAM role, using the role's ARN. You also use the storage credential's external ID in the role's trust relationship policy to make the role self-assuming. | ||||||
|
|
||||||
| == Create a Unity Catalog external location | ||||||
|
|
||||||
| The external location points to the location of the Iceberg data in your S3 bucket. | ||||||
|
|
||||||
| Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/external-locations[official Databricks documentation] to *manually* create an external location. You can create the external location in the Catalog Explorer, or using SQL. You must create the external location manually because the location needs to be associated with the existing Tiered Storage bucket URL, `s3://<bucket-name>`. | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this location fine or does the catalog folder |
||||||
|
|
||||||
| == Create a new catalog | ||||||
|
|
||||||
| Follow the steps in the official Databricks documentation to https://docs.databricks.com/aws/en/catalogs/create-catalog[create a standard catalog^]. When you create the catalog, specify the following as the storage location: | ||||||
|
|
||||||
| * The external location you created in the previous step. | ||||||
| * In the subpath field, enter `redpanda-iceberg-catalog`. | ||||||
|
|
||||||
| You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step. | ||||||
|
|
||||||
| == Authorize access to Unity Catalog | ||||||
|
|
||||||
| Redpanda recommends using OAuth for service principals to grant Redpanda access to Unity Catalog. | ||||||
|
|
||||||
| . Follow the steps in the https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m[official Databricks documentation] to create a service principal, and then generate an OAuth secret. You use the client ID and secret to set Iceberg cluster configuration properties in Redpanda in the next step. | ||||||
| . Open your catalog in the Catalog Explorer, then click *Permissions*. | ||||||
| . Click *Grant* to grant the service principal the following permissions on the catalog: | ||||||
| + | ||||||
| * `ALL PRIVILEGES` | ||||||
| * `EXTERNAL USE SCHEMA` | ||||||
|
|
||||||
| The Iceberg integration for Redpanda also supports using bearer tokens. | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this fine to mention here? In which instances should we recommend the bearer token option? |
||||||
|
|
||||||
| == Update cluster configuration | ||||||
|
|
||||||
| To configure your Redpanda cluster to enable Iceberg on a topic and integrate with Unity Catalog: | ||||||
|
|
||||||
| . Edit your cluster configuration to set the `iceberg_enabled` property to `true`, and set the catalog integration properties listed in the example below. You can run `rpk cluster config edit` to update these properties: | ||||||
| + | ||||||
| [,bash] | ||||||
| ---- | ||||||
| iceberg_enabled: true | ||||||
| iceberg_catalog_type: rest | ||||||
| iceberg_rest_catalog_endpoint: https://<workspace-instance>/api/2.1/unity-catalog/iceberg | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Databricks docs specify |
||||||
| iceberg_rest_catalog_authentication_mode: oauth2 | ||||||
| iceberg_rest_catalog_oauth2_server_uri: https://<workspace-instance>/oidc/v1/token | ||||||
| iceberg_rest_catalog_oauth2_scope: all-apis | ||||||
| iceberg_rest_catalog_client_id: <service-principal-client-id> | ||||||
| iceberg_rest_catalog_client_secret: <service-principal-client-secret> | ||||||
| iceberg_rest_catalog_warehouse: <unity-catalog-name> | ||||||
| iceberg_disable_snapshot_tagging: true | ||||||
|
|
||||||
| # Optional | ||||||
| iceberg_translation_interval_ms_default: 1000 | ||||||
| iceberg_catalog_commit_interval_ms: 1000 | ||||||
| ---- | ||||||
| + | ||||||
| Use your own values for the following placeholders: | ||||||
| + | ||||||
| -- | ||||||
| - `<workspace-instance>`: The URL of your https://docs.databricks.com/aws/en/workspace/workspace-details#workspace-instance-names-urls-and-ids[Databricks workspace instance^], for example, `cust-success.cloud.databricks.com`. | ||||||
| - `<service-principal-client-id>`: The client ID of the service principal you created in an earlier step. | ||||||
| - `<service-principal-client-secret>`: The client secret of the service principal you created in an earlier step. | ||||||
| - `<unity-catalog-name>`: The name of your catalog in Unity Catalog. | ||||||
| -- | ||||||
| + | ||||||
| [,bash,role=no-copy] | ||||||
| ---- | ||||||
| Successfully updated configuration. New configuration version is 2. | ||||||
| ---- | ||||||
|
|
||||||
| . You must restart your cluster if you change the configuration for a running cluster. | ||||||
|
|
||||||
| . Enable the integration for a topic by configuring the topic property `redpanda.iceberg.mode`. This mode creates an Iceberg table for the topic consisting of two columns, one for the record metadata including the key, and another binary column for the record's value. See xref:manage:iceberg/topic-iceberg-integration.adoc#enable-iceberg-integration[Enable Iceberg integration] for more details on Iceberg modes. The following examples show how to use xref:get-started:rpk-install.adoc[`rpk`] to either create a new topic, or alter the configuration for an existing topic, to set the Iceberg mode to `key_value`. | ||||||
| + | ||||||
| .Create a new topic and set `redpanda.iceberg.mode`: | ||||||
| [,bash] | ||||||
| ---- | ||||||
| rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_value | ||||||
| ---- | ||||||
| + | ||||||
| .Set `redpanda.iceberg.mode` for an existing topic: | ||||||
| [,bash] | ||||||
| ---- | ||||||
| rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value | ||||||
| ---- | ||||||
|
|
||||||
| . Produce to the topic. For example, | ||||||
| + | ||||||
| [,bash] | ||||||
| ---- | ||||||
| echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n' | ||||||
| ---- | ||||||
|
|
||||||
| You should see the topic as a table in Unity Catalog. | ||||||
|
|
||||||
| . In Catalog Explorer, open your catalog. You should see a `redpanda` schema, in addition to `default` and `information_schema`. | ||||||
| . The `redpanda` schema and the table residing within this schema are automatically added for you. The table name is the same as the topic name. | ||||||
|
|
||||||
| == Query Iceberg table using Databricks SQL | ||||||
|
|
||||||
| You can query the Iceberg table using different engines, such as Databricks SQL, PyIceberg, or Apache Spark. To query the table or view the table data in Catalog Explorer, ensure that your account has the necessary permissions to read the table. Review the official Databricks documentation on https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/?language=SQL#grant-permissions-on-objects-in-a-unity-catalog-metastore[granting permissions to objects^] and https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges[Unity Catalog privileges^] for details. | ||||||
|
|
||||||
| The following example shows how to query the Iceberg table using SQL in Databricks SQL. | ||||||
|
|
||||||
| . In the Databricks console, open *SQL Editor*. | ||||||
| . In the query editor, run the following: | ||||||
| + | ||||||
| [,sql] | ||||||
| ---- | ||||||
| -- Ensure that the catalog and table name are correctly parsed in case they contain special characters | ||||||
| SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`; | ||||||
| ---- | ||||||
| + | ||||||
| Your query results should look like the following: | ||||||
| + | ||||||
| [,sql,role="no-copy no-wrap"] | ||||||
| ---- | ||||||
| # Example for redpanda.iceberg.mode=key_value with 1 record produced to topic | ||||||
| +--------------------------------------------------------------------------+------------+-------------------------+ | ||||||
| | redpanda | value | redpanda.timestamp_hour | | ||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is |
||||||
| +--------------------------------------------------------------------------+------------+-------------------------+ | ||||||
| | {"partition":0,"offset":"0","timestamp":"2025-04-02T18:57:11.127Z", | 776f726c64 | 2025-04-02-18 | | ||||||
| | "headers":null,"key":"68656c6c6f"} | | | | ||||||
| +--------------------------------------------------------------------------+------------+-------------------------+ | ||||||
| ---- | ||||||
|
|
||||||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there anything we should mention in this doc regarding metadata refresh? Does Unity automatically refresh table metadata? |
||||||
| == See also | ||||||
|
|
||||||
| - xref:manage:iceberg/query-iceberg-topics.adoc[] | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to provide specific guidance regarding what level (e.g. account vs catalog vs schema) predictive optimization should be enabled?