Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration

// tag::single-source[]
:page-topic-type: how-to
:personas: platform_admin, data_engineer
:learning-objective-1: Configure a Unity Catalog integration for Redpanda Iceberg topics with AWS S3
:learning-objective-2: Query Redpanda topic data as Iceberg tables in Databricks SQL

ifndef::env-cloud[]
[NOTE]
Expand All @@ -13,6 +17,11 @@ endif::[]

This guide walks you through querying Redpanda topics as managed Iceberg tables in Databricks, with AWS S3 as object storage and a catalog integration using https://docs.databricks.com/aws/en/data-governance/unity-catalog[Unity Catalog^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[].

After reading this page, you will be able to:

* [ ] {learning-objective-1}
* [ ] {learning-objective-2}

== Prerequisites

ifndef::env-cloud[]
Expand Down Expand Up @@ -78,11 +87,38 @@ endif::[]

Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/external-locations[Databricks documentation] to *manually* create an external location. You can create the external location in the Catalog Explorer or with SQL. You must create the external location manually because the location needs to be associated with the existing Tiered Storage bucket URL, `s3://<bucket-name>`.

== Create a new catalog
== Choose a catalog setup

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can either create a new catalog dedicated to Redpanda topics or use an existing catalog. If you create a new catalog, Redpanda automatically creates the required schema for you. If you need to integrate with an existing catalog, you must manually create the schema in that catalog before Redpanda creates any Iceberg tables.
After you set up your catalog, the authorization and Redpanda configuration steps are the same for both options.

Would recommend an intro like this to help tighten the following subsections.

Also -- would "manually create the schema before Redpanda creates any Iceberg tables" specifically mean that this step should be done before enabling Iceberg topics?

You can either create a new catalog dedicated to Redpanda topics or use an existing catalog. If you create a new catalog, Redpanda automatically creates the required schema for you. If you need to integrate with an existing catalog, you must manually create the schema in that catalog before Redpanda creates any Iceberg tables.

After you set up your catalog, the authorization and Redpanda configuration steps are the same for both options.

=== Option 1: Create a new catalog (recommended)

Follow the steps in the Databricks documentation to https://docs.databricks.com/aws/en/catalogs/create-catalog[create a standard catalog^]. When you create the catalog, specify the external location you created in the previous step as the storage location.

You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step.
In this setup, Redpanda creates the default `redpanda` schema for you. You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step.

=== Option 2: Use an existing catalog with a pre-created schema

If you need to integrate Redpanda with an existing Unity Catalog catalog object, follow the steps to https://docs.databricks.com/aws/en/schemas/create-schema[create a schema^] in the catalog.

* By default, Redpanda creates tables in a schema named `redpanda`. If you want to use a different schema, set config_ref:iceberg_default_catalog_namespace,true,properties/cluster-properties[`iceberg_default_catalog_namespace`] before enabling Iceberg, then manually create that schema in the catalog.
* Set the schema's managed storage location to the same S3 bucket used for Redpanda Tiered Storage, using the external location you created in the previous step.

Unity Catalog resolves managed storage locations through a hierarchy of metastore > catalog > schema. If you assign the schema its own managed storage location, Redpanda can use the existing catalog while the schema stores its managed Iceberg data in the schema-specific location.

For example:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example:
Unity Catalog resolves managed storage locations through a hierarchy of metastore > catalog > schema. If you assign the schema its own managed storage location, Redpanda can use the existing catalog while the schema stores its managed Iceberg data in the schema-specific location.
For example:

I would move this paragraph here, and then the next lines can illustrate the concept, and then we can provide the Databricks doc link afterwards if they are interested in learning more


* Your existing Unity Catalog catalog stores managed data in `s3://<catalog-bucket-name>`.
ifdef::env-cloud[]
* You manually create a `redpanda` schema in that catalog and override its managed storage location, through the external location, to the S3 bucket that Redpanda uses for your cluster's object storage (`s3://redpanda-cloud-storage-<cluster-id>` for BYOC, or your customer-managed bucket for BYOVPC).
endif::[]
ifndef::env-cloud[]
* You manually create a `redpanda` schema in that catalog and override its managed storage location, through the external location, to `s3://<cluster-bucket-name>`, which matches the S3 bucket that Redpanda uses for Tiered Storage.
endif::[]

For more information, see the https://docs.databricks.com/aws/en/data-governance/unity-catalog/#managed-storage-location-hierarchy[Unity Catalog managed storage location hierarchy^] in the Databricks documentation.

== Authorize access to Unity Catalog

Expand Down Expand Up @@ -118,6 +154,8 @@ iceberg_rest_catalog_client_id: <service-principal-client-id>
iceberg_rest_catalog_client_secret: <service-principal-client-secret>
iceberg_rest_catalog_warehouse: <unity-catalog-name>
iceberg_disable_snapshot_tagging: true
# Optional. Set a custom namespace only if you want to use a schema other than the default `redpanda`
# iceberg_default_catalog_namespace: ["<custom-namespace>"]
----
endif::[]
ifdef::env-cloud[]
Expand All @@ -142,6 +180,8 @@ rpk cluster config set \
iceberg_rest_catalog_client_secret='${secrets.<service-principal-client-secret-name>}' \
iceberg_rest_catalog_warehouse=<unity-catalog-name> \
iceberg_disable_snapshot_tagging=true
# Optional. Set a custom namespace only if you want to use a schema other than the default `redpanda`
# iceberg_default_catalog_namespace='["<custom-namespace>"]'
----
endif::[]
+
Expand Down Expand Up @@ -210,7 +250,11 @@ The following example shows how to query the Iceberg table using SQL in Databric
+
[,sql]
----
-- Ensure that the catalog and table name are correctly parsed in case they contain special characters
/* Ensure that the catalog and table name are correctly parsed in case they
contain special characters.

If you set iceberg_default_catalog_namespace to a custom namespace, replace
`redpanda` with that namespace in the query below. */
SELECT * FROM `<catalog-name>`.redpanda.`<table-name>` LIMIT 10;
----
+
Expand Down
Loading