-
Notifications
You must be signed in to change notification settings - Fork 50
iceberg: document dbx w/ existing catalog #1660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
nvartolomei
wants to merge
3
commits into
main
Choose a base branch
from
nv/dbx-existing-catalog
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -3,6 +3,10 @@ | |||||||||
| :page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration | ||||||||||
|
|
||||||||||
| // tag::single-source[] | ||||||||||
| :page-topic-type: how-to | ||||||||||
| :personas: platform_admin, data_engineer | ||||||||||
| :learning-objective-1: Configure a Unity Catalog integration for Redpanda Iceberg topics with AWS S3 | ||||||||||
| :learning-objective-2: Query Redpanda topic data as Iceberg tables in Databricks SQL | ||||||||||
|
|
||||||||||
| ifndef::env-cloud[] | ||||||||||
| [NOTE] | ||||||||||
|
|
@@ -13,6 +17,11 @@ endif::[] | |||||||||
|
|
||||||||||
| This guide walks you through querying Redpanda topics as managed Iceberg tables in Databricks, with AWS S3 as object storage and a catalog integration using https://docs.databricks.com/aws/en/data-governance/unity-catalog[Unity Catalog^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[]. | ||||||||||
|
|
||||||||||
| After reading this page, you will be able to: | ||||||||||
|
|
||||||||||
| * [ ] {learning-objective-1} | ||||||||||
| * [ ] {learning-objective-2} | ||||||||||
|
|
||||||||||
| == Prerequisites | ||||||||||
|
|
||||||||||
| ifndef::env-cloud[] | ||||||||||
|
|
@@ -78,11 +87,38 @@ endif::[] | |||||||||
|
|
||||||||||
| Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/external-locations[Databricks documentation] to *manually* create an external location. You can create the external location in the Catalog Explorer or with SQL. You must create the external location manually because the location needs to be associated with the existing Tiered Storage bucket URL, `s3://<bucket-name>`. | ||||||||||
|
|
||||||||||
| == Create a new catalog | ||||||||||
| == Choose a catalog setup | ||||||||||
|
|
||||||||||
| You can either create a new catalog dedicated to Redpanda topics or use an existing catalog. If you create a new catalog, Redpanda automatically creates the required schema for you. If you need to integrate with an existing catalog, you must manually create the schema in that catalog before Redpanda creates any Iceberg tables. | ||||||||||
|
|
||||||||||
| After you set up your catalog, the authorization and Redpanda configuration steps are the same for both options. | ||||||||||
|
|
||||||||||
| === Option 1: Create a new catalog (recommended) | ||||||||||
|
|
||||||||||
| Follow the steps in the Databricks documentation to https://docs.databricks.com/aws/en/catalogs/create-catalog[create a standard catalog^]. When you create the catalog, specify the external location you created in the previous step as the storage location. | ||||||||||
|
|
||||||||||
| You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step. | ||||||||||
| In this setup, Redpanda creates the default `redpanda` schema for you. You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step. | ||||||||||
|
|
||||||||||
| === Option 2: Use an existing catalog with a pre-created schema | ||||||||||
|
|
||||||||||
| If you need to integrate Redpanda with an existing Unity Catalog catalog object, follow the steps to https://docs.databricks.com/aws/en/schemas/create-schema[create a schema^] in the catalog. | ||||||||||
|
|
||||||||||
| * By default, Redpanda creates tables in a schema named `redpanda`. If you want to use a different schema, set config_ref:iceberg_default_catalog_namespace,true,properties/cluster-properties[`iceberg_default_catalog_namespace`] before enabling Iceberg, then manually create that schema in the catalog. | ||||||||||
| * Set the schema's managed storage location to the same S3 bucket used for Redpanda Tiered Storage, using the external location you created in the previous step. | ||||||||||
|
|
||||||||||
| Unity Catalog resolves managed storage locations through a hierarchy of metastore > catalog > schema. If you assign the schema its own managed storage location, Redpanda can use the existing catalog while the schema stores its managed Iceberg data in the schema-specific location. | ||||||||||
|
|
||||||||||
| For example: | ||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I would move this paragraph here, and then the next lines can illustrate the concept, and then we can provide the Databricks doc link afterwards if they are interested in learning more |
||||||||||
|
|
||||||||||
| * Your existing Unity Catalog catalog stores managed data in `s3://<catalog-bucket-name>`. | ||||||||||
| ifdef::env-cloud[] | ||||||||||
| * You manually create a `redpanda` schema in that catalog and override its managed storage location, through the external location, to the S3 bucket that Redpanda uses for your cluster's object storage (`s3://redpanda-cloud-storage-<cluster-id>` for BYOC, or your customer-managed bucket for BYOVPC). | ||||||||||
| endif::[] | ||||||||||
| ifndef::env-cloud[] | ||||||||||
| * You manually create a `redpanda` schema in that catalog and override its managed storage location, through the external location, to `s3://<cluster-bucket-name>`, which matches the S3 bucket that Redpanda uses for Tiered Storage. | ||||||||||
| endif::[] | ||||||||||
|
|
||||||||||
| For more information, see the https://docs.databricks.com/aws/en/data-governance/unity-catalog/#managed-storage-location-hierarchy[Unity Catalog managed storage location hierarchy^] in the Databricks documentation. | ||||||||||
|
|
||||||||||
| == Authorize access to Unity Catalog | ||||||||||
|
|
||||||||||
|
|
@@ -118,6 +154,8 @@ iceberg_rest_catalog_client_id: <service-principal-client-id> | |||||||||
| iceberg_rest_catalog_client_secret: <service-principal-client-secret> | ||||||||||
| iceberg_rest_catalog_warehouse: <unity-catalog-name> | ||||||||||
| iceberg_disable_snapshot_tagging: true | ||||||||||
| # Optional. Set a custom namespace only if you want to use a schema other than the default `redpanda` | ||||||||||
| # iceberg_default_catalog_namespace: ["<custom-namespace>"] | ||||||||||
| ---- | ||||||||||
| endif::[] | ||||||||||
| ifdef::env-cloud[] | ||||||||||
|
|
@@ -142,6 +180,8 @@ rpk cluster config set \ | |||||||||
| iceberg_rest_catalog_client_secret='${secrets.<service-principal-client-secret-name>}' \ | ||||||||||
| iceberg_rest_catalog_warehouse=<unity-catalog-name> \ | ||||||||||
| iceberg_disable_snapshot_tagging=true | ||||||||||
| # Optional. Set a custom namespace only if you want to use a schema other than the default `redpanda` | ||||||||||
| # iceberg_default_catalog_namespace='["<custom-namespace>"]' | ||||||||||
| ---- | ||||||||||
| endif::[] | ||||||||||
| + | ||||||||||
|
|
@@ -210,7 +250,11 @@ The following example shows how to query the Iceberg table using SQL in Databric | |||||||||
| + | ||||||||||
| [,sql] | ||||||||||
| ---- | ||||||||||
| -- Ensure that the catalog and table name are correctly parsed in case they contain special characters | ||||||||||
| /* Ensure that the catalog and table name are correctly parsed in case they | ||||||||||
| contain special characters. | ||||||||||
|
|
||||||||||
| If you set iceberg_default_catalog_namespace to a custom namespace, replace | ||||||||||
| `redpanda` with that namespace in the query below. */ | ||||||||||
| SELECT * FROM `<catalog-name>`.redpanda.`<table-name>` LIMIT 10; | ||||||||||
| ---- | ||||||||||
| + | ||||||||||
|
|
||||||||||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would recommend an intro like this to help tighten the following subsections.
Also -- would "manually create the schema before Redpanda creates any Iceberg tables" specifically mean that this step should be done before enabling Iceberg topics?