Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a31d6a8
Configure beta release (#985)
JakeSCahill Feb 25, 2025
8b09d9a
Doc-850 SASL/PLAIN Authentication
Feediver1 Feb 26, 2025
76eb542
Merge branch 'main' into beta
Feediver1 Feb 26, 2025
cf6cdbc
Merge branch 'main' into beta
Feediver1 Feb 27, 2025
443c1b3
Revert "Doc-850 SASL/PLAIN Authentication"
JakeSCahill Feb 27, 2025
fbf0815
Doc-850 SASL/PLAIN Authentication
Feediver1 Feb 26, 2025
2a6f5a1
Review feedback
Feediver1 Feb 27, 2025
1f21cad
Revert "Review feedback"
JakeSCahill Feb 27, 2025
9e45cce
Revert "Doc-850 SASL/PLAIN Authentication"
JakeSCahill Feb 27, 2025
5f96dda
Support protobuf normalization in schema registry (#992)
kbatuigas Mar 3, 2025
9849e9b
DOC-850 SASL/PLAIN AuthN (#991)
JakeSCahill Mar 3, 2025
c923717
Addition to 25.1 beta What's New for SASL/PLAIN (#997)
Feediver1 Mar 3, 2025
b5caa8f
Fix typo in authentication.adoc (#999)
Feediver1 Mar 3, 2025
892b0d6
[25.1 beta] Iceberg updates (#989)
kbatuigas Mar 5, 2025
26c1435
Doc-918 first draft
Feediver1 Mar 6, 2025
b5dcac3
Update modules/manage/pages/cluster-maintenance/manage-throughput.adoc
Feediver1 Mar 6, 2025
c313aca
Update modules/manage/pages/cluster-maintenance/manage-throughput.adoc
Feediver1 Mar 7, 2025
212262d
Additional update
Feediver1 Mar 7, 2025
9bb3515
DOC-1093 Add new cluster properties for 25.1 beta (#1003)
JakeSCahill Mar 12, 2025
b2e1023
DOC-1025 Update Console docs for v3 beta (#994)
JakeSCahill Mar 12, 2025
f2e05ee
Update install-beta.adoc - add in certmanager (#1010)
david-yu Mar 13, 2025
22d80e5
Doc-851 Persistent Stacktrace Logging (#1011)
Feediver1 Mar 14, 2025
ce0f9f1
Rephrase lack of OIDC group support (#1012)
JakeSCahill Mar 18, 2025
86fa3e1
DOC-933 Document new consumer group lag metrics and configs (#1014)
JakeSCahill Mar 19, 2025
5af796c
Adds new properties added in 25.1.1-rc2 (#1018)
JakeSCahill Mar 19, 2025
c129523
First draft - TS Safe pause and resume
Feediver1 Mar 18, 2025
5e51d3c
Revert "First draft - TS Safe pause and resume"
JakeSCahill Mar 19, 2025
d4ca48b
Add a tool for converting Console v2 configs to v3 (#1016)
JakeSCahill Mar 21, 2025
4c06437
Add beta install docs for Redpanda Console (#1028)
JakeSCahill Mar 25, 2025
dea4b41
How to use rpk to analyze partitions and size clusters (#1034)
JakeSCahill Mar 28, 2025
f2c28b2
rpk support for protobuf well-known types (#1040)
kbatuigas Apr 1, 2025
f257d8a
Rolling restart Admin API (#1026)
kbatuigas Apr 1, 2025
1f5f9f6
New rpk commands (#1031)
kbatuigas Apr 2, 2025
cae145e
New metrics (#1039)
JakeSCahill Apr 2, 2025
4dd52a5
DOC-1175 Document the min.cleanable.dirty.ratio property and how to u…
JakeSCahill Apr 3, 2025
816545c
Clarify which metrics become available when property is enabled (#1049)
JakeSCahill Apr 3, 2025
87a649f
Add new properties and metrics to the what's new (#1048)
JakeSCahill Apr 3, 2025
3e050de
Start Databricks+Unity guide for Iceberg topics
kbatuigas Apr 3, 2025
63dd4a0
Fix broken link (#1051)
JakeSCahill Apr 4, 2025
ba9e36d
Provide upgrade instructions for Redpanda Console (#1046)
JakeSCahill Apr 4, 2025
bfd6add
Partition memory related changes for 25.1 (#1052)
StephanDollberg Apr 4, 2025
c06a250
Update whats-new.adoc
JakeSCahill Apr 4, 2025
978f814
Update whats-new.adoc
JakeSCahill Apr 4, 2025
29fd716
Fix link
JakeSCahill Apr 4, 2025
6519e2d
TS Safe pause and resume (#1017)
Feediver1 Apr 4, 2025
44b720c
Single source Iceberg (#1032)
kbatuigas Apr 4, 2025
458d4ba
Iceberg performance (#1042)
kbatuigas Apr 5, 2025
5e6e469
DOC-970 Documentation for Operator 2.4.x which defaults to Flux disab…
JakeSCahill Apr 6, 2025
707b72d
Merge branch 'beta' into DOC-1045-Iceberg-Databricks-integration
JakeSCahill Apr 6, 2025
afed685
Update whats-new.adoc for pause and resume (#1053)
Feediver1 Apr 6, 2025
d92c313
Rebase main onto beta (#1056)
JakeSCahill Apr 6, 2025
555b718
25.1 GA (#1059)
JakeSCahill Apr 7, 2025
17f5d49
Merge branch 'main' of https://github.com/redpanda-data/docs into beta
JakeSCahill Apr 7, 2025
be2027a
Apply suggestions from code review
JakeSCahill Apr 7, 2025
f3df080
Merge branch 'beta' into DOC-1045-Iceberg-Databricks-integration
JakeSCahill Apr 7, 2025
47375fc
Merge branch 'main' into DOC-1045-Iceberg-Databricks-integration
JakeSCahill Apr 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -183,6 +183,7 @@
*** xref:manage:iceberg/topic-iceberg-integration.adoc[About Iceberg Topics]
*** xref:manage:iceberg/use-iceberg-catalogs.adoc[Use Iceberg Catalogs]
*** xref:manage:iceberg/query-iceberg-topics.adoc[Query Iceberg Topics]
*** xref:manage:iceberg/iceberg-topics-databricks-unity.adoc[Query Iceberg Topics with Databricks]
*** xref:manage:iceberg/redpanda-topics-iceberg-snowflake-catalog.adoc[Query Iceberg Topics with Snowflake]
** xref:manage:schema-reg/index.adoc[Schema Registry]
*** xref:manage:schema-reg/schema-reg-overview.adoc[Overview]
Expand Down
160 changes: 160 additions & 0 deletions modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
= Query Iceberg Topics using Databricks and Unity Catalog
:description: Add Redpanda topics as Iceberg tables that you can query in Databricks managed by Unity Catalog.
:page-categories: Iceberg, Tiered Storage, Management, High Availability, Data Replication, Integration
:page-beta: true

[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====

This guide walks you through querying Redpanda topics as Iceberg tables in Databricks, with AWS S3 as object storage and a catalog integration using https://docs.databricks.com/aws/en/data-governance/unity-catalog[Unity Catalog^]. For general information about Iceberg catalog integrations in Redpanda, see xref:manage:iceberg/use-iceberg-catalogs.adoc[].

== Prerequisites

* xref:manage:tiered-storage.adoc#configure-object-storage[Object storage configured] for your cluster and xref:manage:tiered-storage.adoc#enable-tiered-storage[Tiered Storage enabled] for the topics for which you want to generate Iceberg tables.
+
You need the AWS S3 bucket URI, so you can configure it as an external location in Unity Catalog.
* A Databricks workspace in the same region as your S3 bucket. See the https://docs.databricks.com/aws/en/resources/supported-regions#supported-regions-list[list of supported AWS regions^].
* Unity Catalog enabled in your Databricks workspace. See the https://docs.databricks.com/aws/en/data-governance/unity-catalog/get-started[official Databricks documentation^] to set up Unity Catalog for your workspace.
* Predictive optimization enabled.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to provide specific guidance regarding what level (e.g. account vs catalog vs schema) predictive optimization should be enabled?

* External data access enabled in your metastore.
* Workspace admin privileges to complete the steps to create a Unity Catalog storage credential and external location that connects your Tiered Storage bucket to Databricks.

== Create a Unity Catalog storage credential

A storage credential is a Databricks object that controls access to external object storage, in this case S3. You associate a storage credential with an AWS IAM role that defines what actions Unity Catalog can perform in the S3 bucket.

Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Vale Style Guide - (Spelling-error) Did you really mean 'Databricks'?
  • Vale Style Guide - (CustomStyle.SwapWords-warning) Consider using 'will' or 'must' instead of 'should'

Proposed fix

Suggested change
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks:
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you will have the following configured in AWS and Databricks:

The spelling of 'Databricks' is correct, and I believe the use of 'should' can be replaced with 'will' to make the instruction more definitive.


* A self-assuming IAM role, meaning you've defined the role trust policy so the role trusts itself.
* Two IAM policies attached to the IAM role. The first policy grants Unity Catalog read and write access to the bucket. The second policy allows Unity Catalog to configure file events.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these policies need to be restricted down to the catalog root redpanda-iceberg-catalog?

+

* A storage credential in Databricks associated with the IAM role, using the role's ARN. You also use the storage credential's external ID in the role's trust relationship policy to make the role self-assuming.

== Create a Unity Catalog external location

The external location points to the location of the Iceberg data in your S3 bucket.

Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/external-locations[official Databricks documentation] to *manually* create an external location. You can create the external location in the Catalog Explorer, or using SQL. You must create the external location manually because the location needs to be associated with the existing Tiered Storage bucket URL, `s3://<bucket-name>`.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this location fine or does the catalog folder redpanda-iceberg-catalog need to be specified here?


== Create a new catalog

Follow the steps in the official Databricks documentation to https://docs.databricks.com/aws/en/catalogs/create-catalog[create a standard catalog^]. When you create the catalog, specify the following as the storage location:

* The external location you created in the previous step.
* In the subpath field, enter `redpanda-iceberg-catalog`.

You use the catalog name when you set the Iceberg cluster configuration properties in Redpanda in a later step.

== Authorize access to Unity Catalog

Redpanda recommends using OAuth for service principals to grant Redpanda access to Unity Catalog.

. Follow the steps in the https://docs.databricks.com/aws/en/dev-tools/auth/oauth-m2m[official Databricks documentation] to create a service principal, and then generate an OAuth secret. You use the client ID and secret to set Iceberg cluster configuration properties in Redpanda in the next step.
. Open your catalog in the Catalog Explorer, then click *Permissions*.
. Click *Grant* to grant the service principal the following permissions on the catalog:
+
* `ALL PRIVILEGES`
* `EXTERNAL USE SCHEMA`

The Iceberg integration for Redpanda also supports using bearer tokens.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fine to mention here? In which instances should we recommend the bearer token option?


== Update cluster configuration

To configure your Redpanda cluster to enable Iceberg on a topic and integrate with Unity Catalog:

. Edit your cluster configuration to set the `iceberg_enabled` property to `true`, and set the catalog integration properties listed in the example below. You can run `rpk cluster config edit` to update these properties:
+
[,bash]
----
iceberg_enabled: true
iceberg_catalog_type: rest
iceberg_rest_catalog_endpoint: https://<workspace-instance>/api/2.1/unity-catalog/iceberg
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Databricks docs specify /api/2.1/unity-catalog/iceberg but our internal examples seem to use /api/2.1/unity-catalog/iceberg-rest. Can someone confirm what the correct value is?

iceberg_rest_catalog_authentication_mode: oauth2
iceberg_rest_catalog_oauth2_server_uri: https://<workspace-instance>/oidc/v1/token
iceberg_rest_catalog_oauth2_scope: all-apis
iceberg_rest_catalog_client_id: <service-principal-client-id>
iceberg_rest_catalog_client_secret: <service-principal-client-secret>
iceberg_rest_catalog_warehouse: <unity-catalog-name>
iceberg_disable_snapshot_tagging: true

# Optional
iceberg_translation_interval_ms_default: 1000
iceberg_catalog_commit_interval_ms: 1000
----
+
Use your own values for the following placeholders:
+
--
- `<workspace-instance>`: The URL of your https://docs.databricks.com/aws/en/workspace/workspace-details#workspace-instance-names-urls-and-ids[Databricks workspace instance^], for example, `cust-success.cloud.databricks.com`.
- `<service-principal-client-id>`: The client ID of the service principal you created in an earlier step.
- `<service-principal-client-secret>`: The client secret of the service principal you created in an earlier step.
- `<unity-catalog-name>`: The name of your catalog in Unity Catalog.
--
+
[,bash,role=no-copy]
----
Successfully updated configuration. New configuration version is 2.
----

. You must restart your cluster if you change the configuration for a running cluster.

. Enable the integration for a topic by configuring the topic property `redpanda.iceberg.mode`. This mode creates an Iceberg table for the topic consisting of two columns, one for the record metadata including the key, and another binary column for the record's value. See xref:manage:iceberg/topic-iceberg-integration.adoc#enable-iceberg-integration[Enable Iceberg integration] for more details on Iceberg modes. The following examples show how to use xref:get-started:rpk-install.adoc[`rpk`] to either create a new topic, or alter the configuration for an existing topic, to set the Iceberg mode to `key_value`.
+
.Create a new topic and set `redpanda.iceberg.mode`:
[,bash]
----
rpk topic create <topic-name> --topic-config=redpanda.iceberg.mode=key_value
----
+
.Set `redpanda.iceberg.mode` for an existing topic:
[,bash]
----
rpk topic alter-config <topic-name> --set redpanda.iceberg.mode=key_value
----

. Produce to the topic. For example,
+
[,bash]
----
echo "hello world\nfoo bar\nbaz qux" | rpk topic produce <topic-name> --format='%k %v\n'
----

You should see the topic as a table in Unity Catalog.

. In Catalog Explorer, open your catalog. You should see a `redpanda` schema, in addition to `default` and `information_schema`.
. The `redpanda` schema and the table residing within this schema are automatically added for you. The table name is the same as the topic name.

== Query Iceberg table using Databricks SQL

You can query the Iceberg table using different engines, such as Databricks SQL, PyIceberg, or Apache Spark. To query the table or view the table data in Catalog Explorer, ensure that your account has the necessary permissions to read the table. Review the official Databricks documentation on https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/?language=SQL#grant-permissions-on-objects-in-a-unity-catalog-metastore[granting permissions to objects^] and https://docs.databricks.com/aws/en/data-governance/unity-catalog/manage-privileges/privileges[Unity Catalog privileges^] for details.

The following example shows how to query the Iceberg table using SQL in Databricks SQL.

. In the Databricks console, open *SQL Editor*.
. In the query editor, run the following:
+
[,sql]
----
-- Ensure that the catalog and table name are correctly parsed in case they contain special characters
SELECT * FROM `<catalog-name>`.redpanda.`<table-name>`;
----
+
Your query results should look like the following:
+
[,sql,role="no-copy no-wrap"]
----
# Example for redpanda.iceberg.mode=key_value with 1 record produced to topic
+--------------------------------------------------------------------------+------------+-------------------------+
| redpanda | value | redpanda.timestamp_hour |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is timestamp_hour a new addition to this schema?

+--------------------------------------------------------------------------+------------+-------------------------+
| {"partition":0,"offset":"0","timestamp":"2025-04-02T18:57:11.127Z", | 776f726c64 | 2025-04-02-18 |
| "headers":null,"key":"68656c6c6f"} | | |
+--------------------------------------------------------------------------+------------+-------------------------+
----

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything we should mention in this doc regarding metadata refresh? Does Unity automatically refresh table metadata?

== See also

- xref:manage:iceberg/query-iceberg-topics.adoc[]