Skip to content

Query Iceberg topics using Databricks and Unity Catalog#1050

Closed
kbatuigas wants to merge 56 commits intomainfrom
DOC-1045-Iceberg-Databricks-integration
Closed

Query Iceberg topics using Databricks and Unity Catalog#1050
kbatuigas wants to merge 56 commits intomainfrom
DOC-1045-Iceberg-Databricks-integration

Conversation

@kbatuigas
Copy link
Copy Markdown
Contributor

@kbatuigas kbatuigas commented Apr 3, 2025

Description

This pull request adds a new guide for querying Iceberg topics using Databricks and Unity Catalog to the documentation. The key changes include updating the navigation to include the new guide and adding detailed instructions for the integration process.

Documentation updates:

New guide:

  • modules/manage/pages/iceberg/iceberg-topics-databricks-unity.adoc: Added a comprehensive guide on how to query Redpanda topics as Iceberg tables in Databricks using Unity Catalog. This includes prerequisites, creating storage credentials and external locations, updating cluster configurations, and querying the Iceberg table using Databricks SQL.

Resolves https://redpandadata.atlassian.net/browse/
Review deadline: 4 April

Page previews

https://deploy-preview-1050--redpanda-docs-preview.netlify.app/25.1/manage/iceberg/iceberg-topics-databricks-unity/

Checks

  • New feature
  • Content gap
  • Support Follow-up
  • Small fix (typos, links, copyedits, etc)

JakeSCahill and others added 30 commits February 25, 2025 16:34
This reverts commit 2a6f5a1.
Co-authored-by: Joyce Fee <joyce@redpanda.com>
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Co-authored-by: Michele Cyran <michele@redpanda.com>
Co-authored-by: Kat Batuigas <36839689+kbatuigas@users.noreply.github.com>
Co-authored-by: Martin Schneppenheim <23424570+weeco@users.noreply.github.com>
Co-authored-by: hyperlint-ai[bot] <154288675+hyperlint-ai[bot]@users.noreply.github.com>
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Co-authored-by: Jake Cahill <45230295+JakeSCahill@users.noreply.github.com>
Co-authored-by: hyperlint-ai[bot] <154288675+hyperlint-ai[bot]@users.noreply.github.com>
Co-authored-by: Angela Simms <102690377+asimms41@users.noreply.github.com>
----
iceberg_enabled: true
iceberg_catalog_type: rest
iceberg_rest_catalog_endpoint: https://<workspace-instance>/api/2.1/unity-catalog/iceberg
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Databricks docs specify /api/2.1/unity-catalog/iceberg but our internal examples seem to use /api/2.1/unity-catalog/iceberg-rest. Can someone confirm what the correct value is?

----
# Example for redpanda.iceberg.mode=key_value with 1 record produced to topic
+--------------------------------------------------------------------------+------------+-------------------------+
| redpanda | value | redpanda.timestamp_hour |
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is timestamp_hour a new addition to this schema?

| "headers":null,"key":"68656c6c6f"} | | |
+--------------------------------------------------------------------------+------------+-------------------------+
----

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything we should mention in this doc regarding metadata refresh? Does Unity automatically refresh table metadata?

JakeSCahill and others added 11 commits April 4, 2025 11:44
Co-authored-by: Jake Cahill <45230295+JakeSCahill@users.noreply.github.com>
Co-authored-by: Michele Cyran <michele@redpanda.com>
Co-authored-by: Michele Cyran <michele@redpanda.com>
Co-authored-by: Michele Cyran <michele@redpanda.com>
…led for new installs (#965)

Co-authored-by: Andrew Stucki <andrew.stucki@gmail.com>
Co-authored-by: David Yu <yongshinyu@gmail.com>
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Co-authored-by: Joyce Fee <joyce@redpanda.com>
Co-authored-by: hyperlint-ai[bot] <154288675+hyperlint-ai[bot]@users.noreply.github.com>
Co-authored-by: Chris Seto <chriskseto@gmail.com>
Co-authored-by: Angela Simms <102690377+asimms41@users.noreply.github.com>

A storage credential is a Databricks object that controls access to external object storage, in this case S3. You associate a storage credential with an AWS IAM role that defines what actions Unity Catalog can perform in the S3 bucket.

Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identified issues

  • Vale Style Guide - (Spelling-error) Did you really mean 'Databricks'?
  • Vale Style Guide - (CustomStyle.SwapWords-warning) Consider using 'will' or 'must' instead of 'should'

Proposed fix

Suggested change
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you should have the following configured in AWS and Databricks:
Follow the steps in the https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/storage-credentials[official Databricks documentation^] to create an AWS IAM role that has the required permissions for the bucket. When you have completed these steps, you will have the following configured in AWS and Databricks:

The spelling of 'Databricks' is correct, and I believe the use of 'should' can be replaced with 'will' to make the instruction more definitive.

Feediver1 and others added 6 commits April 6, 2025 11:09
Co-authored-by: Michele Cyran <michele@redpanda.com>
Co-authored-by: Joyce Fee <102751339+Feediver1@users.noreply.github.com>
Co-authored-by: Suslik Da-Rete <suslik_da-rete@protonmail.ch>
Co-authored-by: Rogger Vasquez <59714880+r-vasquez@users.noreply.github.com>
Co-authored-by: hyperlint-ai[bot] <154288675+hyperlint-ai[bot]@users.noreply.github.com>
Co-authored-by: Yaniv Ben Hemo <70286779+yanivbh1@users.noreply.github.com>
Co-authored-by: Kat Batuigas <36839689+kbatuigas@users.noreply.github.com>
Co-authored-by: Paulo Borges <paulohtb@hotmail.com>
Co-authored-by: Bill Chambers <wchambers@ischool.berkeley.edu>
Co-authored-by: David Yu <yongshinyu@gmail.com>
Co-authored-by: Angela Simms <102690377+asimms41@users.noreply.github.com>
Co-authored-by: Joyce Fee <joyce@redpanda.com>
Co-authored-by: Martin Schneppenheim <23424570+weeco@users.noreply.github.com>
Co-authored-by: Stephan Dollberg <stephan@redpanda.com>
Co-authored-by: Andrew Stucki <andrew.stucki@gmail.com>
Co-authored-by: Chris Seto <chriskseto@gmail.com>
@JakeSCahill JakeSCahill changed the base branch from beta to main April 7, 2025 13:20
@Feediver1
Copy link
Copy Markdown
Contributor

@kbatuigas Please update/provide a status so folks know where this PR stands, what the delay is, etc. Thanks.

@kbatuigas
Copy link
Copy Markdown
Contributor Author

@mattschumpert given our internal conversations with Databricks regarding the availability of managed tables as well as creating Iceberg tables using the metadata path, can we go ahead and close this PR for now, and reopen in June?

@mattschumpert
Copy link
Copy Markdown

Whatever is most efficient. I'm sure we will publish this content.

@kbatuigas kbatuigas closed this Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants