Skip to content

Commit be57e96

Browse files
Data Cloud Agents Teamcopybara-github
authored andcommitted
Project import generated by Copybara.
PiperOrigin-RevId: 932763649
1 parent 65a480a commit be57e96

1 file changed

Lines changed: 308 additions & 0 deletions

File tree

  • skills/federate-lakehouse-catalog
Lines changed: 308 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,308 @@
1+
---
2+
name: federate-lakehouse-catalog
3+
description: >-
4+
Sets up Google Cloud Lakehouse federated catalogs to remote Iceberg REST Catalogs. Currently supported catalogs: Databricks Unity, AWS Glue. Supported clouds hosting those catalogs: GCP, AWS. The primary use case is connecting to remote data to query it from GCP engines (BigQuery, Spark). Examples of when to use this: "federate my lakehouse catalog to databricks", "query data in databricks", "query data in s3", "connect to aws glue". Do NOT use for direct remote database SQL execution (e.g., Databricks SQL) or managing remote clusters and infrastructure (e.g., Databricks clusters, AWS Glue jobs).
5+
license: Apache-2.0
6+
metadata:
7+
version: v1
8+
publisher: google
9+
---
10+
11+
# Federate Lakehouse Catalog via Cross-cloud Lakehouse
12+
13+
This skill describes how to set up a federated catalog in BigQuery to query
14+
remote catalogs like Databricks Unity Catalog or AWS Glue Data Catalog data in
15+
AWS over the public internet.
16+
17+
## Prerequisites
18+
19+
- For Databricks: Databricks Workspace URL and OAuth Service Principal (Client
20+
ID and Secret) with read access.
21+
- For AWS Glue: AWS Administrator access to create IAM roles and permissions
22+
policies.
23+
- Active Google Cloud project with administrative access to create lakehouse
24+
resources, and secrets in the case of Databricks.
25+
26+
## Procedure
27+
28+
### Step 1: Information Gathering and Region Selection
29+
30+
Before running any commands, the agent **MUST** collect the following
31+
information from the user:
32+
33+
1. Determine which catalog the user wants to federate to (e.g., Databricks
34+
Unity or AWS Glue) and verify it is supported.
35+
2. Determine where the remote data is located (the specific AWS region).
36+
3. Using the Region Pairing Best Practice in the Gotchas section, help the user
37+
pick the optimal GCP region to minimize latency.
38+
4. Collect the necessary configuration variables for the chosen flow (e.g.,
39+
Databricks credentials or AWS Account ID).
40+
41+
Only proceed to the next steps once this information is confirmed.
42+
43+
### Step 2: API Verification
44+
45+
Verify that the required Google Cloud APIs are enabled for the project:
46+
47+
```bash
48+
gcloud services check biglake.googleapis.com
49+
```
50+
51+
If the API is not enabled, explicitly ask the user for permission to enable it.
52+
Do NOT proceed without their confirmation.
53+
54+
### Flow A: Databricks Unity Catalog
55+
56+
#### 1. Create a Regional Secret for Credentials
57+
58+
Store the Databricks client ID and secret in Secret Manager. Ensure the
59+
`secretmanager.googleapis.com` API is enabled. The secret **MUST** be in the
60+
same region as your Lakehouse catalog.
61+
62+
1. Create a JSON file named `credentials.json`:
63+
64+
```json
65+
{
66+
"client_id": "<CLIENT_ID>",
67+
"client_secret": "<CLIENT_SECRET>"
68+
}
69+
```
70+
71+
1. Set the Secret Manager API endpoint override for the region:
72+
73+
```bash
74+
gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.<REGION>.rep.googleapis.com/
75+
```
76+
77+
1. Create the secret:
78+
79+
```bash
80+
gcloud secrets create <SECRET_NAME> \
81+
--location="<REGION>" \
82+
--project="<PROJECT_ID>" \
83+
--data-file=credentials.json
84+
```
85+
86+
#### 2. Create the Federated Catalog
87+
88+
Create a BigLake Iceberg catalog of type `federated` pointing to Databricks.
89+
90+
```bash
91+
gcloud alpha biglake iceberg catalogs create <CATALOG_NAME> \
92+
--project="<PROJECT_ID>" \
93+
--primary-location="<REGION>" \
94+
--catalog-type="federated" \
95+
--federated-catalog-type="unity" \
96+
--secret-name="projects/<PROJECT_ID>/locations/<REGION>/secrets/<SECRET_NAME>" \
97+
--unity-instance-name="<UNITY_INSTANCE_NAME>" \
98+
--unity-catalog-name="<UNITY_CATALOG_NAME>" \
99+
--refresh-interval="300s"
100+
```
101+
102+
#### 3. Grant Catalog Access to the Secret
103+
104+
Grant the service account created for the catalog access to read the secret.
105+
106+
1. Get the service account email by describing the catalog:
107+
108+
```bash
109+
gcloud alpha biglake iceberg catalogs describe <CATALOG_NAME> \
110+
--project="<PROJECT_ID>" \
111+
--location="<REGION>" \
112+
--format="value(biglake-service-account-id)"
113+
```
114+
115+
1. Grant access:
116+
117+
```bash
118+
gcloud secrets add-iam-policy-binding <SECRET_NAME> \
119+
--project="<PROJECT_ID>" \
120+
--location="<REGION>" \
121+
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
122+
--role="roles/secretmanager.secretAccessor"
123+
```
124+
125+
### Flow B: AWS Glue
126+
127+
#### 1. Create the AWS IAM role with a placeholder trust policy
128+
129+
Lakehouse provisions a Google service account ID after catalog creation. Create
130+
the AWS IAM role with a placeholder trust policy first.
131+
132+
1. Create a file named `trust_policy.json`:
133+
134+
```json
135+
{
136+
"Version": "2012-10-17",
137+
"Statement": [
138+
{
139+
"Effect": "Allow",
140+
"Principal": {
141+
"Federated": "accounts.google.com"
142+
},
143+
"Action": "sts:AssumeRoleWithWebIdentity",
144+
"Condition": {
145+
"StringEquals": {
146+
"accounts.google.com:aud": ["PLACEHOLDER_VALUE"],
147+
"accounts.google.com:sub": ["PLACEHOLDER_VALUE"]
148+
}
149+
}
150+
}
151+
]
152+
}
153+
```
154+
155+
1. Run the AWS CLI command to create the role:
156+
157+
```bash
158+
aws iam create-role \
159+
--role-name <AWS_ROLE_NAME> \
160+
--assume-role-policy-document file://trust_policy.json \
161+
--max-session-duration 43200
162+
```
163+
164+
#### 2. Attach a permissions policy
165+
166+
Attach a policy that allows Lakehouse to read from Glue and S3.
167+
168+
> [!IMPORTANT] **Safe IAM Scoping**: The example below uses wildcard structures
169+
> for illustration. You **MUST** consult with the user to scope the `Resource`
170+
> ARNs to their specific catalog, database, and S3 buckets. Do NOT blindly apply
171+
> wildcard permissions.
172+
173+
```json
174+
{
175+
"Version": "2012-10-17",
176+
"Statement": [
177+
{
178+
"Sid": "GlueRead",
179+
"Effect": "Allow",
180+
"Action": [
181+
"glue:GetCatalog",
182+
"glue:GetDatabase",
183+
"glue:GetDatabases",
184+
"glue:GetTable",
185+
"glue:GetTables"
186+
],
187+
"Resource": "arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT_ID>:catalog"
188+
},
189+
{
190+
"Sid": "S3Read",
191+
"Effect": "Allow",
192+
"Action": [
193+
"s3:ListBucket",
194+
"s3:GetObject"
195+
],
196+
"Resource": [
197+
"arn:aws:s3:::<SPECIFIC_BUCKET>",
198+
"arn:aws:s3:::<SPECIFIC_BUCKET>/*"
199+
]
200+
}
201+
]
202+
}
203+
```
204+
205+
Attach this permissions policy to your IAM role.
206+
207+
#### 3. Create the Federated Catalog
208+
209+
When creating an AWS Glue federated catalog, the `--glue-warehouse` **MUST** be
210+
set to your 12-digit AWS Account ID string (not an S3 bucket URI). **Best
211+
Practice**: Initialize the catalog without specifying a refresh schedule to
212+
prevent premature metadata synchronization failures while AWS trust
213+
relationships are propagating.
214+
215+
```bash
216+
gcloud alpha biglake iceberg catalogs create <CATALOG_NAME> \
217+
--project="<PROJECT_ID>" \
218+
--primary-location="<REGION>" \
219+
--catalog-type="federated" \
220+
--federated-catalog-type="glue" \
221+
--glue-warehouse="<AWS_ACCOUNT_ID>" \
222+
--glue-aws-region="<AWS_REGION>" \
223+
--glue-aws-role-arn="arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_ROLE_NAME>"
224+
```
225+
226+
#### 4. Update the trust policy
227+
228+
Extract the `biglake-service-account-id` from the created catalog, and update
229+
your AWS IAM role's trust policy to replace `PLACEHOLDER_VALUE` in the `aud` and
230+
`sub` conditions with this Google Service Agent ID.
231+
232+
#### 5. Enable background refresh
233+
234+
Update the catalog to activate background refresh once the trust policy is
235+
updated.
236+
237+
```bash
238+
gcloud alpha biglake iceberg catalogs update <CATALOG_NAME> \
239+
--project="<PROJECT_ID>" \
240+
--refresh-interval="300s"
241+
```
242+
243+
### Querying the Data
244+
245+
Once set up, you can query the tables via BigQuery.
246+
247+
```sql
248+
SELECT * FROM `<PROJECT_ID>.<CATALOG_NAME>.<NAMESPACE>.<TABLE_NAME>` LIMIT 10;
249+
```
250+
251+
## Gotchas and Pitfalls
252+
253+
> [!IMPORTANT] **Regional Isolation**: The Secret Manager secret and the
254+
> Lakehouse catalog **MUST** be created in the exact same region.
255+
256+
> [!TIP] **Region Pairing Best Practice**: When setting up the federated
257+
> catalog, choose GCP regions with "Low Latency Dedicated" or "Partner CCI" to
258+
> ensure optimal performance when federating large datasets across clouds.
259+
> Examples of optimal pairings: - AWS `us-east-1` (N. Virginia) pairs best with
260+
> GCP `us-east4` (Ashburn, VA) - AWS `us-west-2` (Oregon) pairs best with GCP
261+
> `us-west1` (The Dalles, OR) - AWS `eu-west-2` (London) pairs best with GCP
262+
> `europe-west2` (London) - AWS `eu-central-1` (Frankfurt) pairs best with GCP
263+
> `europe-west3` (Frankfurt) For the exhaustive list of mappings, read the full
264+
> capabilities table at:
265+
> https://docs.cloud.google.com/lakehouse/docs/regions-capabilities-cross-cloud-lakehouse
266+
267+
> [!IMPORTANT] **BigQuery Query Location**: When querying the federated catalog
268+
> via BigQuery, you **MUST** ensure the query runs in the same region as the
269+
> catalog (e.g., `us-east4`). If using the `bq` CLI, use the `--location` flag.
270+
271+
## Step 3: Validation and Next Steps
272+
273+
After completing the setup, the agent **MUST** validate that the federation is
274+
working and propose next steps to the user.
275+
276+
1. **Validate the Connection**:
277+
278+
- Attempt to list the namespaces or tables in the newly federated catalog
279+
using the `bq` CLI or BigQuery API. For example:
280+
281+
```bash
282+
bq ls --location="<REGION>" <PROJECT_ID>.<CATALOG_NAME>
283+
```
284+
285+
- If the command returns a list of namespaces/schemas, the federation is
286+
successful.
287+
288+
2. **Troubleshooting**:
289+
290+
- If the validation fails (e.g., permission errors, empty results,
291+
timeout), the agent should consult the Cross-Cloud Lakehouse
292+
Troubleshooting documentation:
293+
https://docs.cloud.google.com/lakehouse/docs/troubleshooting.
294+
- For AWS Glue, verify that the trust policy correctly references the
295+
`biglake-service-account-id` and that the GCP and AWS regions match your
296+
configuration.
297+
- For Databricks, verify that the secret exists in the correct region and
298+
the service account has `roles/secretmanager.secretAccessor`.
299+
300+
3. **Explore and Propose**:
301+
302+
- Assuming the federation is working, browse the available namespaces and
303+
a few key tables.
304+
- Summarize to the user what kind of data was found (e.g., "I see you have
305+
tables related to e-commerce transactions and customer profiles").
306+
- Propose a business or analytical question to the user that would result
307+
in a meaningful query of their data (e.g., "Would you like me to write a
308+
query to find the top 5 purchasing customers from last month?").

0 commit comments

Comments
 (0)