|
| 1 | +--- |
| 2 | +name: federate-lakehouse-catalog |
| 3 | +description: >- |
| 4 | + Sets up Google Cloud Lakehouse federated catalogs to remote Iceberg REST Catalogs. Currently supported catalogs: Databricks Unity, AWS Glue. Supported clouds hosting those catalogs: GCP, AWS. The primary use case is connecting to remote data to query it from GCP engines (BigQuery, Spark). Examples of when to use this: "federate my lakehouse catalog to databricks", "query data in databricks", "query data in s3", "connect to aws glue". Do NOT use for direct remote database SQL execution (e.g., Databricks SQL) or managing remote clusters and infrastructure (e.g., Databricks clusters, AWS Glue jobs). |
| 5 | +license: Apache-2.0 |
| 6 | +metadata: |
| 7 | + version: v1 |
| 8 | + publisher: google |
| 9 | +--- |
| 10 | + |
| 11 | +# Federate Lakehouse Catalog via Cross-cloud Lakehouse |
| 12 | + |
| 13 | +This skill describes how to set up a federated catalog in BigQuery to query |
| 14 | +remote catalogs like Databricks Unity Catalog or AWS Glue Data Catalog data in |
| 15 | +AWS over the public internet. |
| 16 | + |
| 17 | +## Prerequisites |
| 18 | + |
| 19 | +- For Databricks: Databricks Workspace URL and OAuth Service Principal (Client |
| 20 | + ID and Secret) with read access. |
| 21 | +- For AWS Glue: AWS Administrator access to create IAM roles and permissions |
| 22 | + policies. |
| 23 | +- Active Google Cloud project with administrative access to create lakehouse |
| 24 | + resources, and secrets in the case of Databricks. |
| 25 | + |
| 26 | +## Procedure |
| 27 | + |
| 28 | +### Step 1: Information Gathering and Region Selection |
| 29 | + |
| 30 | +Before running any commands, the agent **MUST** collect the following |
| 31 | +information from the user: |
| 32 | + |
| 33 | +1. Determine which catalog the user wants to federate to (e.g., Databricks |
| 34 | + Unity or AWS Glue) and verify it is supported. |
| 35 | +2. Determine where the remote data is located (the specific AWS region). |
| 36 | +3. Using the Region Pairing Best Practice in the Gotchas section, help the user |
| 37 | + pick the optimal GCP region to minimize latency. |
| 38 | +4. Collect the necessary configuration variables for the chosen flow (e.g., |
| 39 | + Databricks credentials or AWS Account ID). |
| 40 | + |
| 41 | +Only proceed to the next steps once this information is confirmed. |
| 42 | + |
| 43 | +### Step 2: API Verification |
| 44 | + |
| 45 | +Verify that the required Google Cloud APIs are enabled for the project: |
| 46 | + |
| 47 | +```bash |
| 48 | +gcloud services check biglake.googleapis.com |
| 49 | +``` |
| 50 | + |
| 51 | +If the API is not enabled, explicitly ask the user for permission to enable it. |
| 52 | +Do NOT proceed without their confirmation. |
| 53 | + |
| 54 | +### Flow A: Databricks Unity Catalog |
| 55 | + |
| 56 | +#### 1. Create a Regional Secret for Credentials |
| 57 | + |
| 58 | +Store the Databricks client ID and secret in Secret Manager. Ensure the |
| 59 | +`secretmanager.googleapis.com` API is enabled. The secret **MUST** be in the |
| 60 | +same region as your Lakehouse catalog. |
| 61 | + |
| 62 | +1. Create a JSON file named `credentials.json`: |
| 63 | + |
| 64 | +```json |
| 65 | +{ |
| 66 | + "client_id": "<CLIENT_ID>", |
| 67 | + "client_secret": "<CLIENT_SECRET>" |
| 68 | +} |
| 69 | +``` |
| 70 | + |
| 71 | +1. Set the Secret Manager API endpoint override for the region: |
| 72 | + |
| 73 | +```bash |
| 74 | +gcloud config set api_endpoint_overrides/secretmanager https://secretmanager.<REGION>.rep.googleapis.com/ |
| 75 | +``` |
| 76 | + |
| 77 | +1. Create the secret: |
| 78 | + |
| 79 | +```bash |
| 80 | +gcloud secrets create <SECRET_NAME> \ |
| 81 | + --location="<REGION>" \ |
| 82 | + --project="<PROJECT_ID>" \ |
| 83 | + --data-file=credentials.json |
| 84 | +``` |
| 85 | + |
| 86 | +#### 2. Create the Federated Catalog |
| 87 | + |
| 88 | +Create a BigLake Iceberg catalog of type `federated` pointing to Databricks. |
| 89 | + |
| 90 | +```bash |
| 91 | +gcloud alpha biglake iceberg catalogs create <CATALOG_NAME> \ |
| 92 | + --project="<PROJECT_ID>" \ |
| 93 | + --primary-location="<REGION>" \ |
| 94 | + --catalog-type="federated" \ |
| 95 | + --federated-catalog-type="unity" \ |
| 96 | + --secret-name="projects/<PROJECT_ID>/locations/<REGION>/secrets/<SECRET_NAME>" \ |
| 97 | + --unity-instance-name="<UNITY_INSTANCE_NAME>" \ |
| 98 | + --unity-catalog-name="<UNITY_CATALOG_NAME>" \ |
| 99 | + --refresh-interval="300s" |
| 100 | +``` |
| 101 | + |
| 102 | +#### 3. Grant Catalog Access to the Secret |
| 103 | + |
| 104 | +Grant the service account created for the catalog access to read the secret. |
| 105 | + |
| 106 | +1. Get the service account email by describing the catalog: |
| 107 | + |
| 108 | +```bash |
| 109 | +gcloud alpha biglake iceberg catalogs describe <CATALOG_NAME> \ |
| 110 | + --project="<PROJECT_ID>" \ |
| 111 | + --location="<REGION>" \ |
| 112 | + --format="value(biglake-service-account-id)" |
| 113 | +``` |
| 114 | + |
| 115 | +1. Grant access: |
| 116 | + |
| 117 | +```bash |
| 118 | +gcloud secrets add-iam-policy-binding <SECRET_NAME> \ |
| 119 | + --project="<PROJECT_ID>" \ |
| 120 | + --location="<REGION>" \ |
| 121 | + --member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \ |
| 122 | + --role="roles/secretmanager.secretAccessor" |
| 123 | +``` |
| 124 | + |
| 125 | +### Flow B: AWS Glue |
| 126 | + |
| 127 | +#### 1. Create the AWS IAM role with a placeholder trust policy |
| 128 | + |
| 129 | +Lakehouse provisions a Google service account ID after catalog creation. Create |
| 130 | +the AWS IAM role with a placeholder trust policy first. |
| 131 | + |
| 132 | +1. Create a file named `trust_policy.json`: |
| 133 | + |
| 134 | +```json |
| 135 | +{ |
| 136 | + "Version": "2012-10-17", |
| 137 | + "Statement": [ |
| 138 | + { |
| 139 | + "Effect": "Allow", |
| 140 | + "Principal": { |
| 141 | + "Federated": "accounts.google.com" |
| 142 | + }, |
| 143 | + "Action": "sts:AssumeRoleWithWebIdentity", |
| 144 | + "Condition": { |
| 145 | + "StringEquals": { |
| 146 | + "accounts.google.com:aud": ["PLACEHOLDER_VALUE"], |
| 147 | + "accounts.google.com:sub": ["PLACEHOLDER_VALUE"] |
| 148 | + } |
| 149 | + } |
| 150 | + } |
| 151 | + ] |
| 152 | +} |
| 153 | +``` |
| 154 | + |
| 155 | +1. Run the AWS CLI command to create the role: |
| 156 | + |
| 157 | +```bash |
| 158 | +aws iam create-role \ |
| 159 | + --role-name <AWS_ROLE_NAME> \ |
| 160 | + --assume-role-policy-document file://trust_policy.json \ |
| 161 | + --max-session-duration 43200 |
| 162 | +``` |
| 163 | + |
| 164 | +#### 2. Attach a permissions policy |
| 165 | + |
| 166 | +Attach a policy that allows Lakehouse to read from Glue and S3. |
| 167 | + |
| 168 | +> [!IMPORTANT] **Safe IAM Scoping**: The example below uses wildcard structures |
| 169 | +> for illustration. You **MUST** consult with the user to scope the `Resource` |
| 170 | +> ARNs to their specific catalog, database, and S3 buckets. Do NOT blindly apply |
| 171 | +> wildcard permissions. |
| 172 | +
|
| 173 | +```json |
| 174 | +{ |
| 175 | + "Version": "2012-10-17", |
| 176 | + "Statement": [ |
| 177 | + { |
| 178 | + "Sid": "GlueRead", |
| 179 | + "Effect": "Allow", |
| 180 | + "Action": [ |
| 181 | + "glue:GetCatalog", |
| 182 | + "glue:GetDatabase", |
| 183 | + "glue:GetDatabases", |
| 184 | + "glue:GetTable", |
| 185 | + "glue:GetTables" |
| 186 | + ], |
| 187 | + "Resource": "arn:aws:glue:<AWS_REGION>:<AWS_ACCOUNT_ID>:catalog" |
| 188 | + }, |
| 189 | + { |
| 190 | + "Sid": "S3Read", |
| 191 | + "Effect": "Allow", |
| 192 | + "Action": [ |
| 193 | + "s3:ListBucket", |
| 194 | + "s3:GetObject" |
| 195 | + ], |
| 196 | + "Resource": [ |
| 197 | + "arn:aws:s3:::<SPECIFIC_BUCKET>", |
| 198 | + "arn:aws:s3:::<SPECIFIC_BUCKET>/*" |
| 199 | + ] |
| 200 | + } |
| 201 | + ] |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +Attach this permissions policy to your IAM role. |
| 206 | + |
| 207 | +#### 3. Create the Federated Catalog |
| 208 | + |
| 209 | +When creating an AWS Glue federated catalog, the `--glue-warehouse` **MUST** be |
| 210 | +set to your 12-digit AWS Account ID string (not an S3 bucket URI). **Best |
| 211 | +Practice**: Initialize the catalog without specifying a refresh schedule to |
| 212 | +prevent premature metadata synchronization failures while AWS trust |
| 213 | +relationships are propagating. |
| 214 | + |
| 215 | +```bash |
| 216 | +gcloud alpha biglake iceberg catalogs create <CATALOG_NAME> \ |
| 217 | + --project="<PROJECT_ID>" \ |
| 218 | + --primary-location="<REGION>" \ |
| 219 | + --catalog-type="federated" \ |
| 220 | + --federated-catalog-type="glue" \ |
| 221 | + --glue-warehouse="<AWS_ACCOUNT_ID>" \ |
| 222 | + --glue-aws-region="<AWS_REGION>" \ |
| 223 | + --glue-aws-role-arn="arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_ROLE_NAME>" |
| 224 | +``` |
| 225 | + |
| 226 | +#### 4. Update the trust policy |
| 227 | + |
| 228 | +Extract the `biglake-service-account-id` from the created catalog, and update |
| 229 | +your AWS IAM role's trust policy to replace `PLACEHOLDER_VALUE` in the `aud` and |
| 230 | +`sub` conditions with this Google Service Agent ID. |
| 231 | + |
| 232 | +#### 5. Enable background refresh |
| 233 | + |
| 234 | +Update the catalog to activate background refresh once the trust policy is |
| 235 | +updated. |
| 236 | + |
| 237 | +```bash |
| 238 | +gcloud alpha biglake iceberg catalogs update <CATALOG_NAME> \ |
| 239 | + --project="<PROJECT_ID>" \ |
| 240 | + --refresh-interval="300s" |
| 241 | +``` |
| 242 | + |
| 243 | +### Querying the Data |
| 244 | + |
| 245 | +Once set up, you can query the tables via BigQuery. |
| 246 | + |
| 247 | +```sql |
| 248 | +SELECT * FROM `<PROJECT_ID>.<CATALOG_NAME>.<NAMESPACE>.<TABLE_NAME>` LIMIT 10; |
| 249 | +``` |
| 250 | + |
| 251 | +## Gotchas and Pitfalls |
| 252 | + |
| 253 | +> [!IMPORTANT] **Regional Isolation**: The Secret Manager secret and the |
| 254 | +> Lakehouse catalog **MUST** be created in the exact same region. |
| 255 | +
|
| 256 | +> [!TIP] **Region Pairing Best Practice**: When setting up the federated |
| 257 | +> catalog, choose GCP regions with "Low Latency Dedicated" or "Partner CCI" to |
| 258 | +> ensure optimal performance when federating large datasets across clouds. |
| 259 | +> Examples of optimal pairings: - AWS `us-east-1` (N. Virginia) pairs best with |
| 260 | +> GCP `us-east4` (Ashburn, VA) - AWS `us-west-2` (Oregon) pairs best with GCP |
| 261 | +> `us-west1` (The Dalles, OR) - AWS `eu-west-2` (London) pairs best with GCP |
| 262 | +> `europe-west2` (London) - AWS `eu-central-1` (Frankfurt) pairs best with GCP |
| 263 | +> `europe-west3` (Frankfurt) For the exhaustive list of mappings, read the full |
| 264 | +> capabilities table at: |
| 265 | +> https://docs.cloud.google.com/lakehouse/docs/regions-capabilities-cross-cloud-lakehouse |
| 266 | +
|
| 267 | +> [!IMPORTANT] **BigQuery Query Location**: When querying the federated catalog |
| 268 | +> via BigQuery, you **MUST** ensure the query runs in the same region as the |
| 269 | +> catalog (e.g., `us-east4`). If using the `bq` CLI, use the `--location` flag. |
| 270 | +
|
| 271 | +## Step 3: Validation and Next Steps |
| 272 | + |
| 273 | +After completing the setup, the agent **MUST** validate that the federation is |
| 274 | +working and propose next steps to the user. |
| 275 | + |
| 276 | +1. **Validate the Connection**: |
| 277 | + |
| 278 | + - Attempt to list the namespaces or tables in the newly federated catalog |
| 279 | + using the `bq` CLI or BigQuery API. For example: |
| 280 | + |
| 281 | + ```bash |
| 282 | + bq ls --location="<REGION>" <PROJECT_ID>.<CATALOG_NAME> |
| 283 | + ``` |
| 284 | + |
| 285 | + - If the command returns a list of namespaces/schemas, the federation is |
| 286 | + successful. |
| 287 | + |
| 288 | +2. **Troubleshooting**: |
| 289 | + |
| 290 | + - If the validation fails (e.g., permission errors, empty results, |
| 291 | + timeout), the agent should consult the Cross-Cloud Lakehouse |
| 292 | + Troubleshooting documentation: |
| 293 | + https://docs.cloud.google.com/lakehouse/docs/troubleshooting. |
| 294 | + - For AWS Glue, verify that the trust policy correctly references the |
| 295 | + `biglake-service-account-id` and that the GCP and AWS regions match your |
| 296 | + configuration. |
| 297 | + - For Databricks, verify that the secret exists in the correct region and |
| 298 | + the service account has `roles/secretmanager.secretAccessor`. |
| 299 | + |
| 300 | +3. **Explore and Propose**: |
| 301 | + |
| 302 | + - Assuming the federation is working, browse the available namespaces and |
| 303 | + a few key tables. |
| 304 | + - Summarize to the user what kind of data was found (e.g., "I see you have |
| 305 | + tables related to e-commerce transactions and customer profiles"). |
| 306 | + - Propose a business or analytical question to the user that would result |
| 307 | + in a meaningful query of their data (e.g., "Would you like me to write a |
| 308 | + query to find the top 5 purchasing customers from last month?"). |
0 commit comments