Skip to content

Commit e692eeb

Browse files
committed
fix
1 parent ead6c56 commit e692eeb

80 files changed

Lines changed: 7123 additions & 4290 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/python-integ-polaris.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ on:
1717
branches:
1818
- main
1919
paths:
20-
- python/src/lance_namespace_impls/iceberg.py
20+
- python/src/lance_namespace_impls/polaris.py
2121
- python/tests/test_polaris.py
2222
- docker/polaris/**
2323
- .github/workflows/python-integ-polaris.yml
@@ -28,7 +28,7 @@ on:
2828
- ready_for_review
2929
- reopened
3030
paths:
31-
- python/src/lance_namespace_impls/iceberg.py
31+
- python/src/lance_namespace_impls/polaris.py
3232
- python/tests/test_polaris.py
3333
- docker/polaris/**
3434
- .github/workflows/python-integ-polaris.yml
@@ -93,7 +93,7 @@ jobs:
9393
echo "Test catalog created/verified"
9494
- name: Install dependencies
9595
working-directory: python
96-
run: make install-iceberg
96+
run: make install-polaris
9797
- name: Run integration tests
9898
run: make python-integ-test-polaris
9999
- name: Collect logs on failure
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Licensed under the Apache License, Version 2.0 (the "License");
2+
# you may not use this file except in compliance with the License.
3+
# You may obtain a copy of the License at
4+
#
5+
# http://www.apache.org/licenses/LICENSE-2.0
6+
#
7+
# Unless required by applicable law or agreed to in writing, software
8+
# distributed under the License is distributed on an "AS IS" BASIS,
9+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10+
# See the License for the specific language governing permissions and
11+
# limitations under the License.
12+
13+
name: Python Polaris
14+
15+
on:
16+
push:
17+
branches:
18+
- main
19+
paths:
20+
- python/src/lance_namespace_impls/polaris.py
21+
- python/tests/test_polaris.py
22+
- python/pyproject.toml
23+
- .github/workflows/python-polaris.yml
24+
pull_request:
25+
types:
26+
- opened
27+
- synchronize
28+
- ready_for_review
29+
- reopened
30+
paths:
31+
- python/src/lance_namespace_impls/polaris.py
32+
- python/tests/test_polaris.py
33+
- python/pyproject.toml
34+
- .github/workflows/python-polaris.yml
35+
36+
concurrency:
37+
group: ${{ github.workflow }}-${{ github.ref }}
38+
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
39+
40+
jobs:
41+
test:
42+
runs-on: ubuntu-24.04
43+
timeout-minutes: 30
44+
strategy:
45+
matrix:
46+
python-version: ["3.10", "3.11", "3.12"]
47+
steps:
48+
- name: Checkout repo
49+
uses: actions/checkout@v4
50+
- name: Set up Python
51+
uses: actions/setup-python@v5
52+
with:
53+
python-version: ${{ matrix.python-version }}
54+
- name: Install uv
55+
uses: astral-sh/setup-uv@v4
56+
- name: Install dependencies
57+
working-directory: python
58+
run: make install-polaris
59+
- name: Lint
60+
working-directory: python
61+
run: make lint-polaris
62+
- name: Test
63+
working-directory: python
64+
run: make test-polaris

docs/src/biglake.md

Lines changed: 4 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,9 @@
1-
# Lance BigLake Namespace
1+
# Google BigLake Metastore
22

3-
**Google BigLake Metastore** is a fully managed, unified metastore service for data lakes on Google Cloud.
3+
**[Google BigLake Metastore](https://docs.cloud.google.com/biglake/docs/about-blms)**
4+
is a fully managed, unified metastore service for data lakes on Google Cloud.
45

56
To use Google BigLake Metastore with Lance, you can leverage BigLake's [Iceberg REST Catalog](https://docs.cloud.google.com/biglake/docs/blms-rest-catalog),
67
which exposes an Apache Iceberg REST Catalog-compatible interface.
78

8-
## Configuration
9-
10-
Configure your Lance Iceberg namespace to connect to the BigLake Metastore endpoint:
11-
12-
- **endpoint**: `https://biglake.googleapis.com/iceberg/v1/restcatalog`
13-
- **warehouse**: Your BigLake catalog name in the format `projects/{project}/locations/{location}/catalogs/{catalog}`
14-
- **auth_token**: A valid Google Cloud OAuth2 access token
15-
16-
All the features and configurations of the [Lance Iceberg REST Catalog Namespace](iceberg.md) apply when using BigLake Metastore.
9+
See [Lance Namespace integration with Iceberg REST Catalog](iceberg.md) for more details.

docs/src/dataproc.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
1-
# Lance Dataproc Namespace
1+
# Google Dataproc Metastore
22

3-
**Google Dataproc Metastore** is a fully managed,
3+
**[Google Dataproc Metastore](https://docs.cloud.google.com/dataproc-metastore/docs/overview)** is a fully managed,
44
highly available, autohealing, serverless metastore that runs on Google Cloud.
55

66
To use Google Dataproc Metastore with Lance, you can leverage Dataproc's [Hive metastore](https://cloud.google.com/dataproc-metastore/docs/hive-metastore),
7-
which exposes a Hive MetaStore-compatible interface.
7+
which exposes a Apache Hive MetaStore-compatible interface.
88

9-
Simply configure your Lance Hive namespace to connect to Dataproc's Hive MetaStore endpoint.
10-
All the features and configurations of the Lance Hive Namespace ([V2](hive2.md) or [V3](hive3.md)) apply when using Dataproc Metastore.
9+
See Lance Namespace integration with Hive metastore ([V2](hive2.md) or [V3](hive3.md)) for more details.

docs/src/glue.md

Lines changed: 214 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
1-
# Lance Glue Namespace Implementation Spec
1+
# AWS Glue Data Catalog Lance Namespace Implementation Spec
22

3-
This document describes how the AWS Glue Data Catalog implements the Lance Namespace client spec.
3+
This document describes how the AWS Glue Data Catalog
4+
implements the Lance Namespace client spec.
45

56
## Background
67

7-
AWS Glue Data Catalog is a fully managed metadata repository that stores structural and operational metadata for data assets. It is compatible with the Apache Hive Metastore API and can be used as a central metadata repository for data lakes. For details on AWS Glue, see the [AWS Glue Data Catalog Documentation](https://docs.aws.amazon.com/glue/).
8+
AWS Glue Data Catalog is a fully managed metadata repository that stores structural and operational metadata for data assets.
9+
It is compatible with the Apache Hive Metastore API and can be used as a central metadata repository for data lakes.
10+
For details on AWS Glue, see the [AWS Glue Data Catalog Documentation](https://docs.aws.amazon.com/glue/latest/dg/manage-catalog.html).
811

912
## Namespace Implementation Configuration Properties
1013

@@ -22,9 +25,15 @@ The **secret_access_key** property is optional and specifies the AWS secret acce
2225

2326
The **session_token** property is optional and specifies the AWS session token for temporary credentials.
2427

25-
The **root** property is optional and specifies the storage root location of the lakehouse on Glue catalog. Default value is the current working directory.
28+
The **assume_role_arn** property is optional and specifies the ARN of the IAM role to assume for Glue operations.
2629

27-
The **storage.*** prefix properties are optional and specify additional storage configurations to access tables (e.g., `storage.region=us-west-2`).
30+
The **assume_role_region** property is optional and specifies the AWS region for the STS client when assuming a role.
31+
32+
The **assume_role_external_id** property is optional and specifies the external ID for cross-account role assumption. For more details, see [AWS external ID documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user_externalid.html).
33+
34+
The **assume_role_session_name** property is optional and specifies the session name for the assumed role session. For more details, see [AWS role session name documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_iam-condition-keys.html#ck_rolesessionname).
35+
36+
The **assume_role_timeout_sec** property is optional and specifies the duration in seconds for which the assumed role session is valid (default: 3600). At the end of the timeout, a new set of role session credentials will be fetched through the STS client.
2837

2938
### Authentication
3039

@@ -33,24 +42,34 @@ The Glue namespace supports multiple authentication methods:
3342
1. **Default AWS credential provider chain**: When no explicit credentials are provided, the client uses the default AWS credential provider chain
3443
2. **Static credentials**: Set `access_key_id` and `secret_access_key` for basic AWS credentials
3544
3. **Session credentials**: Additionally provide `session_token` for temporary AWS credentials
45+
4. **Assume role credentials**: Set `assume_role_arn` to assume an IAM role. Optionally configure `assume_role_region`, `assume_role_external_id`, `assume_role_session_name`, and `assume_role_timeout_sec` to customize the role assumption behavior
3646

3747
## Object Mapping
3848

3949
### Namespace
4050

41-
The **root namespace** is represented by the AWS Glue Data Catalog itself.
51+
AWS Glue Data Catalog supports a recursive catalog structure through the [GetCatalog](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalog.html) and [GetCatalogs](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalogs.html) APIs.
52+
This allows for multi-level namespace hierarchies.
53+
54+
The **root namespace** is represented by the default AWS Glue Data Catalog, which has a catalog ID of `None` or equal to the caller's AWS account ID.
55+
56+
A **child catalog** within the root catalog forms a child namespace. The [GetCatalogs](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalogs.html) API supports `ParentCatalogId` parameter to traverse the catalog hierarchy.
4257

43-
A **child namespace** is a database in Glue, forming a 2-level namespace hierarchy.
58+
A **database** within a catalog represents the leaf namespace level. Databases are created within a specific catalog using the `CatalogId` parameter in the [CreateDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateDatabase.html) API.
4459

45-
The **namespace identifier** is the database name.
60+
The **namespace identifier** follows a hierarchical pattern:
61+
- For catalogs: the catalog name (e.g., `my_catalog`)
62+
- For databases: the catalog chain joined with database name using the `$` delimiter (e.g., `catalog$database` or `parent_catalog$child_catalog$database`)
4663

47-
**Namespace properties** are stored in the Glue Database object's parameters map.
64+
**Namespace properties** are stored in:
65+
- Catalog's `Parameters` map for catalog-level namespaces
66+
- Database's `Parameters` map for database-level namespaces
4867

4968
### Table
5069

5170
A **table** is represented as a [Table](https://docs.aws.amazon.com/glue/latest/webapi/API_Table.html) object in AWS Glue with `TableType` set to `EXTERNAL_TABLE`.
5271

53-
The **table identifier** is constructed by joining database and table name with the `$` delimiter (e.g., `database$table`).
72+
The **table identifier** is constructed by joining the full namespace path and table name with the `$` delimiter (e.g., `catalog$database$table`).
5473

5574
The **table location** is stored in the [`StorageDescriptor.Location`](https://docs.aws.amazon.com/glue/latest/webapi/API_StorageDescriptor.html#Glue-Type-StorageDescriptor-Location) field, pointing to the root location of the Lance table.
5675

@@ -60,6 +79,189 @@ The **table location** is stored in the [`StorageDescriptor.Location`](https://d
6079

6180
A table in AWS Glue is identified as a Lance table when it meets the following criteria: the `TableType` is `EXTERNAL_TABLE`, and the `Parameters` map contains a key `table_type` with value `lance` (case insensitive). The `StorageDescriptor.Location` must point to a valid Lance table root directory.
6281

63-
## Optimistic Concurrency Control
82+
## Basic Operations
83+
84+
### CreateNamespace
85+
86+
Creates a new catalog or database in AWS Glue.
87+
88+
The implementation:
89+
90+
1. Parse the namespace identifier to determine if it is a catalog or database level
91+
2. For catalog-level namespace:
92+
- Construct a [CreateCatalog](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateCatalog.html) request with name and properties
93+
- Set the `Parameters` map with the provided namespace properties
94+
3. For database-level namespace:
95+
- Verify the parent catalog exists
96+
- Construct a [CreateDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateDatabase.html) request with database name and `CatalogId`
97+
- Set the `Parameters` map with the provided namespace properties
98+
4. Handle creation mode (CREATE, EXIST_OK, OVERWRITE) appropriately
99+
100+
**Error Handling:**
101+
102+
If the namespace already exists and mode is CREATE, return error code `2` (NamespaceAlreadyExists).
103+
104+
If the parent catalog does not exist, return error code `1` (NamespaceNotFound).
105+
106+
If access is denied, return error code `16` (Forbidden).
107+
108+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
109+
110+
### ListNamespaces
111+
112+
Lists catalogs or databases in AWS Glue.
113+
114+
The implementation:
115+
116+
1. Parse the parent namespace identifier
117+
2. For root namespace (no parent):
118+
- Use [GetCatalogs](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalogs.html) with `IncludeRoot=true` to list all catalogs
119+
- Use `ParentCatalogId` set to account ID and `Recursive=false` for direct children
120+
3. For catalog-level namespace:
121+
- Use [GetDatabases](https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabases.html) with the catalog's `CatalogId`
122+
- Additionally use [GetCatalogs](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalogs.html) with `ParentCatalogId` to list child catalogs
123+
4. Sort the results and apply pagination using `NextToken`
124+
125+
**Error Handling:**
126+
127+
If the parent namespace does not exist, return error code `1` (NamespaceNotFound).
128+
129+
If access is denied, return error code `16` (Forbidden).
130+
131+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
132+
133+
### DescribeNamespace
134+
135+
Retrieves properties and metadata for a catalog or database.
136+
137+
The implementation:
138+
139+
1. Parse the namespace identifier to determine the level
140+
2. For catalog-level namespace:
141+
- Use [GetCatalog](https://docs.aws.amazon.com/glue/latest/webapi/API_GetCatalog.html) with the catalog ID
142+
- Extract properties from the `Parameters` map
143+
3. For database-level namespace:
144+
- Use [GetDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabase.html) with the database name and `CatalogId`
145+
- Extract properties from the Database's `Parameters` map
146+
147+
**Error Handling:**
148+
149+
If the namespace does not exist, return error code `1` (NamespaceNotFound).
150+
151+
If access is denied, return error code `16` (Forbidden).
152+
153+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
154+
155+
### DropNamespace
156+
157+
Removes a catalog or database from AWS Glue. Only RESTRICT mode is supported; CASCADE mode is not implemented.
158+
159+
The implementation:
160+
161+
1. Parse the namespace identifier to determine the level
162+
2. Check if the namespace exists (handle SKIP mode if not)
163+
3. For catalog-level namespace:
164+
- Verify the catalog has no child catalogs or databases
165+
- Use [DeleteCatalog](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteCatalog.html) with the catalog ID
166+
4. For database-level namespace:
167+
- Verify the database is empty (no tables)
168+
- Use [DeleteDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteDatabase.html) with the database name and `CatalogId`
169+
170+
**Error Handling:**
171+
172+
If the namespace does not exist and mode is FAIL, return error code `1` (NamespaceNotFound).
173+
174+
If the namespace is not empty, return error code `3` (NamespaceNotEmpty).
175+
176+
If access is denied, return error code `16` (Forbidden).
177+
178+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
179+
180+
### DeclareTable
181+
182+
Declares a new Lance table in AWS Glue without creating the underlying data.
183+
184+
The implementation:
185+
186+
1. Parse the table identifier to extract catalog, database, and table name
187+
2. Verify the parent namespace (database) exists using [GetDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabase.html)
188+
3. Construct a [CreateTable](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html) request with:
189+
- `CatalogId`: the catalog ID from the namespace
190+
- `DatabaseName`: the database name
191+
- `TableInput.Name`: the table name
192+
- `TableInput.TableType`: `EXTERNAL_TABLE`
193+
- `TableInput.Parameters`: include `table_type=lance` and other properties
194+
- `TableInput.StorageDescriptor.Location`: the specified table location
195+
4. POST the CreateTable request to Glue
196+
197+
**Error Handling:**
198+
199+
If the parent namespace does not exist, return error code `1` (NamespaceNotFound).
200+
201+
If the table already exists, return error code `5` (TableAlreadyExists).
202+
203+
If access is denied, return error code `16` (Forbidden).
204+
205+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
206+
207+
### ListTables
208+
209+
Lists all Lance tables in a database.
210+
211+
The implementation:
212+
213+
1. Parse the namespace identifier to extract catalog and database
214+
2. Verify the namespace exists using [GetDatabase](https://docs.aws.amazon.com/glue/latest/webapi/API_GetDatabase.html)
215+
3. Use [GetTables](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTables.html) with `CatalogId` and `DatabaseName`
216+
4. Filter tables where `Parameters.table_type=lance` (case insensitive)
217+
5. Sort the results and apply pagination using `NextToken`
218+
219+
**Error Handling:**
220+
221+
If the namespace does not exist, return error code `1` (NamespaceNotFound).
222+
223+
If access is denied, return error code `16` (Forbidden).
224+
225+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
226+
227+
### DescribeTable
228+
229+
Retrieves metadata for a Lance table. Only `load_detailed_metadata=false` is supported. When `load_detailed_metadata=false`, only the table location and storage_options are returned; other fields (version, table_uri, schema, stats) are null.
230+
231+
The implementation:
232+
233+
1. Parse the table identifier to extract catalog, database, and table name
234+
2. Use [GetTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html) with `CatalogId`, `DatabaseName`, and `Name`
235+
3. Validate that the table is a Lance table (check `Parameters.table_type=lance`)
236+
4. Return the table location from `StorageDescriptor.Location` and storage_options from `Parameters`
237+
238+
**Error Handling:**
239+
240+
If the table does not exist, return error code `4` (TableNotFound).
241+
242+
If the table is not a Lance table, return error code `13` (InvalidInput).
243+
244+
If access is denied, return error code `16` (Forbidden).
245+
246+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).
247+
248+
### DeregisterTable
249+
250+
Removes a Lance table registration from AWS Glue without deleting the underlying data.
251+
252+
The implementation:
253+
254+
1. Parse the table identifier to extract catalog, database, and table name
255+
2. Use [GetTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetTable.html) to retrieve and validate the table is a Lance table
256+
3. Use [DeleteTable](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteTable.html) with `CatalogId`, `DatabaseName`, and `Name`
257+
4. The underlying Lance table data at `StorageDescriptor.Location` is not deleted
258+
259+
**Error Handling:**
260+
261+
If the table does not exist, return error code `4` (TableNotFound).
262+
263+
If the table is not a Lance table, return error code `13` (InvalidInput).
264+
265+
If access is denied, return error code `16` (Forbidden).
64266

65-
Updates to Lance tables in AWS Glue should use the `VersionId` for conditional updates through the [UpdateTable](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateTable.html) API. If the `VersionId` does not match the expected version, the update fails to prevent concurrent modification conflicts.
267+
If the Glue service is unavailable, return error code `17` (ServiceUnavailable).

0 commit comments

Comments
 (0)