spec: support multi-catalog concept by jackye1995 · Pull Request #11 · lance-format/lance-namespace

jackye1995 · 2025-04-11T22:45:53Z

Add a text specification to define what is a Lance catalog, and the relationship of a directory of Lance tables vs a collection of Lance tables accessed through the REST catalog protocol by making the catalog concept recursive and allowing multi-level catalog.

Change all namespace concepts to be catalog concepts in REST spec.

Rename catalog.yaml to rest-catalog.yaml to make it clear that the OpenAPI spec is for the REST catalog specifically.

Closes #9
Closes #58

github-actions · 2025-04-11T22:46:10Z

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

yanghua · 2025-04-17T05:28:13Z

+A Lance catalog is a centralized repository for discovering, organizing, and managing Lance tables.
+There are 2 types of Lance catalogs: storage directory and REST catalog.
+
+## Lance Directory


Do we need to use Lance Directory Catalog to make it clearer? Let users know it's a catalog, not a concept of lance format. WDYT?

yanghua · 2025-04-17T05:30:33Z

+Lance storage directory, or **Lance Directory**, is a lightweight and simple catalog
+for people to get started with creating and using Lance tables directly on top of any local or remote storage system.
+
+A directory in a storage can contain a list of Lance tables, 


Here do we need to describe the equal concept of catalog and namespace? e.g. A directory acts as a catalog, sub-dir could be acted as a namespace?

Tbh, I went back and forth on this.

It seems like so far the use pattern is just connecting to a directory, and then there is a list of tables in that directory.

I saw the ListingCatalog was added recently, but there is not an established pattern for that one more level yet. This seems to be very similar to the HadoopCatalog in Iceberg, which has been very controversial and is now in deprecated state, because (1) there were quite a few operations that basically can never be supported with this directory setup compared to a normal catalog, (2) the more features we added to it, when used across storage solutions, there were more inconsistent behaviors.

My current feeling is that a directory with a list of tables is sufficient, and we could avoid the mistake in Iceberg HadoopCatalog by not having that one extra layer of directories. But I know you added the ListingCatalog. If that has some important use cases on your side, I think we should for sure add that support, but just document the limitations.

But I know you added the ListingCatalog. If that has some important use cases on your side

There are no important use cases. The background of ListingCatalog is that, at that moment, we have two traits: Catalog and Database. It's a code-level abstraction.

My concern is mainly about:

if users get confused, here is missing a namespace, while our rest catalog has this concept;

end users may want to provide an abstraction themself, to switch between two catalogs simply, if a unified abstraction could provide some convenience；

There are a few alternative ways we could evaluate:

Option 1: 2-level directory

This is basically the ListingCatalog approach. I think my main concern is that that was a disapproved and deprecated use pattern in Iceberg, and it feels like we should not go again on the same path, but I might be wrong.

Option 2: make it an engine level feature

That was what I was trying to do in Spark, take a look at lance-format/lance-spark#13 (comment). To paste it here, user could construct the second level by themselves on engine side, using some notion like:

spark = SparkSession.builder() .appName("spark-lance-connector-test") .master("local") .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceCatalog") .config("spark.sql.catalog.lance.type", "dir") .config("spark.sql.catalog.lance.paths.ns1", dbPath1) .config("spark.sql.catalog.lance.paths.ns2", dbPath2) .getOrCreate();

This will map Spark namespace ns1 to dbPath1, ns2 to dbPath2, so user can run:

spark.sql("SELECT t1.c1, t2.c2 FROM ns1.t1 as t1, ns2.t2 as t2 WHERE t1.c3 = t2.c3")

Option 3: make the REST spec 1 level

This means instead of have GET /v1/namespaces/{ns}/tables/{table}, we could make it top level, like GET /v1/tables/{table}?namespace={ns}. This essentially makes the additional level of namespace optional.

It probably also achieves your goal of "end users may want to provide an abstraction themself, to switch between two catalogs simply".

Option 4: use a file system-based design that can actually do more.

There are ways to make the file system-based approach actually overcome the limitations of a 2 level directory setup as a multi-level catalog.

For example, me and my friends are developing a project https://olympiaformat.org/ for exactly that purpose. But this will make the design very complicated and I probably won't recommend using it at its current stage.

For a simpler implementation, you could also imagine just having a catalog file, which holds a list of metadata information in the catalog. For example, this is just a JSON list of object name -> the JSON definition that is consistent with the REST spec.

My current take

I clearly have experimented option 2 already, but tbh I am undecided on that option either. I am curious what you think!

Fine, let's take option 2 since we can achieve the final goal.

yanghua · 2025-04-17T05:33:50Z

+This mode ensures the catalog service is aware of all activities in the table,
+and can thus fully enforce any governance and management features for the table. 
+
+#### Storage Managed Table


Compared with Non Catalog Managed Table, which one is better? One true, one false.

I thought about that, but I ended up feeling using a "type" enum might be better. This is also related to why I was renaming the "federated" to "storage managed", because we might want to actually introduce a "federated" type in the future.

This is basically what Polaris is developing right now, imagine 2 Lance REST catalogs, catalog A has table ns1.t1, I could created a federated table in catalog B with name ns2.t2, it will just be a pointer to catalogA.ns1.t1, and catalog B will basically just serve as a proxy (maybe + some caching).

Another type we might want to consider is external. It has been increasingly irrelevant in data lake world with transactional table formats, but when talking to the data engineers, many still like to stick to the definition of a static external table. So there is also a likelihood that we would add that in the future.

These are probably a bit far away for Lance catalog, but using a type allows us to reserve the ability to add more types.

but using a type allows us to reserve the ability to add more types

sounds reasonable.

yanghua · 2025-04-17T05:34:33Z

+We recommend tools to offer the following configurations in some form or shape 
+for users to configure connection to a Lance catalog:
+
+| Config Key | Description                                                                                 | Required?                   | 


Do we need to describe the default value?

if it is required, why does it need a default value?

I mean, if we can set a default for the type. e.g. default is rest?

yanghua · 2025-04-27T12:22:21Z

+This mode ensures the catalog service is aware of all activities in the table,
+and can thus fully enforce any governance and management features for the table. 
+
+##### Storage Managed Table


What about naming native table? That means directly using the native SDK-created tables.

yanghua · 2025-04-27T12:26:25Z

-          * EXIST_OK: Create the namespace if it does not exist. If a namespace of the same name already exists, the operation succeeds and the existing namespace is kept.
-          * OVERWRITE: Create the namespace if it does not exist. If a namespace of the same name already exists, the existing namespace is dropped and a new namespace with this name with no table is created.
-      operationId: CreateNamespace
+        Create a new catalog. A catalog can manage either a collection of child catalogs, or a collection of tables.


I am thinking how we mapping a pattern catalog1.catalog2.catalog3....table -> a traditional catalog system catalog1.database1.table.

you will have something like the following:

GET /catalogs > [ "database1" ] POST /catalogs { "name": "database2" } GET /catalogs > [ "database1", "database2" ] GET /catalogs/database1 > { "name": "database1" } GET /tables?catalog=database1 > [ "table1" ] GET /tables/database1.table1 > { // table metadata }

github-actions Bot added the spec Restful openapi spec label Apr 11, 2025

jackye1995 mentioned this pull request Apr 11, 2025

Discuss support for Lance tables in a directory #9

Closed

jackye1995 marked this pull request as draft April 11, 2025 23:55

jackye1995 mentioned this pull request Apr 15, 2025

spec: register table #22

Merged

github-actions Bot added java Java features rust Rust features labels Apr 16, 2025

jackye1995 changed the title ~~Draft for a text spec for Lance Catalog~~ spec: add a text spec to fully define Lance Catalog Apr 16, 2025

github-actions Bot added the enhancement New feature or request label Apr 16, 2025

jackye1995 requested review from westonpace and yanghua April 16, 2025 20:29

jackye1995 marked this pull request as ready for review April 16, 2025 23:45

yanghua reviewed Apr 17, 2025

View reviewed changes

jackye1995 mentioned this pull request Apr 23, 2025

Multi-level Catalog? #58

Closed

jackye1995 added 16 commits April 26, 2025 22:36

Draft for a text spec for Lance Catalog

31788ea

minor change in wording

3250673

minor change in wording

8a38c87

minor change in wording

e5274fb

minor change in wording

603e568

minor change in wording

3cbc368

minor change in wording

63581b2

fix path

3889bb2

fix wording

110f191

fix wording

3f9b713

fix wording

2ed760d

fix wording

d46cea5

fix wording

46709c3

commit

f23cacc

update based on comment

c10c420

use catalog instead of namespace

747a3ea

codegen

b5f81d0

github-actions Bot added the python Python features label Apr 27, 2025

minor wording

49dcffa

jackye1995 changed the title ~~spec: add a text spec to fully define Lance Catalog~~ spec: support multi-catalog concept Apr 27, 2025

jackye1995 requested a review from yanghua April 27, 2025 06:59

yanghua reviewed Apr 27, 2025

View reviewed changes

jackye1995 added 3 commits April 27, 2025 13:40

clarify identifier

fd6f7eb

fix codegen

ded5cd0

fix based on comments

49163db

jackye1995 closed this Apr 28, 2025

jackye1995 reopened this Apr 28, 2025

jackye1995 closed this by deleting the head repository Apr 28, 2025

Conversation

jackye1995 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Option 1: 2-level directory

Option 2: make it an engine level feature

Option 3: make the REST spec 1 level

Option 4: use a file system-based design that can actually do more.

My current take

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanghua Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jackye1995 commented Apr 11, 2025 •

edited

Loading

yanghua Apr 27, 2025 •

edited

Loading