Skip to content

spec: support multi-catalog concept#11

Closed
jackye1995 wants to merge 21 commits into
lance-format:mainfrom
jackye1995:text-spec
Closed

spec: support multi-catalog concept#11
jackye1995 wants to merge 21 commits into
lance-format:mainfrom
jackye1995:text-spec

Conversation

@jackye1995
Copy link
Copy Markdown
Collaborator

@jackye1995 jackye1995 commented Apr 11, 2025

Add a text specification to define what is a Lance catalog, and the relationship of a directory of Lance tables vs a collection of Lance tables accessed through the REST catalog protocol by making the catalog concept recursive and allowing multi-level catalog.

Change all namespace concepts to be catalog concepts in REST spec.

Rename catalog.yaml to rest-catalog.yaml to make it clear that the OpenAPI spec is for the REST catalog specifically.

Closes #9
Closes #58

@github-actions github-actions Bot added the spec Restful openapi spec label Apr 11, 2025
@github-actions
Copy link
Copy Markdown
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@jackye1995 jackye1995 marked this pull request as draft April 11, 2025 23:55
@jackye1995 jackye1995 mentioned this pull request Apr 15, 2025
@github-actions github-actions Bot added java Java features rust Rust features labels Apr 16, 2025
@jackye1995 jackye1995 changed the title Draft for a text spec for Lance Catalog spec: add a text spec to fully define Lance Catalog Apr 16, 2025
@github-actions github-actions Bot added the enhancement New feature or request label Apr 16, 2025
@jackye1995 jackye1995 marked this pull request as ready for review April 16, 2025 23:45
Comment thread spec/spec.md Outdated
A Lance catalog is a centralized repository for discovering, organizing, and managing Lance tables.
There are 2 types of Lance catalogs: storage directory and REST catalog.

## Lance Directory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use Lance Directory Catalog to make it clearer? Let users know it's a catalog, not a concept of lance format. WDYT?

Comment thread spec/spec.md Outdated
Lance storage directory, or **Lance Directory**, is a lightweight and simple catalog
for people to get started with creating and using Lance tables directly on top of any local or remote storage system.

A directory in a storage can contain a list of Lance tables,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here do we need to describe the equal concept of catalog and namespace? e.g. A directory acts as a catalog, sub-dir could be acted as a namespace?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh, I went back and forth on this.

It seems like so far the use pattern is just connecting to a directory, and then there is a list of tables in that directory.

I saw the ListingCatalog was added recently, but there is not an established pattern for that one more level yet. This seems to be very similar to the HadoopCatalog in Iceberg, which has been very controversial and is now in deprecated state, because (1) there were quite a few operations that basically can never be supported with this directory setup compared to a normal catalog, (2) the more features we added to it, when used across storage solutions, there were more inconsistent behaviors.

My current feeling is that a directory with a list of tables is sufficient, and we could avoid the mistake in Iceberg HadoopCatalog by not having that one extra layer of directories. But I know you added the ListingCatalog. If that has some important use cases on your side, I think we should for sure add that support, but just document the limitations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I know you added the ListingCatalog. If that has some important use cases on your side

There are no important use cases. The background of ListingCatalog is that, at that moment, we have two traits: Catalog and Database. It's a code-level abstraction.

My concern is mainly about:

  • if users get confused, here is missing a namespace, while our rest catalog has this concept;
  • end users may want to provide an abstraction themself, to switch between two catalogs simply, if a unified abstraction could provide some convenience;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few alternative ways we could evaluate:

Option 1: 2-level directory

This is basically the ListingCatalog approach. I think my main concern is that that was a disapproved and deprecated use pattern in Iceberg, and it feels like we should not go again on the same path, but I might be wrong.

Option 2: make it an engine level feature

That was what I was trying to do in Spark, take a look at lance-format/lance-spark#13 (comment). To paste it here, user could construct the second level by themselves on engine side, using some notion like:

spark =
        SparkSession.builder()
            .appName("spark-lance-connector-test")
            .master("local")
            .config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceCatalog")
            .config("spark.sql.catalog.lance.type", "dir")
            .config("spark.sql.catalog.lance.paths.ns1", dbPath1)
            .config("spark.sql.catalog.lance.paths.ns2", dbPath2)
            .getOrCreate();

This will map Spark namespace ns1 to dbPath1, ns2 to dbPath2, so user can run:

spark.sql("SELECT t1.c1, t2.c2 FROM ns1.t1 as t1, ns2.t2 as t2 WHERE t1.c3 = t2.c3")

Option 3: make the REST spec 1 level

This means instead of have GET /v1/namespaces/{ns}/tables/{table}, we could make it top level, like GET /v1/tables/{table}?namespace={ns}. This essentially makes the additional level of namespace optional.

It probably also achieves your goal of "end users may want to provide an abstraction themself, to switch between two catalogs simply".

Option 4: use a file system-based design that can actually do more.

There are ways to make the file system-based approach actually overcome the limitations of a 2 level directory setup as a multi-level catalog.

For example, me and my friends are developing a project https://olympiaformat.org/ for exactly that purpose. But this will make the design very complicated and I probably won't recommend using it at its current stage.

For a simpler implementation, you could also imagine just having a catalog file, which holds a list of metadata information in the catalog. For example, this is just a JSON list of object name -> the JSON definition that is consistent with the REST spec.

My current take

I clearly have experimented option 2 already, but tbh I am undecided on that option either. I am curious what you think!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, let's take option 2 since we can achieve the final goal.

Comment thread spec/spec.md Outdated
This mode ensures the catalog service is aware of all activities in the table,
and can thus fully enforce any governance and management features for the table.

#### Storage Managed Table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared with Non Catalog Managed Table, which one is better? One true, one false.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but I ended up feeling using a "type" enum might be better. This is also related to why I was renaming the "federated" to "storage managed", because we might want to actually introduce a "federated" type in the future.

This is basically what Polaris is developing right now, imagine 2 Lance REST catalogs, catalog A has table ns1.t1, I could created a federated table in catalog B with name ns2.t2, it will just be a pointer to catalogA.ns1.t1, and catalog B will basically just serve as a proxy (maybe + some caching).

Another type we might want to consider is external. It has been increasingly irrelevant in data lake world with transactional table formats, but when talking to the data engineers, many still like to stick to the definition of a static external table. So there is also a likelihood that we would add that in the future.

These are probably a bit far away for Lance catalog, but using a type allows us to reserve the ability to add more types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but using a type allows us to reserve the ability to add more types

sounds reasonable.

Comment thread spec/spec.md Outdated
We recommend tools to offer the following configurations in some form or shape
for users to configure connection to a Lance catalog:

| Config Key | Description | Required? |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to describe the default value?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is required, why does it need a default value?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, if we can set a default for the type. e.g. default is rest?

@jackye1995 jackye1995 mentioned this pull request Apr 23, 2025
@github-actions github-actions Bot added the python Python features label Apr 27, 2025
@jackye1995 jackye1995 changed the title spec: add a text spec to fully define Lance Catalog spec: support multi-catalog concept Apr 27, 2025
@jackye1995 jackye1995 requested a review from yanghua April 27, 2025 06:59
Comment thread spec/spec.md
This mode ensures the catalog service is aware of all activities in the table,
and can thus fully enforce any governance and management features for the table.

##### Storage Managed Table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about naming native table? That means directly using the native SDK-created tables.

Comment thread spec/rest-catalog.yaml Outdated
* EXIST_OK: Create the namespace if it does not exist. If a namespace of the same name already exists, the operation succeeds and the existing namespace is kept.
* OVERWRITE: Create the namespace if it does not exist. If a namespace of the same name already exists, the existing namespace is dropped and a new namespace with this name with no table is created.
operationId: CreateNamespace
Create a new catalog. A catalog can manage either a collection of child catalogs, or a collection of tables.
Copy link
Copy Markdown
Contributor

@yanghua yanghua Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking how we mapping a pattern catalog1.catalog2.catalog3....table -> a traditional catalog system catalog1.database1.table.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will have something like the following:

GET /catalogs
> [ "database1" ]

POST /catalogs
{ "name": "database2" }

GET /catalogs
> [ "database1", "database2" ]

GET /catalogs/database1
> { "name": "database1" }

GET /tables?catalog=database1
> [ "table1" ]

GET /tables/database1.table1
> { // table metadata }

@jackye1995 jackye1995 closed this Apr 28, 2025
@jackye1995 jackye1995 reopened this Apr 28, 2025
@jackye1995 jackye1995 closed this by deleting the head repository Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java Java features python Python features rust Rust features spec Restful openapi spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-level Catalog? Discuss support for Lance tables in a directory

2 participants