spec: support multi-catalog concept#11
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
| A Lance catalog is a centralized repository for discovering, organizing, and managing Lance tables. | ||
| There are 2 types of Lance catalogs: storage directory and REST catalog. | ||
|
|
||
| ## Lance Directory |
There was a problem hiding this comment.
Do we need to use Lance Directory Catalog to make it clearer? Let users know it's a catalog, not a concept of lance format. WDYT?
| Lance storage directory, or **Lance Directory**, is a lightweight and simple catalog | ||
| for people to get started with creating and using Lance tables directly on top of any local or remote storage system. | ||
|
|
||
| A directory in a storage can contain a list of Lance tables, |
There was a problem hiding this comment.
Here do we need to describe the equal concept of catalog and namespace? e.g. A directory acts as a catalog, sub-dir could be acted as a namespace?
There was a problem hiding this comment.
Tbh, I went back and forth on this.
It seems like so far the use pattern is just connecting to a directory, and then there is a list of tables in that directory.
I saw the ListingCatalog was added recently, but there is not an established pattern for that one more level yet. This seems to be very similar to the HadoopCatalog in Iceberg, which has been very controversial and is now in deprecated state, because (1) there were quite a few operations that basically can never be supported with this directory setup compared to a normal catalog, (2) the more features we added to it, when used across storage solutions, there were more inconsistent behaviors.
My current feeling is that a directory with a list of tables is sufficient, and we could avoid the mistake in Iceberg HadoopCatalog by not having that one extra layer of directories. But I know you added the ListingCatalog. If that has some important use cases on your side, I think we should for sure add that support, but just document the limitations.
There was a problem hiding this comment.
But I know you added the ListingCatalog. If that has some important use cases on your side
There are no important use cases. The background of ListingCatalog is that, at that moment, we have two traits: Catalog and Database. It's a code-level abstraction.
My concern is mainly about:
- if users get confused, here is missing a
namespace, while ourrestcatalog has this concept; - end users may want to provide an abstraction themself, to switch between two catalogs simply, if a unified abstraction could provide some convenience;
There was a problem hiding this comment.
There are a few alternative ways we could evaluate:
Option 1: 2-level directory
This is basically the ListingCatalog approach. I think my main concern is that that was a disapproved and deprecated use pattern in Iceberg, and it feels like we should not go again on the same path, but I might be wrong.
Option 2: make it an engine level feature
That was what I was trying to do in Spark, take a look at lance-format/lance-spark#13 (comment). To paste it here, user could construct the second level by themselves on engine side, using some notion like:
spark =
SparkSession.builder()
.appName("spark-lance-connector-test")
.master("local")
.config("spark.sql.catalog.lance", "com.lancedb.lance.spark.LanceCatalog")
.config("spark.sql.catalog.lance.type", "dir")
.config("spark.sql.catalog.lance.paths.ns1", dbPath1)
.config("spark.sql.catalog.lance.paths.ns2", dbPath2)
.getOrCreate();This will map Spark namespace ns1 to dbPath1, ns2 to dbPath2, so user can run:
spark.sql("SELECT t1.c1, t2.c2 FROM ns1.t1 as t1, ns2.t2 as t2 WHERE t1.c3 = t2.c3")Option 3: make the REST spec 1 level
This means instead of have GET /v1/namespaces/{ns}/tables/{table}, we could make it top level, like GET /v1/tables/{table}?namespace={ns}. This essentially makes the additional level of namespace optional.
It probably also achieves your goal of "end users may want to provide an abstraction themself, to switch between two catalogs simply".
Option 4: use a file system-based design that can actually do more.
There are ways to make the file system-based approach actually overcome the limitations of a 2 level directory setup as a multi-level catalog.
For example, me and my friends are developing a project https://olympiaformat.org/ for exactly that purpose. But this will make the design very complicated and I probably won't recommend using it at its current stage.
For a simpler implementation, you could also imagine just having a catalog file, which holds a list of metadata information in the catalog. For example, this is just a JSON list of object name -> the JSON definition that is consistent with the REST spec.
My current take
I clearly have experimented option 2 already, but tbh I am undecided on that option either. I am curious what you think!
There was a problem hiding this comment.
Fine, let's take option 2 since we can achieve the final goal.
| This mode ensures the catalog service is aware of all activities in the table, | ||
| and can thus fully enforce any governance and management features for the table. | ||
|
|
||
| #### Storage Managed Table |
There was a problem hiding this comment.
Compared with Non Catalog Managed Table, which one is better? One true, one false.
There was a problem hiding this comment.
I thought about that, but I ended up feeling using a "type" enum might be better. This is also related to why I was renaming the "federated" to "storage managed", because we might want to actually introduce a "federated" type in the future.
This is basically what Polaris is developing right now, imagine 2 Lance REST catalogs, catalog A has table ns1.t1, I could created a federated table in catalog B with name ns2.t2, it will just be a pointer to catalogA.ns1.t1, and catalog B will basically just serve as a proxy (maybe + some caching).
Another type we might want to consider is external. It has been increasingly irrelevant in data lake world with transactional table formats, but when talking to the data engineers, many still like to stick to the definition of a static external table. So there is also a likelihood that we would add that in the future.
These are probably a bit far away for Lance catalog, but using a type allows us to reserve the ability to add more types.
There was a problem hiding this comment.
but using a type allows us to reserve the ability to add more types
sounds reasonable.
| We recommend tools to offer the following configurations in some form or shape | ||
| for users to configure connection to a Lance catalog: | ||
|
|
||
| | Config Key | Description | Required? | |
There was a problem hiding this comment.
Do we need to describe the default value?
There was a problem hiding this comment.
if it is required, why does it need a default value?
There was a problem hiding this comment.
I mean, if we can set a default for the type. e.g. default is rest?
| This mode ensures the catalog service is aware of all activities in the table, | ||
| and can thus fully enforce any governance and management features for the table. | ||
|
|
||
| ##### Storage Managed Table |
There was a problem hiding this comment.
What about naming native table? That means directly using the native SDK-created tables.
| * EXIST_OK: Create the namespace if it does not exist. If a namespace of the same name already exists, the operation succeeds and the existing namespace is kept. | ||
| * OVERWRITE: Create the namespace if it does not exist. If a namespace of the same name already exists, the existing namespace is dropped and a new namespace with this name with no table is created. | ||
| operationId: CreateNamespace | ||
| Create a new catalog. A catalog can manage either a collection of child catalogs, or a collection of tables. |
There was a problem hiding this comment.
I am thinking how we mapping a pattern catalog1.catalog2.catalog3....table -> a traditional catalog system catalog1.database1.table.
There was a problem hiding this comment.
you will have something like the following:
GET /catalogs
> [ "database1" ]
POST /catalogs
{ "name": "database2" }
GET /catalogs
> [ "database1", "database2" ]
GET /catalogs/database1
> { "name": "database1" }
GET /tables?catalog=database1
> [ "table1" ]
GET /tables/database1.table1
> { // table metadata }
Add a text specification to define what is a Lance catalog, and the relationship of a directory of Lance tables vs a collection of Lance tables accessed through the REST catalog protocol by making the catalog concept recursive and allowing multi-level catalog.
Change all namespace concepts to be catalog concepts in REST spec.
Rename
catalog.yamltorest-catalog.yamlto make it clear that the OpenAPI spec is for the REST catalog specifically.Closes #9
Closes #58