feat: add spec for Iceberg namespace#118
Conversation
| 1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table | ||
| 2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow: | ||
| 1. there is a key `table_type` set to `lance` (case insensitive) | ||
| 2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage` |
There was a problem hiding this comment.
There was a problem hiding this comment.
if the default is storage, should we add a equivalent section to "Requirement for Implementation Managed Table"
There was a problem hiding this comment.
yes good point, let me do that
There was a problem hiding this comment.
ended up not adding a section since there is no particular requirement I can think of except for the ones in table definition. Will add one later if there is additional points unique to storage managed tables.
| ## Requirement for Implementation Managed Table | ||
|
|
||
| An update to the implementation managed table must go through IRC [UpdateTable](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L997) API | ||
| or [CommitTransaction](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L1336) API |
There was a problem hiding this comment.
Is the plan to support all IRC metadata updates? Currently Lance doesn't have branching, partitioning, etc. Maybe you could add how incompatibilities will be handled.
There was a problem hiding this comment.
I guess this part is still hypothetical. I think we will need to start with storage managed table, which will only leverage LoadTable and CreateTable, and all the update commits directly go against the storage.
| An update to the implementation managed table must go through IRC [UpdateTable](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L997) API | ||
| or [CommitTransaction](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L1336) API | ||
| with a requirement that the [`assert-ref-snapshot-id`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L3051) is the current Lance table version. | ||
| If the commit fails due to unresolvable concurrent commits, the IRC server must fail with `409 Conflict` according to the IRC spec. |
There was a problem hiding this comment.
Does Lance support concurrent commits or are all commits serialized using the version number?
There was a problem hiding this comment.
Does Lance support concurrent commits or are all commits serialized using the version number?
I think it's both, it's basically the same as Iceberg. There is a rebase process among concurrent commits, and then there is the commit to next version (equivalent to Iceberg's commit to catalog), if that fails it retries rebase and commit against the new concurrent commit.
|
|
||
| 1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table | ||
| 2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow: | ||
| 1. there is a key `table_type` set to `lance` (case insensitive) |
There was a problem hiding this comment.
When a request comes in it seems we would need to know the type before we load the metadata?
There was a problem hiding this comment.
In my mind the IRC server should be able to manipulate what is returned as the response of LoadTable, for example if it's a Lance table, it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?
There was a problem hiding this comment.
HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is
this also reminds me of Polaris generic table
https://polaris.apache.org/in-dev/unreleased/generic-table/
There was a problem hiding this comment.
it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?
i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).
There was a problem hiding this comment.
HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is
yes exactly, this is derived from the HMS experience.
i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).
Yes. Maybe I should add a line that the metadata-location must not be set.
| It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc. | ||
| should store and use Lance tables, as well as how ML/AI tools and analytics compute engines should integrate with Lance tables. | ||
| It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Iceberg REST Catalog (IRC), | ||
| Apache Gravitino, Unity Catalog, etc. should store and use Lance tables, |
There was a problem hiding this comment.
are the "Apache Gravitino, Unity Catalog" integrations via IRC or custom?
There was a problem hiding this comment.
I think we are considering 2 approaches - (1) through HMS, since that is what have been the most mature integration route in both Unity and Gravitino, (2) through REST, which we will add another REST server in these projects.
See apache/gravitino#7358 for example.
For IRC, I am actually not sure. Maybe we could integrate through this Iceberg way, but both in Gravitino and Unity, IRC is purposely built for Iceberg tables at this moment. It would require some paradigm shift. But if we think this is a good direction, happy to push for that approach in both communities.
|
|
||
| ## Namespace Mapping | ||
|
|
||
| An IRC server can be viewed as the root Lance namespace. |
There was a problem hiding this comment.
nit: iceberg tables in IRC are addressed as (namespace, table) where namespace can be "nested"/"leveled"
its not clear to me how lance tables will be presented here
There was a problem hiding this comment.
I think the mapping is that
- IRC server: Lance root namespace
- IRC multi-level namespace: Lance multi-level namespace
- IRC tables in a namesapce: Lance tables in a namesapce
Do you think the wording is not clear? Or did I misunderstand the comment?
|
|
||
| 1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table | ||
| 2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow: | ||
| 1. there is a key `table_type` set to `lance` (case insensitive) |
There was a problem hiding this comment.
HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is
this also reminds me of Polaris generic table
https://polaris.apache.org/in-dev/unreleased/generic-table/
|
|
||
| 1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table | ||
| 2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow: | ||
| 1. there is a key `table_type` set to `lance` (case insensitive) |
There was a problem hiding this comment.
it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?
i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).
| 1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table | ||
| 2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow: | ||
| 1. there is a key `table_type` set to `lance` (case insensitive) | ||
| 2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage` |
There was a problem hiding this comment.
if the default is storage, should we add a equivalent section to "Requirement for Implementation Managed Table"
|
Merging to proceed with the next step |
No description provided.