Skip to content

feat: add spec for Iceberg namespace#118

Merged
jackye1995 merged 2 commits into
lance-format:mainfrom
jackye1995:iceberg
Jul 9, 2025
Merged

feat: add spec for Iceberg namespace#118
jackye1995 merged 2 commits into
lance-format:mainfrom
jackye1995:iceberg

Conversation

@jackye1995

Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions github-actions Bot added enhancement New feature or request spec Restful openapi spec labels Jul 8, 2025
1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table
2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)
2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this property for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the default is storage, should we add a equivalent section to "Requirement for Implementation Managed Table"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good point, let me do that

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up not adding a section since there is no particular requirement I can think of except for the ones in table definition. Will add one later if there is additional points unique to storage managed tables.

## Requirement for Implementation Managed Table

An update to the implementation managed table must go through IRC [UpdateTable](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L997) API
or [CommitTransaction](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L1336) API

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to support all IRC metadata updates? Currently Lance doesn't have branching, partitioning, etc. Maybe you could add how incompatibilities will be handled.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this part is still hypothetical. I think we will need to start with storage managed table, which will only leverage LoadTable and CreateTable, and all the update commits directly go against the storage.

An update to the implementation managed table must go through IRC [UpdateTable](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L997) API
or [CommitTransaction](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L1336) API
with a requirement that the [`assert-ref-snapshot-id`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L3051) is the current Lance table version.
If the commit fails due to unresolvable concurrent commits, the IRC server must fail with `409 Conflict` according to the IRC spec.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Lance support concurrent commits or are all commits serialized using the version number?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Lance support concurrent commits or are all commits serialized using the version number?

I think it's both, it's basically the same as Iceberg. There is a rebase process among concurrent commits, and then there is the commit to next version (equivalent to Iceberg's commit to catalog), if that fails it retries rebase and commit against the new concurrent commit.


1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table
2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a request comes in it seems we would need to know the type before we load the metadata?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the IRC server should be able to manipulate what is returned as the response of LoadTable, for example if it's a Lance table, it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is

this also reminds me of Polaris generic table
https://polaris.apache.org/in-dev/unreleased/generic-table/

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?

i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is

yes exactly, this is derived from the HMS experience.

i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).

Yes. Maybe I should add a line that the metadata-location must not be set.

@kevinjqliu kevinjqliu left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

Comment thread docs/src/index.md
It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc.
should store and use Lance tables, as well as how ML/AI tools and analytics compute engines should integrate with Lance tables.
It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Iceberg REST Catalog (IRC),
Apache Gravitino, Unity Catalog, etc. should store and use Lance tables,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the "Apache Gravitino, Unity Catalog" integrations via IRC or custom?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are considering 2 approaches - (1) through HMS, since that is what have been the most mature integration route in both Unity and Gravitino, (2) through REST, which we will add another REST server in these projects.

See apache/gravitino#7358 for example.

For IRC, I am actually not sure. Maybe we could integrate through this Iceberg way, but both in Gravitino and Unity, IRC is purposely built for Iceberg tables at this moment. It would require some paradigm shift. But if we think this is a good direction, happy to push for that approach in both communities.


## Namespace Mapping

An IRC server can be viewed as the root Lance namespace.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: iceberg tables in IRC are addressed as (namespace, table) where namespace can be "nested"/"leveled"

its not clear to me how lance tables will be presented here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mapping is that

  • IRC server: Lance root namespace
  • IRC multi-level namespace: Lance multi-level namespace
  • IRC tables in a namesapce: Lance tables in a namesapce

Do you think the wording is not clear? Or did I misunderstand the comment?


1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table
2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HMS with iceberg works like this. It look up the generic table object and then see what the table_type property is

this also reminds me of Polaris generic table
https://polaris.apache.org/in-dev/unreleased/generic-table/


1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table
2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can just generate a TableMetadata JSON payload without the need to read any actual Iceberg metadata JSON file, and provide this information in the properties section. Is that feasible?

i think its feasible as long as the reader uses the TableMetadata and not re-read the metadata.json file (i think spark does this today).

1. the [`location`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2494) must point to the root location of the Lance table
2. the [`properties`](https://github.com/apache/iceberg/blob/apache-iceberg-1.9.0/open-api/rest-catalog-open-api.yaml#L2499) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)
2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the default is storage, should we add a equivalent section to "Requirement for Implementation Managed Table"

@jackye1995

Copy link
Copy Markdown
Collaborator Author

Merging to proceed with the next step

@jackye1995 jackye1995 merged commit 6f2f3f3 into lance-format:main Jul 9, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request spec Restful openapi spec

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants