Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,11 +67,14 @@ extra:
nav:
- Introduction: index.md
- Spec:
- Text: spec/spec.md
- OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/docs/src/spec/rest.yaml
- Concepts: spec/concepts.md
- Operations: spec/operations.md
- Implementations: spec/implementations.md
- Tool Integration: spec/tools.md
- OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml
- Native Implementations:
- Overview: spec/impls/overview.md
- Rest: spec/impls/rest.md
- REST: spec/impls/rest.md
- Directory: spec/impls/dir.md
- Apache Hive MetaStore: spec/impls/hive.md
- Contributing: contributing.md
Expand Down
20 changes: 20 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,28 @@

![logo](./logo/wide.png)

## Lance Namespace Specification

**Lance Namespace Specification** is an open specification on top of the storage-based Lance data format
to standardize access to a collection of Lance tables (a.k.a. Lance datasets).
It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc.
should store and use Lance tables, as well as how ML/AI tools and analytics compute engines should integrate with Lance tables.

## Why _Namespace_ not _Catalog_?

There are many terms used to describe the concept of a container in database systems
— such as _namespace_, _catalog_, _schema_, _database_, _metastore_, and _metalake_.
Among these, namespace and catalog have become the most prominent in modern lakehouse architectures.

The term catalog typically implies a hierarchical structure with at least two levels.
For example, Apache Hive uses a catalog → database → table model,
while Apache Iceberg’s REST catalog adopts a catalog → multi-level namespace → table hierarchy.

In contrast, the ML and AI communities tend to favor a flatter organizational model.
It’s common to store datasets in simple directories
and categorize them using flexible systems like tagging, rather than rigid hierarchies.

To better support this usage pattern, Lance adopts the term **_namespace_** to represent all container concepts
— including what would traditionally be called a catalog.
With the **Lance Namespace Specification**, we provide a flexible, multi-level namespace abstraction
that allows users to structure and manage Lance datasets in ways that best align with their data organization strategies.
79 changes: 79 additions & 0 deletions docs/src/spec/concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Namespace Concepts

## Namespace Definition

A Lance namespace is a centralized repository for discovering, organizing, and managing Lance tables.
It can either contain a collection of tables, or a collection of Lance namespaces recursively.
It is designed to encapsulates concepts including namespace, metastore, database, schema, etc.
that frequently appear in other similar data systems to allow easy integration with any system of any type of object hierarchy.

Here is an example layout of a Lance namespace:

![Lance namespace layout](layout.png)

## Parent & Child

We use the term **parent** and **child** to describe relationship between 2 objects.
If namespace A directly contains B, then A is the parent namespace of B, i.e. B is a child of A.
For examples:

- Namespace `ns1` contains a **child namespace** `ns4`. i.e. `ns1` is the **parent namespace** of `ns4`.
- Namespace `ns2` contains a **child table** `t2`, i.e. `t2` belongs to **parent namespace** `ns2`.

## Root Namespace

A root namespace is a namespace that has no parent.
The root namespace is assumed to always exist and is ready to be connected to by a tool to explore objects in the namespace.
The lifecycle management (e.g. creation, deletion) of the root namespace is out of scope of this specification.

## Object Name

The **name** of an object is a string that uniquely identifies the object within the parent namespace it belongs to.
The name of any object must be unique among all other objects that share the same parent namespace.
For examples:

- `cat2`, `cat3` and `cat4` are all unique names under the root namespace
- `t3` and `t4` are both unique names under `cat4`

## Object Identifier

The **identifier** of an object uniquely identifies the object within the root namespace it belongs to.
The identifier of any object must be unique among all other objects that share the same root namespace.

Based on the uniqueness property of an object name within its parent namespace,
an object identifier is the list of object names starting from (not including) the root namespace to (including) the object itself.
This is also called an **list identifier**.
For examples:

- the list identifier of `cat5` is `[cat2, cat5]`
- the list identifier of `t1` is `[cat2, cat5, t1]`

The dot (`.`) symbol is typically used as the delimiter to join all the names to form an **string identifier**,
but other symbols could also be used if dot is used in the object name.
For examples:

- the string identifier of `cat5` is `cat2.cat5`
- the string identifier of `t1` is `cat2.cat5.t1`
- the string identifier of `t3` is `cat4$t3` when using delimiter `$`

## Name and Identifier for Root Namespace

The root namespace itself has no name or identifier.
When represented in code, its name and string identifier is represented by an empty or null string,
and its list identifier is represented by an empty or null list.

The actual name and identifier of the root namespace is typically
assigned by users through some configuration when used in a tool.
For example, a root namespace can be called `cat1` in Ray, but called `cat2` in Apache Spark,
and they are both configured to connect to the same root namespace.

## Namespace Level

If every table has the same number of namespaces all the way to the root namespace,
the namespace is called **leveled**. The [example above](#namespace-definition) is not leveled
because `t1` has 2 namespaces `ns1` and `ns4` before root, whereas `t2` has 1 namespace `ns2` before root.

For a leveled namespace, the number of namespaces up to and including the root for any table
is referred to as the **number of levels**.
For example, a [directory namespace](../impls/dir) is a 1-level namespace,
and a [Hive 2.x namespace](../impls/hive) is a 2-level namespace.
41 changes: 41 additions & 0 deletions docs/src/spec/implementations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Namespace Implementations

A **Lance Namespace Implementation** is an implementation of the Lance namespace specification,
more specifically:

1. It satisfies all the Lance namespace definitions and concepts.
2. It declares and implements a list of supported Lance namespace operations.

## Implementation and Storage

Except for any storage-only implementation (e.g. [directory namespace](../impls/dir)),
a Lance table exists both in the storage and the implementation.
For example, a Lance table exists both in HMS and storage for the [Hive namespace](../impls/hive).
There are 2 possible ways to manage a Lance table under such setting.
A Lance namespace implementation can choose to support one or both:

### Implementation Managed Table

A implementation managed Lance table is a table that is fully managed by the Lance namespace implementation.
The implementation must maintain information about the latest version of the Lance table.
Any modifications to the table must happen through the implementation.
If a user directly modifies the underlying table in the storage bypassing the implementation,
the implementation must not reflect the changes in the table to the namespace users.

This mode ensures the namespace service is aware of all activities in the table,
and can thus fully enforce any governance and management features for the table.

### Storage Managed Table

A storage managed Lance table is a table that is fully managed by the storage
with a metadata definition in the Lance namespace implementation.
The implementation only contains information about the table directory location.
It is expected that a tool finds the latest version of the Lance table based on the contents
in the table directory according to the Lance format specification.
A modification to the table can happen either directly against the storage,
or happen as a request to the implementation, where the implementation is responsible for applying the corresponding
change to the underlying storage according to the Lance format specification.

This mode is more flexible for real world ML/AI workflows
but the implementation loses full visibility and control over the actions performed against the table,
so it will be harder to enforce any governance and management features for storage managed tables.
34 changes: 30 additions & 4 deletions docs/src/spec/impls/hive.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,32 @@
# Lance HMS Namespace
# Lance Hive Namespace

Lance HMS Namespace directly integrates with HMS to offer a 2-level Lance namespace experience.
The root namespace maps to the entire HMS, which has HMS databases as child namespaces.
**Lance Hive Namespace** is an implementation using Apache Hive MetaStore (HMS).
For more details about HMS, please read [HMS AdminManual 2.x](https://hive.apache.org/docs/latest/adminmanual-metastore-administration_27362076/)
and [HMS AdminManual 3.x](https://hive.apache.org/docs/latest/adminmanual-metastore-3-0-administration_75978150/).

TODO: add more information after implementation is officially added.
## Namespace Mapping

A HMS server can be viewed as the root Lance namespace.

For HMS 2.x and below, a database in HMS maps to the first level Lance namespace
to form a 2-level Lance namespace as a whole.

For HMS 3.x and above, a catalog in HMS maps to the first level Lance namespace,
and a database in HMS maps to the second level Lance namespace
to form a 3-level Lance namespace as a whole.

## Table Definition

A Lance table should appear as a [Table object](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L631)
in HMS with the following requirements:

1. the [tableType](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L643) must be set as `EXTERNAL_TABLE` to indicate this is not a managed Hive table
2. the [parameters](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L640) must follow:
1. there is a key `table_type` set to `lance` (case insensitive)
2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage`
3. there is a key `version` set to the latest numeric version number of the table. This field will only be respected if `managed_by=impl`

## Requirement for Implementation Managed Table

An update to the implementation managed table must use Hive's atomic update feature ([HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882))
and use the `version` parameter value to perform conditional update through [alter_table_with_environment_context](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L2733)
2 changes: 1 addition & 1 deletion docs/src/spec/impls/overview.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Native Implementations

A native Lance namespace implementation is a [Lance Namespace implementation](../spec.md#namespace-implementations)
A native Lance namespace implementation is a [Lance Namespace implementation](../../../spec/implementations)
that is maintained in this `lance-namespace` repository.
Any implementation that is outside the repository is considered as a third-party implementation.
2 changes: 1 addition & 1 deletion docs/src/spec/impls/rest.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In an enterprise environment, typically there is a requirement to store tables i
for more advanced governance features around access control, auditing, lineage tracking, etc.
**Lance REST Namespace** is an OpenAPI protocol that enables reading, writing and managing Lance tables
by connecting those metadata services or building a custom metadata server in a standardized way.
The REST server definition can be found in the OpenAPI specification.
The REST server definition can be found in the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml).

## REST Routes

Expand Down
35 changes: 35 additions & 0 deletions docs/src/spec/operations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Namespace Operations

The Lance Namespace Specification defines a list of operations that can be performed against any Lance namespace:

| Operation ID | Description |
|---------------------|--------------------------------------------------------------------------------------------------------------------|
| ListOperations | List the operations that are supported by this Lance namespace |
| ListNamespaces | List the names of child namespaces in the parent namespace, or root namespace if parent namespace is not specified |
| NamespaceExists | Check if a namespace exists |
| DescribeNamespace | Describe information of a namespace |
| CreateNamespace | Create a new namespace under a parent namespace, or root namespace if parent namespace is not specified |
| AlterNamespace | Alter information of a namespace |
| DropNamespace | Drop a namespace from its a parent namespace, or root namespace if parent namespace is not specified |
| ListTables | List the names of tables in a namespace |
| TableExists | Check if a table exists |
| DescribeTable | Describe information of a Lance table in the namespace |
| CreateTable | Create a new Lance table under a namespace |
| RegisterTable | Register an existing table at a given storage location to a namespace |
| AlterTable | Alter information of a Lance table |
| DropTable | Drop a table from its namespace |
| DeregisterTable | Deregister a table from its namespace, table content is kept unchanged in storage |
| DescribeTransaction | Describe information of a transaction |
| AlterTransaction | Alter information of a transaction |

## Operation Versioning

There is no versioning concept within an operation. When backwards incompatible change is introduced,
a new operation needs to be created, with a naming convention of `<operationId>V<version>`,
for example `ListNamespacesV2`, `DescribeTableV3`, etc.

## Operation Request and Response Schema

Each operation has a request and response.
The request and response schema is defined using JSON schema in the
`components/schemas` section of the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml).
Loading