Skip to content

Commit e81eb3d

Browse files
authored
feat: fill missing Hive definition and reorganize full text spec (#115)
Fill in Hive namespace implementation spec with proper definition of how a Lance table should be stored. Also reorganize the original text spec into pieces for better readability.
1 parent 3575873 commit e81eb3d

10 files changed

Lines changed: 235 additions & 184 deletions

File tree

docs/mkdocs.yml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -67,11 +67,14 @@ extra:
6767
nav:
6868
- Introduction: index.md
6969
- Spec:
70-
- Text: spec/spec.md
71-
- OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/docs/src/spec/rest.yaml
70+
- Concepts: spec/concepts.md
71+
- Operations: spec/operations.md
72+
- Implementations: spec/implementations.md
73+
- Tool Integration: spec/tools.md
74+
- OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml
7275
- Native Implementations:
7376
- Overview: spec/impls/overview.md
74-
- Rest: spec/impls/rest.md
77+
- REST: spec/impls/rest.md
7578
- Directory: spec/impls/dir.md
7679
- Apache Hive MetaStore: spec/impls/hive.md
7780
- Contributing: contributing.md

docs/src/index.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,28 @@
22

33
![logo](./logo/wide.png)
44

5+
## Lance Namespace Specification
6+
57
**Lance Namespace Specification** is an open specification on top of the storage-based Lance data format
68
to standardize access to a collection of Lance tables (a.k.a. Lance datasets).
79
It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc.
810
should store and use Lance tables, as well as how ML/AI tools and analytics compute engines should integrate with Lance tables.
911

12+
## Why _Namespace_ not _Catalog_?
13+
14+
There are many terms used to describe the concept of a container in database systems
15+
— such as _namespace_, _catalog_, _schema_, _database_, _metastore_, and _metalake_.
16+
Among these, namespace and catalog have become the most prominent in modern lakehouse architectures.
17+
18+
The term catalog typically implies a hierarchical structure with at least two levels.
19+
For example, Apache Hive uses a catalog → database → table model,
20+
while Apache Iceberg’s REST catalog adopts a catalog → multi-level namespace → table hierarchy.
21+
22+
In contrast, the ML and AI communities tend to favor a flatter organizational model.
23+
It’s common to store datasets in simple directories
24+
and categorize them using flexible systems like tagging, rather than rigid hierarchies.
25+
26+
To better support this usage pattern, Lance adopts the term **_namespace_** to represent all container concepts
27+
— including what would traditionally be called a catalog.
28+
With the **Lance Namespace Specification**, we provide a flexible, multi-level namespace abstraction
29+
that allows users to structure and manage Lance datasets in ways that best align with their data organization strategies.

docs/src/spec/concepts.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Namespace Concepts
2+
3+
## Namespace Definition
4+
5+
A Lance namespace is a centralized repository for discovering, organizing, and managing Lance tables.
6+
It can either contain a collection of tables, or a collection of Lance namespaces recursively.
7+
It is designed to encapsulates concepts including namespace, metastore, database, schema, etc.
8+
that frequently appear in other similar data systems to allow easy integration with any system of any type of object hierarchy.
9+
10+
Here is an example layout of a Lance namespace:
11+
12+
![Lance namespace layout](layout.png)
13+
14+
## Parent & Child
15+
16+
We use the term **parent** and **child** to describe relationship between 2 objects.
17+
If namespace A directly contains B, then A is the parent namespace of B, i.e. B is a child of A.
18+
For examples:
19+
20+
- Namespace `ns1` contains a **child namespace** `ns4`. i.e. `ns1` is the **parent namespace** of `ns4`.
21+
- Namespace `ns2` contains a **child table** `t2`, i.e. `t2` belongs to **parent namespace** `ns2`.
22+
23+
## Root Namespace
24+
25+
A root namespace is a namespace that has no parent.
26+
The root namespace is assumed to always exist and is ready to be connected to by a tool to explore objects in the namespace.
27+
The lifecycle management (e.g. creation, deletion) of the root namespace is out of scope of this specification.
28+
29+
## Object Name
30+
31+
The **name** of an object is a string that uniquely identifies the object within the parent namespace it belongs to.
32+
The name of any object must be unique among all other objects that share the same parent namespace.
33+
For examples:
34+
35+
- `cat2`, `cat3` and `cat4` are all unique names under the root namespace
36+
- `t3` and `t4` are both unique names under `cat4`
37+
38+
## Object Identifier
39+
40+
The **identifier** of an object uniquely identifies the object within the root namespace it belongs to.
41+
The identifier of any object must be unique among all other objects that share the same root namespace.
42+
43+
Based on the uniqueness property of an object name within its parent namespace,
44+
an object identifier is the list of object names starting from (not including) the root namespace to (including) the object itself.
45+
This is also called an **list identifier**.
46+
For examples:
47+
48+
- the list identifier of `cat5` is `[cat2, cat5]`
49+
- the list identifier of `t1` is `[cat2, cat5, t1]`
50+
51+
The dot (`.`) symbol is typically used as the delimiter to join all the names to form an **string identifier**,
52+
but other symbols could also be used if dot is used in the object name.
53+
For examples:
54+
55+
- the string identifier of `cat5` is `cat2.cat5`
56+
- the string identifier of `t1` is `cat2.cat5.t1`
57+
- the string identifier of `t3` is `cat4$t3` when using delimiter `$`
58+
59+
## Name and Identifier for Root Namespace
60+
61+
The root namespace itself has no name or identifier.
62+
When represented in code, its name and string identifier is represented by an empty or null string,
63+
and its list identifier is represented by an empty or null list.
64+
65+
The actual name and identifier of the root namespace is typically
66+
assigned by users through some configuration when used in a tool.
67+
For example, a root namespace can be called `cat1` in Ray, but called `cat2` in Apache Spark,
68+
and they are both configured to connect to the same root namespace.
69+
70+
## Namespace Level
71+
72+
If every table has the same number of namespaces all the way to the root namespace,
73+
the namespace is called **leveled**. The [example above](#namespace-definition) is not leveled
74+
because `t1` has 2 namespaces `ns1` and `ns4` before root, whereas `t2` has 1 namespace `ns2` before root.
75+
76+
For a leveled namespace, the number of namespaces up to and including the root for any table
77+
is referred to as the **number of levels**.
78+
For example, a [directory namespace](../impls/dir) is a 1-level namespace,
79+
and a [Hive 2.x namespace](../impls/hive) is a 2-level namespace.

docs/src/spec/implementations.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Namespace Implementations
2+
3+
A **Lance Namespace Implementation** is an implementation of the Lance namespace specification,
4+
more specifically:
5+
6+
1. It satisfies all the Lance namespace definitions and concepts.
7+
2. It declares and implements a list of supported Lance namespace operations.
8+
9+
## Implementation and Storage
10+
11+
Except for any storage-only implementation (e.g. [directory namespace](../impls/dir)),
12+
a Lance table exists both in the storage and the implementation.
13+
For example, a Lance table exists both in HMS and storage for the [Hive namespace](../impls/hive).
14+
There are 2 possible ways to manage a Lance table under such setting.
15+
A Lance namespace implementation can choose to support one or both:
16+
17+
### Implementation Managed Table
18+
19+
A implementation managed Lance table is a table that is fully managed by the Lance namespace implementation.
20+
The implementation must maintain information about the latest version of the Lance table.
21+
Any modifications to the table must happen through the implementation.
22+
If a user directly modifies the underlying table in the storage bypassing the implementation,
23+
the implementation must not reflect the changes in the table to the namespace users.
24+
25+
This mode ensures the namespace service is aware of all activities in the table,
26+
and can thus fully enforce any governance and management features for the table.
27+
28+
### Storage Managed Table
29+
30+
A storage managed Lance table is a table that is fully managed by the storage
31+
with a metadata definition in the Lance namespace implementation.
32+
The implementation only contains information about the table directory location.
33+
It is expected that a tool finds the latest version of the Lance table based on the contents
34+
in the table directory according to the Lance format specification.
35+
A modification to the table can happen either directly against the storage,
36+
or happen as a request to the implementation, where the implementation is responsible for applying the corresponding
37+
change to the underlying storage according to the Lance format specification.
38+
39+
This mode is more flexible for real world ML/AI workflows
40+
but the implementation loses full visibility and control over the actions performed against the table,
41+
so it will be harder to enforce any governance and management features for storage managed tables.

docs/src/spec/impls/hive.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,32 @@
1-
# Lance HMS Namespace
1+
# Lance Hive Namespace
22

3-
Lance HMS Namespace directly integrates with HMS to offer a 2-level Lance namespace experience.
4-
The root namespace maps to the entire HMS, which has HMS databases as child namespaces.
3+
**Lance Hive Namespace** is an implementation using Apache Hive MetaStore (HMS).
4+
For more details about HMS, please read [HMS AdminManual 2.x](https://hive.apache.org/docs/latest/adminmanual-metastore-administration_27362076/)
5+
and [HMS AdminManual 3.x](https://hive.apache.org/docs/latest/adminmanual-metastore-3-0-administration_75978150/).
56

6-
TODO: add more information after implementation is officially added.
7+
## Namespace Mapping
8+
9+
A HMS server can be viewed as the root Lance namespace.
10+
11+
For HMS 2.x and below, a database in HMS maps to the first level Lance namespace
12+
to form a 2-level Lance namespace as a whole.
13+
14+
For HMS 3.x and above, a catalog in HMS maps to the first level Lance namespace,
15+
and a database in HMS maps to the second level Lance namespace
16+
to form a 3-level Lance namespace as a whole.
17+
18+
## Table Definition
19+
20+
A Lance table should appear as a [Table object](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L631)
21+
in HMS with the following requirements:
22+
23+
1. the [tableType](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L643) must be set as `EXTERNAL_TABLE` to indicate this is not a managed Hive table
24+
2. the [parameters](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L640) must follow:
25+
1. there is a key `table_type` set to `lance` (case insensitive)
26+
2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage`
27+
3. there is a key `version` set to the latest numeric version number of the table. This field will only be respected if `managed_by=impl`
28+
29+
## Requirement for Implementation Managed Table
30+
31+
An update to the implementation managed table must use Hive's atomic update feature ([HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882))
32+
and use the `version` parameter value to perform conditional update through [alter_table_with_environment_context](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L2733)

docs/src/spec/impls/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
## Native Implementations
22

3-
A native Lance namespace implementation is a [Lance Namespace implementation](../spec.md#namespace-implementations)
3+
A native Lance namespace implementation is a [Lance Namespace implementation](../../../spec/implementations)
44
that is maintained in this `lance-namespace` repository.
55
Any implementation that is outside the repository is considered as a third-party implementation.

docs/src/spec/impls/rest.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ In an enterprise environment, typically there is a requirement to store tables i
44
for more advanced governance features around access control, auditing, lineage tracking, etc.
55
**Lance REST Namespace** is an OpenAPI protocol that enables reading, writing and managing Lance tables
66
by connecting those metadata services or building a custom metadata server in a standardized way.
7-
The REST server definition can be found in the OpenAPI specification.
7+
The REST server definition can be found in the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml).
88

99
## REST Routes
1010

docs/src/spec/operations.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Namespace Operations
2+
3+
The Lance Namespace Specification defines a list of operations that can be performed against any Lance namespace:
4+
5+
| Operation ID | Description |
6+
|---------------------|--------------------------------------------------------------------------------------------------------------------|
7+
| ListOperations | List the operations that are supported by this Lance namespace |
8+
| ListNamespaces | List the names of child namespaces in the parent namespace, or root namespace if parent namespace is not specified |
9+
| NamespaceExists | Check if a namespace exists |
10+
| DescribeNamespace | Describe information of a namespace |
11+
| CreateNamespace | Create a new namespace under a parent namespace, or root namespace if parent namespace is not specified |
12+
| AlterNamespace | Alter information of a namespace |
13+
| DropNamespace | Drop a namespace from its a parent namespace, or root namespace if parent namespace is not specified |
14+
| ListTables | List the names of tables in a namespace |
15+
| TableExists | Check if a table exists |
16+
| DescribeTable | Describe information of a Lance table in the namespace |
17+
| CreateTable | Create a new Lance table under a namespace |
18+
| RegisterTable | Register an existing table at a given storage location to a namespace |
19+
| AlterTable | Alter information of a Lance table |
20+
| DropTable | Drop a table from its namespace |
21+
| DeregisterTable | Deregister a table from its namespace, table content is kept unchanged in storage |
22+
| DescribeTransaction | Describe information of a transaction |
23+
| AlterTransaction | Alter information of a transaction |
24+
25+
## Operation Versioning
26+
27+
There is no versioning concept within an operation. When backwards incompatible change is introduced,
28+
a new operation needs to be created, with a naming convention of `<operationId>V<version>`,
29+
for example `ListNamespacesV2`, `DescribeTableV3`, etc.
30+
31+
## Operation Request and Response Schema
32+
33+
Each operation has a request and response.
34+
The request and response schema is defined using JSON schema in the
35+
`components/schemas` section of the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml).

0 commit comments

Comments
 (0)