From ecdd665a93d854f551d2374d71d374fe00087992 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Fri, 4 Jul 2025 22:41:36 -0700 Subject: [PATCH 1/6] feat: fill missing Hive definition and reorganize full text spec --- docs/mkdocs.yml | 9 +- docs/src/spec/concepts.md | 79 ++++++++++++++ docs/src/spec/implementations.md | 41 ++++++++ docs/src/spec/impls/hive.md | 28 ++++- docs/src/spec/impls/overview.md | 2 +- docs/src/spec/impls/rest.md | 2 +- docs/src/spec/operations.md | 35 +++++++ docs/src/spec/spec.md | 175 ------------------------------- docs/src/spec/tools.md | 22 ++++ 9 files changed, 209 insertions(+), 184 deletions(-) create mode 100644 docs/src/spec/concepts.md create mode 100644 docs/src/spec/implementations.md create mode 100644 docs/src/spec/operations.md delete mode 100644 docs/src/spec/spec.md create mode 100644 docs/src/spec/tools.md diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 923659625..d2a3637c0 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -67,11 +67,14 @@ extra: nav: - Introduction: index.md - Spec: - - Text: spec/spec.md - - OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/docs/src/spec/rest.yaml + - Concepts: spec/concepts.md + - Operations: spec/operations.md + - Implementations: spec/implementations.md + - Tool Integration: spec/tools.md + - OpenAPI: https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml - Native Implementations: - Overview: spec/impls/overview.md - - Rest: spec/impls/rest.md + - REST: spec/impls/rest.md - Directory: spec/impls/dir.md - Apache Hive MetaStore: spec/impls/hive.md - Contributing: contributing.md diff --git a/docs/src/spec/concepts.md b/docs/src/spec/concepts.md new file mode 100644 index 000000000..126e872c7 --- /dev/null +++ b/docs/src/spec/concepts.md @@ -0,0 +1,79 @@ +# Namespace Concepts + +## Namespace Definition + +A Lance namespace is a centralized repository for discovering, organizing, and managing Lance tables. +It can either contain a collection of tables, or a collection of Lance namespaces recursively. +It is designed to encapsulates concepts including namespace, metastore, database, schema, etc. +that frequently appear in other similar data systems to allow easy integration with any system of any type of object hierarchy. + +Here is an example layout of a Lance namespace: + +![Lance namespace layout](layout.png) + +## Parent & Child + +We use the term **parent** and **child** to describe relationship between 2 objects. +If namespace A directly contains B, then A is the parent namespace of B, i.e. B is a child of A. +For examples: + +- Namespace `ns1` contains a **child namespace** `ns4`. i.e. `ns1` is the **parent namespace** of `ns4`. +- Namespace `ns2` contains a **child table** `t2`, i.e. `t2` belongs to **parent namespace** `ns2`. + +## Root Namespace + +A root namespace is a namespace that has no parent. +The root namespace is assumed to always exist and is ready to be connected to by a tool to explore objects in the namespace. +The lifecycle management (e.g. creation, deletion) of the root namespace is out of scope of this specification. + +## Object Name + +The **name** of an object is a string that uniquely identifies the object within the parent namespace it belongs to. +The name of any object must be unique among all other objects that share the same parent namespace. +For examples: + +- `cat2`, `cat3` and `cat4` are all unique names under the root namespace +- `t3` and `t4` are both unique names under `cat4` + +## Object Identifier + +The **identifier** of an object uniquely identifies the object within the root namespace it belongs to. +The identifier of any object must be unique among all other objects that share the same root namespace. + +Based on the uniqueness property of an object name within its parent namespace, +an object identifier is the list of object names starting from (not including) the root namespace to (including) the object itself. +This is also called an **list identifier**. +For examples: + +- the list identifier of `cat5` is `[cat2, cat5]` +- the list identifier of `t1` is `[cat2, cat5, t1]` + +The dot (`.`) symbol is typically used as the delimiter to join all the names to form an **string identifier**, +but other symbols could also be used if dot is used in the object name. +For examples: + +- the string identifier of `cat5` is `cat2.cat5` +- the string identifier of `t1` is `cat2.cat5.t1` +- the string identifier of `t3` is `cat4$t3` when using delimiter `$` + +## Name and Identifier for Root Namespace + +The root namespace itself has no name or identifier. +When represented in code, its name and string identifier is represented by an empty or null string, +and its list identifier is represented by an empty or null list. + +The actual name and identifier of the root namespace is typically +assigned by users through some configuration when used in a tool. +For example, a root namespace can be called `cat1` in Ray, but called `cat2` in Apache Spark, +and they are both configured to connect to the same root namespace. + +## Namespace Level + +If every table has the same number of namespaces all the way to the root namespace, +the namespace is called **leveled**. The [example above](#namespace-definition) is not leveled +because `t1` has 2 namespaces `ns1` and `ns4` before root, whereas `t2` has 1 namespace `ns2` before root. + +For a leveled namespace, the number of namespaces up to and including the root for any table +is referred to as the **number of levels**. +For example, a [directory namespace](../impls/dir) is a 1-level namespace, +and a [Hive namespace](../impls/hive) is a 2-level namespace. \ No newline at end of file diff --git a/docs/src/spec/implementations.md b/docs/src/spec/implementations.md new file mode 100644 index 000000000..d8432abe5 --- /dev/null +++ b/docs/src/spec/implementations.md @@ -0,0 +1,41 @@ +# Namespace Implementations + +A **Lance Namespace Implementation** is an implementation of the Lance namespace specification, +more specifically: + +1. It satisfies all the Lance namespace definitions and concepts. +2. It declares and implements a list of supported Lance namespace operations. + +## Implementation and Storage + +Except for any storage-only implementation (e.g. [directory namespace](../impls/dir)), +a Lance table exists both in the storage and the implementation. +For example, a Lance table exists both in HMS and storage for the [Hive namespace](../impls/hive). +There are 2 possible ways to manage a Lance table under such setting. +A Lance namespace implementation can choose to support one or both: + +### Implementation Managed Table + +A implementation managed Lance table is a table that is fully managed by the Lance namespace implementation. +The implementation must maintain information about the latest version of the Lance table. +Any modifications to the table must happen through the implementation. +If a user directly modifies the underlying table in the storage bypassing the implementation, +the implementation must not reflect the changes in the table to the namespace users. + +This mode ensures the namespace service is aware of all activities in the table, +and can thus fully enforce any governance and management features for the table. + +### Storage Managed Table + +A storage managed Lance table is a table that is fully managed by the storage +with a metadata definition in the Lance namespace implementation. +The implementation only contains information about the table directory location. +It is expected that a tool finds the latest version of the Lance table based on the contents +in the table directory according to the Lance format specification. +A modification to the table can happen either directly against the storage, +or happen as a request to the implementation, where the implementation is responsible for applying the corresponding +change to the underlying storage according to the Lance format specification. + +This mode is more flexible for real world ML/AI workflows +but the implementation loses full visibility and control over the actions performed against the table, +so it will be harder to enforce any governance and management features for storage managed tables. \ No newline at end of file diff --git a/docs/src/spec/impls/hive.md b/docs/src/spec/impls/hive.md index 81ab17f29..67d3cd857 100644 --- a/docs/src/spec/impls/hive.md +++ b/docs/src/spec/impls/hive.md @@ -1,6 +1,26 @@ -# Lance HMS Namespace +# Lance Hive Namespace -Lance HMS Namespace directly integrates with HMS to offer a 2-level Lance namespace experience. -The root namespace maps to the entire HMS, which has HMS databases as child namespaces. +** Lance Hive Namespace** is an implementation using Apache Hive MetaStore (HMS). +For more details about HMS, please read [HMS AdminManual 2.x](https://hive.apache.org/docs/latest/adminmanual-metastore-administration_27362076/) +and [HMS AdminManual 3.x](https://hive.apache.org/docs/latest/adminmanual-metastore-3-0-administration_75978150/). -TODO: add more information after implementation is officially added. \ No newline at end of file +## Namespace Mapping + +A HMS server can be viewed as the root Lance namespace, and a database in HMS maps to the first level Lance namespace +to offer a 2-level Lance namespace as a whole. + +## Table Definition + +A Lance table should appear as a [Table object](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L631) +in HMS with the following requirements: + +1. the [tableType](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L643) must be set as `EXTERNAL_TABLE` to indicate this is not a managed Hive table +2. the [parameters](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L640) must follow: + 1. there is a key `table_type` set to `lance` (case insensitive) + 2. there is a key `managed_by` set to either `storage` or `impl` (case insensitive). If not set, default to `storage` + 3. there is a key `version` set to the latest numeric version number of the table. This field will only be respected if `managed_by=impl` + +## Requirement for Implementation Managed Table + +An update to the implementation managed table must use Hive's atomic update feature ([HIVE-26882](https://issues.apache.org/jira/browse/HIVE-26882)) +and use the `version` parameter value to perform conditional update through [alter_table_with_environment_context](https://github.com/apache/hive/blob/branch-4.0/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift#L2733) \ No newline at end of file diff --git a/docs/src/spec/impls/overview.md b/docs/src/spec/impls/overview.md index fdb697807..8b2533d1f 100644 --- a/docs/src/spec/impls/overview.md +++ b/docs/src/spec/impls/overview.md @@ -1,5 +1,5 @@ ## Native Implementations -A native Lance namespace implementation is a [Lance Namespace implementation](../spec.md#namespace-implementations) +A native Lance namespace implementation is a [Lance Namespace implementation](../../../spec/implementations) that is maintained in this `lance-namespace` repository. Any implementation that is outside the repository is considered as a third-party implementation. \ No newline at end of file diff --git a/docs/src/spec/impls/rest.md b/docs/src/spec/impls/rest.md index 8478eb501..f4863e262 100644 --- a/docs/src/spec/impls/rest.md +++ b/docs/src/spec/impls/rest.md @@ -4,7 +4,7 @@ In an enterprise environment, typically there is a requirement to store tables i for more advanced governance features around access control, auditing, lineage tracking, etc. **Lance REST Namespace** is an OpenAPI protocol that enables reading, writing and managing Lance tables by connecting those metadata services or building a custom metadata server in a standardized way. -The REST server definition can be found in the OpenAPI specification. +The REST server definition can be found in the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml). ## REST Routes diff --git a/docs/src/spec/operations.md b/docs/src/spec/operations.md new file mode 100644 index 000000000..fde569a8f --- /dev/null +++ b/docs/src/spec/operations.md @@ -0,0 +1,35 @@ +# Namespace Operations + +The Lance Namespace Specification defines a list of operations that can be performed against any Lance namespace: + +| Operation ID | Description | +|---------------------|--------------------------------------------------------------------------------------------------------------------| +| ListOperations | List the operations that are supported by this Lance namespace | +| ListNamespaces | List the names of child namespaces in the parent namespace, or root namespace if parent namespace is not specified | +| NamespaceExists | Check if a namespace exists | +| DescribeNamespace | Describe information of a namespace | +| CreateNamespace | Create a new namespace under a parent namespace, or root namespace if parent namespace is not specified | +| AlterNamespace | Alter information of a namespace | +| DropNamespace | Drop a namespace from its a parent namespace, or root namespace if parent namespace is not specified | +| ListTables | List the names of tables in a namespace | +| TableExists | Check if a table exists | +| DescribeTable | Describe information of a Lance table in the namespace | +| CreateTable | Create a new Lance table under a namespace | +| RegisterTable | Register an existing table at a given storage location to a namespace | +| AlterTable | Alter information of a Lance table | +| DropTable | Drop a table from its namespace | +| DeregisterTable | Deregister a table from its namespace, table content is kept unchanged in storage | +| DescribeTransaction | Describe information of a transaction | +| AlterTransaction | Alter information of a transaction | + +## Operation Versioning + +There is no versioning concept within an operation. When backwards incompatible change is introduced, +a new operation needs to be created, with a naming convention of `V`, +for example `ListNamespacesV2`, `DescribeTableV3`, etc. + +## Operation Request and Response Schema + +Each operation has a request and response. +The request and response schema is defined using JSON schema in the +`components/schemas` section of the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lancedb/lance-namespace/refs/heads/main/docs/src/spec/rest.yaml). diff --git a/docs/src/spec/spec.md b/docs/src/spec/spec.md deleted file mode 100644 index 9faaeb97f..000000000 --- a/docs/src/spec/spec.md +++ /dev/null @@ -1,175 +0,0 @@ -# Lance Namespace Specification - -**Lance Namespace Specification** is an open specification on top of the storage-based Lance data format -to standardize access to a collection of Lance tables (a.k.a. Lance datasets). -It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc. -should store and use Lance tables, as well as how ML/AI tools and analytics compute engines -(will together be called _"tools"_ in this document) should integrate with Lance tables. - -## Namespace Concepts - -### Namespace Definition - -A Lance namespace is a centralized repository for discovering, organizing, and managing Lance tables. -It can either contain a collection of tables, or a collection of Lance namespaces recursively. -It is designed to encapsulates concepts including namespace, metastore, database, schema, etc. -that frequently appear in other similar data systems to allow easy integration with any system of any type of object hierarchy. - -Here is an example layout of a Lance namespace: - -![Lance namespace layout](layout.png) - -### Parent & Child - -We use the term **parent** and **child** to describe relationship between 2 objects. -If namespace A directly contains B, then A is the parent namespace of B, i.e. B is a child of A. -For examples: - -- Namespace `ns1` contains a **child namespace** `ns4`. i.e. `ns1` is the **parent namespace** of `ns4`. -- Namespace `ns2` contains a **child table** `t2`, i.e. `t2` belongs to **parent namespace** `ns2`. - -### Root Namespace - -A root namespace is a namespace that has no parent. -The root namespace is assumed to always exist and is ready to be connected to by a tool to explore objects in the namespace. -The lifecycle management (e.g. creation, deletion) of the root namespace is out of scope of this specification. - -### Object Name - -The **name** of an object is a string that uniquely identifies the object within the parent namespace it belongs to. -The name of any object must be unique among all other objects that share the same parent namespace. -For examples: - -- `cat2`, `cat3` and `cat4` are all unique names under the root namespace -- `t3` and `t4` are both unique names under `cat4` - -### Object Identifier - -The **identifier** of an object uniquely identifies the object within the root namespace it belongs to. -The identifier of any object must be unique among all other objects that share the same root namespace. - -Based on the uniqueness property of an object name within its parent namespace, -an object identifier is the list of object names starting from (not including) the root namespace to (including) the object itself. -This is also called an **list identifier**. -For examples: - -- the list identifier of `cat5` is `[cat2, cat5]` -- the list identifier of `t1` is `[cat2, cat5, t1]` - -The dot (`.`) symbol is typically used as the delimiter to join all the names to form an **string identifier**, -but other symbols could also be used if dot is used in the object name. -For examples: - -- the string identifier of `cat5` is `cat2.cat5` -- the string identifier of `t1` is `cat2.cat5.t1` -- the string identifier of `t3` is `cat4$t3` when using delimiter `$` - -### Name and Identifier for Root Namespace - -The root namespace itself has no name or identifier. -When represented in code, its name and string identifier is represented by an empty or null string, -and its list identifier is represented by an empty or null list. - -The actual name and identifier of the root namespace is typically -assigned by users through some configuration when used in a tool. -For example, a root namespace can be called `cat1` in Ray, but called `cat2` in Apache Spark, -and they are both configured to connect to the same root namespace. - -## Namespace Operations - -The Lance Namespace Specification defines a list of operations that can be performed against any Lance namespace: - -| Operation ID | Description | -|---------------------|--------------------------------------------------------------------------------------------------------------------| -| ListOperations | List the operations that are supported by this Lance namespace | -| ListNamespaces | List the names of child namespaces in the parent namespace, or root namespace if parent namespace is not specified | -| NamespaceExists | Check if a namespace exists | -| DescribeNamespace | Describe information of a namespace | -| CreateNamespace | Create a new namespace under a parent namespace, or root namespace if parent namespace is not specified | -| AlterNamespace | Alter information of a namespace | -| DropNamespace | Drop a namespace from its a parent namespace, or root namespace if parent namespace is not specified | -| ListTables | List the names of tables in a namespace | -| TableExists | Check if a table exists | -| DescribeTable | Describe information of a Lance table in the namespace | -| CreateTable | Create a new Lance table under a namespace | -| RegisterTable | Register an existing table at a given storage location to a namespace | -| AlterTable | Alter information of a Lance table | -| DropTable | Drop a table from its namespace | -| DeregisterTable | Deregister a table from its namespace, table content is kept unchanged in storage | -| DescribeTransaction | Describe information of a transaction | -| AlterTransaction | Alter information of a transaction | - -### Operation Versioning - -There is no versioning concept within an operation. When backwards incompatible change is introduced, -a new operation needs to be created, with a naming convention of `V`, -for example `ListNamespacesV2`, `DescribeTableV3`, etc. - -### Operation Request and Response Schema - -Each operation has a request and response. -The request and response schema is defined using JSON schema in the `components/schemas` section of [rest.yaml](rest.yaml). - -## Namespace Implementations - -A **Lance Namespace Implementation** is an implementation of the Lance namespace specification, -more specifically: - -1. It satisfies all the Lance namespace definitions and concepts. -2. It declares and implements a list of supported Lance namespace operations. - -### Implementation and Storage - -Except for any storage-only implementation (e.g. [Lance directory namespace](#lance-directory-namespace)), -a Lance table exists both in the storage and the implementation. -For example, a Lance table exists both in HMS and storage for the [Lance HMS namespace](#lance-hms-namespace). -There are 2 possible ways to manage a Lance table under such setting. -A Lance namespace implementation can choose to support one or both: - -#### Implementation Managed Table - -A implementation managed Lance table is a table that is fully managed by the Lance namespace implementation. -The implementation must maintain information about the latest version of the Lance table. -Any modifications to the table must happen through the implementation. -If a user directly modifies the underlying table in the storage bypassing the implementation, -the implementation must not reflect the changes in the table to the namespace users. - -This mode ensures the namespace service is aware of all activities in the table, -and can thus fully enforce any governance and management features for the table. - -#### Storage Managed Table - -A storage managed Lance table is a table that is fully managed by the storage -with a metadata definition in the Lance namespace implementation. -The implementation only contains information about the table directory location. -It is expected that a tool finds the latest version of the Lance table based on the contents -in the table directory according to the Lance format specification. -A modification to the table can happen either directly against the storage, -or happen as a request to the implementation, where the implementation is responsible for applying the corresponding -change to the underlying storage according to the Lance format specification. - -This mode is more flexible for real world ML/AI workflows -but the implementation loses full visibility and control over the actions performed against the table, -so it will be harder to enforce any governance and management features for storage managed tables. - -## Tool Integration Guidelines - -The following are guidelines for tools to integrate with Lance namespaces. -Note that these are recommendations rather than hard requirements. -The goal of these guidelines is to offer a consistent user experience across different tools. - -### Configuring the Implementation - -We recommend tools to offer a `impl` config key that allows user to configure the Namespace implementation. -We recommend the following values for the natively supported implementations: - -| Implementation | `impl` Value | -|-----------------------|--------------| -| Directory | dir | -| Apache Hive MetaStore | hive | -| REST | rest | - -### Configuring an Implementation Details - -We recommend tools to offer implementation specific configurations using the `impl` value as the config key prefix. -For example, all config keys for the directory namespace should start with `dir.`, like `dir.path`. \ No newline at end of file diff --git a/docs/src/spec/tools.md b/docs/src/spec/tools.md new file mode 100644 index 000000000..1afb8f78d --- /dev/null +++ b/docs/src/spec/tools.md @@ -0,0 +1,22 @@ +# Tool Integration Guidelines + +Tools refer to all the ML/AI training tools and analytics compute engines that can integrate with Lance tables. +The following are guidelines for tools to integrate with Lance namespaces. +Note that these are recommendations rather than hard requirements. +The goal of these guidelines is to offer a consistent user experience across different tools. + +## Configuring the Implementation + +We recommend tools to offer a `impl` config key that allows user to configure the Namespace implementation. +We recommend the following values for the natively supported implementations: + +| Implementation | `impl` Value | +|-----------------------|--------------| +| Directory | dir | +| Apache Hive MetaStore | hive | +| REST | rest | + +### Configuring an Implementation Details + +We recommend tools to offer implementation specific configurations using the `impl` value as the config key prefix. +For example, all config keys for the directory namespace should start with `dir.`, like `dir.path`. \ No newline at end of file From efb8b1e0dd93a3d4589772a4a43bb6caeb0ad172 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Sat, 5 Jul 2025 20:55:55 -0700 Subject: [PATCH 2/6] address comments --- docs/src/index.md | 23 ++++++++++++++++++++++- docs/src/spec/impls/hive.md | 10 ++++++++-- 2 files changed, 30 insertions(+), 3 deletions(-) diff --git a/docs/src/index.md b/docs/src/index.md index cec599c19..e35d135fe 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,9 +1,30 @@ -# Introduction +# ![logo](./logo/wide.png) +## Introduction + **Lance Namespace Specification** is an open specification on top of the storage-based Lance data format to standardize access to a collection of Lance tables (a.k.a. Lance datasets). It describes how a metadata service like Apache Hive MetaStore (HMS), Apache Gravitino, Unity Catalog, etc. should store and use Lance tables, as well as how ML/AI tools and analytics compute engines should integrate with Lance tables. +## Why _Namespace_ not _Catalog_? + +There are many equivalent terms that provides a container concept in a database system, +including _namespace_, _catalog_, _schema_, _database_, _metastore_, _metalake_, etc. +Namespace and catalog are the 2 most popular terms used in modern lakehouse systems. + +Between namespace and catalog, catalog typically implies at least a 2-level hierarchy, +such as `catalog -> database -> table` in Apache Hive MetaStore, +and `catalog -> multi-level namespace -> table` in Apache Iceberg REST catalog. + +Lance is a format for ML and LLM use cases, and we observe a popularity for a 1-level hierarchy in the ML/AI community. +People commonly just use a simple directory to store datasets, +and categorize them through mechanisms like tagging instead of organizing them into a fixed hierarchy. + +To accommodate this architecture as a first-class citizen, +we decide to use the term **_namespace_** to represent all container concepts including a catalog. +By offering a multi-level namespace semantics on top of Lance through the Lance Namespace Specification, +we are able to flexibly model against any data categorization strategy, +and allow users to store and manage Lance datasets in their system. \ No newline at end of file diff --git a/docs/src/spec/impls/hive.md b/docs/src/spec/impls/hive.md index 67d3cd857..7f5c45423 100644 --- a/docs/src/spec/impls/hive.md +++ b/docs/src/spec/impls/hive.md @@ -6,8 +6,14 @@ and [HMS AdminManual 3.x](https://hive.apache.org/docs/latest/adminmanual-metast ## Namespace Mapping -A HMS server can be viewed as the root Lance namespace, and a database in HMS maps to the first level Lance namespace -to offer a 2-level Lance namespace as a whole. +A HMS server can be viewed as the root Lance namespace. + +For HMS 2.x and below, a database in HMS maps to the first level Lance namespace +to form a 2-level Lance namespace as a whole. + +For HMS 3.x and above, a catalog in HMS maps to the first level Lance namespace, +and a database in HMS maps to the second level Lance namespace +to form a 3-level Lance namespace as a whole. ## Table Definition From b9478bf34bf0dcd80701ab4056064d2036c97edf Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Sat, 5 Jul 2025 21:00:24 -0700 Subject: [PATCH 3/6] wording fix --- docs/src/index.md | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/docs/src/index.md b/docs/src/index.md index e35d135fe..9d428a573 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -11,20 +11,19 @@ should store and use Lance tables, as well as how ML/AI tools and analytics comp ## Why _Namespace_ not _Catalog_? -There are many equivalent terms that provides a container concept in a database system, -including _namespace_, _catalog_, _schema_, _database_, _metastore_, _metalake_, etc. -Namespace and catalog are the 2 most popular terms used in modern lakehouse systems. - -Between namespace and catalog, catalog typically implies at least a 2-level hierarchy, -such as `catalog -> database -> table` in Apache Hive MetaStore, -and `catalog -> multi-level namespace -> table` in Apache Iceberg REST catalog. - -Lance is a format for ML and LLM use cases, and we observe a popularity for a 1-level hierarchy in the ML/AI community. -People commonly just use a simple directory to store datasets, -and categorize them through mechanisms like tagging instead of organizing them into a fixed hierarchy. - -To accommodate this architecture as a first-class citizen, -we decide to use the term **_namespace_** to represent all container concepts including a catalog. -By offering a multi-level namespace semantics on top of Lance through the Lance Namespace Specification, -we are able to flexibly model against any data categorization strategy, -and allow users to store and manage Lance datasets in their system. \ No newline at end of file +There are many terms used to describe the concept of a container in database systems +— such as _namespace_, _catalog_, _schema_, _database_, _metastore_, and _metalake_. +Among these, namespace and catalog have become the most prominent in modern lakehouse architectures. + +The term catalog typically implies a hierarchical structure with at least two levels. +For example, Apache Hive uses a catalog → database → table model, +while Apache Iceberg’s REST catalog adopts a catalog → multi-level namespace → table hierarchy. + +In contrast, the ML and AI communities tend to favor a flatter organizational model. +It’s common to store datasets in simple directories +and categorize them using flexible systems like tagging, rather than rigid hierarchies. + +To better support this usage pattern, Lance adopts the term **_namespace_** to represent all container concepts +— including what would traditionally be called a catalog. +With the **Lance Namespace Specification**, we provide a flexible, multi-level namespace abstraction +that allows users to structure and manage Lance datasets in ways that best align with their data organization strategies. \ No newline at end of file From fb174d8877f4203b56ae1082fefb01bd34fa2951 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Sat, 5 Jul 2025 21:06:08 -0700 Subject: [PATCH 4/6] commit --- docs/src/spec/impls/hive.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/spec/impls/hive.md b/docs/src/spec/impls/hive.md index 7f5c45423..bd033c945 100644 --- a/docs/src/spec/impls/hive.md +++ b/docs/src/spec/impls/hive.md @@ -1,6 +1,6 @@ # Lance Hive Namespace -** Lance Hive Namespace** is an implementation using Apache Hive MetaStore (HMS). +**Lance Hive Namespace** is an implementation using Apache Hive MetaStore (HMS). For more details about HMS, please read [HMS AdminManual 2.x](https://hive.apache.org/docs/latest/adminmanual-metastore-administration_27362076/) and [HMS AdminManual 3.x](https://hive.apache.org/docs/latest/adminmanual-metastore-3-0-administration_75978150/). From ef69e1edf8b4af41098da1dd1e20f9f79ce43488 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Sun, 6 Jul 2025 16:23:24 -0700 Subject: [PATCH 5/6] fix --- docs/src/spec/concepts.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/src/spec/concepts.md b/docs/src/spec/concepts.md index 126e872c7..ab7bd213a 100644 --- a/docs/src/spec/concepts.md +++ b/docs/src/spec/concepts.md @@ -76,4 +76,4 @@ because `t1` has 2 namespaces `ns1` and `ns4` before root, whereas `t2` has 1 na For a leveled namespace, the number of namespaces up to and including the root for any table is referred to as the **number of levels**. For example, a [directory namespace](../impls/dir) is a 1-level namespace, -and a [Hive namespace](../impls/hive) is a 2-level namespace. \ No newline at end of file +and a [Hive 2.x namespace](../impls/hive) is a 2-level namespace. \ No newline at end of file From 77fb766cddb13e404630405c20d8222ef220e344 Mon Sep 17 00:00:00 2001 From: Jack Ye Date: Sun, 6 Jul 2025 20:45:40 -0700 Subject: [PATCH 6/6] fix index title --- docs/src/index.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/src/index.md b/docs/src/index.md index 9d428a573..cedaa40f0 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -1,8 +1,8 @@ -# +# Introduction ![logo](./logo/wide.png) -## Introduction +## Lance Namespace Specification **Lance Namespace Specification** is an open specification on top of the storage-based Lance data format to standardize access to a collection of Lance tables (a.k.a. Lance datasets).