Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/java.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ on:
- ready_for_review
- reopened
paths:
- docs/src/rest.yaml
- docs/src/spec.yaml
- java/**
- .github/workflows/java.yml

Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

.PHONY: lint
lint:
uv run openapi-spec-validator --errors all docs/src/rest.yaml
uv run openapi-spec-validator --errors all docs/src/spec.yaml

.PHONY: clean-rust
clean-rust:
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

# Java model docs source and destination
JAVA_DOCS_SRC := ../java/lance-namespace-apache-client/docs
MODELS_DEST := src/client/operations/models
MODELS_DEST := src/namespace/operations/models

# API files to exclude (Java-specific, not data models)
API_FILES := DataApi.md IndexApi.md MetadataApi.md NamespaceApi.md TableApi.md TagApi.md TransactionApi.md
Expand Down
5 changes: 2 additions & 3 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
site_name: Lance Namespace
site_description: open specification on top of the storage-based Lance data format to standardize access to a collection of Lance tables
site_name: Lance Catalog & Namespace
site_description: Open specification for managing collections of Lance tables through catalog specs (Directory and REST) and a unified Namespace SDK
site_url: https://lance.org/format/namespace/
docs_dir: src

Expand Down Expand Up @@ -68,4 +68,3 @@ extra:

extra_javascript:
- https://unpkg.com/mermaid@10/dist/mermaid.min.js

7 changes: 2 additions & 5 deletions docs/src/.pages
Original file line number Diff line number Diff line change
@@ -1,7 +1,4 @@
nav:
- index.md
- Client Spec: client
- Directory Namespace: dir
- REST Namespace: rest
- Catalog Integrations: integrations
- Partitioning Spec: partitioning-spec.md
- Catalog Specs: catalog
- Namespace Client Spec: namespace
5 changes: 5 additions & 0 deletions docs/src/catalog/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
title: Catalog Specs
nav:
- Overview: index.md
- Directory Catalog: dir
- REST Catalog: rest
33 changes: 19 additions & 14 deletions docs/src/dir/catalog-spec.md → docs/src/catalog/dir/index.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Lance Directory Namespace Catalog Spec
# Directory Catalog Format Specification

**Lance directory namespace** is a catalog that stores tables in a directory structure
on any local or remote storage system. It has gone through 2 major spec versions so far:
The **Lance Directory Catalog** is a storage-native catalog format that stores tables in a directory structure on any local or remote storage system. It requires no external metadata service — only a filesystem or object store.

Machine learning workloads frequently operate on datasets stored in object storage and favor minimal operational dependencies, even in production environments. However, existing lakehouse formats typically require an external catalog service, while storage-only approaches lack the transactional guarantees required for reliable production use. The Directory Catalog addresses this gap by providing a catalog built directly on top of the Lance table format.

The Directory Catalog has gone through 2 major spec versions:

- **V1 (Directory Listing)**: A lightweight, simple 1-level namespace that discovers tables by scanning the directory.
- **V2 (Manifest)**: A more advanced implementation backed by a manifest table (a Lance table) that supports nested namespaces and better performance at scale.
Expand All @@ -13,11 +16,11 @@ This mode is ideal for getting started quickly with Lance tables.

### Directory Layout

A directory namespace maps to a directory on storage, called the **namespace directory**.
A Lance table corresponds to a subdirectory in the namespace directory that has the format `<table_name>.lance`,
A directory catalog maps to a directory on storage, called the **catalog directory**.
A Lance table corresponds to a subdirectory in the catalog directory that has the format `<table_name>.lance`,
called a **table directory**.

Consider the following example namespace directory layout:
Consider the following example catalog directory layout:

```
.
Expand All @@ -38,15 +41,15 @@ Consider the following example namespace directory layout:
└── .lance-reserved # Marker: table4 is reserved but not created
```

This describes a Lance directory namespace with the namespace directory at `/my/dir1/`.
This describes a Lance Directory Catalog with the catalog directory at `/my/dir1/`.
It contains active tables `table1` and `table2` at table directories
`/my/dir1/table1.lance` and `/my/dir1/table2.lance`.
Table `table3` exists on storage but is deregistered (excluded from table listings).
Table `table4` is reserved but not yet created with data.

### Table Existence

In V1, a table exists in a Lance directory namespace if a table directory of the specific name exists
In V1, a table exists in a Lance Directory Catalog if a table directory of the specific name exists
and the table is not marked as deregistered.
In object store terms, this means the prefix `<table_name>.lance/` has at least one file in it
and the file `<table_name>.lance/.lance-deregistered` does not exist.
Expand All @@ -65,16 +68,18 @@ is created inside the table directory. This causes the table to be excluded from
and to return "not found" for `DescribeTable` and `TableExists` operations, while preserving the table data
for potential re-registration.

## V2: Manifest
## V2: Manifest

V2 uses a special `__manifest` table (a Lance table) stored in the namespace directory to track all tables
V2 uses a special `__manifest` table (a Lance table) stored in the catalog directory to track all tables
and namespaces. This provides several advantages over V1:

- **Nested namespaces**: Support for hierarchical namespace organization
- **Better performance**: Table discovery queries the manifest table instead of scanning the directory and leverages Lance's random access capability.
- **Metadata support**: All operations can be supported, e.g. namespaces can have associated properties/metadata, tables can be renamed.
- **Optimized directory path**: Hash-based directory naming prevents conflicts and maximizes throughput in object storage.

Because the catalog metadata is itself stored as a Lance table, the catalog inherits the transactional semantics, snapshot isolation, and schema evolution guarantees of the table format, while also benefiting from Lance's random-access-friendly file layout and table-level indexing capabilities.

### Directory Layout

```
Expand Down Expand Up @@ -107,13 +112,13 @@ The `__manifest` table has the following schema:

**Primary Key**: The `object_id` column is the [unenforced primary key](https://lance.org/format/table/#unenforced-primary-key) for the manifest table. Implementation of this spec must always enforce the primary key uniqueness using features like Lance merge insert with primary key deduplication.

**Schema Extensibility**: The `__manifest` table schema may include additional columns beyond those listed above. Extensions like [partitioned namespaces](../partitioning-spec.md) add columns for efficient filtering. Implementations should preserve unrecognized columns during updates.
**Schema Extensibility**: The `__manifest` table schema may include additional columns beyond those listed above. Implementations should preserve unrecognized columns during updates, since extensions may add columns for filtering or other metadata-driven behaviors.

### Root Namespace Properties

In V2, the root namespace is implicit and does not have a row in the `__manifest` table. Instead, root namespace properties are stored in the `__manifest` Lance table's metadata map. Properties are stored as key-value pairs where the key is the property name and the value is a UTF-8 encoded byte array.

For example, a partitioned namespace stores its `partition_spec_v1`, `partition_spec_v2`, and `schema` properties in the `__manifest` table's metadata.
For example, implementations may store catalog-level properties in the `__manifest` table's metadata.

### Manifest Table Indexes

Expand Down Expand Up @@ -145,7 +150,7 @@ In [compatibility mode](#compatibility-mode), root namespace tables use `<table_

### Table Version Management

V2 optionally supports managed table versioning, where table versions are tracked in the `__manifest` table instead of relying on Lance's native version management. When enabled, the directory namespace acts as an [external manifest store](https://lance.org/format/table/transaction/#external-manifest-store). This feature must be enabled for the entire namespace.
V2 optionally supports managed table versioning, where table versions are tracked in the `__manifest` table instead of relying on Lance's native version management. When enabled, the directory catalog acts as an [external manifest store](https://lance.org/format/table/transaction/#external-manifest-store). This feature must be enabled for the entire catalog.

#### Enabling Table Version Management

Expand Down Expand Up @@ -186,7 +191,7 @@ Example metadata JSON:

## Compatibility Mode

By default, the directory namespace operates in compatibility mode, supporting both V1 and V2 tables simultaneously. This allows gradual migration from V1 to V2 without disrupting existing workflows.
By default, the directory catalog operates in compatibility mode, supporting both V1 and V2 tables simultaneously. This allows gradual migration from V1 to V2 without disrupting existing workflows.

In compatibility mode:

Expand Down
29 changes: 29 additions & 0 deletions docs/src/catalog/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# Lance Catalog Specs

A **catalog** manages collections of tables and provides table discovery, management, and transactional coordination. Catalog implementations vary widely across deployments, ranging from lightweight environments to enterprise platforms integrating with authorization systems or metadata services such as Apache Hive metastores.

To support this range of environments, Lance provides two catalog approaches:

## Directory Catalog

The **[Directory Catalog](dir/index.md)** is a storage-native catalog format that requires only a filesystem or object store — no additional services are needed. This makes it suitable for lightweight deployments, or even embedded in-process databases.

Key characteristics:

- **Zero infrastructure**: Requires only storage (local filesystem, S3, GCS, Azure, etc.)
- **Transactional guarantees**: Catalog metadata is stored as a Lance table, inheriting transactional semantics, snapshot isolation, and schema evolution guarantees
- **Simple deployment**: Ideal for ML/AI workloads that favor minimal operational dependencies

## REST Catalog

The **[REST Catalog](rest/index.md)** is an OpenAPI-based protocol that enables reading, writing, and managing Lance tables through a REST API. This is ideal for enterprise environments that require integration with existing governance, access control, and compliance systems.

Key characteristics:

- **Enterprise integration**: Connect to existing metadata services and authorization systems
- **Standardized API**: OpenAPI specification enables consistent client/server implementations
- **External manifest store**: Table version management APIs can act as an external manifest store for governance policies

## Supported Catalogs

Beyond the natively maintained catalog specs, Lance supports integration with external catalog systems through the [Namespace Client Spec](../namespace/index.md). Namespace Client implementation specs for systems like Apache Polaris, Unity Catalog, Apache Hive Metastore, and Apache Iceberg REST Catalog are maintained separately and can be found in the [Supported Catalogs](../namespace/supported-catalogs/index.md) section.
48 changes: 26 additions & 22 deletions docs/src/rest/catalog-spec.md → docs/src/catalog/rest/index.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
# Lance REST Namespace Catalog Spec
# REST Catalog API Specification

In an enterprise environment, typically there is a requirement to store tables in a metadata service
for more advanced governance features around access control, auditing, lineage tracking, etc.
**Lance REST Namespace** is an OpenAPI catalog protocol that enables reading, writing and managing Lance tables
by connecting those metadata services or building a custom metadata server in a standardized way.
The REST server definition can be found in the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lance-format/lance-namespace/refs/heads/main/docs/src/rest.yaml).
In enterprise environments, ML teams often must integrate with existing catalog systems to satisfy governance, access control, and compliance requirements. The **Lance REST Catalog** is an OpenAPI protocol that enables reading, writing, and managing Lance tables by connecting to metadata services or building a custom metadata server in a standardized way.

## Duality with Client-Side Access Spec
The REST Catalog specification, defined as an OpenAPI document, describes the data models and metadata operations needed to discover and manage Lance tables. It also defines data operations such as `QueryTable` and `InsertIntoTable` which exchange Arrow record batches via Apache Arrow IPC streams for efficient data transfer and interoperability with Arrow-native compute engines.

The Lance Namespace client-side access spec defines request and response models using OpenAPI.
The REST namespace spec leverages this fact — the REST API is largely identical to the client-side access spec,
The REST server definition can be found in the [OpenAPI specification](https://editor-next.swagger.io/?url=https://raw.githubusercontent.com/lance-format/lance-namespace/refs/heads/main/docs/src/spec.yaml).

## External Manifest Store

The REST Catalog also exposes table version management APIs that can act as an external manifest store. When used, table commits are coordinated through the catalog before the resulting table metadata is written to storage. This enables organizations to enforce governance policies such as auditing, access control, and commit validation while still preserving the Lance table format as the authoritative source of table state.

## Duality with Namespace Client Spec

The Lance Namespace Client spec defines request and response models using OpenAPI.
The REST Catalog spec leverages this fact — the REST API is largely identical to the Namespace Client spec,
with the request and response schemas directly used as HTTP request and response bodies.

This duality minimizes data conversion between client and server:
a client can serialize its request model directly to JSON for the HTTP body,
and deserialize the HTTP response body directly into the response model.

There are a few exceptions where the REST spec diverges from the client-side access spec.
There are a few exceptions where the REST spec diverges from the Namespace Client spec.
For example, for some operations like `InsertIntoTable`, `CreateTable`, `MergeInsertIntoTable`,
the HTTP request body is used for transmitting Arrow IPC binary data,
and the operation request fields are transmitted through query parameters instead.
Expand Down Expand Up @@ -48,7 +52,7 @@ When the information in the request body is missing, the server must use the inf
## Identity Header Mapping

All request schemas include an optional `identity` field for authentication.
For REST Namespace, the identity fields are mapped to HTTP headers:
For REST Catalog, the identity fields are mapped to HTTP headers:

| Identity Field | REST Form | Location |
|----------------|-----------------|----------|
Expand All @@ -65,7 +69,7 @@ All request schemas include an optional `context` field for passing arbitrary ke
This allows clients to send implementation-specific context that can be used by the server
or forwarded to downstream services.

For REST Namespace, context entries are mapped to HTTP headers using the naming convention:
For REST Catalog, context entries are mapped to HTTP headers using the naming convention:

| Context Entry | REST Form | Location |
|----------------------------|-------------------------------|----------|
Expand Down Expand Up @@ -297,33 +301,33 @@ Both request and response bodies are direct objects (map of string to string) in
|----------------|---------------|-------------------------------------------------------------------|
| `metadata` | Response body | Direct object `{"key": "value", ...}` (not `{"metadata": {...}}`) |

## Namespace Server and Adapter
## REST Catalog Server and Adapter

Any REST HTTP server that implements this OpenAPI protocol is called a **Lance Namespace server**.
If you are a metadata service provider that is building a custom implementation of Lance namespace,
Any REST HTTP server that implements this OpenAPI protocol is called a **Lance REST Catalog server**.
If you are a metadata service provider that is building a custom implementation of Lance catalog,
building a REST server gives you standardized integration to Lance
without the need to worry about tool support and
continuously distribute newer library versions compared to using an implementation.

If the main purpose of this server is to be a proxy on top of an existing metadata service,
converting back and forth between Lance REST API models and native API models of the metadata service,
then this Lance namespace server is called a **Lance Namespace adapter**.
then this Lance REST Catalog server is called a **Lance Catalog adapter**.

## Choosing between an Adapter vs an Implementation

Any adapter can always be directly a Lance namespace implementation bypassing the REST server,
Any adapter can always be directly a Lance catalog implementation bypassing the REST server,
and vise versa. In fact, an implementation is basically the backend of an adapter.
For example, we natively support a Lance HMS Namespace implementation,
as well as a Lance namespace adapter for HMS by using the HMS Namespace implementation to fulfill requests in the Lance REST server.
For example, we natively support a Lance HMS Catalog implementation,
as well as a Lance catalog adapter for HMS by using the HMS Catalog implementation to fulfill requests in the Lance REST server.

If you are considering between a Lance namespace adapter vs implementation to build or use in your environment,
If you are considering between a Lance catalog adapter vs implementation to build or use in your environment,
here are some criteria to consider:

1. **Multi-Language Feasibility & Maintenance Cost**: If you want a single strategy that works across all Lance language bindings, an adapter is preferred.
Sometimes it is not even possible for an integration to go with the implementation approach since it cannot support all the languages.
Sometimes an integration is popular or important enough that it is viable to build an implementation and maintain one library per language.
2. **Tooling Support**: each tool needs to declare the Lance namespace implementations it supports.
That means there will be a preference for tools to always support a REST namespace,
2. **Tooling Support**: each tool needs to declare the Lance catalog implementations it supports.
That means there will be a preference for tools to always support a REST catalog,
but it might not always support a specific implementation. This favors the adapter approach.
3. **Security**: if you have security concerns about the adapter being a man-in-the-middle, you should choose an implementation
4. **Performance**: after all, adapter adds one layer of indirection and is thus not the most performant solution.
Expand Down
Loading
Loading