Skip to content

feat(catalog): add Unity Catalog integration with extensible connector architecture#1

Open
jja725 wants to merge 55 commits into
mainfrom
feat/unity-catalog-integration
Open

feat(catalog): add Unity Catalog integration with extensible connector architecture#1
jja725 wants to merge 55 commits into
mainfrom
feat/unity-catalog-integration

Conversation

@jja725

@jja725 jja725 commented Feb 27, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add Unity Catalog (OSS) integration for browsing catalog metadata and auto-registering tables into SqlEngine
  • Introduce Presto-inspired extensible connector architecture with clean separation of catalog metadata (CatalogProvider) and data format reading (TableReader)
  • Support Delta Lake and Parquet table formats out of the box

Architecture

Inspired by Presto's connector SPI:

Layer Component Purpose
SPI CatalogProvider trait Browse catalog metadata (like Presto's ConnectorMetadata)
SPI TableReader trait Read data in specific formats (like Presto's ConnectorPageSourceProvider)
Facade Connector struct Bundles catalog + readers (like Presto's Connector)

Extensibility guarantee:

  • New catalog (e.g., AWS Glue) → implement CatalogProvider, reuses existing Delta/Parquet readers
  • New format (e.g., Iceberg) → implement TableReader, works with any catalog

Python API

from lance_graph import UnityCatalog

uc = UnityCatalog("http://localhost:8080/api/2.1/unity-catalog")

# Browse
catalogs = uc.list_catalogs()
tables = uc.list_tables("unity", "default")
table = uc.get_table("unity", "default", "marksheet")
print(table.columns())

# Query — auto-registers Delta + Parquet tables
engine = uc.create_sql_engine("unity", "default")
result = engine.execute("SELECT * FROM marksheet WHERE mark > 80")

New files

  • lance-graph-catalog: catalog_provider.rs, table_reader.rs, connector.rs, type_mapping.rs, unity_catalog.rs
  • lance-graph: table_readers.rs (Parquet + Delta), sql_catalog.rs (bridge)
  • lance-graph-python: catalog.rs (PyO3 bindings)

Test plan

  • 12 unit tests for UC type → Arrow type mapping
  • 15 wiremock integration tests for UC REST client
  • 9 Python unit tests for UnityCatalog class
  • 6 Python integration tests (require live UC server, skipped in CI)
  • All 110 existing tests pass unchanged

Lance Release Bot and others added 30 commits December 18, 2025 00:45
…er (lance-format#68)

* feat: implement arithmetic expression computation in DataFusion planner

Implement proper translation of Cypher arithmetic expressions (+, -, *, /, %)
to DataFusion BinaryExpr instead of returning a placeholder lit(0).

Changes:
- Add full arithmetic operator support in to_df_value_expr()
- Map ArithmeticOperator variants to DataFusion Operator equivalents
- Update tests to verify proper BinaryExpr generation with correct operators

* fix format

---------

Co-authored-by: jianjian.xie <jianjian.xie@uber.com>
…ance-format#70)

* Fix lance dependency and remove the implicit feather fallback

* Fix lint
* feat: support gcs and s3 as storage

* fix lint
* feat: implement the LIKE pattern matching
* ci: upgrade lance in dev-dep to 1.0.0

* ci: upgrade the datafusion version to 50.3

* ci: upgrade arrow version to 56.2

* format code

* ci: disable cache targets
add ilike support

Co-authored-by: jianjian.xie <jianjian.xie@uber.com>
* feat: support vector search/similarity as udfs

* feat: optimize the value expression parsing

* refactor: remove the parameterized query
* feat: support vector literals
* feat: support lance vector search

* feat(python): support vector search

* ci: install protoc in python lint workflow

* ci: install essential-only tools for python link workflow

* style: fix python lint errors
…on (lance-format#84)

* fix(simple-executor): return consistent column names using dot notation

Update simple executor to return Cypher dot notation (e.g., p.name)
instead of double underscore format (e.g., p__name) for column names.
This ensures consistency with the DataFusion executor output.

Changes:
- Add to_cypher_column_name() function to convert to dot notation
- Update apply_return_with_qualifier() to apply conversion for final output
- Add unit tests for the new function
- Preserve explicit user-provided aliases

Fixes lance-format#30

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* cargo fmt

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…mat#82)

* feat: add toLower/toUpper support, fix type coercion error

- Add support for toLower/toUpper (Cypher) and lower/upper (SQL) string functions
- Change fallback for unsupported functions from lit(0) to ScalarValue::Null
  to prevent type coercion errors in any context (string or numeric)
- Add unit tests for new string functions
- Add integration tests including exact bug reproduction

Fixes type coercion error when using toLower(s.name) CONTAINS 'x' with
integer columns in RETURN clause.

* chore: remove redundant test section headers

* fix: move string function tests to correct section

* style: apply cargo fmt formatting

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
Adds COLLECT() aggregation that collects values into an array,
translating to DataFusion's array_agg function.

- Add collect case in to_df_value_expr for DataFusion translation
- Update contains_aggregate to recognize collect
- Update semantic validation to allow COLLECT with bare variables
- Add tests for COLLECT with and without grouping

Co-authored-by: Allen Cheng <dacheng@Allens-MacBook-Pro.local>
…t#86)

Adds WITH clause for intermediate projections, aggregations, and
query chaining in Cypher queries.

Supported syntax:
- WITH projection: WITH p.name AS name
- WITH aggregation: WITH city, count(*) AS total
- WITH ORDER BY/LIMIT: WITH p ORDER BY p.age DESC LIMIT 10
- Post-WITH WHERE: WITH ... WHERE total > 1
- Post-WITH MATCH: WITH ... MATCH (p2:Person) ...

Changes:
- Add WithClause to AST with items, order_by, limit fields
- Add with_clause, post_with_match_clauses, post_with_where_clause to CypherQuery
- Parse WITH clause and post-WITH MATCH/WHERE in parser
- Add semantic analysis for WITH scope
- Add plan_with_clause in logical planner
- Chain post-WITH MATCH using plan_match_clause_with_base
- Add 5 comprehensive tests for WITH functionality

Note: WITH p (passing whole node) then MATCH (p)-[]->(f) requires
explicit property projection. Use WITH p.id AS id ... instead.

Co-authored-by: Allen Cheng <dacheng@Allens-MacBook-Pro.local>
…ormat#89)

Previously, every query would enumerate all datasets on cloud storage
and load all of them, causing ~10s latency with 20+ datasets on GCS.

Now the query parser extracts which tables are actually referenced
(via node_labels() and relationship_types()), and only those specific
datasets are loaded. Paths are computed directly from the root path
without enumeration.

Fixes lance-format#87

Co-authored-by: Claude <noreply@anthropic.com>
* refactor: adopt workspace layout

* ci: skip rust publish on PRs
* fix(query): error on unsupported Cypher functions

* test: format scalar function test

* style: sort imports in scalar function test

* style(rust): appease rustfmt

* test(query): cover scalar function semantics

* chore: add TODO for hard error on unknown functions

* fix(simple): respect RETURN aliases

* test: reorganize simple executor coverage

* fix: update semantic for post-WITH reading clauses

* test: restore UNWIND coverage

* style: format unwind test
* feat: implement case-insensitive support
Lance Release Bot and others added 25 commits February 1, 2026 04:28
…format#124)

Currently, every call of `query.execute()` needs to build the catalog,
which is an expensive operation. This PR exposes a `CypherEngine` API in
Python for multi-query execution, with a reusable catalog.

An example usage (also included in the README.md):

```python
import pyarrow as pa
from lance_graph import CypherEngine, GraphConfig

cfg = (
    GraphConfig.builder()
    .with_node_label("Person", "id")
    .with_node_label("City", "id")
    .with_relationship("lives_in", "src", "dst")
    .build()
)

datasets = {
    "Person": pa.table({"id": [1, 2], "name": ["Alice", "Bob"], "age": [30, 25]}),
    "City": pa.table({"id": [10, 20], "name": ["London", "Sydney"]}),
    "lives_in": pa.table({"src": [1, 2], "dst": [10, 20]}),
}

# Create engine once - builds catalog
engine = CypherEngine(cfg, datasets)

# Execute multiple queries efficiently - catalog is reused
result1 = engine.execute("MATCH (p:Person) WHERE p.age > 25 RETURN p.name")
result2 = engine.execute("MATCH (p:Person)-[:lives_in]->(c:City) RETURN p.name, c.name")
result3 = engine.execute("MATCH (p:Person) RETURN count(*) as total")

print(result1.to_pylist())
# [{'p.name': 'Alice'}]
```
This PR adds support for reusing variables across multiple patterns in a
Cypher query without introducing duplicate column errors. It implements
logic in the DataFusion planner to detect when a target variable in a
pattern is already bound in the schema, and applies a filter constraint
(e.g., `prev.id = next.id`) instead of performing a redundant join.
## Summary
- add `lance-graph-catalog` crate for namespace/catalog utilities
- re-export catalog/namespace types from `lance-graph` for API
compatibility
- update workspace and crate docs to reference the new catalog crate

## Testing
- /home/user/.cargo/bin/cargo check
- /home/user/.cargo/bin/cargo test --all
## Summary
Addresses code review feedback from PR lance-format#130 by removing redundant
re-export modules and updating imports to use `lance_graph_catalog`
directly.

## Changes
- Removed `crates/lance-graph/src/namespace/` directory (only contained
re-exports)
- Removed `crates/lance-graph/src/source_catalog.rs` (only contained
re-exports)
- Updated `lib.rs` to re-export catalog types directly from
`lance_graph_catalog`
- Updated all internal imports in datafusion_planner, query, and Python
bindings to use `lance_graph_catalog`

## Test plan
- [x] `cargo check --all` passes
- [x] `cargo test --all` passes (all 4 doc tests pass, all unit tests
pass)

Resolves review comments from lance-format#130

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…format#135)

## Summary
- move PyO3 Rust crate from python/ to crates/lance-graph-python
- wire new manifest into workspace, maturin config, Makefile, docs,
examples, and release workflow
- rebuild bindings and run Python tests to validate relocation

## Testing
- cargo check --workspace
- uv run --project . maturin develop
- uv run --project . pytest python/tests -vv
…ld error (lance-format#137)

## Summary
Explicitly set `module-name = "lance_graph._internal"` in
`pyproject.toml`.
This fixes a build error where maturin was looking for `_internal` as a
top-level module in the python source directory, instead of placing it
inside the `lance_graph` package.

## Test plan
- Verified locally that `maturin develop` works correctly with this
change.
- CI should pass.


Made with [Cursor](https://cursor.com)

Co-authored-by: Cursor <cursoragent@cursor.com>
…format#136)

This PR exposes the vector rerank method in the `CypherEngine` API,
similar to the existing `CypherQuery` API.

An example:

```python
engine = CypherEngine(config, datasets)

results = engine.execute_with_vector_rerank(
    "MATCH (d:Document) WHERE d.category = 'tech' RETURN d.id, d.name, d.embedding",
    VectorSearch("d.embedding")
    .query_vector([1.0, 0.0, 0.0])
    .metric(DistanceMetric.L2)
    .top_k(2),
)

data = results.to_pydict()
```
Currently, we cannot create a new release due to the error:
`FileNotFoundError: File not found: 'python/Cargo.toml'`

This PR fixes the issue by pointing to the correct Python folder.
This PR adds the `lance-graph-catalog` to the `.bumpverison.toml` to
unblock the release.
Fix the python's makefile for publishing the python release.
…t#140)

## Summary

- Add a VectorSearch.use_lance_index(True) opt-in to run a vector-first
path when datasets are Lance datasets.
- When enabled, CypherQuery.execute_with_vector_rerank uses Lance ANN
(nearest) to get top-k rows for the vector label, then executes Cypher
on that reduced dataset.\n- Adds a Python test behind requires_lance to
validate the path.


## Notes

- This is an explicit opt-in and only applies when the Cypher query has
no WITH/WHERE clauses; otherwise it falls back to the existing rerank
behavior.
- Semantics can differ from candidate-then-rerank if you enable it on
filtered queries; the guard avoids that by default.
 
## Motivation

Using Lance datasets with ANN indices can significantly reduce latency
for GraphRAG hybrid retrieval on large datasets by avoiding full-table
reranking.

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

## Summary
Optimize CI workflows to dramatically reduce build times through
improved caching and build configuration.

## Changes

### 1. Shared Rust Dependency Cache
- Added `shared-key: "lance-graph-deps"` to all workflows
- Cache is shared across Python, Rust, Build, and Style workflows
- Include both `crates/lance-graph` and `crates/lance-graph-python`
workspaces
- Prevents duplicate compilation across different workflows

### 2. Fix Double Rust Build in Python Tests
- Removed `uv pip install -e .[tests]` which was triggering Rust
compilation
- Now only `maturin develop` builds the extension once
- **Result: 24 min → 6 min (74% faster!)**

### 3. Remove Nightly from Rust Test Matrix
- Only test on stable toolchain (matches upstream lance project)
- Prevents nightly breakage from blocking PRs
- Stable is the primary deployment target

### 4. Explicit CARGO_INCREMENTAL=0
- Set explicitly for CI (rust-cache overrides to 0 anyway)
- Makes intention clear: incremental artifacts not useful in CI

## Performance Impact

**Before:**
- Python Tests: ~24 minutes
- Rust compiled twice (18min + 5min)
- No cache sharing across workflows

**After:**
- Python Tests: **6 minutes (74% faster)** ✅
- Rust compiled once (5min)
- Shared cache across all workflows

## Test Results

All CI checks passing:
- ✅ Python Tests (test 3.11) - 6 minutes
- ✅ Rust Tests (stable)
- ✅ Rust Coverage
- ✅ Build (stable)
- ✅ All Linters (clippy, format, Python lint, spell check)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Achieved returning full node objects by expanding to all properties in
logical_plan. Also added unit and system tests

Closes lance-format#110
Add SqlQuery and SqlEngine that let users run standard SQL directly
against their datasets without requiring a GraphConfig. This is useful
for data analytics workflows where users want explicit JOINs and
aggregations against node/relationship tables. DataFusion handles SQL
parsing and execution.
…r architecture

Add support for browsing and querying tables from Unity Catalog (OSS).
Inspired by Presto's connector SPI, the design cleanly separates:

- CatalogProvider trait: catalog metadata browsing (UC first, extensible
  to Hive Metastore, AWS Glue, Iceberg REST Catalog)
- TableReader trait: format-specific data reading (Parquet + Delta Lake,
  extensible to CSV, Iceberg, ORC)
- Connector struct: facade bundling catalog + readers

Key features:
- Full UC REST API client (list/get catalogs, schemas, tables, columns)
- UC type → Arrow type mapping (20 type mappings)
- ParquetTableReader via DataFusion register_parquet()
- DeltaTableReader via deltalake 0.29 (behind "delta" feature flag)
- Auto-register UC tables into SqlEngine via create_sql_engine()
- Python bindings: UnityCatalog, CatalogInfo, SchemaInfo, TableInfo
- 15 wiremock integration tests for UC REST client
- 12 type mapping unit tests
- 9 Python unit tests

Python API:
  uc = UnityCatalog("http://localhost:8080/api/2.1/unity-catalog")
  engine = uc.create_sql_engine("unity", "default")
  result = engine.execute("SELECT * FROM my_table")

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cargo fmt fixes across all new files
- Replace EnumName::Variant with Self::Variant (clippy::unnecessary_structure_name_repetition)
- Fix Python import sorting and line length (ruff)
…re, GCS)

- Add `storage_options` parameter to `TableReader::register_table()` trait
- `Connector::with_storage_options()` stores credentials and passes them
  to table readers during registration
- `DeltaTableReader` uses `open_table_with_storage_options()` when
  storage options are provided
- Enable deltalake cloud features: s3, azure, gcs
- Python: `UnityCatalog(url, storage_options={...})` accepts cloud creds

Usage:
  uc = UnityCatalog(
      "http://localhost:8080/api/2.1/unity-catalog",
      storage_options={
          "azure_storage_account_name": "myaccount",
          "azure_storage_account_key": "...",
      }
  )
  engine = uc.create_sql_engine("unity", "default")
Add examples for UnityCatalog browsing, create_sql_engine, and
cloud storage options (S3, Azure, GCS) to both project and Python READMEs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants