feat(catalog): add Unity Catalog integration with extensible connector architecture#1
Open
jja725 wants to merge 55 commits into
Open
feat(catalog): add Unity Catalog integration with extensible connector architecture#1jja725 wants to merge 55 commits into
jja725 wants to merge 55 commits into
Conversation
…er (lance-format#68) * feat: implement arithmetic expression computation in DataFusion planner Implement proper translation of Cypher arithmetic expressions (+, -, *, /, %) to DataFusion BinaryExpr instead of returning a placeholder lit(0). Changes: - Add full arithmetic operator support in to_df_value_expr() - Map ArithmeticOperator variants to DataFusion Operator equivalents - Update tests to verify proper BinaryExpr generation with correct operators * fix format --------- Co-authored-by: jianjian.xie <jianjian.xie@uber.com>
…ance-format#70) * Fix lance dependency and remove the implicit feather fallback * Fix lint
* feat: support gcs and s3 as storage * fix lint
* feat: implement the LIKE pattern matching
* ci: upgrade lance in dev-dep to 1.0.0 * ci: upgrade the datafusion version to 50.3 * ci: upgrade arrow version to 56.2 * format code * ci: disable cache targets
add ilike support Co-authored-by: jianjian.xie <jianjian.xie@uber.com>
* feat: support vector search/similarity as udfs * feat: optimize the value expression parsing * refactor: remove the parameterized query
* feat: support vector literals
* feat: support lance vector search * feat(python): support vector search * ci: install protoc in python lint workflow * ci: install essential-only tools for python link workflow * style: fix python lint errors
…on (lance-format#84) * fix(simple-executor): return consistent column names using dot notation Update simple executor to return Cypher dot notation (e.g., p.name) instead of double underscore format (e.g., p__name) for column names. This ensures consistency with the DataFusion executor output. Changes: - Add to_cypher_column_name() function to convert to dot notation - Update apply_return_with_qualifier() to apply conversion for final output - Add unit tests for the new function - Preserve explicit user-provided aliases Fixes lance-format#30 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * cargo fmt --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…mat#82) * feat: add toLower/toUpper support, fix type coercion error - Add support for toLower/toUpper (Cypher) and lower/upper (SQL) string functions - Change fallback for unsupported functions from lit(0) to ScalarValue::Null to prevent type coercion errors in any context (string or numeric) - Add unit tests for new string functions - Add integration tests including exact bug reproduction Fixes type coercion error when using toLower(s.name) CONTAINS 'x' with integer columns in RETURN clause. * chore: remove redundant test section headers * fix: move string function tests to correct section * style: apply cargo fmt formatting Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
Adds COLLECT() aggregation that collects values into an array, translating to DataFusion's array_agg function. - Add collect case in to_df_value_expr for DataFusion translation - Update contains_aggregate to recognize collect - Update semantic validation to allow COLLECT with bare variables - Add tests for COLLECT with and without grouping Co-authored-by: Allen Cheng <dacheng@Allens-MacBook-Pro.local>
…t#86) Adds WITH clause for intermediate projections, aggregations, and query chaining in Cypher queries. Supported syntax: - WITH projection: WITH p.name AS name - WITH aggregation: WITH city, count(*) AS total - WITH ORDER BY/LIMIT: WITH p ORDER BY p.age DESC LIMIT 10 - Post-WITH WHERE: WITH ... WHERE total > 1 - Post-WITH MATCH: WITH ... MATCH (p2:Person) ... Changes: - Add WithClause to AST with items, order_by, limit fields - Add with_clause, post_with_match_clauses, post_with_where_clause to CypherQuery - Parse WITH clause and post-WITH MATCH/WHERE in parser - Add semantic analysis for WITH scope - Add plan_with_clause in logical planner - Chain post-WITH MATCH using plan_match_clause_with_base - Add 5 comprehensive tests for WITH functionality Note: WITH p (passing whole node) then MATCH (p)-[]->(f) requires explicit property projection. Use WITH p.id AS id ... instead. Co-authored-by: Allen Cheng <dacheng@Allens-MacBook-Pro.local>
…ormat#89) Previously, every query would enumerate all datasets on cloud storage and load all of them, causing ~10s latency with 20+ datasets on GCS. Now the query parser extracts which tables are actually referenced (via node_labels() and relationship_types()), and only those specific datasets are loaded. Paths are computed directly from the root path without enumeration. Fixes lance-format#87 Co-authored-by: Claude <noreply@anthropic.com>
* refactor: adopt workspace layout * ci: skip rust publish on PRs
* fix(query): error on unsupported Cypher functions * test: format scalar function test * style: sort imports in scalar function test * style(rust): appease rustfmt * test(query): cover scalar function semantics * chore: add TODO for hard error on unknown functions * fix(simple): respect RETURN aliases * test: reorganize simple executor coverage * fix: update semantic for post-WITH reading clauses * test: restore UNWIND coverage * style: format unwind test
* feat: implement case-insensitive support
…format#124) Currently, every call of `query.execute()` needs to build the catalog, which is an expensive operation. This PR exposes a `CypherEngine` API in Python for multi-query execution, with a reusable catalog. An example usage (also included in the README.md): ```python import pyarrow as pa from lance_graph import CypherEngine, GraphConfig cfg = ( GraphConfig.builder() .with_node_label("Person", "id") .with_node_label("City", "id") .with_relationship("lives_in", "src", "dst") .build() ) datasets = { "Person": pa.table({"id": [1, 2], "name": ["Alice", "Bob"], "age": [30, 25]}), "City": pa.table({"id": [10, 20], "name": ["London", "Sydney"]}), "lives_in": pa.table({"src": [1, 2], "dst": [10, 20]}), } # Create engine once - builds catalog engine = CypherEngine(cfg, datasets) # Execute multiple queries efficiently - catalog is reused result1 = engine.execute("MATCH (p:Person) WHERE p.age > 25 RETURN p.name") result2 = engine.execute("MATCH (p:Person)-[:lives_in]->(c:City) RETURN p.name, c.name") result3 = engine.execute("MATCH (p:Person) RETURN count(*) as total") print(result1.to_pylist()) # [{'p.name': 'Alice'}] ```
This PR adds support for reusing variables across multiple patterns in a Cypher query without introducing duplicate column errors. It implements logic in the DataFusion planner to detect when a target variable in a pattern is already bound in the schema, and applies a filter constraint (e.g., `prev.id = next.id`) instead of performing a redundant join.
## Summary - add `lance-graph-catalog` crate for namespace/catalog utilities - re-export catalog/namespace types from `lance-graph` for API compatibility - update workspace and crate docs to reference the new catalog crate ## Testing - /home/user/.cargo/bin/cargo check - /home/user/.cargo/bin/cargo test --all
## Summary Addresses code review feedback from PR lance-format#130 by removing redundant re-export modules and updating imports to use `lance_graph_catalog` directly. ## Changes - Removed `crates/lance-graph/src/namespace/` directory (only contained re-exports) - Removed `crates/lance-graph/src/source_catalog.rs` (only contained re-exports) - Updated `lib.rs` to re-export catalog types directly from `lance_graph_catalog` - Updated all internal imports in datafusion_planner, query, and Python bindings to use `lance_graph_catalog` ## Test plan - [x] `cargo check --all` passes - [x] `cargo test --all` passes (all 4 doc tests pass, all unit tests pass) Resolves review comments from lance-format#130 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…format#135) ## Summary - move PyO3 Rust crate from python/ to crates/lance-graph-python - wire new manifest into workspace, maturin config, Makefile, docs, examples, and release workflow - rebuild bindings and run Python tests to validate relocation ## Testing - cargo check --workspace - uv run --project . maturin develop - uv run --project . pytest python/tests -vv
…ld error (lance-format#137) ## Summary Explicitly set `module-name = "lance_graph._internal"` in `pyproject.toml`. This fixes a build error where maturin was looking for `_internal` as a top-level module in the python source directory, instead of placing it inside the `lance_graph` package. ## Test plan - Verified locally that `maturin develop` works correctly with this change. - CI should pass. Made with [Cursor](https://cursor.com) Co-authored-by: Cursor <cursoragent@cursor.com>
…format#136) This PR exposes the vector rerank method in the `CypherEngine` API, similar to the existing `CypherQuery` API. An example: ```python engine = CypherEngine(config, datasets) results = engine.execute_with_vector_rerank( "MATCH (d:Document) WHERE d.category = 'tech' RETURN d.id, d.name, d.embedding", VectorSearch("d.embedding") .query_vector([1.0, 0.0, 0.0]) .metric(DistanceMetric.L2) .top_k(2), ) data = results.to_pydict() ```
Currently, we cannot create a new release due to the error: `FileNotFoundError: File not found: 'python/Cargo.toml'` This PR fixes the issue by pointing to the correct Python folder.
This PR adds the `lance-graph-catalog` to the `.bumpverison.toml` to unblock the release.
Fix the python's makefile for publishing the python release.
…t#140) ## Summary - Add a VectorSearch.use_lance_index(True) opt-in to run a vector-first path when datasets are Lance datasets. - When enabled, CypherQuery.execute_with_vector_rerank uses Lance ANN (nearest) to get top-k rows for the vector label, then executes Cypher on that reduced dataset.\n- Adds a Python test behind requires_lance to validate the path. ## Notes - This is an explicit opt-in and only applies when the Cypher query has no WITH/WHERE clauses; otherwise it falls back to the existing rerank behavior. - Semantics can differ from candidate-then-rerank if you enable it on filtered queries; the guard avoids that by default. ## Motivation Using Lance datasets with ANN indices can significantly reduce latency for GraphRAG hybrid retrieval on large datasets by avoiding full-table reranking. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) ## Summary Optimize CI workflows to dramatically reduce build times through improved caching and build configuration. ## Changes ### 1. Shared Rust Dependency Cache - Added `shared-key: "lance-graph-deps"` to all workflows - Cache is shared across Python, Rust, Build, and Style workflows - Include both `crates/lance-graph` and `crates/lance-graph-python` workspaces - Prevents duplicate compilation across different workflows ### 2. Fix Double Rust Build in Python Tests - Removed `uv pip install -e .[tests]` which was triggering Rust compilation - Now only `maturin develop` builds the extension once - **Result: 24 min → 6 min (74% faster!)** ### 3. Remove Nightly from Rust Test Matrix - Only test on stable toolchain (matches upstream lance project) - Prevents nightly breakage from blocking PRs - Stable is the primary deployment target ### 4. Explicit CARGO_INCREMENTAL=0 - Set explicitly for CI (rust-cache overrides to 0 anyway) - Makes intention clear: incremental artifacts not useful in CI ## Performance Impact **Before:** - Python Tests: ~24 minutes - Rust compiled twice (18min + 5min) - No cache sharing across workflows **After:** - Python Tests: **6 minutes (74% faster)** ✅ - Rust compiled once (5min) - Shared cache across all workflows ## Test Results All CI checks passing: - ✅ Python Tests (test 3.11) - 6 minutes - ✅ Rust Tests (stable) - ✅ Rust Coverage - ✅ Build (stable) - ✅ All Linters (clippy, format, Python lint, spell check) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Achieved returning full node objects by expanding to all properties in logical_plan. Also added unit and system tests Closes lance-format#110
Add SqlQuery and SqlEngine that let users run standard SQL directly against their datasets without requiring a GraphConfig. This is useful for data analytics workflows where users want explicit JOINs and aggregations against node/relationship tables. DataFusion handles SQL parsing and execution.
…r architecture
Add support for browsing and querying tables from Unity Catalog (OSS).
Inspired by Presto's connector SPI, the design cleanly separates:
- CatalogProvider trait: catalog metadata browsing (UC first, extensible
to Hive Metastore, AWS Glue, Iceberg REST Catalog)
- TableReader trait: format-specific data reading (Parquet + Delta Lake,
extensible to CSV, Iceberg, ORC)
- Connector struct: facade bundling catalog + readers
Key features:
- Full UC REST API client (list/get catalogs, schemas, tables, columns)
- UC type → Arrow type mapping (20 type mappings)
- ParquetTableReader via DataFusion register_parquet()
- DeltaTableReader via deltalake 0.29 (behind "delta" feature flag)
- Auto-register UC tables into SqlEngine via create_sql_engine()
- Python bindings: UnityCatalog, CatalogInfo, SchemaInfo, TableInfo
- 15 wiremock integration tests for UC REST client
- 12 type mapping unit tests
- 9 Python unit tests
Python API:
uc = UnityCatalog("http://localhost:8080/api/2.1/unity-catalog")
engine = uc.create_sql_engine("unity", "default")
result = engine.execute("SELECT * FROM my_table")
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- cargo fmt fixes across all new files - Replace EnumName::Variant with Self::Variant (clippy::unnecessary_structure_name_repetition) - Fix Python import sorting and line length (ruff)
…re, GCS)
- Add `storage_options` parameter to `TableReader::register_table()` trait
- `Connector::with_storage_options()` stores credentials and passes them
to table readers during registration
- `DeltaTableReader` uses `open_table_with_storage_options()` when
storage options are provided
- Enable deltalake cloud features: s3, azure, gcs
- Python: `UnityCatalog(url, storage_options={...})` accepts cloud creds
Usage:
uc = UnityCatalog(
"http://localhost:8080/api/2.1/unity-catalog",
storage_options={
"azure_storage_account_name": "myaccount",
"azure_storage_account_key": "...",
}
)
engine = uc.create_sql_engine("unity", "default")
Add examples for UnityCatalog browsing, create_sql_engine, and cloud storage options (S3, Azure, GCS) to both project and Python READMEs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SqlEngineCatalogProvider) and data format reading (TableReader)Architecture
Inspired by Presto's connector SPI:
CatalogProvidertraitConnectorMetadata)TableReadertraitConnectorPageSourceProvider)ConnectorstructConnector)Extensibility guarantee:
CatalogProvider, reuses existing Delta/Parquet readersTableReader, works with any catalogPython API
New files
lance-graph-catalog:catalog_provider.rs,table_reader.rs,connector.rs,type_mapping.rs,unity_catalog.rslance-graph:table_readers.rs(Parquet + Delta),sql_catalog.rs(bridge)lance-graph-python:catalog.rs(PyO3 bindings)Test plan