Skip to content

Commit cb49627

Browse files
authored
feat: generalize storage_options for GCS/Azure and any object_store backend (#46)
## Summary Closes #45. The Rust + PyO3 layer already forwards an arbitrary `storage_options` dict through to lance's `DatasetBuilder::with_storage_options`, but the Python constructor only exposed ergonomic shortcuts for AWS keys, which made it look like GCS/Azure weren't supported and blocked tiered-memory work on `gs://` URIs. This PR: - Promotes `storage_options` as the canonical, backend-agnostic way to configure remote stores — aligned with how `lance` and `lance-graph` handle backends. - Keeps `aws_access_key_id`, `aws_secret_access_key`, `aws_session_token`, `region`, `endpoint_url`, `allow_http` as backwards-compatible shortcuts that now emit a single `DeprecationWarning` pointing callers at `storage_options`. - Documents S3, GCS, and Azure usage in the Python docstring and the README. - Adds `python/tests/test_storage_options.py` with 8 unit tests covering merge / pass-through / precedence / deprecation semantics. - Adds `python/tests/test_gcs_persistence.py`, an opt-in real-GCS integration test gated on `LANCE_CONTEXT_GCS_BUCKET` + one of `LANCE_CONTEXT_GCS_SERVICE_ACCOUNT_KEY` / `GOOGLE_APPLICATION_CREDENTIALS` / `LANCE_CONTEXT_GCS_ENDPOINT` (for \`fake-gcs-server\`-style emulators). - Rewrites the existing S3 integration test to use the canonical \`storage_options=\` path and adds a companion back-compat test that asserts the deprecated AWS kwargs still work and emit the warning. - Fixes the \`moto.server\` subprocess invocation for moto >= 5 (dropped the positional service arg) and adds \`moto[s3,server]\` so \`flask\` / \`flask-cors\` are available; without this the S3 suite was being silently skipped on modern environments. No behavior change for local or S3 callers that already pass \`storage_options=\`. ## Notes An emulator-based GCS test was explored with \`gcp-storage-emulator\` but has two compat issues outside this repo's scope: (1) it imports \`fs\` (pyfilesystem2) which calls \`pkg_resources.declare_namespace\`, broken under setuptools >= 81 (Python 3.12+); (2) its JSON responses fail OpenDAL's deserializer with \`invalid type: null, expected a string\`. The opt-in integration test works against \`fake-gcs-server\` (Go binary) and real GCS, which is what the UP+GCS tiered-memory rollout will actually use. ## Test plan - [x] \`cargo fmt --all -- --check\` - [x] \`cargo clippy --all-targets -- -D warnings\` - [x] \`cargo test --workspace\` (6 passed) - [x] \`uv run pytest\` (39 passed, 2 skipped — PIL image test + opt-in GCS) - [x] \`uv run ruff format --check python/\` - [x] \`uv run ruff check python/\` - [x] \`uv run pyright\` (0 errors) - [ ] Run \`test_gcs_persistence.py\` against real GCS with \`LANCE_CONTEXT_GCS_BUCKET=... GOOGLE_APPLICATION_CREDENTIALS=...\` from the UP+GCS environment before/after merge to confirm end-to-end for the tiered-memory work Made with [Cursor](https://cursor.com)
1 parent 0ba80e5 commit cb49627

7 files changed

Lines changed: 1085 additions & 43 deletions

File tree

README.md

Lines changed: 37 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,8 @@ Key motivations inspired by the broader Lance roadmap<sup>[1](https://github.com
1818
- Unified schema for agent messages (`ContextRecord`) with optional embeddings and metadata.
1919
- Automatic versioning via Lance manifests with `checkout(version)` support.
2020
- Background compaction to optimize storage and read performance.
21-
- Remote persistence: point the store at `s3://` URIs with either AWS environment variables or explicit credentials/endpoint overrides.
21+
- Remote persistence on any `object_store` backend (S3, GCS, Azure Blob, ...)
22+
via the generic `storage_options` dict, aligned with `lance` and `lance-graph`.
2223
- Python API (`lance_context.api.Context`) aligned with the Rust implementation.
2324
- Integration tests that exercise real persistence, image serialization, and version rollbacks.
2425

@@ -68,16 +69,42 @@ ctx.checkout(first_version)
6869

6970
print("Entries after checkout:", ctx.entries())
7071

71-
# Store context in S3 (e.g., for MinIO/moto test endpoints)
72+
# Remote persistence on any object_store backend uses a generic `storage_options`
73+
# dict, matching the conventions used by `lance` and `lance-graph`.
74+
#
75+
# Amazon S3 (and S3-compatible endpoints like MinIO / moto):
7276
ctx = Context.create(
7377
"s3://my-bucket/context.lance",
74-
aws_access_key_id="minioadmin",
75-
aws_secret_access_key="minioadmin",
76-
region="us-east-1",
77-
endpoint_url="http://localhost:9000",
78-
allow_http=True,
78+
storage_options={
79+
"aws_access_key_id": "minioadmin",
80+
"aws_secret_access_key": "minioadmin",
81+
"aws_region": "us-east-1",
82+
"aws_endpoint_url": "http://localhost:9000", # optional
83+
"aws_allow_http": "true", # optional
84+
},
85+
)
86+
# Environment variables (AWS_ACCESS_KEY_ID, ...) are picked up by lance when
87+
# `storage_options` isn't provided; pass overrides only when you need them.
88+
89+
# Google Cloud Storage:
90+
ctx = Context.create(
91+
"gs://my-bucket/context.lance",
92+
storage_options={
93+
# Pick one: inline service-account JSON, path to the JSON file, or ADC.
94+
"google_service_account_key": service_account_json,
95+
# "google_service_account_path": "/path/to/sa.json",
96+
# "google_application_credentials": "/path/to/adc.json",
97+
},
98+
)
99+
100+
# Azure Blob Storage:
101+
ctx = Context.create(
102+
"az://my-container/context.lance",
103+
storage_options={
104+
"azure_storage_account_name": "...",
105+
"azure_storage_account_key": "...",
106+
},
79107
)
80-
# AWS_* environment variables work too—pass overrides only when you need custom endpoints.
81108

82109
# Background Compaction - optimize storage and read performance
83110
ctx = Context.create(
@@ -141,7 +168,8 @@ println!("Current version {}", store.version());
141168

142169
We are tracking future enhancements as GitHub issues:
143170

144-
- [Support S3-backed context stores](https://github.com/lance-format/lance-context/issues/14)
171+
- ~~[Support S3-backed context stores](https://github.com/lance-format/lance-context/issues/14)~~**Implemented**
172+
- ~~[Support standard storage_options / GCS](https://github.com/lance-format/lance-context/issues/45)~~**Implemented**
145173
- [Add relationship column for GraphRAG workflows](https://github.com/lance-format/lance-context/issues/15)
146174
- ~~[Background compaction for Lance fragments](https://github.com/lance-format/lance-context/issues/16)~~**Implemented**
147175

python/pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,9 @@ requires = ["maturin>=1.4"]
4141
build-backend = "maturin"
4242

4343
[project.optional-dependencies]
44-
tests = ["pytest", "ruff", "moto[s3]", "boto3", "botocore"]
44+
# `moto[server]` pulls in flask + flask-cors so moto.server can be launched
45+
# as a subprocess for the S3 integration tests.
46+
tests = ["pytest", "ruff", "moto[s3,server]", "boto3", "botocore"]
4547
dev = ["ruff", "pyright"]
4648

4749
[tool.ruff]

python/python/lance_context/api.py

Lines changed: 122 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import warnings
34
from datetime import datetime
45
from io import BytesIO
56
from typing import Any
@@ -134,40 +135,145 @@ def _normalize_search_hit(raw: dict[str, Any]) -> dict[str, Any]:
134135
return result
135136

136137

138+
_AWS_KWARG_MAP: dict[str, str] = {
139+
"aws_access_key_id": "aws_access_key_id",
140+
"aws_secret_access_key": "aws_secret_access_key",
141+
"aws_session_token": "aws_session_token",
142+
"region": "aws_region",
143+
"endpoint_url": "aws_endpoint_url",
144+
}
145+
146+
147+
def _merge_storage_options(
148+
storage_options: dict[str, Any] | None,
149+
*,
150+
aws_access_key_id: str | None,
151+
aws_secret_access_key: str | None,
152+
aws_session_token: str | None,
153+
region: str | None,
154+
endpoint_url: str | None,
155+
allow_http: bool,
156+
) -> dict[str, Any]:
157+
"""Merge deprecated AWS-specific kwargs into a generic storage_options dict.
158+
159+
Emits a single DeprecationWarning when any AWS kwarg is used so callers
160+
can migrate to the generic `storage_options` path (which works for S3,
161+
GCS, Azure, and any other lance/object_store backend).
162+
"""
163+
options: dict[str, Any] = dict(storage_options or {})
164+
165+
aws_kwargs = {
166+
"aws_access_key_id": aws_access_key_id,
167+
"aws_secret_access_key": aws_secret_access_key,
168+
"aws_session_token": aws_session_token,
169+
"region": region,
170+
"endpoint_url": endpoint_url,
171+
}
172+
used = [name for name, value in aws_kwargs.items() if value is not None]
173+
if allow_http:
174+
used.append("allow_http")
175+
176+
if used:
177+
warnings.warn(
178+
"The AWS-specific kwargs "
179+
f"({', '.join(sorted(used))}) are deprecated and will be removed in a "
180+
"future release. Pass credentials via the generic "
181+
"`storage_options` dict instead (e.g. "
182+
"storage_options={'aws_access_key_id': ..., "
183+
"'aws_secret_access_key': ...} for S3, or "
184+
"storage_options={'google_service_account_key': ...} for GCS).",
185+
DeprecationWarning,
186+
stacklevel=3,
187+
)
188+
189+
for kwarg_name, option_key in _AWS_KWARG_MAP.items():
190+
value = aws_kwargs[kwarg_name]
191+
if value is not None:
192+
options.setdefault(option_key, value)
193+
if allow_http:
194+
options.setdefault("aws_allow_http", True)
195+
196+
return options
197+
198+
137199
class Context:
200+
"""Multimodal, versioned context store backed by Lance.
201+
202+
Storage backends are configured via the generic ``storage_options`` dict,
203+
aligned with the conventions used by ``lance`` and ``lance-graph``. Any
204+
keys understood by the underlying ``object_store`` crate are accepted.
205+
206+
Examples:
207+
Local filesystem::
208+
209+
Context.create("/tmp/context.lance")
210+
211+
Amazon S3 (or S3-compatible endpoints like MinIO / moto)::
212+
213+
Context.create(
214+
"s3://bucket/prefix/context.lance",
215+
storage_options={
216+
"aws_access_key_id": "...",
217+
"aws_secret_access_key": "...",
218+
"aws_region": "us-east-1",
219+
"aws_endpoint_url": "http://localhost:9000", # optional
220+
"aws_allow_http": "true", # optional
221+
},
222+
)
223+
224+
Google Cloud Storage::
225+
226+
Context.create(
227+
"gs://bucket/prefix/context.lance",
228+
storage_options={
229+
# Any one of these is enough; pick whatever fits your
230+
# deployment (inline JSON, file path, or ADC).
231+
"google_service_account_key": service_account_json,
232+
# "google_service_account_path": "/path/to/sa.json",
233+
# "google_application_credentials": "/path/to/adc.json",
234+
},
235+
)
236+
237+
Azure Blob Storage::
238+
239+
Context.create(
240+
"az://container/prefix/context.lance",
241+
storage_options={
242+
"azure_storage_account_name": "...",
243+
"azure_storage_account_key": "...",
244+
},
245+
)
246+
"""
247+
138248
def __init__(
139249
self,
140250
uri: str,
141251
*,
142252
storage_options: dict[str, Any] | None = None,
253+
# --- Deprecated AWS-specific shortcuts (kept for backwards compat). ---
143254
aws_access_key_id: str | None = None,
144255
aws_secret_access_key: str | None = None,
145256
aws_session_token: str | None = None,
146257
region: str | None = None,
147258
endpoint_url: str | None = None,
148259
allow_http: bool = False,
149-
# Compaction configuration
260+
# --- Compaction configuration. ---
150261
enable_background_compaction: bool = False,
151262
compaction_interval_secs: int = 300,
152263
compaction_min_fragments: int = 5,
153264
compaction_target_rows: int = 1_000_000,
154265
quiet_hours: list[tuple[int, int]] | None = None,
155266
) -> None:
156-
options = dict(storage_options or {})
157-
if aws_access_key_id is not None:
158-
options["aws_access_key_id"] = aws_access_key_id
159-
if aws_secret_access_key is not None:
160-
options["aws_secret_access_key"] = aws_secret_access_key
161-
if aws_session_token is not None:
162-
options["aws_session_token"] = aws_session_token
163-
if region is not None:
164-
options["aws_region"] = region
165-
if endpoint_url is not None:
166-
options["aws_endpoint_url"] = endpoint_url
167-
if allow_http:
168-
options["aws_allow_http"] = True
169-
170-
# Build compaction config
267+
options = _merge_storage_options(
268+
storage_options,
269+
aws_access_key_id=aws_access_key_id,
270+
aws_secret_access_key=aws_secret_access_key,
271+
aws_session_token=aws_session_token,
272+
region=region,
273+
endpoint_url=endpoint_url,
274+
allow_http=allow_http,
275+
)
276+
171277
compaction_config = {
172278
"enabled": enable_background_compaction,
173279
"check_interval_secs": compaction_interval_secs,
@@ -197,7 +303,6 @@ def create(
197303
region: str | None = None,
198304
endpoint_url: str | None = None,
199305
allow_http: bool = False,
200-
# Compaction configuration
201306
enable_background_compaction: bool = False,
202307
compaction_interval_secs: int = 300,
203308
compaction_min_fragments: int = 5,
Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
"""Opt-in GCS integration tests.
2+
3+
Skipped by default. To run locally against a real (or emulated) GCS:
4+
5+
# Option A: real GCS
6+
export LANCE_CONTEXT_GCS_BUCKET=my-test-bucket
7+
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa.json
8+
uv run pytest python/tests/test_gcs_persistence.py -v
9+
10+
# Option B: against fake-gcs-server or another emulator, point the
11+
# relevant storage_options at the emulator endpoint (e.g. via
12+
# `use_opendal=true`, `endpoint=http://...`, `allow_anonymous=true`).
13+
export LANCE_CONTEXT_GCS_BUCKET=test-bucket
14+
export LANCE_CONTEXT_GCS_ENDPOINT=http://127.0.0.1:4443
15+
uv run pytest python/tests/test_gcs_persistence.py -v
16+
17+
These tests intentionally do not bring up their own emulator in CI because
18+
there is no pure-Python GCS emulator that is both (a) maintained on modern
19+
Python and (b) fully compatible with the lance-io GCS backend. The S3
20+
suite uses moto, which has no GCS counterpart of equivalent quality.
21+
"""
22+
23+
from __future__ import annotations
24+
25+
import os
26+
import sys
27+
import uuid
28+
from pathlib import Path
29+
30+
import pytest
31+
32+
PACKAGE_ROOT = Path(__file__).resolve().parents[2] / "python" / "python"
33+
if str(PACKAGE_ROOT) not in sys.path:
34+
sys.path.insert(0, str(PACKAGE_ROOT))
35+
36+
from lance_context.api import Context # noqa: E402
37+
38+
lance = pytest.importorskip("lance")
39+
40+
GCS_BUCKET = os.environ.get("LANCE_CONTEXT_GCS_BUCKET")
41+
GCS_ENDPOINT = os.environ.get("LANCE_CONTEXT_GCS_ENDPOINT")
42+
GCS_SA_JSON = os.environ.get("LANCE_CONTEXT_GCS_SERVICE_ACCOUNT_KEY")
43+
GCS_ADC = os.environ.get("GOOGLE_APPLICATION_CREDENTIALS")
44+
45+
_has_gcs_config = bool(GCS_BUCKET) and (
46+
bool(GCS_SA_JSON) or bool(GCS_ADC) or bool(GCS_ENDPOINT)
47+
)
48+
49+
pytestmark = pytest.mark.skipif(
50+
not _has_gcs_config,
51+
reason=(
52+
"Set LANCE_CONTEXT_GCS_BUCKET plus one of "
53+
"LANCE_CONTEXT_GCS_SERVICE_ACCOUNT_KEY / "
54+
"GOOGLE_APPLICATION_CREDENTIALS / "
55+
"LANCE_CONTEXT_GCS_ENDPOINT to run GCS integration tests."
56+
),
57+
)
58+
59+
60+
def _gcs_storage_options() -> dict[str, str]:
61+
options: dict[str, str] = {}
62+
if GCS_SA_JSON is not None:
63+
options["google_service_account_key"] = GCS_SA_JSON
64+
if GCS_ADC is not None:
65+
options["google_application_credentials"] = GCS_ADC
66+
if GCS_ENDPOINT is not None:
67+
# Emulator path: OpenDAL backend supports a custom endpoint and
68+
# anonymous auth, which is how fake-gcs-server is typically driven.
69+
options["use_opendal"] = "true"
70+
options["endpoint"] = GCS_ENDPOINT
71+
options.setdefault("allow_anonymous", "true")
72+
return options
73+
74+
75+
def test_gcs_round_trip_via_storage_options() -> None:
76+
"""End-to-end: Context.create(gs://...) with generic storage_options."""
77+
assert GCS_BUCKET is not None
78+
key = f"contexts/{uuid.uuid4().hex}/context.lance"
79+
uri = f"gs://{GCS_BUCKET}/{key}"
80+
options = _gcs_storage_options()
81+
82+
ctx = Context.create(uri, storage_options=options)
83+
84+
ctx.add("user", "gcs-hello")
85+
ctx.add("assistant", "gcs-response")
86+
assert ctx.entries() == 2
87+
88+
dataset = lance.dataset(uri, storage_options=options)
89+
rows = dataset.to_table().to_pylist()
90+
assert [row["text_payload"] for row in rows] == ["gcs-hello", "gcs-response"]

0 commit comments

Comments
 (0)