feat(ci): support publishing wheels to custom PyPI index by jackye1995 · Pull Request #94 · jackye1995/lance

jackye1995 · 2026-04-07T22:57:42Z

Summary

Add custom_pypi_url and custom_pypi_token workflow dispatch inputs to pypi-publish.yml
Update upload_wheel action to support custom PyPI uploads via twine --repository-url
Allow manual publishing of any released version to custom indexes like Azure Artifacts

Test plan

Manually trigger workflow with custom PyPI URL and token to verify upload works
Verify existing release and workflow_dispatch flows still work as expected

🤖 Generated with Claude Code

…6146) fix CI error: `FAILED python/tests/test_integration.py::test_duckdb_pushdown_extension_types - _duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated, use to_arrow_table() instead.`

20%+ faster for 2GB index, could be more for larger index

) This PR fixes the regression benchmarks workflow failing to resolve the pinned `google-github-actions/auth` action. The workflow had quoted the entire `uses` value, which caused the trailing `# v2` comment to be parsed as part of the action ref.

There was a conflict table in transaction.rs but this was incomplete (some rows/columns missing) and seemed to be imprecise or incorrect in a few spots. I've attempted to more thoroughly document this in transaction.md instead.

…ance-format#6160) Previously, `adjust_child_validity` would call `ArrayData::try_new` with a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic with `InvalidArgumentError("Arrays of type Null cannot contain a null bitmask")`. The trigger: when a user inserts rows where a struct sub-field has only null values, Arrow infers `DataType::Null` for that column. If a subsequent fragment omits that nullable sub-field, Lance inserts a `NullReader` to fill it in. `MergeStream` then merges the real batch (with null struct rows) and the `NullReader` batch (all-null struct), recursing into the struct where `adjust_child_validity` is called with the `Null`-typed child and a non-empty parent validity — triggering the panic. Fix: skip the bitmask operation when `child.data_type() == DataType::Null`. A `Null` array is always entirely null by definition and needs no validity adjustment. Closes lance-format#6159 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…e-format#6163) Previously, when `FragReuseIndexDetails` exceeded 204800 bytes (triggered by large compactions with many fragments), the code wrote the details to an external file (`details.binpb`). On local filesystems, `ObjectStore::create` returns a `LocalWriter` that atomically renames a temp file to the final path in `Writer::shutdown`. However, `frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not `lance_io::traits::Writer`, so `writer.shutdown()` resolved to `AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted on drop without being persisted. Any subsequent `load_indices` call would fail with `Not found: .../details.binpb`. Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to explicitly call the lance trait method, matching the existing pattern in `ivf.rs` and `blob.rs`. Fixes lance-format#6161 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

This breaks the "build_partitions" stage into "build_partitions" and "merge_partitions", and also updates the progress reporting on the shuffle phase to be in terms of rows instead of batches.

This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob empty-range fix can stay focused on the regression it addresses. The changes here are all mechanical simplifications with no intended behavior change.

…t#6175) This PR moves the Linux and Windows workflows that currently run on Warp onto GitHub-hosted runners. The goal is to reduce reliance on custom runners and take advantage of the sponsored larger GitHub-hosted machines for the slowest CI paths. This is focused on the current CI bottlenecks we observed in recent successful PR runs, especially Rust ARM and Python Windows jobs, while keeping the existing macOS and benchmark-specific runners unchanged until we verify equivalent GitHub-hosted options for them. Context: - Recent PR history shows Rust `linux-arm` and Python `windows` as the dominant critical-path jobs. - This change upgrades those jobs to larger GitHub-hosted runners where available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`, `windows-latest-4x`) and aligns the remaining Linux/Windows workflows with the same runner family. - I validated the workflow YAML locally after the runner migration; no product code or test logic changed. --- Updates: - Rust linux-arm：40.7 -> 19.4，about -52% - Rust windows-build：27.7 -> 21.0，about -24% - Python windows：36.5 -> 23.1，about -37% - Python Linux 3.13 ARM：26.9 -> 20.7，about -23% - Python Linux 3.13 x86_64：26.8 -> 19.1，about -29% - Python Linux 3.9 x86_64：25.9 -> 19.2，about -26%

Improvements lance-format#4247 alicloud storage config doc. Signed-off-by: FarmerChillax <farmerchillax@outlook.com>

Blob reads should return empty bytes when the logical blob is empty or the cursor is already at EOF. Today `BlobFile::read` / `read_up_to` can still issue a `get_range(start..end)` request with `start == end`, which is tolerated by local readers but rejected by cloud object stores. This showed up while investigating `random_blob` failures on the original-scale `laion10m-full` dataset, where legacy blob reads on S3 failed with errors like `Range started at 1 and ended at 1`. The fix short-circuits empty reads and restores the cursor to blob-relative semantics after `read()`, and adds regression coverage for both the empty-range case and packed-blob cursor behavior.

…format#6179)

<img width="1340" height="800" alt="image" src="https://github.com/user-attachments/assets/355caf26-14cb-4823-9474-6e4c9e780823" /> - FTS indexing is ~2.5x faster, this removes merge phase, and produces large partitions directly. - memory footprint is reduced by ~60%, this compresses posting lists while building them, which can save a lot of memory, and reduces fragmented objects in memory. This also bumps the default worker memory budget from 256MiB to 1GiB because we need to produce larger partition directly, but the memory footprint is still much less. This adds a new param `memory_limit` so that users can control how the indexing should work --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com> Co-authored-by: LuQQiu <luqiujob@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com>

…ity (lance-format#6182)

…#6187) This fixes the reader panic in lance-format#6185 when a page keeps nullable rep/def layer metadata but does not materialize any definition levels. The decoder now treats that page-local state as all-valid and includes a regression test that reproduces the mixed-page case before the fix. Closes lance-format#6185.

) This fixes the merge-insert fast path for delete-by-source operations while preserving the existing `UpdateIf` semantics. It also keeps full-schema `FixedSizeList` merges on the optimized path so target-side payload columns are pruned from the join build side. Fix lancedb/lancedb#3094

This updates the benchmark TPC-H datagen path to use DuckDB's `to_arrow_reader()` API instead of the deprecated `fetch_arrow_reader()` call. The benchmark CI treats `DeprecationWarning` as an error, so this removes the warning that was breaking the random access benchmark job. I also dropped a leftover `print(ds.count_rows())` debug statement to keep benchmark logs clean.

…lance-format#6191) Signed-off-by: BubbleCal <bubble-cal@outlook.com>

In retrospect the old name was somewhat presumptuous. It would probably be good to get the Arrow project's permission before taking up cargo real estate. This also adds a README which was preventing the publish.

…mat#6145) ## Summary Closes lance-format#6138 This PR extends `index_matches_criteria()` in `rust/lance/src/index/scalar.rs` to handle vector index types in addition to scalar indices. ## Problem Previously, `index_matches_criteria()` contained an early return at lines 464-467 that rejected all non-scalar (vector) indices. This made it impossible to use `describe_indices` to filter for vector indices on a specific column. ## Solution - Removed the early return that rejected all vector indices - Refactored FTS and exact equality checks to only apply to scalar indices (these checks are not relevant for vector indices) - Vector indices now pass through when matching basic criteria (name and column filters) ## Changes - 1 file modified: `rust/lance/src/index/scalar.rs` - 15 lines added, 16 lines removed - Updated existing test `test_index_matches_criteria_vector_index()` to reflect the new expected behavior ## Testing - Updated the existing unit test for vector index criteria matching - The test now correctly expects vector indices to match basic criteria instead of being rejected ## AI Disclosure This contribution was developed with the assistance of Claude (AI by Anthropic). The implementation approach, code, and PR description were AI-assisted. All changes are focused on resolving the specific issue described above. Co-Authored-By: AI Assistant (Claude) <ai-assistant@contributor-bot.dev> Signed-off-by: ndpvt-web <ndpvt-web@users.noreply.github.com> Co-authored-by: ndpvt-web <ndpvt-web@users.noreply.github.com> Co-authored-by: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>

…er (lance-format#6197) Signed-off-by: BubbleCal <bubble-cal@outlook.com>

…rmat#6194) This PR makes two changes to ensure stale credentials are not used: (1) In the Directory namespace if either vending is not enabled or a credential vendor is not configured we return `None` for storage options. (2) The `DynamicStorageOptionsCredentialProvider` falls back to the default credential provider (lazily loaded) if it is not able to retrieve credentials. Closes lance-format/lance-spark#292 --------- Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>

…ance-format#6119) SimpleIndex (HNSW over centroids) previously only supported fp32 centroids, causing fp16 vector workloads to fall back to brute-force partition assignment — O(K×D) per vector instead of O(log K × D). For 31K centroids × 1024 dims this is a ~600x difference per vector. Cast fp16 centroids to fp32 at HNSW construction time (one-time cost) and cast fp16 query vectors at search time (1024 floats per query, negligible vs the distance computations saved). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Xuanwo <github@xuanwo.io>

lance-format#6142) Previously we would use the default file version when creating new index files. This was originally done to get some testing of the 2.0 format before it was made the default. However, this led to a bit of a potential compatibility problem. If we change the default file version then the files created by the new release would become unreadable on very old versions that didn't know how to read that file, even if the dataset itself had an older file version and the old version knew how to handle the index otherwise. To avoid this we change things in this PR so that new index files use the same format version as the dataset. This should mean the indexes are always readable if the dataset is readable, regardless of what version was used to write the index. --- Parts of this PR were written with Claude (Opus 4.6) and I take full responsibility for its contents.

# Summary Support round-trip to use bf16 from PyTorch Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Add `ObjectStore::prefers_lite_scheduler()` that returns `true` for `file+uring://` stores, so the lite scheduler is used automatically without needing the env var - Change `SchedulerConfig::use_lite_scheduler` from `bool` to `Option<bool>` — `Some(true/false)` overrides, `None` defers to the object store's preference - `LANCE_USE_LITE_SCHEDULER` env var still works as an override when explicitly set ## Test plan - [x] `cargo check -p lance-io --tests --benches` compiles cleanly - [x] `cargo test -p lance-io` — all 148 tests pass - [x] `cargo clippy -p lance-io --tests --benches -- -D warnings` — no warnings 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

1. Remove `storage_options_provider` in python and java, because to make managed verisoning work, we have updated the codebase to pass namespace and table ID into the python and java binding layer. It becomes unnecessary for us to do a language specific `storage_options_provider` and then bind that to rust, because we can directly construct the rust `StorageOptionsProvider` using binded namespace client. 2. rename the following: `namespace` to `namespace_client`, `namespace_impl` to `namespace_client_impl`, `namespace_properties` to `namespace_client_properties`, `namespace` which means the namespace path to `namespace_path`. This is done for all code in rust, python, java. This rename is based on community feedback, and aims at clarifying the concept of Namespace Client SDK and its implementations vs the namespace path like `["ns1", "ns2"]`. 3. add `vend_input_storage_options` and `ops_metrics_enabled` so that we can now use DirectoryNamespace directory for testing all these changes made, without the need to rely on an extra tracking namespace. Update all tests accordingly to use the new feature. 4. fix the known bug that python and java binding for non-native namespace client implementation is not fully working with managed versioning due to binding level model conversion. --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Happy <yesreply@happy.engineering>

This change adds `with_index_segments()` for vector queries and makes ANN planning prune to the selected index segments instead of always searching the full logical index. It also makes `with_fragments()` participate in segment selection and flat fallback computation so fragment-filtered and segment-filtered searches stay correct when only part of the logical index is queried. This feature will make distributed search much faster to avoid loading not related index segments. --- FTS should also support this, will add after lance-format#6305 been merged.

This adds `external_blob_mode="ingest"` for blob v2 writes so datasets can import external blob bytes into Lance-managed storage instead of always persisting URI references. The write path now streams external sources through `lance-io` reader streams for packed and dedicated blobs while preserving inline materialization for small payloads. Closes lance-format#6321.

…at#6375) ## Problem Scanning `_row_created_at_version` or `_row_last_updated_at_version` is extremely slow on fragments with deletion vectors — **53 seconds for 1M rows** (vs 0.03s for a regular data column on the same table). This makes the `delta()` API (`get_inserted_rows()` / `get_updated_rows()`) unusable on any table that has had deletions. ## Root Cause In `apply_row_id_and_deletes()` (`lance-table/src/utils/stream.rs`), the version column is built by: ```rust sequence.versions() .skip(r.start as usize) .take((r.end - r.start) as usize) ``` `versions()` creates a fresh iterator from the start of the RLE sequence. For each range in the selection, `skip(r.start)` walks through all prior elements — O(rows) per range. With deletion vectors creating many small ranges, this becomes O(rows × ranges). ## Fix Replace with `version_values_for_selection()` that: - **Fast-paths single-run fragments** (common case: all rows same version) with `vec![version; count]` — O(1) - **Binary search over precomputed run offsets** for O(log(runs)) per position in the multi-run case ## Benchmark 1M rows, 33% deleted, `_row_created_at_version` scan: | | Before | After | |---|---|---| | Time | 53s | 0.02s | | Complexity | O(rows × ranges) | O(rows × log(runs)) | ## Minimal Reproduction ```python import lance, pyarrow as pa, numpy as np, time uri = "/tmp/test_version_col_perf.lance" lance.write_dataset( pa.table({"val": np.random.randint(0, 1000, 1_000_000)}), uri, mode="overwrite", enable_stable_row_ids=True, ) ds = lance.dataset(uri) ds.delete("val < 333") ds = lance.dataset(uri) t0 = time.time() ds.scanner(columns=["_row_created_at_version"]).to_table() print(f"_row_created_at_version with deletions: {time.time()-t0:.2f}s") t0 = time.time() ds.scanner(columns=["val"]).to_table() print(f"normal column with deletions: {time.time()-t0:.2f}s") ``` --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…6368) track_batch_for_wal was returning a pre-resolved watcher instead of the actual BatchDurableWatcher, so durable_write=true never blocked waiting for the WAL flush to complete. Also include close_duration in benchmark timing for accurate end-to-end throughput reporting. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ance-format#6210)

…at#6393) ## Summary - Call `target.put_part` synchronously in `ThrottledMultipartUpload::put_part` to lock in part ordering at creation time, rather than deferring the call into the async future body where await order would determine part order. - Remove the unnecessary `Arc<Mutex<...>>` wrapper around the inner upload target since `&mut self` already prevents concurrent `put_part` calls. - Add test `test_throttled_multipart_reorders_parts` that verifies parts are ordered by creation, not by await order. ## Test plan - [x] `cargo test -p lance-io throttle` — all 21 tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…rmat#6299) Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0. <details> <summary>Release notes</summary> Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>. <blockquote> <h2>v2.33.0</h2> <h2>2.33.0 (2026-03-25)</h2> Announcements <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> Security <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> Improvements <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> Bugfixes <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> Deprecations <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> Documentation <ul> <li>Various typo fixes and doc improvements.</li> </ul> <h2>New Contributors</h2> <ul> <li><a href="https://github.com/M0d3v1"><code>@M0d3v1</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li> <li><a href="https://github.com/aminvakil"><code>@aminvakil</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li> <li><a href="https://github.com/E8Price"><code>@E8Price</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li> <li><a href="https://github.com/mitre88"><code>@mitre88</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li> <li><a href="https://github.com/magsen"><code>@magsen</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li> <li><a href="https://github.com/Rohan5commit"><code>@Rohan5commit</code></a> made their first contribution in <a href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li> </ul> Full Changelog: <a href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a> </blockquote> </details> <details> <summary>Changelog</summary> Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>. <blockquote> <h2>2.33.0 (2026-03-25)</h2> Announcements <ul> <li>📣 Requests is adding inline types. If you have a typed code base that uses Requests, please take a look at <a href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>. Give it a try, and report any gaps or feedback you may have in the issue. 📣</li> </ul> Security <ul> <li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now extracts contents to a non-deterministic location to prevent malicious file replacement. This does not affect default usage of Requests, only applications calling the utility function directly.</li> </ul> Improvements <ul> <li>Migrated to a PEP 517 build system using setuptools. (<a href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li> </ul> Bugfixes <ul> <li>Fixed an issue where an empty netrc entry could cause malformed authentication to be applied to Requests on Python 3.11+. (<a href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li> </ul> Deprecations <ul> <li>Dropped support for Python 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li> </ul> Documentation <ul> <li>Various typo fixes and doc improvements.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/psf/requests/commit/bc04dfd6dad4cb02cd92f5daa81eb562d280a761"><code>bc04dfd</code></a> v2.33.0</li> <li><a href="https://github.com/psf/requests/commit/66d21cb07bd6255b1280291c4fafb71803cdb3b7"><code>66d21cb</code></a> Merge commit from fork</li> <li><a href="https://github.com/psf/requests/commit/8b9bc8fc0f63be84602387913c4b689f19efd028"><code>8b9bc8f</code></a> Move badges to top of README (<a href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li> <li><a href="https://github.com/psf/requests/commit/e331a288f369973f5de0ec8901c94cae4fa87286"><code>e331a28</code></a> Remove unused extraction call (<a href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li> <li><a href="https://github.com/psf/requests/commit/753fd08c5eacce0aa0df73fe47e49525c67e0a29"><code>753fd08</code></a> docs: fix FAQ grammar in httplib2 example</li> <li><a href="https://github.com/psf/requests/commit/774a0b837a194ee885d4fdd9ca947900cc3daf71"><code>774a0b8</code></a> docs(socks): same block as other sections</li> <li><a href="https://github.com/psf/requests/commit/9c72a41bec8597f948c9d8caa5dc3f12273b3303"><code>9c72a41</code></a> Bump github/codeql-action from 4.33.0 to 4.34.1</li> <li><a href="https://github.com/psf/requests/commit/ebf71906798ec82f34e07d3168f8b8aecaf8a3be"><code>ebf7190</code></a> Bump github/codeql-action from 4.32.0 to 4.33.0</li> <li><a href="https://github.com/psf/requests/commit/0e4ae38f0c93d4f92a96c774bd52c069d12a4798"><code>0e4ae38</code></a> docs: exclude Response.is_permanent_redirect from API docs (<a href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li> <li><a href="https://github.com/psf/requests/commit/d568f47278492e630cc990a259047c67991d007a"><code>d568f47</code></a> docs: clarify Quickstart POST example (<a href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li> <li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare view</a></li> </ul> </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

## Summary - Threads accurate `data_size` (in bytes) from `DataBlock::data_size()` at the encoding layer through the full decode pipeline to the final `RecordBatch` - Implements `DataBlock::data_size()` for `Struct` and `Dictionary` variants (were `todo!()`) - Uses the accurate data size for the "batch is too large" warning instead of Arrow's `get_array_memory_size()`, which over-reports due to shared page buffers - Changes `DecodeArrayTask::decode()` to return `(ArrayRef, u64)` so data size flows through naturally ## Test plan - [x] All 364 existing `lance-encoding` tests pass - [x] `cargo clippy -p lance-encoding --tests -- -D warnings` clean - [x] `cargo clippy -p lance-file --tests -- -D warnings` clean - [x] `cargo fmt --all -- --check` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

This updates `python/uv.lock` to match the current result of `uv sync --extra tests --extra dev` in the Python project. The lock file was lagging behind the resolved `lance-namespace` packages, which caused recurring diffs after local syncs. I validated the change by rerunning `uv sync --extra tests --extra dev` successfully in `python/`.

lance-format#6396) ### Description Removes the leftover `tempfile.NamedTemporaryFile` save in `train_ivf_centroids_on_accelerator`. This was a debugging/checkpoint artifact that is no longer needed — the `IvfModel.save()` API now provides explicit persistence to any URI (local or cloud). The temp file was created with `delete=False` and never cleaned up, leaking disk space over repeated runs. The CPU training path (Rust `indices.train_ivf_model`) does not have this behavior, so this change also makes the two paths consistent. ### Changes - **`python/python/lance/vector.py`**: Removed the `tempfile.NamedTemporaryFile` + `np.save` + log line from `train_ivf_centroids_on_accelerator` (3 lines deleted). Closes lance-format#6395

This PR introduces `LogicalVectorIndex` as a logical aggregate and moves IVF-specific partition inspection into `LogicalIvfView`, so the API boundary matches the actual semantics.

…-format#6367) Avoid confusion with object store regions (e.g., AWS regions) which are unrelated to the MemWAL concept of a unique writer/reader instance. Closes lance-format#6355 Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This changes segmented vector index optimize so the default rebalance path keeps segment boundaries and rewrites only the single worst segment in each run. It builds on lance-format#6400's logical vector index / IVF view work and avoids the current behavior where segmented optimize treats the logical index as one physical index. I also added a regression test that creates a skewed two-segment IVF index and verifies that optimize replaces only the oversized segment while leaving the other segment untouched.

…rmat#6379) [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=aiohttp&package-manager=uv&previous-version=3.12.15&new-version=3.13.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/lance-format/lance/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…e-format#6394) Expose Python progress callbacks for index creation and segment merge. Add the IndexProgress event type, pump Rust async progress events back into Python while waiting on index work

…lance-format#6132) close lance-format#6111 - Implement `count_table_rows` with optional version checkout and predicate filter - Implement `insert_into_table` with append/overwrite modes via Arrow IPC - Implement `query_table` supporting vector similarity search (with distance type, nprobes, refine factor, prefilter) and plain scan with filter/limit/offset/version - Add `lance-linalg` dependency for `DistanceType` in vector search - Add 8 unit tests covering all new methods and edge cases ## Test plan - [x] `cargo test -p lance-namespace-impls` passes all new tests - [x] count_table_rows returns correct count with and without predicate filter - [x] insert_into_table correctly appends and overwrites data - [x] query_table returns correct results for vector search and plain scan --------- Co-authored-by: Jack Ye <yezhaoqin@gmail.com>

) ## Summary - Add `check_column_indices()` validation in `rust/lance/src/io/commit.rs` that rejects non-leaf fields (structs, lists) with real column indices in v2.1+ data files at commit time, preventing cryptic read-time errors - Exempts packed structs and blob fields which legitimately have column indices in v2.1+ - Wired into both `commit_transaction` and `do_commit_detached_transaction` paths Closes lance-format#6412 ## Test plan - [x] `test_check_column_indices_rejects_struct_with_column` — struct with column_index=0 in v2.1 → error - [x] `test_check_column_indices_rejects_list_with_column` — list with column_index=0 in v2.1 → error - [x] `test_check_column_indices_allows_correct_v21` — correct indices (non-leaf=-1, leaf>=0) → ok - [x] `test_check_column_indices_allows_packed_struct` — packed struct with real column_index → ok - [x] `test_check_column_indices_skips_v20` — non-leaf with column_index>=0 in v2.0 → ok (no validation) - [x] `cargo clippy -p lance --tests -- -D warnings` passes - [x] `cargo fmt --all` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…ormat#6416) ## Summary - Add a 5.0.0 section to the migration guide documenting how `DataFile.column_indices` changed with data storage version 2.1: non-leaf fields (structs, lists) now get `-1` instead of sequential column indices - Add an admonition to the table format spec's Data Files section noting the version difference - Includes a concrete before/after example and opt-out instructions Closes lance-format#6411 ## Test plan - [x] Docs build successfully with `mkdocs build` - [ ] Verify rendered migration guide section and table format admonition look correct 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…erialization (lance-format#6405) Fixes lance-format#6403 `FragmentMetadata.to_json()` raised `NotImplementedError` when `row_id_meta` was present (`enable_stable_row_ids=True`), because `PyRowIdMeta.asdict()` was stubbed out. This also broke `FragmentMetadata.from_json()` for the round-trip, since `RowIdMeta(**dict)` doesn't work on a PyO3 class. ### Repro ```python import lance, pyarrow as pa uri = "/tmp/repro.lance" ds = lance.write_dataset(pa.table({"x": [1, 2, 3]}), uri, enable_stable_row_ids=True) ds.get_fragments()[0].metadata.to_json() # NotImplementedError: PyRowIdMeta.asdict is not yet supported. ``` ### Fix - Implement `PyRowIdMeta.asdict()` via `pythonize` (Rust struct → Python dict) - Add `PyRowIdMeta.from_dict()` via `depythonize` (Python dict → Rust struct) - Update `FragmentMetadata.from_json()` to use `RowIdMeta.from_dict()` instead of `RowIdMeta(**dict)` - Add JSON round-trip test to the existing `test_fragment_metadata_pickle` parametrized test

…rmat#6420) This canonicalizes all-valid validity bitmaps into the same rep/def state as no-null arrays, so sub-schema `merge_insert` updates on `data_storage_version=2.2` stop emitting inconsistent control-word metadata and corrupting variable-width pages. This PR picks up the proposed fix for lance-format#6338 and adds regression coverage for both the end-to-end binary `merge_insert` failure and the underlying repdef canonicalization invariant. --------- Co-authored-by: Eran Dagan <eran@botika.io>

…ormat#6330) ## Summary Add support for writing blob v2 columns with external URI references that are outside registered base paths. This enables use cases like INSERT INTO SELECT across Lance tables where the target table stores external blob references pointing to the source table's blob files instead of copying the actual blob bytes. ## Changes - **WriteParams.java**: Add `allowExternalBlobOutsideBases` Optional<Boolean> field, getter, and builder method - **Fragment.java**: Pass the new field through `createWithFfiArray` and `createWithFfiStream` native methods - **fragment.rs (JNI)**: Thread the new `Optional<Boolean>` parameter through all fragment creation functions to `extract_write_params` - **utils.rs (JNI)**: Parse the new parameter and set `allow_external_blob_outside_bases` on Rust `WriteParams` - **blocking_dataset.rs (JNI)**: Pass `JObject::null()` for the new param in `Dataset.write()` path (not needed there) ## Context This is a prerequisite for lance-spark blob JOIN support (lance-format/lance-spark#355). When blob data flows through Spark's shuffle during JOIN + INSERT INTO, the target table needs to write external blob references pointing to the source table's physical blob files. The Rust `BlobPreprocessor` already supports this via `allow_external_blob_outside_bases`, but the Java SDK had no way to set it. Ref: lance-format#6321, lance-format#6322 ## Test plan - [x] Rust JNI code compiles cleanly (no errors in changed files) - [ ] Java unit tests (CI) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - Fixes lance-format#6417: Overwrite validation incorrectly used the old manifest's storage format, causing strict legacy checks to reject valid STABLE-format fragments that omit struct parent fields - Pass `None` instead of `Some(manifest)` for Overwrite validation since all fragments are replaced and the old format is irrelevant - Added regression test for LEGACY→STABLE overwrite with struct fields ## Test plan - [x] `test_overwrite_legacy_to_stable_with_struct_fields` — verifies that overwriting a LEGACY dataset with STABLE fragments containing struct fields succeeds - [ ] Re-enable `replaceTableChangesStorageVersion` test in lance-spark after this lands 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add workflow dispatch inputs for custom_pypi_url and custom_pypi_token to allow manual publishing to custom PyPI indexes like Azure Artifacts. When these inputs are provided, wheels are uploaded to the custom index instead of the default PyPI or Fury repositories. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

esteban and others added 30 commits March 10, 2026 02:15

chore: ci/cd workflow improvements and fixes (lance-format#6127)

88b2635

chore: release beta version 4.0.0-beta.8

e7eb014

fix: replace fetch_arrow_table with to_arrow_table (lance-format#…

31de8d4

…6146) fix CI error: `FAILED python/tests/test_integration.py::test_duckdb_pushdown_extension_types - _duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated, use to_arrow_table() instead.`

perf: parallelize FTS prewarming (lance-format#6144)

0681a08

20%+ faster for 2GB index, could be more for larger index

feat: clearer progress reporting for IVF (lance-format#6126)

fa64837

This breaks the "build_partitions" stage into "build_partitions" and "merge_partitions", and also updates the progress reporting on the shuffle phase to be in terms of rows instead of batches.

chore: release beta version 4.0.0-beta.9

e133d82

chore: clippy cleanups (lance-format#6172)

31cd8d3

This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob empty-range fix can stay focused on the regression it addresses. The changes here are all mechanical simplifications with no intended behavior change.

docs: add alicloud oss configuration (lance-format#6167)

f3e50d7

Improvements lance-format#4247 alicloud storage config doc. Signed-off-by: FarmerChillax <farmerchillax@outlook.com>

perf: remove shard content key sorting from distributed merge (lance-…

f4adbc0

…format#6179)

chore: release beta version 4.0.0-beta.10

3f74834

docs: update the rules for data replacement conflicts to reflect real…

46826b1

…ity (lance-format#6182)

perf(inverted): reuse posting batch builder and merge tail partitions (…

e675fdb

…lance-format#6191) Signed-off-by: BubbleCal <bubble-cal@outlook.com>

refactor: rename arrow-scalar to lance-arrow-scalar (lance-format#6199)

7aa6d33

In retrospect the old name was somewhat presumptuous. It would probably be good to get the Arrow project's permission before taking up cargo real estate. This also adds a README which was preventing the publish.

fix: memory_limit and num_workers params are not passed to index work…

c9c8c46

…er (lance-format#6197) Signed-off-by: BubbleCal <bubble-cal@outlook.com>

chore: release beta version 4.0.0-beta.11

e8109ad

chore: release beta version 4.0.0-beta.12

1e7b725

eddyxu and others added 29 commits April 1, 2026 13:41

feat: support bf16 from pytorch dataset (lance-format#6342)

21d830a

# Summary Support round-trip to use bf16 from PyTorch Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

chore: release beta version 5.0.0-beta.3

5504998

feat!: add progress monitoring via callbacks for distributed merge (l…

cc89488

…ance-format#6210)

chore: release beta version 5.0.0-beta.4

d9068e7

feat: refine logical vector index into an IVF view (lance-format#6400)

effca10

This PR introduces `LogicalVectorIndex` as a logical aggregate and moves IVF-specific partition inspection into `LogicalIvfView`, so the API boundary matches the actual semantics.

feat: support index build progress callbacks in Python bindings (lanc…

2c20d75

…e-format#6394) Expose Python progress callbacks for index creation and segment merge. Add the IndexProgress event type, pump Rust async progress events back into Python while waiting on index work

chore: release beta version 5.0.0-beta.5

d630106

github-actions Bot added the enhancement New feature or request label Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): support publishing wheels to custom PyPI index#94

feat(ci): support publishing wheels to custom PyPI index#94
jackye1995 wants to merge 150 commits into
mainfrom
custom-pypi

jackye1995 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jackye1995 commented Apr 7, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants