Skip to content

feat(ci): support publishing wheels to custom PyPI index#94

Open
jackye1995 wants to merge 150 commits into
mainfrom
custom-pypi
Open

feat(ci): support publishing wheels to custom PyPI index#94
jackye1995 wants to merge 150 commits into
mainfrom
custom-pypi

Conversation

@jackye1995
Copy link
Copy Markdown
Owner

Summary

  • Add custom_pypi_url and custom_pypi_token workflow dispatch inputs to pypi-publish.yml
  • Update upload_wheel action to support custom PyPI uploads via twine --repository-url
  • Allow manual publishing of any released version to custom indexes like Azure Artifacts

Test plan

  • Manually trigger workflow with custom PyPI URL and token to verify upload works
  • Verify existing release and workflow_dispatch flows still work as expected

🤖 Generated with Claude Code

esteban and others added 30 commits March 10, 2026 02:15
…6146)

fix CI error: `FAILED
python/tests/test_integration.py::test_duckdb_pushdown_extension_types -
_duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated,
use to_arrow_table() instead.`
20%+ faster for 2GB index, could be more for larger index
)

This PR fixes the regression benchmarks workflow failing to resolve the
pinned `google-github-actions/auth` action. The workflow had quoted the
entire `uses` value, which caused the trailing `# v2` comment to be
parsed as part of the action ref.
There was a conflict table in transaction.rs but this was incomplete
(some rows/columns missing) and seemed to be imprecise or incorrect in a
few spots. I've attempted to more thoroughly document this in
transaction.md instead.
…ance-format#6160)

Previously, `adjust_child_validity` would call `ArrayData::try_new` with
a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic
with `InvalidArgumentError("Arrays of type Null cannot contain a null
bitmask")`.

The trigger: when a user inserts rows where a struct sub-field has only
null values, Arrow infers `DataType::Null` for that column. If a
subsequent fragment omits that nullable sub-field, Lance inserts a
`NullReader` to fill it in. `MergeStream` then merges the real batch
(with null struct rows) and the `NullReader` batch (all-null struct),
recursing into the struct where `adjust_child_validity` is called with
the `Null`-typed child and a non-empty parent validity — triggering the
panic.

Fix: skip the bitmask operation when `child.data_type() ==
DataType::Null`. A `Null` array is always entirely null by definition
and needs no validity adjustment.

Closes lance-format#6159

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-format#6163)

Previously, when `FragReuseIndexDetails` exceeded 204800 bytes
(triggered by large compactions with many fragments), the code wrote the
details to an external file (`details.binpb`). On local filesystems,
`ObjectStore::create` returns a `LocalWriter` that atomically renames a
temp file to the final path in `Writer::shutdown`. However,
`frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not
`lance_io::traits::Writer`, so `writer.shutdown()` resolved to
`AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted
on drop without being persisted. Any subsequent `load_indices` call
would fail with `Not found: .../details.binpb`.

Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to
explicitly call the lance trait method, matching the existing pattern in
`ivf.rs` and `blob.rs`.

Fixes lance-format#6161

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This breaks the "build_partitions" stage into "build_partitions" and
"merge_partitions", and also updates the progress reporting on the
shuffle phase to be in terms of rows instead of batches.
This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob
empty-range fix can stay focused on the regression it addresses. The
changes here are all mechanical simplifications with no intended
behavior change.
…t#6175)

This PR moves the Linux and Windows workflows that currently run on Warp
onto GitHub-hosted runners. The goal is to reduce reliance on custom
runners and take advantage of the sponsored larger GitHub-hosted
machines for the slowest CI paths.

This is focused on the current CI bottlenecks we observed in recent
successful PR runs, especially Rust ARM and Python Windows jobs, while
keeping the existing macOS and benchmark-specific runners unchanged
until we verify equivalent GitHub-hosted options for them.

Context:
- Recent PR history shows Rust `linux-arm` and Python `windows` as the
dominant critical-path jobs.
- This change upgrades those jobs to larger GitHub-hosted runners where
available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`,
`windows-latest-4x`) and aligns the remaining Linux/Windows workflows
with the same runner family.
- I validated the workflow YAML locally after the runner migration; no
product code or test logic changed.

---

Updates:

- Rust linux-arm:40.7 -> 19.4,about -52%
- Rust windows-build:27.7 -> 21.0,about -24%
- Python windows:36.5 -> 23.1,about -37%
- Python Linux 3.13 ARM:26.9 -> 20.7,about -23%
- Python Linux 3.13 x86_64:26.8 -> 19.1,about -29%
- Python Linux 3.9 x86_64:25.9 -> 19.2,about -26%
Improvements lance-format#4247 alicloud
storage config doc.

Signed-off-by: FarmerChillax <farmerchillax@outlook.com>
Blob reads should return empty bytes when the logical blob is empty or
the cursor is already at EOF. Today `BlobFile::read` / `read_up_to` can
still issue a `get_range(start..end)` request with `start == end`, which
is tolerated by local readers but rejected by cloud object stores.

This showed up while investigating `random_blob` failures on the
original-scale `laion10m-full` dataset, where legacy blob reads on S3
failed with errors like `Range started at 1 and ended at 1`. The fix
short-circuits empty reads and restores the cursor to blob-relative
semantics after `read()`, and adds regression coverage for both the
empty-range case and packed-blob cursor behavior.
<img width="1340" height="800" alt="image"
src="https://github.com/user-attachments/assets/355caf26-14cb-4823-9474-6e4c9e780823"
/>

- FTS indexing is ~2.5x faster, this removes merge phase, and produces
large partitions directly.
- memory footprint is reduced by ~60%, this compresses posting lists
while building them, which can save a lot of memory, and reduces
fragmented objects in memory.

This also bumps the default worker memory budget from 256MiB to 1GiB
because we need to produce larger partition directly, but the memory
footprint is still much less.

This adds a new param `memory_limit` so that users can control how the
indexing should work

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: LuQQiu <luqiujob@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
…#6187)

This fixes the reader panic in lance-format#6185 when a page keeps nullable rep/def
layer metadata but does not materialize any definition levels. The
decoder now treats that page-local state as all-valid and includes a
regression test that reproduces the mixed-page case before the fix.

Closes lance-format#6185.
)

This fixes the merge-insert fast path for delete-by-source operations
while preserving the existing `UpdateIf` semantics. It also keeps
full-schema `FixedSizeList` merges on the optimized path so target-side
payload columns are pruned from the join build side.

Fix lancedb/lancedb#3094
This updates the benchmark TPC-H datagen path to use DuckDB's
`to_arrow_reader()` API instead of the deprecated `fetch_arrow_reader()`
call.

The benchmark CI treats `DeprecationWarning` as an error, so this
removes the warning that was breaking the random access benchmark job. I
also dropped a leftover `print(ds.count_rows())` debug statement to keep
benchmark logs clean.
In retrospect the old name was somewhat presumptuous. It would probably
be good to get the Arrow project's permission before taking up cargo
real estate. This also adds a README which was preventing the publish.
…mat#6145)

## Summary

Closes lance-format#6138

This PR extends `index_matches_criteria()` in
`rust/lance/src/index/scalar.rs` to handle vector index types in
addition to scalar indices.

## Problem

Previously, `index_matches_criteria()` contained an early return at
lines 464-467 that rejected all non-scalar (vector) indices. This made
it impossible to use `describe_indices` to filter for vector indices on
a specific column.

## Solution

- Removed the early return that rejected all vector indices
- Refactored FTS and exact equality checks to only apply to scalar
indices (these checks are not relevant for vector indices)
- Vector indices now pass through when matching basic criteria (name and
column filters)

## Changes

- 1 file modified: `rust/lance/src/index/scalar.rs`
- 15 lines added, 16 lines removed
- Updated existing test `test_index_matches_criteria_vector_index()` to
reflect the new expected behavior

## Testing

- Updated the existing unit test for vector index criteria matching
- The test now correctly expects vector indices to match basic criteria
instead of being rejected

## AI Disclosure

This contribution was developed with the assistance of Claude (AI by
Anthropic). The implementation approach, code, and PR description were
AI-assisted. All changes are focused on resolving the specific issue
described above.

Co-Authored-By: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>

Signed-off-by: ndpvt-web <ndpvt-web@users.noreply.github.com>
Co-authored-by: ndpvt-web <ndpvt-web@users.noreply.github.com>
Co-authored-by: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>
…er (lance-format#6197)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…rmat#6194)

This PR makes two changes to ensure stale credentials are not used:
(1) In the Directory namespace if either vending is not enabled or a
credential vendor is not configured we return `None` for storage
options.
(2) The `DynamicStorageOptionsCredentialProvider` falls back to the
default credential provider (lazily loaded) if it is not able to
retrieve credentials.

Closes lance-format/lance-spark#292

---------

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
…ance-format#6119)

SimpleIndex (HNSW over centroids) previously only supported fp32
centroids, causing fp16 vector workloads to fall back to brute-force
partition assignment — O(K×D) per vector instead of O(log K × D). For
31K centroids × 1024 dims this is a ~600x difference per vector.

Cast fp16 centroids to fp32 at HNSW construction time (one-time cost)
and cast fp16 query vectors at search time (1024 floats per query,
negligible vs the distance computations saved).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
lance-format#6142)

Previously we would use the default file version when creating new index
files. This was originally done to get some testing of the 2.0 format
before it was made the default. However, this led to a bit of a
potential compatibility problem. If we change the default file version
then the files created by the new release would become unreadable on
very old versions that didn't know how to read that file, even if the
dataset itself had an older file version and the old version knew how to
handle the index otherwise.

To avoid this we change things in this PR so that new index files use
the same format version as the dataset. This should mean the indexes are
always readable if the dataset is readable, regardless of what version
was used to write the index.

---

Parts of this PR were written with Claude (Opus 4.6) and I take full
responsibility for its contents.
eddyxu and others added 29 commits April 1, 2026 13:41
# Summary

Support round-trip to use bf16 from PyTorch


Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary

- Add `ObjectStore::prefers_lite_scheduler()` that returns `true` for
`file+uring://` stores, so the lite scheduler is used automatically
without needing the env var
- Change `SchedulerConfig::use_lite_scheduler` from `bool` to
`Option<bool>` — `Some(true/false)` overrides, `None` defers to the
object store's preference
- `LANCE_USE_LITE_SCHEDULER` env var still works as an override when
explicitly set

## Test plan

- [x] `cargo check -p lance-io --tests --benches` compiles cleanly
- [x] `cargo test -p lance-io` — all 148 tests pass
- [x] `cargo clippy -p lance-io --tests --benches -- -D warnings` — no
warnings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1. Remove `storage_options_provider` in python and java, because to make
managed verisoning work, we have updated the codebase to pass namespace
and table ID into the python and java binding layer. It becomes
unnecessary for us to do a language specific `storage_options_provider`
and then bind that to rust, because we can directly construct the rust
`StorageOptionsProvider` using binded namespace client.
2. rename the following: `namespace` to `namespace_client`,
`namespace_impl` to `namespace_client_impl`, `namespace_properties` to
`namespace_client_properties`, `namespace` which means the namespace
path to `namespace_path`. This is done for all code in rust, python,
java. This rename is based on community feedback, and aims at clarifying
the concept of Namespace Client SDK and its implementations vs the
namespace path like `["ns1", "ns2"]`.
3. add `vend_input_storage_options` and `ops_metrics_enabled` so that we
can now use DirectoryNamespace directory for testing all these changes
made, without the need to rely on an extra tracking namespace. Update
all tests accordingly to use the new feature.
4. fix the known bug that python and java binding for non-native
namespace client implementation is not fully working with managed
versioning due to binding level model conversion.

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Happy <yesreply@happy.engineering>
This change adds `with_index_segments()` for vector queries and makes
ANN planning prune to the selected index segments instead of always
searching the full logical index. It also makes `with_fragments()`
participate in segment selection and flat fallback computation so
fragment-filtered and segment-filtered searches stay correct when only
part of the logical index is queried.

This feature will make distributed search much faster to avoid loading
not related index segments.

---

FTS should also support this, will add after
lance-format#6305 been merged.
This adds `external_blob_mode="ingest"` for blob v2 writes so datasets
can import external blob bytes into Lance-managed storage instead of
always persisting URI references. The write path now streams external
sources through `lance-io` reader streams for packed and dedicated blobs
while preserving inline materialization for small payloads.


Closes lance-format#6321.
…at#6375)

## Problem

Scanning `_row_created_at_version` or `_row_last_updated_at_version` is
extremely slow on fragments with deletion vectors — **53 seconds for 1M
rows** (vs 0.03s for a regular data column on the same table).

This makes the `delta()` API (`get_inserted_rows()` /
`get_updated_rows()`) unusable on any table that has had deletions.

## Root Cause

In `apply_row_id_and_deletes()` (`lance-table/src/utils/stream.rs`), the
version column is built by:

```rust
sequence.versions()
    .skip(r.start as usize)
    .take((r.end - r.start) as usize)
```

`versions()` creates a fresh iterator from the start of the RLE
sequence. For each range in the selection, `skip(r.start)` walks through
all prior elements — O(rows) per range. With deletion vectors creating
many small ranges, this becomes O(rows × ranges).

## Fix

Replace with `version_values_for_selection()` that:
- **Fast-paths single-run fragments** (common case: all rows same
version) with `vec![version; count]` — O(1)
- **Binary search over precomputed run offsets** for O(log(runs)) per
position in the multi-run case

## Benchmark

1M rows, 33% deleted, `_row_created_at_version` scan:

| | Before | After |
|---|---|---|
| Time | 53s | 0.02s |
| Complexity | O(rows × ranges) | O(rows × log(runs)) |

## Minimal Reproduction

```python
import lance, pyarrow as pa, numpy as np, time

uri = "/tmp/test_version_col_perf.lance"
lance.write_dataset(
    pa.table({"val": np.random.randint(0, 1000, 1_000_000)}),
    uri, mode="overwrite", enable_stable_row_ids=True,
)
ds = lance.dataset(uri)
ds.delete("val < 333")
ds = lance.dataset(uri)

t0 = time.time()
ds.scanner(columns=["_row_created_at_version"]).to_table()
print(f"_row_created_at_version with deletions: {time.time()-t0:.2f}s")

t0 = time.time()
ds.scanner(columns=["val"]).to_table()
print(f"normal column with deletions: {time.time()-t0:.2f}s")
```

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…6368)

track_batch_for_wal was returning a pre-resolved watcher instead of the
actual BatchDurableWatcher, so durable_write=true never blocked waiting
for the WAL flush to complete. Also include close_duration in benchmark
timing for accurate end-to-end throughput reporting.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…at#6393)

## Summary
- Call `target.put_part` synchronously in
`ThrottledMultipartUpload::put_part` to lock in part ordering at
creation time, rather than deferring the call into the async future body
where await order would determine part order.
- Remove the unnecessary `Arc<Mutex<...>>` wrapper around the inner
upload target since `&mut self` already prevents concurrent `put_part`
calls.
- Add test `test_throttled_multipart_reorders_parts` that verifies parts
are ordered by creation, not by await order.

## Test plan
- [x] `cargo test -p lance-io throttle` — all 21 tests pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…rmat#6299)

Bumps [requests](https://github.com/psf/requests) from 2.32.5 to 2.33.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/releases">requests's
releases</a>.</em></p>
<blockquote>
<h2>v2.33.0</h2>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report any gaps or feedback you may have in the
issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts contents to a non-deterministic location to prevent malicious
file replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause malformed
authentication to be applied to Requests on Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
<h2>New Contributors</h2>
<ul>
<li><a href="https://github.com/M0d3v1"><code>@​M0d3v1</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6865">psf/requests#6865</a></li>
<li><a href="https://github.com/aminvakil"><code>@​aminvakil</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7220">psf/requests#7220</a></li>
<li><a href="https://github.com/E8Price"><code>@​E8Price</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6960">psf/requests#6960</a></li>
<li><a href="https://github.com/mitre88"><code>@​mitre88</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7244">psf/requests#7244</a></li>
<li><a href="https://github.com/magsen"><code>@​magsen</code></a> made
their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/6553">psf/requests#6553</a></li>
<li><a
href="https://github.com/Rohan5commit"><code>@​Rohan5commit</code></a>
made their first contribution in <a
href="https://redirect.github.com/psf/requests/pull/7227">psf/requests#7227</a></li>
</ul>
<p><strong>Full Changelog</strong>: <a
href="https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25">https://github.com/psf/requests/blob/main/HISTORY.md#2330-2026-03-25</a></p>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's
changelog</a>.</em></p>
<blockquote>
<h2>2.33.0 (2026-03-25)</h2>
<p><strong>Announcements</strong></p>
<ul>
<li>📣 Requests is adding inline types. If you have a typed code base
that
uses Requests, please take a look at <a
href="https://redirect.github.com/psf/requests/issues/7271">#7271</a>.
Give it a try, and report
any gaps or feedback you may have in the issue. 📣</li>
</ul>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2026-25645 <code>requests.utils.extract_zipped_paths</code> now
extracts
contents to a non-deterministic location to prevent malicious file
replacement. This does not affect default usage of Requests, only
applications calling the utility function directly.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Migrated to a PEP 517 build system using setuptools. (<a
href="https://redirect.github.com/psf/requests/issues/7012">#7012</a>)</li>
</ul>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed an issue where an empty netrc entry could cause
malformed authentication to be applied to Requests on
Python 3.11+. (<a
href="https://redirect.github.com/psf/requests/issues/7205">#7205</a>)</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Dropped support for Python 3.9 following its end of support. (<a
href="https://redirect.github.com/psf/requests/issues/7196">#7196</a>)</li>
</ul>
<p><strong>Documentation</strong></p>
<ul>
<li>Various typo fixes and doc improvements.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/psf/requests/commit/bc04dfd6dad4cb02cd92f5daa81eb562d280a761"><code>bc04dfd</code></a>
v2.33.0</li>
<li><a
href="https://github.com/psf/requests/commit/66d21cb07bd6255b1280291c4fafb71803cdb3b7"><code>66d21cb</code></a>
Merge commit from fork</li>
<li><a
href="https://github.com/psf/requests/commit/8b9bc8fc0f63be84602387913c4b689f19efd028"><code>8b9bc8f</code></a>
Move badges to top of README (<a
href="https://redirect.github.com/psf/requests/issues/7293">#7293</a>)</li>
<li><a
href="https://github.com/psf/requests/commit/e331a288f369973f5de0ec8901c94cae4fa87286"><code>e331a28</code></a>
Remove unused extraction call (<a
href="https://redirect.github.com/psf/requests/issues/7292">#7292</a>)</li>
<li><a
href="https://github.com/psf/requests/commit/753fd08c5eacce0aa0df73fe47e49525c67e0a29"><code>753fd08</code></a>
docs: fix FAQ grammar in httplib2 example</li>
<li><a
href="https://github.com/psf/requests/commit/774a0b837a194ee885d4fdd9ca947900cc3daf71"><code>774a0b8</code></a>
docs(socks): same block as other sections</li>
<li><a
href="https://github.com/psf/requests/commit/9c72a41bec8597f948c9d8caa5dc3f12273b3303"><code>9c72a41</code></a>
Bump github/codeql-action from 4.33.0 to 4.34.1</li>
<li><a
href="https://github.com/psf/requests/commit/ebf71906798ec82f34e07d3168f8b8aecaf8a3be"><code>ebf7190</code></a>
Bump github/codeql-action from 4.32.0 to 4.33.0</li>
<li><a
href="https://github.com/psf/requests/commit/0e4ae38f0c93d4f92a96c774bd52c069d12a4798"><code>0e4ae38</code></a>
docs: exclude Response.is_permanent_redirect from API docs (<a
href="https://redirect.github.com/psf/requests/issues/7244">#7244</a>)</li>
<li><a
href="https://github.com/psf/requests/commit/d568f47278492e630cc990a259047c67991d007a"><code>d568f47</code></a>
docs: clarify Quickstart POST example (<a
href="https://redirect.github.com/psf/requests/issues/6960">#6960</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/psf/requests/compare/v2.32.5...v2.33.0">compare
view</a></li>
</ul>
</details>
<br />

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Summary
- Threads accurate `data_size` (in bytes) from `DataBlock::data_size()`
at the encoding layer through the full decode pipeline to the final
`RecordBatch`
- Implements `DataBlock::data_size()` for `Struct` and `Dictionary`
variants (were `todo!()`)
- Uses the accurate data size for the "batch is too large" warning
instead of Arrow's `get_array_memory_size()`, which over-reports due to
shared page buffers
- Changes `DecodeArrayTask::decode()` to return `(ArrayRef, u64)` so
data size flows through naturally

## Test plan
- [x] All 364 existing `lance-encoding` tests pass
- [x] `cargo clippy -p lance-encoding --tests -- -D warnings` clean
- [x] `cargo clippy -p lance-file --tests -- -D warnings` clean
- [x] `cargo fmt --all -- --check` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This updates `python/uv.lock` to match the current result of `uv sync
--extra tests --extra dev` in the Python project.

The lock file was lagging behind the resolved `lance-namespace`
packages, which caused recurring diffs after local syncs.

I validated the change by rerunning `uv sync --extra tests --extra dev`
successfully in `python/`.
lance-format#6396)

### Description

Removes the leftover `tempfile.NamedTemporaryFile` save in
`train_ivf_centroids_on_accelerator`.

This was a debugging/checkpoint artifact that is no longer needed — the
`IvfModel.save()` API now provides explicit persistence to any URI
(local or cloud). The temp file was created with `delete=False` and
never cleaned up, leaking disk space over repeated runs.

The CPU training path (Rust `indices.train_ivf_model`) does not have
this behavior, so this change also makes the two paths consistent.

### Changes

- **`python/python/lance/vector.py`**: Removed the
`tempfile.NamedTemporaryFile` + `np.save` + log line from
`train_ivf_centroids_on_accelerator` (3 lines deleted).

Closes lance-format#6395
This PR introduces `LogicalVectorIndex` as a logical aggregate and moves
IVF-specific partition inspection into `LogicalIvfView`, so the API
boundary matches the actual semantics.
…-format#6367)

Avoid confusion with object store regions (e.g., AWS regions) which are
unrelated to the MemWAL concept of a unique writer/reader instance.

Closes lance-format#6355

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This changes segmented vector index optimize so the default rebalance
path keeps segment boundaries and rewrites only the single worst segment
in each run. It builds on lance-format#6400's logical vector index / IVF view work
and avoids the current behavior where segmented optimize treats the
logical index as one physical index.

I also added a regression test that creates a skewed two-segment IVF
index and verifies that optimize replaces only the oversized segment
while leaving the other segment untouched.
…rmat#6379)

[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=aiohttp&package-manager=uv&previous-version=3.12.15&new-version=3.13.4)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/lance-format/lance/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…e-format#6394)

Expose Python progress callbacks for index creation and segment merge.

Add the IndexProgress event type, pump Rust async progress events back
into Python while waiting on index work
…lance-format#6132)

close lance-format#6111

- Implement `count_table_rows` with optional version checkout and
predicate filter
- Implement `insert_into_table` with append/overwrite modes via Arrow
IPC
- Implement `query_table` supporting vector similarity search (with
distance type, nprobes, refine factor, prefilter) and plain scan with
filter/limit/offset/version
- Add `lance-linalg` dependency for `DistanceType` in vector search
- Add 8 unit tests covering all new methods and edge cases

## Test plan

- [x] `cargo test -p lance-namespace-impls` passes all new tests
- [x] count_table_rows returns correct count with and without predicate
filter
- [x] insert_into_table correctly appends and overwrites data
- [x] query_table returns correct results for vector search and plain
scan

---------

Co-authored-by: Jack Ye <yezhaoqin@gmail.com>
)

## Summary

- Add `check_column_indices()` validation in
`rust/lance/src/io/commit.rs` that rejects non-leaf fields (structs,
lists) with real column indices in v2.1+ data files at commit time,
preventing cryptic read-time errors
- Exempts packed structs and blob fields which legitimately have column
indices in v2.1+
- Wired into both `commit_transaction` and
`do_commit_detached_transaction` paths

Closes lance-format#6412

## Test plan

- [x] `test_check_column_indices_rejects_struct_with_column` — struct
with column_index=0 in v2.1 → error
- [x] `test_check_column_indices_rejects_list_with_column` — list with
column_index=0 in v2.1 → error
- [x] `test_check_column_indices_allows_correct_v21` — correct indices
(non-leaf=-1, leaf>=0) → ok
- [x] `test_check_column_indices_allows_packed_struct` — packed struct
with real column_index → ok
- [x] `test_check_column_indices_skips_v20` — non-leaf with
column_index>=0 in v2.0 → ok (no validation)
- [x] `cargo clippy -p lance --tests -- -D warnings` passes
- [x] `cargo fmt --all` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…ormat#6416)

## Summary

- Add a 5.0.0 section to the migration guide documenting how
`DataFile.column_indices` changed with data storage version 2.1:
non-leaf fields (structs, lists) now get `-1` instead of sequential
column indices
- Add an admonition to the table format spec's Data Files section noting
the version difference
- Includes a concrete before/after example and opt-out instructions

Closes lance-format#6411

## Test plan

- [x] Docs build successfully with `mkdocs build`
- [ ] Verify rendered migration guide section and table format
admonition look correct

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…erialization (lance-format#6405)

Fixes lance-format#6403

`FragmentMetadata.to_json()` raised `NotImplementedError` when
`row_id_meta` was present (`enable_stable_row_ids=True`), because
`PyRowIdMeta.asdict()` was stubbed out. This also broke
`FragmentMetadata.from_json()` for the round-trip, since
`RowIdMeta(**dict)` doesn't work on a PyO3 class.

### Repro

```python
import lance, pyarrow as pa

uri = "/tmp/repro.lance"
ds = lance.write_dataset(pa.table({"x": [1, 2, 3]}), uri, enable_stable_row_ids=True)
ds.get_fragments()[0].metadata.to_json()  # NotImplementedError: PyRowIdMeta.asdict is not yet supported.
```

### Fix

- Implement `PyRowIdMeta.asdict()` via `pythonize` (Rust struct → Python
dict)
- Add `PyRowIdMeta.from_dict()` via `depythonize` (Python dict → Rust
struct)
- Update `FragmentMetadata.from_json()` to use `RowIdMeta.from_dict()`
instead of `RowIdMeta(**dict)`
- Add JSON round-trip test to the existing
`test_fragment_metadata_pickle` parametrized test
…rmat#6420)

This canonicalizes all-valid validity bitmaps into the same rep/def
state as no-null arrays, so sub-schema `merge_insert` updates on
`data_storage_version=2.2` stop emitting inconsistent control-word
metadata and corrupting variable-width pages.

This PR picks up the proposed fix for lance-format#6338 and adds regression coverage
for both the end-to-end binary `merge_insert` failure and the underlying
repdef canonicalization invariant.

---------

Co-authored-by: Eran Dagan <eran@botika.io>
…ormat#6330)

## Summary

Add support for writing blob v2 columns with external URI references
that are outside registered base paths. This enables use cases like
INSERT INTO SELECT across Lance tables where the target table stores
external blob references pointing to the source table's blob files
instead of copying the actual blob bytes.

## Changes

- **WriteParams.java**: Add `allowExternalBlobOutsideBases`
Optional<Boolean> field, getter, and builder method
- **Fragment.java**: Pass the new field through `createWithFfiArray` and
`createWithFfiStream` native methods
- **fragment.rs (JNI)**: Thread the new `Optional<Boolean>` parameter
through all fragment creation functions to `extract_write_params`
- **utils.rs (JNI)**: Parse the new parameter and set
`allow_external_blob_outside_bases` on Rust `WriteParams`
- **blocking_dataset.rs (JNI)**: Pass `JObject::null()` for the new
param in `Dataset.write()` path (not needed there)

## Context

This is a prerequisite for lance-spark blob JOIN support
(lance-format/lance-spark#355). When blob data flows through Spark's
shuffle during JOIN + INSERT INTO, the target table needs to write
external blob references pointing to the source table's physical blob
files. The Rust `BlobPreprocessor` already supports this via
`allow_external_blob_outside_bases`, but the Java SDK had no way to set
it.

Ref: lance-format#6321, lance-format#6322

## Test plan

- [x] Rust JNI code compiles cleanly (no errors in changed files)
- [ ] Java unit tests (CI)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Fixes lance-format#6417: Overwrite validation incorrectly used the old manifest's
storage format, causing strict legacy checks to reject valid
STABLE-format fragments that omit struct parent fields
- Pass `None` instead of `Some(manifest)` for Overwrite validation since
all fragments are replaced and the old format is irrelevant
- Added regression test for LEGACY→STABLE overwrite with struct fields

## Test plan

- [x] `test_overwrite_legacy_to_stable_with_struct_fields` — verifies
that overwriting a LEGACY dataset with STABLE fragments containing
struct fields succeeds
- [ ] Re-enable `replaceTableChangesStorageVersion` test in lance-spark
after this lands

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add workflow dispatch inputs for custom_pypi_url and custom_pypi_token
to allow manual publishing to custom PyPI indexes like Azure Artifacts.
When these inputs are provided, wheels are uploaded to the custom index
instead of the default PyPI or Fury repositories.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.