fix(namespace): align error handling with namespace spec#95
Open
jackye1995 wants to merge 206 commits into
Open
fix(namespace): align error handling with namespace spec#95jackye1995 wants to merge 206 commits into
jackye1995 wants to merge 206 commits into
Conversation
…6146) fix CI error: `FAILED python/tests/test_integration.py::test_duckdb_pushdown_extension_types - _duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated, use to_arrow_table() instead.`
20%+ faster for 2GB index, could be more for larger index
There was a conflict table in transaction.rs but this was incomplete (some rows/columns missing) and seemed to be imprecise or incorrect in a few spots. I've attempted to more thoroughly document this in transaction.md instead.
…ance-format#6160) Previously, `adjust_child_validity` would call `ArrayData::try_new` with a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic with `InvalidArgumentError("Arrays of type Null cannot contain a null bitmask")`. The trigger: when a user inserts rows where a struct sub-field has only null values, Arrow infers `DataType::Null` for that column. If a subsequent fragment omits that nullable sub-field, Lance inserts a `NullReader` to fill it in. `MergeStream` then merges the real batch (with null struct rows) and the `NullReader` batch (all-null struct), recursing into the struct where `adjust_child_validity` is called with the `Null`-typed child and a non-empty parent validity — triggering the panic. Fix: skip the bitmask operation when `child.data_type() == DataType::Null`. A `Null` array is always entirely null by definition and needs no validity adjustment. Closes lance-format#6159 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-format#6163) Previously, when `FragReuseIndexDetails` exceeded 204800 bytes (triggered by large compactions with many fragments), the code wrote the details to an external file (`details.binpb`). On local filesystems, `ObjectStore::create` returns a `LocalWriter` that atomically renames a temp file to the final path in `Writer::shutdown`. However, `frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not `lance_io::traits::Writer`, so `writer.shutdown()` resolved to `AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted on drop without being persisted. Any subsequent `load_indices` call would fail with `Not found: .../details.binpb`. Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to explicitly call the lance trait method, matching the existing pattern in `ivf.rs` and `blob.rs`. Fixes lance-format#6161 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This breaks the "build_partitions" stage into "build_partitions" and "merge_partitions", and also updates the progress reporting on the shuffle phase to be in terms of rows instead of batches.
This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob empty-range fix can stay focused on the regression it addresses. The changes here are all mechanical simplifications with no intended behavior change.
…t#6175) This PR moves the Linux and Windows workflows that currently run on Warp onto GitHub-hosted runners. The goal is to reduce reliance on custom runners and take advantage of the sponsored larger GitHub-hosted machines for the slowest CI paths. This is focused on the current CI bottlenecks we observed in recent successful PR runs, especially Rust ARM and Python Windows jobs, while keeping the existing macOS and benchmark-specific runners unchanged until we verify equivalent GitHub-hosted options for them. Context: - Recent PR history shows Rust `linux-arm` and Python `windows` as the dominant critical-path jobs. - This change upgrades those jobs to larger GitHub-hosted runners where available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`, `windows-latest-4x`) and aligns the remaining Linux/Windows workflows with the same runner family. - I validated the workflow YAML locally after the runner migration; no product code or test logic changed. --- Updates: - Rust linux-arm:40.7 -> 19.4,about -52% - Rust windows-build:27.7 -> 21.0,about -24% - Python windows:36.5 -> 23.1,about -37% - Python Linux 3.13 ARM:26.9 -> 20.7,about -23% - Python Linux 3.13 x86_64:26.8 -> 19.1,about -29% - Python Linux 3.9 x86_64:25.9 -> 19.2,about -26%
Improvements lance-format#4247 alicloud storage config doc. Signed-off-by: FarmerChillax <farmerchillax@outlook.com>
Blob reads should return empty bytes when the logical blob is empty or the cursor is already at EOF. Today `BlobFile::read` / `read_up_to` can still issue a `get_range(start..end)` request with `start == end`, which is tolerated by local readers but rejected by cloud object stores. This showed up while investigating `random_blob` failures on the original-scale `laion10m-full` dataset, where legacy blob reads on S3 failed with errors like `Range started at 1 and ended at 1`. The fix short-circuits empty reads and restores the cursor to blob-relative semantics after `read()`, and adds regression coverage for both the empty-range case and packed-blob cursor behavior.
<img width="1340" height="800" alt="image" src="https://github.com/user-attachments/assets/355caf26-14cb-4823-9474-6e4c9e780823" /> - FTS indexing is ~2.5x faster, this removes merge phase, and produces large partitions directly. - memory footprint is reduced by ~60%, this compresses posting lists while building them, which can save a lot of memory, and reduces fragmented objects in memory. This also bumps the default worker memory budget from 256MiB to 1GiB because we need to produce larger partition directly, but the memory footprint is still much less. This adds a new param `memory_limit` so that users can control how the indexing should work --------- Signed-off-by: BubbleCal <bubble-cal@outlook.com> Co-authored-by: LuQQiu <luqiujob@gmail.com> Co-authored-by: Weston Pace <weston.pace@gmail.com>
…#6187) This fixes the reader panic in lance-format#6185 when a page keeps nullable rep/def layer metadata but does not materialize any definition levels. The decoder now treats that page-local state as all-valid and includes a regression test that reproduces the mixed-page case before the fix. Closes lance-format#6185.
) This fixes the merge-insert fast path for delete-by-source operations while preserving the existing `UpdateIf` semantics. It also keeps full-schema `FixedSizeList` merges on the optimized path so target-side payload columns are pruned from the join build side. Fix lancedb/lancedb#3094
This updates the benchmark TPC-H datagen path to use DuckDB's `to_arrow_reader()` API instead of the deprecated `fetch_arrow_reader()` call. The benchmark CI treats `DeprecationWarning` as an error, so this removes the warning that was breaking the random access benchmark job. I also dropped a leftover `print(ds.count_rows())` debug statement to keep benchmark logs clean.
…lance-format#6191) Signed-off-by: BubbleCal <bubble-cal@outlook.com>
In retrospect the old name was somewhat presumptuous. It would probably be good to get the Arrow project's permission before taking up cargo real estate. This also adds a README which was preventing the publish.
…mat#6145) ## Summary Closes lance-format#6138 This PR extends `index_matches_criteria()` in `rust/lance/src/index/scalar.rs` to handle vector index types in addition to scalar indices. ## Problem Previously, `index_matches_criteria()` contained an early return at lines 464-467 that rejected all non-scalar (vector) indices. This made it impossible to use `describe_indices` to filter for vector indices on a specific column. ## Solution - Removed the early return that rejected all vector indices - Refactored FTS and exact equality checks to only apply to scalar indices (these checks are not relevant for vector indices) - Vector indices now pass through when matching basic criteria (name and column filters) ## Changes - 1 file modified: `rust/lance/src/index/scalar.rs` - 15 lines added, 16 lines removed - Updated existing test `test_index_matches_criteria_vector_index()` to reflect the new expected behavior ## Testing - Updated the existing unit test for vector index criteria matching - The test now correctly expects vector indices to match basic criteria instead of being rejected ## AI Disclosure This contribution was developed with the assistance of Claude (AI by Anthropic). The implementation approach, code, and PR description were AI-assisted. All changes are focused on resolving the specific issue described above. Co-Authored-By: AI Assistant (Claude) <ai-assistant@contributor-bot.dev> Signed-off-by: ndpvt-web <ndpvt-web@users.noreply.github.com> Co-authored-by: ndpvt-web <ndpvt-web@users.noreply.github.com> Co-authored-by: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>
…er (lance-format#6197) Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…rmat#6194) This PR makes two changes to ensure stale credentials are not used: (1) In the Directory namespace if either vending is not enabled or a credential vendor is not configured we return `None` for storage options. (2) The `DynamicStorageOptionsCredentialProvider` falls back to the default credential provider (lazily loaded) if it is not able to retrieve credentials. Closes lance-format/lance-spark#292 --------- Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
…ance-format#6119) SimpleIndex (HNSW over centroids) previously only supported fp32 centroids, causing fp16 vector workloads to fall back to brute-force partition assignment — O(K×D) per vector instead of O(log K × D). For 31K centroids × 1024 dims this is a ~600x difference per vector. Cast fp16 centroids to fp32 at HNSW construction time (one-time cost) and cast fp16 query vectors at search time (1024 floats per query, negligible vs the distance computations saved). --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Xuanwo <github@xuanwo.io>
lance-format#6142) Previously we would use the default file version when creating new index files. This was originally done to get some testing of the 2.0 format before it was made the default. However, this led to a bit of a potential compatibility problem. If we change the default file version then the files created by the new release would become unreadable on very old versions that didn't know how to read that file, even if the dataset itself had an older file version and the old version knew how to handle the index otherwise. To avoid this we change things in this PR so that new index files use the same format version as the dataset. This should mean the indexes are always readable if the dataset is readable, regardless of what version was used to write the index. --- Parts of this PR were written with Claude (Opus 4.6) and I take full responsibility for its contents.
…lance-format#6477) ## Summary - Change `DataFile.fields` and `DataFile.column_indices` from `Vec<i32>` to `Arc<[i32]>` so that fragments with identical field lists share a single heap allocation - Add `DataFileFieldInterner` that deduplicates these slices during manifest deserialization - In homogeneous tables (the common case), every fragment carries the same field list, so at 20M fragments this saves **~2.4 GB** of redundant heap allocations ## Motivation When dataset manifests grow large (>1 GB with millions of fragments), opening the dataset becomes very expensive in terms of memory. Each `DataFile` previously owned its own `Vec<i32>` for `fields` and `column_indices`, even though in most tables every fragment has the exact same field list. This PR deduplicates those allocations at deserialization time. ### Per-fragment memory breakdown (before) | Field | Size per fragment | |-------|------------------| | `fields: Vec<i32>` (10 fields) | ~64 bytes | | `column_indices: Vec<i32>` (10 cols) | ~64 bytes | | **Total redundant** | **~128 bytes x 20M = ~2.4 GB** | ### After this change With interning, all 20M fragments share a single `Arc<[i32]>` allocation (~80 bytes total instead of 2.4 GB). ## Changes - **`lance-table/src/format/fragment.rs`** — Core struct change (`Vec<i32>` → `Arc<[i32]>`), custom `Serialize`/`Deserialize` impls, and `DataFileFieldInterner` - **`lance-table/src/format/manifest.rs`** — Use interner during manifest deserialization - **`lance/src/dataset/fragment.rs`**, **`merge_insert.rs`**, **`io/commit.rs`** — Tombstoning and field-remapping rebuilt as new `Arc<[i32]>` instead of in-place mutation - **`python/src/fragment.rs`**, **`java/lance-jni/src/fragment.rs`** — FFI boundary conversions - Various test files — Updated struct literals and assertions ## Compatibility - No format change — protobuf schema is unchanged - Serde JSON output is identical (custom impl serializes `Arc<[i32]>` as `[i32]`) - All public API signatures that take `Vec<i32>` (e.g., `DataFile::new()`, `Fragment::add_file()`) still accept `Vec<i32>` and convert internally 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mory (lance-format#6499) ## Summary - Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>` so that fragments with identical version metadata share a single heap allocation - Extend `DataFileFieldInterner` to deduplicate these inline byte payloads during manifest deserialization - Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan for ≤16 entries and upgrades to HashMap for larger caches - Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta` to handle `Arc<[u8]>` transparently ## Motivation Follow-up to lance-format#6477 (interning `DataFile.fields`/`column_indices`). After a compaction, all fragments are stamped with the same version metadata (both `last_updated_at_version_meta` and `created_at_version_meta`), but each fragment previously owned its own `Vec<u8>` copy. ### Per-fragment memory breakdown (before) | Field | Size per fragment | |-------|------------------| | `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload | | **Total redundant at 20M fragments** | **~480 MB+** | ### After this change With interning, all 20M fragments share a single `Arc<[u8]>` allocation per unique payload. ## Benchmark results Microbenchmark at 100K fragments (10 fields per fragment): | Scenario | No interning | With interning | Delta | |----------|-------------|----------------|-------| | **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** | | **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** | | **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** | | **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** | | Memory (100K fragments) | No interning | With interning | Savings | |------------------------|-------------|----------------|---------| | **10 fields** | 39.47 MB | 29.74 MB | **24.6%** | | **50 fields** | 69.99 MB | 29.74 MB | **57.5%** | Both memory and speed improve across all scenarios. The hybrid `InternCache` uses fast Vec scan for the common case (1-3 unique values) and upgrades to HashMap when diversity exceeds 16 entries. Run with: `cargo bench -p lance-table --bench manifest_intern` ## Changes - **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` → `Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions - **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>` (Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta interning - **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark covering uniform and diverse scenarios ## Compatibility - No format change — protobuf schema is unchanged - Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as `[u8]`) - `from_sequence()` still works as before (converts internally) ## Test plan - [x] `cargo check --workspace --tests` passes - [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes - [x] All 88 `lance-table` tests pass - [x] `cargo fmt --all -- --check` passes - [x] Microbenchmark validates performance across uniform and diverse scenarios - [ ] CI 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmat#6308) - `list_all_tables` - `restore_table` - `update_table_schema_metadata` - `get_table_stats` - `explain_table_query_plan` - `analyze_table_query_plan` --------- Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary - Adds `#[instrument]` attributes from the `tracing` crate to key functions across the `mem_wal` module - Covers write path (`RegionWriter::open`, `put`, `close`), flush path (`MemTableFlusher::flush`, `flush_with_indexes`), WAL operations, manifest store, memtable inserts, scanner/planner, point lookups, and vector search - Uses appropriate trace levels (`info` for high-level operations, `debug` for internals) with relevant fields (region_id, epoch, row counts, batch counts) ## Test plan - [x] `cargo check` passes — no functional changes, only attribute additions - [x] Existing `mem_wal` tests continue to pass - [ ] Tracing output verified with `RUST_LOG=debug` showing instrumented spans 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) ## Summary Refactor `FullZipScheduler::create_page_load_task` to accept a pre-submitted I/O future instead of deferring I/O submission until the async task executes. This allows the I/O requests to be submitted immediately during scheduling, enabling the object store layer to batch and parallelize them. close lance-format#6504 ## I/O Model Change ### Before: Lazy I/O submission (serialized) Previously, `create_page_load_task` received a `FullZipReadSource::Remote(io)` along with byte ranges and priority. The actual `io.submit_request()` call happened **inside** the async block, meaning the I/O request was not submitted until the future was first polled. When decoding multiple pages (e.g. across many fragments), this created a sequential I/O pattern: ``` Page 1: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode] Page 2: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode] Page 3: [schedule] -> [poll] -> ... ``` Each page's I/O request could only be submitted after the previous task started executing. The I/O scheduler had no visibility into upcoming requests, preventing it from batching or parallelizing them effectively. ### After: Eager I/O submission (pipelined) Now, `io.submit_request()` is called **before** constructing the `PageLoadTask`, and the resulting future is passed into `create_page_load_task`. All I/O requests for all pages are submitted upfront during the scheduling phase: ``` [schedule all pages] --> submit I/O page 1 -+ --> submit I/O page 2 -+ --> submit I/O page 3 -+ (all in-flight concurrently) --> submit I/O page N -+ | [poll] -> [await page 1 response] -> [decode] [poll] -> [await page 2 response] -> [decode] [poll] -> [await page 3 response] -> [decode] ``` The object store layer can now see all pending requests at once and optimize I/O through batching, connection multiplexing, and parallel fetches. The async tasks only await the already-in-flight I/O futures. ## Changes - `rust/lance-encoding/src/encodings/logical/primitive.rs`: - Changed `create_page_load_task` signature to accept `BoxFuture<'static, Result<Vec<Bytes>>>` instead of `FullZipReadSource` + byte ranges + priority - Moved `io.submit_request()` calls to happen eagerly at both call sites (`schedule_ranges_with_rep_index` and the non-rep-index path), before constructing the page load task ## Performance Tested with a multi-fragment dataset containing fixed-width columns (768-dim float32 vectors, 40 fragments, 50 rows/fragment): | Benchmark | Before (p50) | After (p50) | Speedup | |---|---|---|---| | Fixed-width column scan | 3453 ms | 523 ms | **6.6x** | The improvement comes entirely from I/O pipelining — the decoding logic itself is unchanged. The effect is most pronounced with many fragments or pages, where the serialized I/O submission was the dominant bottleneck.
## Summary - Add `blob_max_pack_file_bytes` to `WriteParams`, allowing users to override the default 1 GiB maximum pack (`.blob`) sidecar file size - Thread the configuration through the full write path: `WriteParams` -> `WriterGenerator` -> `WriterOptions` -> `BlobPreprocessor` -> `PackWriter` - Expose the option in Python (`write_dataset`) and Java (`WriteParams.Builder`) bindings ## Test plan - [x] All 37 existing blob tests pass (`cargo test -p lance blob`) - [x] Clippy clean on `lance` and `lance-jni` crates - [x] Verify Python binding works end-to-end with `blob_max_pack_file_bytes` kwarg - [x] Verify Java binding compiles with `./mvnw compile` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Bump `jieba-rs` from 0.8.1 to 0.9.0 to fix the `build-no-lock` CI job - The `core2` crate v0.4.0 was yanked from crates.io, breaking fresh dependency resolution (`jieba-rs` → `include-flate` → `libflate` → `core2`) - `jieba-rs` 0.9.0 drops the `include-flate`/`libflate`/`core2` chain entirely, removing 9 transitive dependencies with no API changes ## Test plan - [x] `cargo check -p lance-index --features tokenizer-jieba` passes - [x] Verified build succeeds without `Cargo.lock` (simulating the CI job) - [ ] CI `build-no-lock` job passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This tightens the repository's environment guidance so language-specific tasks must follow the documented workflow before reporting missing tools or dependencies. For Python work, the docs now make `uv sync --extra tests --extra dev` and `uv run ...` mandatory, and explicitly call out the common failure mode where slow `uv sync` is interrupted or global Python is used instead.
This changes per-base runtime configuration to use exact `ObjectStoreParams` bindings keyed by `BasePath.path` instead of per-base storage option overrides. Dataset-level and write-level store params now act only as fallbacks, while reads, target-base writes, and external blob resolution all consult the same base-scoped binding model. This keeps provider-specific runtime state out of the manifest and follows the direction in discussion lance-format#6307 to keep `BasePath` focused on identity.
This PR vendors the tokenizer stack Lance actually uses into a new `rust/lance-tokenizer` crate and rewires FTS and inverted-index code to depend on it instead of `tantivy` and `lindera-tantivy`. It keeps the existing document and query tokenization semantics in-tree, renames the old FTS document adapter module to `document_tokenizer`, and preserves upstream license headers on vendored code.
…ormat#6517) ## Summary - Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2 distance (`Σ(a-b)²`) in new `l2_u8.rs` - Add fused single-pass u8 cosine distance kernel in new `cosine_u8.rs` — computes `dot(a,b)`, `‖a‖²`, `‖b‖²` simultaneously, halving memory traffic vs the previous 2-3 pass approach - Wire both into the `L2 for u8` and `Cosine for u8` trait impls - Add benchmarks comparing scalar vs SIMD for both kernels ### Algorithmic approach (adapted from [NumKong](https://github.com/ashvardanian/NumKong)) **L2 (AVX2):** Saturating subtraction for `|a-b|`, zero-extend u8→i16, `VPMADDWD(diff, diff)` to square and accumulate into i32. 32 elements/iter. **L2 (AVX-512 VNNI):** Same abs-diff approach with `VPDPWSSD` for fused square-accumulate. 64 elements/iter. **Cosine (AVX2):** Zero-extend both vectors to i16, triple `VPMADDWD` per half (a·b, a·a, b·b). 32 elements/iter, single pass. **Cosine (AVX-512 VNNI):** Same three-accumulator approach with `VPDPWSSD`. 64 elements/iter. Both kernels use `OnceLock`-based runtime CPU dispatch, falling back to portable scalar on non-x86 platforms. ### Benchmarks *1M × 1024-dim u8 vectors.* **x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)** | Kernel | Scalar | SIMD | Speedup | |--------|--------|------|---------| | L2(u8) | 73.5 ms | 58.2 ms | **1.26x** | | Cosine(u8) | 122.2 ms | 82.1 ms | **1.49x** | L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than that path. **aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)** | Kernel | Scalar | SIMD (dispatch) | |--------|--------|-----------------| | L2(u8) | 26.8 ms | 27.3 ms | | Cosine(u8) | 90.1 ms | 90.4 ms | On aarch64 the SIMD path falls through to scalar (no AVX2), so times are identical — confirms no regression on non-x86 platforms. AVX-512 VNNI systems (Ice Lake+, Zen 4+) should see larger gains. ## Test plan - [x] All 11 new tests pass: SIMD backends verified against scalar reference across 18 vector sizes (0–4097), boundary values (0/255), alternating patterns, random seeds - [x] All 63 existing lance-linalg tests pass (no regressions) - [x] Clippy clean, fmt clean - [x] Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine 1.49x faster - [ ] Verify on AVX-512 VNNI system for additional speedup data 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This fixes the release bump configuration after `lance-tokenizer` was added to the workspace dependencies. `.bumpversion.toml` was missing the corresponding replacement rule, so version bumps could leave that internal dependency on the previous version. This is a targeted config-only fix to keep the release automation updating all workspace crates consistently.
This fixes the directory namespace CI failure where single-instance concurrent create/drop operations on `__manifest` could time out with `TooMuchWriteContention`, especially in the Windows build. Manifest mutations are now serialized within a single `ManifestNamespace` instance so concurrent operations stop racing on stale in-memory snapshots, and inline manifest maintenance now defers compaction/index merges until the table has accumulated enough fragments. Context: https://github.com/lance-format/lance/actions/runs/24439767878/job/71401857043
Blob columns can be represented either as loaded values or as unloaded descriptor schemas, but our schema projection logic still treated those views as incompatible types. This change teaches field projection and intersection to recognize blob loaded/unloaded pairs as the same logical column, and adds regression coverage for both the core schema path and the projection-plan path that previously failed.
## Summary - Adds `ChopBatchesStream`, a stream wrapper that splits oversized batches (>1.5x target `batch_size_bytes`) into smaller sub-batches using zero-copy `RecordBatch::slice` - Wraps the filtered read output stream with `ChopBatchesStream` when `batch_size_bytes` is configured via `FileReaderOptions` - Serves as a safety net when the underlying file reader doesn't estimate batch sizes accurately enough **Stacked on feat/byte-sized-batches-file-reader** — wait for that to merge first, then rebase this PR. ## Test plan - [x] Unit tests for `ChopBatchesStream`: splits large batches, passes small batches through, `wrap_if_needed(None)` is a no-op - [x] `cargo clippy` clean - [x] `cargo fmt` clean 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-format#6503) Add protobuf encode/decode for `ANNIvfSubIndexExec` --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This isolates the `test_memory_leaks` index statistics probe into a fresh subprocess instead of running it inside the long-lived pytest worker process. That keeps the test focused on repeated `index_statistics` calls and avoids false positives from RSS growth left behind by earlier tests such as the recent batch-chopping coverage added in `test_dataset.py`.
…at#6352) This PR improves blob I/O in two complementary ways: `BlobFile` instances that resolve to the same physical object now share a lazy `BlobSource` and can opportunistically coalesce concurrent reads before handing them to Lance's existing scheduler, and datasets now expose a planned `read_blobs` API for materializing blob payloads directly. It also adds explicit cursor-preserving range reads for `BlobFile` across Rust, Python, and Java, with end-to-end Python coverage for the new API and the edge cases it uncovered. This keeps the optimization aligned with Lance's existing scheduler model while giving callers a higher-level path for sequential and batched blob access. ## Python example ```python import lance dataset = lance.dataset("/path/to/dataset") blobs = dataset.read_blobs( "images", indices=[0, 4, 8], target_request_bytes=8 * 1024 * 1024, max_gap_bytes=64 * 1024, max_concurrency=4, preserve_order=True, ) for row_address, payload in blobs: print(row_address, len(payload)) ```
…mat#6540) ## Summary - Adds `f64x4` and `f64x8` SIMD types to `lance-linalg` with support for x86_64 (AVX2/AVX-512), aarch64 (NEON), and loongarch64 (LASX) - Replaces auto-vectorization-dependent f64 distance functions with explicit SIMD using two-level unrolling (f64x8 + f64x4 + scalar tail) - Updates norm_l2, dot, L2, and cosine distance for f64 ## Benchmark Results (Apple M-series, aarch64 NEON) 1M vectors × 1024 dimensions: | Benchmark | Before | After | Change | |-----------|--------|-------|--------| | NormL2(f64, auto-vec) | 117.76 ms | 116.04 ms | ~same | | NormL2(f64, SIMD) | N/A (TODO) | 119.16 ms | new | | Dot(f64, auto-vec) | 129.36 ms | 130.23 ms | ~same | | L2(f64, auto-vec) | 132.53 ms | 135.15 ms | ~same | | **Cosine(f64, auto-vec)** | **202.52 ms** | **139.23 ms** | **-31.4%** | The biggest win is **cosine distance**, which previously had an empty `impl Cosine for f64 {}` falling back to the scalar path. The explicit SIMD implementation is **31% faster**. For norm_l2, dot, and L2, LLVM's auto-vectorization with the LANES=8 hint was already producing good code on this platform. The explicit SIMD ensures consistent performance across compilers and platforms rather than relying on fragile auto-vectorization hints. ## Test plan - [x] All 59 lance-linalg tests pass - [x] Clippy clean (`-D warnings`) - [x] `cargo fmt` clean - [ ] CI passes on all platforms 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…at#6506) ## Summary - Add AVX-512 VNNI and AVX2 backends for unsigned int8 dot product with runtime CPU feature detection and automatic fallback to scalar - Replace the unoptimized `Dot<u8>` impl (which had an explicit TODO) with dispatched SIMD kernel - All existing callers including SQ distance computation benefit automatically with zero changes to lance-index ## Details ### New file: `rust/lance-linalg/src/distance/dot_u8.rs` Three backends selected at runtime via `OnceLock` + `is_x86_feature_detected!`: | Backend | Instruction | Elements/iter | CPU | |---|---|---|---| | AVX-512 VNNI | `VPDPBUSD` + XOR-0x80 bias trick | 64 | Ice Lake+ / Zen 4+ | | AVX2 | `VPMADDWD` on zero-extended u16 | 32 | Haswell+ / Zen 1+ | | Scalar | portable reference | - | any (including ARM) | ### The VNNI bias trick `VPDPBUSD` expects one unsigned and one signed operand, but SQ vectors are u8×u8. We XOR one operand with 0x80 to map it to the signed domain, then correct by adding `128·Σa` at the end. The correction uses `VPSADBW` which runs on execution port 5 while `VPDPBUSD` runs on port 0 — they execute in parallel every cycle, making the correction effectively free. ### SQ integration (automatic) `SQDistCalculator::distance()` already calls `dot_distance()` → `u8::dot()` for Dot distance type. Replacing the `Dot<u8>` body is the only change needed. ## Benchmarks ### Ryzen 4500 (AVX2, no VNNI) 1M total u8 elements, varying vector dimension. Scalar baseline vs AVX2-dispatched path: | Dimension | Scalar | Dispatch (AVX2) | Speedup | |-----------|--------|-----------------|---------| | 128 | 51.02 µs | 58.25 µs | 0.88x (dispatch overhead dominates) | | 256 | 44.96 µs | 38.62 µs | **1.16x** | | 512 | 42.82 µs | 28.27 µs | **1.51x** | | 1024 | 41.00 µs | 25.17 µs | **1.63x** | AVX2 delivers up to 1.63x throughput at dim=1024. At dim=128 the `OnceLock` dispatch and AVX2 loop setup overhead exceeds the SIMD gains on short vectors. AVX-512 VNNI (Ice Lake+ / Zen 4+) is expected to show larger gains with 64 elements/iter. ### Apple M4 (ARM64, scalar fallback) On ARM64 the dispatch falls back to scalar, so both paths perform identically (~13 µs at dim=1024). A follow-up ARM NEON `UDOT` path would bring SIMD gains to Apple Silicon. ### Out of scope (follow-up) - L2/Cosine u8 SIMD optimization (different kernel: `Σ(a-b)²`) - Native `VPDPBUUD` (unsigned×unsigned, Sierra Forest+) — too new for stable Rust - ARM NEON `UDOT` path - Precomputed norms for SQ L2/Cosine (requires storage format change) ## Test plan - [x] Unit tests: random inputs across 18 vector sizes (0-4097), boundary values (all 0s, all 255s, alternating), one-sided zeros, all-ones patterns - [x] Each backend tested independently against scalar reference (with `#[cfg]` guards for missing CPU features) - [x] Existing `dot` tests continue to pass (9/9) - [x] `cargo clippy -p lance-linalg --tests --benches -- -D warnings` clean - [x] Benchmark on x86_64 with AVX2: `cargo bench --bench dot -p lance-linalg -- "Dot\(u8"` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Replaces the external `numkong` dependency with in-tree C kernels for
**bf16 distance computation** (dot product, L2, cosine, norm_l2)
- Follows the existing f16 kernel pattern: C source compiled via
`build.rs` with per-architecture flags, runtime CPU dispatch via
`SIMD_SUPPORT`
- Kernels are only enabled when the CPU supports the required
instructions (NEON on aarch64, AVX2/AVX-512 on x86_64, LSX/LASX on
loongarch64), with scalar fallback otherwise
- Gated behind the existing `fp16kernels` feature flag
## Benchmark Results
Tested on two platforms with 1M x 1024-dim vectors:
### Apple Silicon (M-series, NEON)
| Benchmark | Before (scalar) | After (C kernel) | Change |
|-----------|-----------------|-------------------|--------|
| **Dot(bf16)** | 144 ms | 55 ms | **2.6x faster** |
| **NormL2(bf16)** | 90 ms | 36 ms | **2.5x faster** |
### AMD Ryzen 5 4500 (Zen 2, AVX2)
| Benchmark | Before (scalar) | After (C kernel) | Change |
|-----------|-----------------|-------------------|--------|
| **Dot(bf16)** | 578 ms | 363 ms | **1.6x faster** (−37%) |
| **NormL2(bf16)** | 365 ms | 207 ms | **1.8x faster** (−43%) |
### Why the approach works
BF16-to-f32 conversion is a simple left-shift by 16 bits. The C kernels
compiled with architecture-specific flags (`-march=haswell`,
`-mtune=apple-m1`, etc.) plus `-ffast-math` and vectorization pragmas
give the compiler more freedom to emit tight SIMD code than LLVM gets
from the Rust scalar loops. ARM benefits more because the baseline Rust
auto-vectorization was weaker there.
## Files Changed
- **New**: `rust/lance-linalg/src/simd/bf16.c` — C kernels for dot, L2,
cosine, norm_l2
- `rust/lance-linalg/build.rs` — compile bf16.c for each architecture
- `rust/lance-linalg/src/distance/{dot,l2,cosine,norm_l2}.rs` — runtime
SIMD dispatch for bf16
- `rust/lance-linalg/Cargo.toml` — removed `numkong` dependency and
feature
- `rust/lance-linalg/benches/{dot,l2,cosine}.rs` — removed numkong
benchmark sections
- **Deleted**: `scripts/bench_numkong.sh`
## Test plan
- [x] `cargo test -p lance-linalg --features fp16kernels` — all bf16
tests pass (kernel path)
- [x] `cargo test -p lance-linalg` — all bf16 tests pass (scalar
fallback)
- [x] `cargo clippy -p lance-linalg --features fp16kernels --tests
--benches -- -D warnings` — clean
- [x] Benchmarked on Apple Silicon (ARM NEON)
- [x] Benchmarked on AMD Ryzen 5 4500 (x86_64 AVX2)
- To reproduce:
```bash
git checkout HEAD~1
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
dot -- --save-baseline before "bf16"
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
norm_l2 -- --save-baseline before "bf16"
git checkout -
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
dot -- --baseline before "bf16"
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
norm_l2 -- --baseline before "bf16"
```
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…a object store calls (lance-format#6507) ## Summary - Adds `dir_listing_to_manifest_migration_enabled` flag (default: `false`) to `DirectoryNamespaceBuilder` and `DirectoryNamespace` - When `false` and both `manifest_enabled` and `dir_listing_enabled` are `true`, root-level table operations (`table_exists`, `describe_table`, `list_tables`) skip the manifest check and use directory listing directly, avoiding extra object store listing calls - When `true`, preserves the existing hybrid behavior of checking manifest first then falling back to directory listing - Includes a test with a counting object store wrapper verifying only a single `list_with_delimiter` call is made for root-level table operations without migration mode --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…6562) ## Summary - The `FairSpillPool` divides memory evenly across spillable consumers. With up to 8 partitions, each sort consumer was limited to ~12.5MB from a flat 100MB pool, causing merge_insert operations with large payloads to fail with "not enough memory to continue external sort" at very small batch sizes (e.g. 5 rows with 1MB payloads). - Scale the default pool size to 100MB **per partition** so each consumer gets a reasonable allocation. Explicit `LANCE_MEM_POOL_SIZE` or `mem_pool_size` settings are respected as-is. - This is a partial fix — very large batches can still exhaust the per-partition budget. A more complete fix may involve revisiting the pool type or spilling behavior for merge_insert. ## Test plan - [x] Added unit test `test_mem_pool_size_scales_with_partitions` verifying pool size scales correctly - [x] Verified with a Python repro script that merge_insert with 1MB-per-row payloads no longer fails at 5 rows (now succeeds up to ~50 rows) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b101398 to
11f5d9f
Compare
…e-format#6570) ## Summary The `linux-build` CI job installs unpinned `nightly`, which broke after 2026-04-17 because `ethnum 1.5.2` uses `unsafe { mem::transmute(()) }` to create `TryFromIntError`. Newer nightly builds reject this with `error[E0512]: cannot transmute between types of different sizes`. Dependency chain: `lance-arrow` → `jsonb 0.5.6` → `ethnum 1.5.2` This PR pins `nightly-2026-04-16` (last known-good date) in both the toolchain install step and the `cargo +nightly` invocation. ## Root Cause `ethnum-1.5.2/src/error.rs:16`: ```rust pub const fn tfie() -> TryFromIntError { unsafe { mem::transmute(()) } // () is 0 bits, TryFromIntError is 8 bits } ``` Rust nightly `e9e32aca5` (2026-04-17) tightened `transmute` checks, making this a hard error. ## Follow-up - Upstream fix needed in [`nlordell/ethnum-rs`](https://github.com/nlordell/ethnum-rs) to replace the transmute hack - Once `ethnum` publishes a fix and `jsonb` picks it up, the pin can be removed ## Test plan - [x] `linux-build` job passes with pinned nightly
11f5d9f to
dec39ed
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ErrorResponsemodel, instead of guessing errors from HTTP status codesNamespaceErrorvariants per the spec's per-operation error tablesErrorResponsemodel, and correct HTTP status code mappings (406 for Unsupported, 409 for NamespaceNotEmpty/InvalidTableState)ErrorCode::ThrottledtoErrorCode::Throttlingto match Python/Java SDK naming{:?}(Debug) formatting for all source errors to preserve the full error chain