Skip to content

fix(namespace): align error handling with namespace spec#95

Open
jackye1995 wants to merge 206 commits into
mainfrom
ns-error-message
Open

fix(namespace): align error handling with namespace spec#95
jackye1995 wants to merge 206 commits into
mainfrom
ns-error-message

Conversation

@jackye1995
Copy link
Copy Markdown
Owner

@jackye1995 jackye1995 commented Apr 12, 2026

Summary

  • Fix RestNamespace to parse the spec-defined flat error response format using the generated ErrorResponse model, instead of guessing errors from HTTP status codes
  • Fix DirectoryNamespace to return correct NamespaceError variants per the spec's per-operation error tables
  • Fix REST adapter to produce spec-compliant error responses using ErrorResponse model, and correct HTTP status code mappings (406 for Unsupported, 409 for NamespaceNotEmpty/InvalidTableState)
  • Rename ErrorCode::Throttled to ErrorCode::Throttling to match Python/Java SDK naming
  • Use {:?} (Debug) formatting for all source errors to preserve the full error chain

esteban and others added 30 commits March 10, 2026 02:15
…6146)

fix CI error: `FAILED
python/tests/test_integration.py::test_duckdb_pushdown_extension_types -
_duckdb.Error: DeprecationWarning: fetch_arrow_table() is deprecated,
use to_arrow_table() instead.`
20%+ faster for 2GB index, could be more for larger index
)

This PR fixes the regression benchmarks workflow failing to resolve the
pinned `google-github-actions/auth` action. The workflow had quoted the
entire `uses` value, which caused the trailing `# v2` comment to be
parsed as part of the action ref.
There was a conflict table in transaction.rs but this was incomplete
(some rows/columns missing) and seemed to be imprecise or incorrect in a
few spots. I've attempted to more thoroughly document this in
transaction.md instead.
…ance-format#6160)

Previously, `adjust_child_validity` would call `ArrayData::try_new` with
a null bitmap on a `DataType::Null` array, causing an `.unwrap()` panic
with `InvalidArgumentError("Arrays of type Null cannot contain a null
bitmask")`.

The trigger: when a user inserts rows where a struct sub-field has only
null values, Arrow infers `DataType::Null` for that column. If a
subsequent fragment omits that nullable sub-field, Lance inserts a
`NullReader` to fill it in. `MergeStream` then merges the real batch
(with null struct rows) and the `NullReader` batch (all-null struct),
recursing into the struct where `adjust_child_validity` is called with
the `Null`-typed child and a non-empty parent validity — triggering the
panic.

Fix: skip the bitmask operation when `child.data_type() ==
DataType::Null`. A `Null` array is always entirely null by definition
and needs no validity adjustment.

Closes lance-format#6159

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-format#6163)

Previously, when `FragReuseIndexDetails` exceeded 204800 bytes
(triggered by large compactions with many fragments), the code wrote the
details to an external file (`details.binpb`). On local filesystems,
`ObjectStore::create` returns a `LocalWriter` that atomically renames a
temp file to the final path in `Writer::shutdown`. However,
`frag_reuse.rs` imported `tokio::io::AsyncWriteExt` but not
`lance_io::traits::Writer`, so `writer.shutdown()` resolved to
`AsyncWriteExt::shutdown` (flush/close only) — the temp file was deleted
on drop without being persisted. Any subsequent `load_indices` call
would fail with `Not found: .../details.binpb`.

Fixed by using UFCS `Writer::shutdown(writer.as_mut()).await?` to
explicitly call the lance trait method, matching the existing pattern in
`ivf.rs` and `blob.rs`.

Fixes lance-format#6161

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This breaks the "build_partitions" stage into "build_partitions" and
"merge_partitions", and also updates the progress reporting on the
shuffle phase to be in terms of rows instead of batches.
This PR moves a few unrelated clippy cleanups out of lance-format#6168 so the blob
empty-range fix can stay focused on the regression it addresses. The
changes here are all mechanical simplifications with no intended
behavior change.
…t#6175)

This PR moves the Linux and Windows workflows that currently run on Warp
onto GitHub-hosted runners. The goal is to reduce reliance on custom
runners and take advantage of the sponsored larger GitHub-hosted
machines for the slowest CI paths.

This is focused on the current CI bottlenecks we observed in recent
successful PR runs, especially Rust ARM and Python Windows jobs, while
keeping the existing macOS and benchmark-specific runners unchanged
until we verify equivalent GitHub-hosted options for them.

Context:
- Recent PR history shows Rust `linux-arm` and Python `windows` as the
dominant critical-path jobs.
- This change upgrades those jobs to larger GitHub-hosted runners where
available (`ubuntu-24.04-8x`, `ubuntu-24.04-arm64-8x`,
`windows-latest-4x`) and aligns the remaining Linux/Windows workflows
with the same runner family.
- I validated the workflow YAML locally after the runner migration; no
product code or test logic changed.

---

Updates:

- Rust linux-arm:40.7 -> 19.4,about -52%
- Rust windows-build:27.7 -> 21.0,about -24%
- Python windows:36.5 -> 23.1,about -37%
- Python Linux 3.13 ARM:26.9 -> 20.7,about -23%
- Python Linux 3.13 x86_64:26.8 -> 19.1,about -29%
- Python Linux 3.9 x86_64:25.9 -> 19.2,about -26%
Improvements lance-format#4247 alicloud
storage config doc.

Signed-off-by: FarmerChillax <farmerchillax@outlook.com>
Blob reads should return empty bytes when the logical blob is empty or
the cursor is already at EOF. Today `BlobFile::read` / `read_up_to` can
still issue a `get_range(start..end)` request with `start == end`, which
is tolerated by local readers but rejected by cloud object stores.

This showed up while investigating `random_blob` failures on the
original-scale `laion10m-full` dataset, where legacy blob reads on S3
failed with errors like `Range started at 1 and ended at 1`. The fix
short-circuits empty reads and restores the cursor to blob-relative
semantics after `read()`, and adds regression coverage for both the
empty-range case and packed-blob cursor behavior.
<img width="1340" height="800" alt="image"
src="https://github.com/user-attachments/assets/355caf26-14cb-4823-9474-6e4c9e780823"
/>

- FTS indexing is ~2.5x faster, this removes merge phase, and produces
large partitions directly.
- memory footprint is reduced by ~60%, this compresses posting lists
while building them, which can save a lot of memory, and reduces
fragmented objects in memory.

This also bumps the default worker memory budget from 256MiB to 1GiB
because we need to produce larger partition directly, but the memory
footprint is still much less.

This adds a new param `memory_limit` so that users can control how the
indexing should work

---------

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Co-authored-by: LuQQiu <luqiujob@gmail.com>
Co-authored-by: Weston Pace <weston.pace@gmail.com>
…#6187)

This fixes the reader panic in lance-format#6185 when a page keeps nullable rep/def
layer metadata but does not materialize any definition levels. The
decoder now treats that page-local state as all-valid and includes a
regression test that reproduces the mixed-page case before the fix.

Closes lance-format#6185.
)

This fixes the merge-insert fast path for delete-by-source operations
while preserving the existing `UpdateIf` semantics. It also keeps
full-schema `FixedSizeList` merges on the optimized path so target-side
payload columns are pruned from the join build side.

Fix lancedb/lancedb#3094
This updates the benchmark TPC-H datagen path to use DuckDB's
`to_arrow_reader()` API instead of the deprecated `fetch_arrow_reader()`
call.

The benchmark CI treats `DeprecationWarning` as an error, so this
removes the warning that was breaking the random access benchmark job. I
also dropped a leftover `print(ds.count_rows())` debug statement to keep
benchmark logs clean.
In retrospect the old name was somewhat presumptuous. It would probably
be good to get the Arrow project's permission before taking up cargo
real estate. This also adds a README which was preventing the publish.
…mat#6145)

## Summary

Closes lance-format#6138

This PR extends `index_matches_criteria()` in
`rust/lance/src/index/scalar.rs` to handle vector index types in
addition to scalar indices.

## Problem

Previously, `index_matches_criteria()` contained an early return at
lines 464-467 that rejected all non-scalar (vector) indices. This made
it impossible to use `describe_indices` to filter for vector indices on
a specific column.

## Solution

- Removed the early return that rejected all vector indices
- Refactored FTS and exact equality checks to only apply to scalar
indices (these checks are not relevant for vector indices)
- Vector indices now pass through when matching basic criteria (name and
column filters)

## Changes

- 1 file modified: `rust/lance/src/index/scalar.rs`
- 15 lines added, 16 lines removed
- Updated existing test `test_index_matches_criteria_vector_index()` to
reflect the new expected behavior

## Testing

- Updated the existing unit test for vector index criteria matching
- The test now correctly expects vector indices to match basic criteria
instead of being rejected

## AI Disclosure

This contribution was developed with the assistance of Claude (AI by
Anthropic). The implementation approach, code, and PR description were
AI-assisted. All changes are focused on resolving the specific issue
described above.

Co-Authored-By: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>

Signed-off-by: ndpvt-web <ndpvt-web@users.noreply.github.com>
Co-authored-by: ndpvt-web <ndpvt-web@users.noreply.github.com>
Co-authored-by: AI Assistant (Claude) <ai-assistant@contributor-bot.dev>
…er (lance-format#6197)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…rmat#6194)

This PR makes two changes to ensure stale credentials are not used:
(1) In the Directory namespace if either vending is not enabled or a
credential vendor is not configured we return `None` for storage
options.
(2) The `DynamicStorageOptionsCredentialProvider` falls back to the
default credential provider (lazily loaded) if it is not able to
retrieve credentials.

Closes lance-format/lance-spark#292

---------

Signed-off-by: Daniel Rammer <hamersaw@protonmail.com>
…ance-format#6119)

SimpleIndex (HNSW over centroids) previously only supported fp32
centroids, causing fp16 vector workloads to fall back to brute-force
partition assignment — O(K×D) per vector instead of O(log K × D). For
31K centroids × 1024 dims this is a ~600x difference per vector.

Cast fp16 centroids to fp32 at HNSW construction time (one-time cost)
and cast fp16 query vectors at search time (1024 floats per query,
negligible vs the distance computations saved).

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Xuanwo <github@xuanwo.io>
lance-format#6142)

Previously we would use the default file version when creating new index
files. This was originally done to get some testing of the 2.0 format
before it was made the default. However, this led to a bit of a
potential compatibility problem. If we change the default file version
then the files created by the new release would become unreadable on
very old versions that didn't know how to read that file, even if the
dataset itself had an older file version and the old version knew how to
handle the index otherwise.

To avoid this we change things in this PR so that new index files use
the same format version as the dataset. This should mean the indexes are
always readable if the dataset is readable, regardless of what version
was used to write the index.

---

Parts of this PR were written with Claude (Opus 4.6) and I take full
responsibility for its contents.
beinan and others added 25 commits April 13, 2026 11:10
…lance-format#6477)

## Summary

- Change `DataFile.fields` and `DataFile.column_indices` from `Vec<i32>`
to `Arc<[i32]>` so that fragments with identical field lists share a
single heap allocation
- Add `DataFileFieldInterner` that deduplicates these slices during
manifest deserialization
- In homogeneous tables (the common case), every fragment carries the
same field list, so at 20M fragments this saves **~2.4 GB** of redundant
heap allocations

## Motivation

When dataset manifests grow large (>1 GB with millions of fragments),
opening the dataset becomes very expensive in terms of memory. Each
`DataFile` previously owned its own `Vec<i32>` for `fields` and
`column_indices`, even though in most tables every fragment has the
exact same field list. This PR deduplicates those allocations at
deserialization time.

### Per-fragment memory breakdown (before)

| Field | Size per fragment |
|-------|------------------|
| `fields: Vec<i32>` (10 fields) | ~64 bytes |
| `column_indices: Vec<i32>` (10 cols) | ~64 bytes |
| **Total redundant** | **~128 bytes x 20M = ~2.4 GB** |

### After this change

With interning, all 20M fragments share a single `Arc<[i32]>` allocation
(~80 bytes total instead of 2.4 GB).

## Changes

- **`lance-table/src/format/fragment.rs`** — Core struct change
(`Vec<i32>` → `Arc<[i32]>`), custom `Serialize`/`Deserialize` impls, and
`DataFileFieldInterner`
- **`lance-table/src/format/manifest.rs`** — Use interner during
manifest deserialization
- **`lance/src/dataset/fragment.rs`**, **`merge_insert.rs`**,
**`io/commit.rs`** — Tombstoning and field-remapping rebuilt as new
`Arc<[i32]>` instead of in-place mutation
- **`python/src/fragment.rs`**, **`java/lance-jni/src/fragment.rs`** —
FFI boundary conversions
- Various test files — Updated struct literals and assertions

## Compatibility

- No format change — protobuf schema is unchanged
- Serde JSON output is identical (custom impl serializes `Arc<[i32]>` as
`[i32]`)
- All public API signatures that take `Vec<i32>` (e.g.,
`DataFile::new()`, `Fragment::add_file()`) still accept `Vec<i32>` and
convert internally

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…mory (lance-format#6499)

## Summary

- Change `RowDatasetVersionMeta::Inline` from `Vec<u8>` to `Arc<[u8]>`
so that fragments with identical version metadata share a single heap
allocation
- Extend `DataFileFieldInterner` to deduplicate these inline byte
payloads during manifest deserialization
- Introduce `InternCache<T>`: a hybrid cache that uses Vec linear scan
for ≤16 entries and upgrades to HashMap for larger caches
- Add custom `Serialize`/`Deserialize` impls for `RowDatasetVersionMeta`
to handle `Arc<[u8]>` transparently

## Motivation

Follow-up to lance-format#6477 (interning `DataFile.fields`/`column_indices`). After
a compaction, all fragments are stamped with the same version metadata
(both `last_updated_at_version_meta` and `created_at_version_meta`), but
each fragment previously owned its own `Vec<u8>` copy.

### Per-fragment memory breakdown (before)

| Field | Size per fragment |
|-------|------------------|
| `last_updated_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload
|
| `created_at_version_meta: Inline(Vec<u8>)` | ~24 bytes + payload |
| **Total redundant at 20M fragments** | **~480 MB+** |

### After this change

With interning, all 20M fragments share a single `Arc<[u8]>` allocation
per unique payload.

## Benchmark results

Microbenchmark at 100K fragments (10 fields per fragment):

| Scenario | No interning | With interning | Delta |
|----------|-------------|----------------|-------|
| **Uniform (1 unique version)** | 24.5 ms | 17.9 ms | **27% faster** |
| **Diverse (10 unique)** | 25.7 ms | 19.7 ms | **23% faster** |
| **Diverse (100 unique)** | 26.0 ms | 23.4 ms | **10% faster** |
| **Diverse (500 unique)** | 26.0 ms | 22.8 ms | **12% faster** |

| Memory (100K fragments) | No interning | With interning | Savings |
|------------------------|-------------|----------------|---------|
| **10 fields** | 39.47 MB | 29.74 MB | **24.6%** |
| **50 fields** | 69.99 MB | 29.74 MB | **57.5%** |

Both memory and speed improve across all scenarios. The hybrid
`InternCache` uses fast Vec scan for the common case (1-3 unique values)
and upgrades to HashMap when diversity exceeds 16 entries.

Run with: `cargo bench -p lance-table --bench manifest_intern`

## Changes

- **`rust/lance-table/src/rowids/version.rs`** — `Inline(Vec<u8>)` →
`Inline(Arc<[u8]>)`, custom serde impls, updated protobuf conversions
- **`rust/lance-table/src/format/fragment.rs`** — `InternCache<T>`
(Vec/HashMap hybrid), extended `DataFileFieldInterner` with version meta
interning
- **`rust/lance-table/benches/manifest_intern.rs`** — Microbenchmark
covering uniform and diverse scenarios

## Compatibility

- No format change — protobuf schema is unchanged
- Serde JSON output is identical (custom impl serializes `Arc<[u8]>` as
`[u8]`)
- `from_sequence()` still works as before (converts internally)

## Test plan

- [x] `cargo check --workspace --tests` passes
- [x] `cargo clippy -p lance-table -p lance -- -D warnings` passes
- [x] All 88 `lance-table` tests pass
- [x] `cargo fmt --all -- --check` passes
- [x] Microbenchmark validates performance across uniform and diverse
scenarios
- [ ] CI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rmat#6308)

- `list_all_tables`
- `restore_table`
- `update_table_schema_metadata`
- `get_table_stats`
- `explain_table_query_plan`
- `analyze_table_query_plan`

---------

Co-authored-by: zhangyue19921010 <zhangyue.1010@bytedance.com>
## Summary

- Adds `#[instrument]` attributes from the `tracing` crate to key
functions across the `mem_wal` module
- Covers write path (`RegionWriter::open`, `put`, `close`), flush path
(`MemTableFlusher::flush`, `flush_with_indexes`), WAL operations,
manifest store, memtable inserts, scanner/planner, point lookups, and
vector search
- Uses appropriate trace levels (`info` for high-level operations,
`debug` for internals) with relevant fields (region_id, epoch, row
counts, batch counts)

## Test plan

- [x] `cargo check` passes — no functional changes, only attribute
additions
- [x] Existing `mem_wal` tests continue to pass
- [ ] Tracing output verified with `RUST_LOG=debug` showing instrumented
spans

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

## Summary

Refactor `FullZipScheduler::create_page_load_task` to accept a
pre-submitted I/O future instead of deferring I/O submission until the
async task executes. This allows the I/O requests to be submitted
immediately during scheduling, enabling the object store layer to batch
and parallelize them. close lance-format#6504

## I/O Model Change

### Before: Lazy I/O submission (serialized)

Previously, `create_page_load_task` received a
`FullZipReadSource::Remote(io)` along with byte ranges and priority. The
actual `io.submit_request()` call happened **inside** the async block,
meaning the I/O request was not submitted until the future was first
polled.

When decoding multiple pages (e.g. across many fragments), this created
a sequential I/O pattern:

```
Page 1: [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode]
Page 2:                                          [schedule] -> [poll] -> [submit I/O] -> [wait response] -> [decode]
Page 3:                                                                                   [schedule] -> [poll] -> ...
```

Each page's I/O request could only be submitted after the previous task
started executing. The I/O scheduler had no visibility into upcoming
requests, preventing it from batching or parallelizing them effectively.

### After: Eager I/O submission (pipelined)

Now, `io.submit_request()` is called **before** constructing the
`PageLoadTask`, and the resulting future is passed into
`create_page_load_task`. All I/O requests for all pages are submitted
upfront during the scheduling phase:

```
[schedule all pages] --> submit I/O page 1 -+
                     --> submit I/O page 2 -+
                     --> submit I/O page 3 -+  (all in-flight concurrently)
                     --> submit I/O page N -+
                                            |
                     [poll] -> [await page 1 response] -> [decode]
                     [poll] -> [await page 2 response] -> [decode]
                     [poll] -> [await page 3 response] -> [decode]
```

The object store layer can now see all pending requests at once and
optimize I/O through batching, connection multiplexing, and parallel
fetches. The async tasks only await the already-in-flight I/O futures.

## Changes

- `rust/lance-encoding/src/encodings/logical/primitive.rs`:
- Changed `create_page_load_task` signature to accept
`BoxFuture<'static, Result<Vec<Bytes>>>` instead of `FullZipReadSource`
+ byte ranges + priority
- Moved `io.submit_request()` calls to happen eagerly at both call sites
(`schedule_ranges_with_rep_index` and the non-rep-index path), before
constructing the page load task

## Performance

Tested with a multi-fragment dataset containing fixed-width columns
(768-dim float32 vectors, 40 fragments, 50 rows/fragment):

| Benchmark | Before (p50) | After (p50) | Speedup |
|---|---|---|---|
| Fixed-width column scan | 3453 ms | 523 ms | **6.6x** |

The improvement comes entirely from I/O pipelining — the decoding logic
itself is unchanged. The effect is most pronounced with many fragments
or pages, where the serialized I/O submission was the dominant
bottleneck.
## Summary
- Add `blob_max_pack_file_bytes` to `WriteParams`, allowing users to
override the default 1 GiB maximum pack (`.blob`) sidecar file size
- Thread the configuration through the full write path: `WriteParams` ->
`WriterGenerator` -> `WriterOptions` -> `BlobPreprocessor` ->
`PackWriter`
- Expose the option in Python (`write_dataset`) and Java
(`WriteParams.Builder`) bindings

## Test plan
- [x] All 37 existing blob tests pass (`cargo test -p lance blob`)
- [x] Clippy clean on `lance` and `lance-jni` crates
- [x] Verify Python binding works end-to-end with
`blob_max_pack_file_bytes` kwarg
- [x] Verify Java binding compiles with `./mvnw compile`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Bump `jieba-rs` from 0.8.1 to 0.9.0 to fix the `build-no-lock` CI job
- The `core2` crate v0.4.0 was yanked from crates.io, breaking fresh
dependency resolution (`jieba-rs` → `include-flate` → `libflate` →
`core2`)
- `jieba-rs` 0.9.0 drops the `include-flate`/`libflate`/`core2` chain
entirely, removing 9 transitive dependencies with no API changes

## Test plan

- [x] `cargo check -p lance-index --features tokenizer-jieba` passes
- [x] Verified build succeeds without `Cargo.lock` (simulating the CI
job)
- [ ] CI `build-no-lock` job passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This tightens the repository's environment guidance so language-specific
tasks must follow the documented workflow before reporting missing tools
or dependencies.

For Python work, the docs now make `uv sync --extra tests --extra dev`
and `uv run ...` mandatory, and explicitly call out the common failure
mode where slow `uv sync` is interrupted or global Python is used
instead.
This changes per-base runtime configuration to use exact
`ObjectStoreParams` bindings keyed by `BasePath.path` instead of
per-base storage option overrides. Dataset-level and write-level store
params now act only as fallbacks, while reads, target-base writes, and
external blob resolution all consult the same base-scoped binding model.

This keeps provider-specific runtime state out of the manifest and
follows the direction in discussion lance-format#6307 to keep `BasePath` focused on
identity.
This PR vendors the tokenizer stack Lance actually uses into a new
`rust/lance-tokenizer` crate and rewires FTS and inverted-index code to
depend on it instead of `tantivy` and `lindera-tantivy`. It keeps the
existing document and query tokenization semantics in-tree, renames the
old FTS document adapter module to `document_tokenizer`, and preserves
upstream license headers on vendored code.
…ormat#6517)

## Summary

- Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2
distance (`Σ(a-b)²`) in new `l2_u8.rs`
- Add fused single-pass u8 cosine distance kernel in new `cosine_u8.rs`
— computes `dot(a,b)`, `‖a‖²`, `‖b‖²` simultaneously, halving memory
traffic vs the previous 2-3 pass approach
- Wire both into the `L2 for u8` and `Cosine for u8` trait impls
- Add benchmarks comparing scalar vs SIMD for both kernels

### Algorithmic approach (adapted from
[NumKong](https://github.com/ashvardanian/NumKong))

**L2 (AVX2):** Saturating subtraction for `|a-b|`, zero-extend u8→i16,
`VPMADDWD(diff, diff)` to square and accumulate into i32. 32
elements/iter.

**L2 (AVX-512 VNNI):** Same abs-diff approach with `VPDPWSSD` for fused
square-accumulate. 64 elements/iter.

**Cosine (AVX2):** Zero-extend both vectors to i16, triple `VPMADDWD`
per half (a·b, a·a, b·b). 32 elements/iter, single pass.

**Cosine (AVX-512 VNNI):** Same three-accumulator approach with
`VPDPWSSD`. 64 elements/iter.

Both kernels use `OnceLock`-based runtime CPU dispatch, falling back to
portable scalar on non-x86 platforms.

### Benchmarks

*1M × 1024-dim u8 vectors.*

**x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)**

| Kernel | Scalar | SIMD | Speedup |
|--------|--------|------|---------|
| L2(u8) | 73.5 ms | 58.2 ms | **1.26x** |
| Cosine(u8) | 122.2 ms | 82.1 ms | **1.49x** |

L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than
that path.

**aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)**

| Kernel | Scalar | SIMD (dispatch) |
|--------|--------|-----------------|
| L2(u8) | 26.8 ms | 27.3 ms |
| Cosine(u8) | 90.1 ms | 90.4 ms |

On aarch64 the SIMD path falls through to scalar (no AVX2), so times are
identical — confirms no regression on non-x86 platforms. AVX-512 VNNI
systems (Ice Lake+, Zen 4+) should see larger gains.

## Test plan

- [x] All 11 new tests pass: SIMD backends verified against scalar
reference across 18 vector sizes (0–4097), boundary values (0/255),
alternating patterns, random seeds
- [x] All 63 existing lance-linalg tests pass (no regressions)
- [x] Clippy clean, fmt clean
- [x] Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine
1.49x faster
- [ ] Verify on AVX-512 VNNI system for additional speedup data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This fixes the release bump configuration after `lance-tokenizer` was
added to the workspace dependencies. `.bumpversion.toml` was missing the
corresponding replacement rule, so version bumps could leave that
internal dependency on the previous version. This is a targeted
config-only fix to keep the release automation updating all workspace
crates consistently.
This fixes the directory namespace CI failure where single-instance
concurrent create/drop operations on `__manifest` could time out with
`TooMuchWriteContention`, especially in the Windows build.

Manifest mutations are now serialized within a single
`ManifestNamespace` instance so concurrent operations stop racing on
stale in-memory snapshots, and inline manifest maintenance now defers
compaction/index merges until the table has accumulated enough
fragments.

Context:
https://github.com/lance-format/lance/actions/runs/24439767878/job/71401857043
Blob columns can be represented either as loaded values or as unloaded
descriptor schemas, but our schema projection logic still treated those
views as incompatible types. This change teaches field projection and
intersection to recognize blob loaded/unloaded pairs as the same logical
column, and adds regression coverage for both the core schema path and
the projection-plan path that previously failed.
## Summary

- Adds `ChopBatchesStream`, a stream wrapper that splits oversized
batches (>1.5x target `batch_size_bytes`) into smaller sub-batches using
zero-copy `RecordBatch::slice`
- Wraps the filtered read output stream with `ChopBatchesStream` when
`batch_size_bytes` is configured via `FileReaderOptions`
- Serves as a safety net when the underlying file reader doesn't
estimate batch sizes accurately enough

**Stacked on feat/byte-sized-batches-file-reader** — wait for that to
merge first, then rebase this PR.

## Test plan

- [x] Unit tests for `ChopBatchesStream`: splits large batches, passes
small batches through, `wrap_if_needed(None)` is a no-op
- [x] `cargo clippy` clean
- [x] `cargo fmt` clean

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-format#6503)

Add protobuf encode/decode for `ANNIvfSubIndexExec`

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This isolates the `test_memory_leaks` index statistics probe into a
fresh subprocess instead of running it inside the long-lived pytest
worker process. That keeps the test focused on repeated
`index_statistics` calls and avoids false positives from RSS growth left
behind by earlier tests such as the recent batch-chopping coverage added
in `test_dataset.py`.
…at#6352)

This PR improves blob I/O in two complementary ways: `BlobFile`
instances that resolve to the same physical object now share a lazy
`BlobSource` and can opportunistically coalesce concurrent reads before
handing them to Lance's existing scheduler, and datasets now expose a
planned `read_blobs` API for materializing blob payloads directly. It
also adds explicit cursor-preserving range reads for `BlobFile` across
Rust, Python, and Java, with end-to-end Python coverage for the new API
and the edge cases it uncovered.

This keeps the optimization aligned with Lance's existing scheduler
model while giving callers a higher-level path for sequential and
batched blob access.

## Python example

```python
import lance

dataset = lance.dataset("/path/to/dataset")
blobs = dataset.read_blobs(
    "images",
    indices=[0, 4, 8],
    target_request_bytes=8 * 1024 * 1024,
    max_gap_bytes=64 * 1024,
    max_concurrency=4,
    preserve_order=True,
)

for row_address, payload in blobs:
    print(row_address, len(payload))
```
…mat#6540)

## Summary

- Adds `f64x4` and `f64x8` SIMD types to `lance-linalg` with support for
x86_64 (AVX2/AVX-512), aarch64 (NEON), and loongarch64 (LASX)
- Replaces auto-vectorization-dependent f64 distance functions with
explicit SIMD using two-level unrolling (f64x8 + f64x4 + scalar tail)
- Updates norm_l2, dot, L2, and cosine distance for f64

## Benchmark Results (Apple M-series, aarch64 NEON)

1M vectors × 1024 dimensions:

| Benchmark | Before | After | Change |
|-----------|--------|-------|--------|
| NormL2(f64, auto-vec) | 117.76 ms | 116.04 ms | ~same |
| NormL2(f64, SIMD) | N/A (TODO) | 119.16 ms | new |
| Dot(f64, auto-vec) | 129.36 ms | 130.23 ms | ~same |
| L2(f64, auto-vec) | 132.53 ms | 135.15 ms | ~same |
| **Cosine(f64, auto-vec)** | **202.52 ms** | **139.23 ms** | **-31.4%**
|

The biggest win is **cosine distance**, which previously had an empty
`impl Cosine for f64 {}` falling back to the scalar path. The explicit
SIMD implementation is **31% faster**.

For norm_l2, dot, and L2, LLVM's auto-vectorization with the LANES=8
hint was already producing good code on this platform. The explicit SIMD
ensures consistent performance across compilers and platforms rather
than relying on fragile auto-vectorization hints.

## Test plan
- [x] All 59 lance-linalg tests pass
- [x] Clippy clean (`-D warnings`)
- [x] `cargo fmt` clean
- [ ] CI passes on all platforms

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…at#6506)

## Summary

- Add AVX-512 VNNI and AVX2 backends for unsigned int8 dot product with
runtime CPU feature detection and automatic fallback to scalar
- Replace the unoptimized `Dot<u8>` impl (which had an explicit TODO)
with dispatched SIMD kernel
- All existing callers including SQ distance computation benefit
automatically with zero changes to lance-index

## Details

### New file: `rust/lance-linalg/src/distance/dot_u8.rs`

Three backends selected at runtime via `OnceLock` +
`is_x86_feature_detected!`:

| Backend | Instruction | Elements/iter | CPU |
|---|---|---|---|
| AVX-512 VNNI | `VPDPBUSD` + XOR-0x80 bias trick | 64 | Ice Lake+ / Zen
4+ |
| AVX2 | `VPMADDWD` on zero-extended u16 | 32 | Haswell+ / Zen 1+ |
| Scalar | portable reference | - | any (including ARM) |

### The VNNI bias trick

`VPDPBUSD` expects one unsigned and one signed operand, but SQ vectors
are u8×u8. We XOR one operand with 0x80 to map it to the signed domain,
then correct by adding `128·Σa` at the end. The correction uses
`VPSADBW` which runs on execution port 5 while `VPDPBUSD` runs on port 0
— they execute in parallel every cycle, making the correction
effectively free.

### SQ integration (automatic)

`SQDistCalculator::distance()` already calls `dot_distance()` →
`u8::dot()` for Dot distance type. Replacing the `Dot<u8>` body is the
only change needed.

## Benchmarks

### Ryzen 4500 (AVX2, no VNNI)

1M total u8 elements, varying vector dimension. Scalar baseline vs
AVX2-dispatched path:

| Dimension | Scalar | Dispatch (AVX2) | Speedup |
|-----------|--------|-----------------|---------|
| 128 | 51.02 µs | 58.25 µs | 0.88x (dispatch overhead dominates) |
| 256 | 44.96 µs | 38.62 µs | **1.16x** |
| 512 | 42.82 µs | 28.27 µs | **1.51x** |
| 1024 | 41.00 µs | 25.17 µs | **1.63x** |

AVX2 delivers up to 1.63x throughput at dim=1024. At dim=128 the
`OnceLock` dispatch and AVX2 loop setup overhead exceeds the SIMD gains
on short vectors. AVX-512 VNNI (Ice Lake+ / Zen 4+) is expected to show
larger gains with 64 elements/iter.

### Apple M4 (ARM64, scalar fallback)

On ARM64 the dispatch falls back to scalar, so both paths perform
identically (~13 µs at dim=1024). A follow-up ARM NEON `UDOT` path would
bring SIMD gains to Apple Silicon.

### Out of scope (follow-up)
- L2/Cosine u8 SIMD optimization (different kernel: `Σ(a-b)²`)
- Native `VPDPBUUD` (unsigned×unsigned, Sierra Forest+) — too new for
stable Rust
- ARM NEON `UDOT` path
- Precomputed norms for SQ L2/Cosine (requires storage format change)

## Test plan

- [x] Unit tests: random inputs across 18 vector sizes (0-4097),
boundary values (all 0s, all 255s, alternating), one-sided zeros,
all-ones patterns
- [x] Each backend tested independently against scalar reference (with
`#[cfg]` guards for missing CPU features)
- [x] Existing `dot` tests continue to pass (9/9)
- [x] `cargo clippy -p lance-linalg --tests --benches -- -D warnings`
clean
- [x] Benchmark on x86_64 with AVX2: `cargo bench --bench dot -p
lance-linalg -- "Dot\(u8"`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary

- Replaces the external `numkong` dependency with in-tree C kernels for
**bf16 distance computation** (dot product, L2, cosine, norm_l2)
- Follows the existing f16 kernel pattern: C source compiled via
`build.rs` with per-architecture flags, runtime CPU dispatch via
`SIMD_SUPPORT`
- Kernels are only enabled when the CPU supports the required
instructions (NEON on aarch64, AVX2/AVX-512 on x86_64, LSX/LASX on
loongarch64), with scalar fallback otherwise
- Gated behind the existing `fp16kernels` feature flag

## Benchmark Results

Tested on two platforms with 1M x 1024-dim vectors:

### Apple Silicon (M-series, NEON)

| Benchmark | Before (scalar) | After (C kernel) | Change |
|-----------|-----------------|-------------------|--------|
| **Dot(bf16)** | 144 ms | 55 ms | **2.6x faster** |
| **NormL2(bf16)** | 90 ms | 36 ms | **2.5x faster** |

### AMD Ryzen 5 4500 (Zen 2, AVX2)

| Benchmark | Before (scalar) | After (C kernel) | Change |
|-----------|-----------------|-------------------|--------|
| **Dot(bf16)** | 578 ms | 363 ms | **1.6x faster** (−37%) |
| **NormL2(bf16)** | 365 ms | 207 ms | **1.8x faster** (−43%) |

### Why the approach works

BF16-to-f32 conversion is a simple left-shift by 16 bits. The C kernels
compiled with architecture-specific flags (`-march=haswell`,
`-mtune=apple-m1`, etc.) plus `-ffast-math` and vectorization pragmas
give the compiler more freedom to emit tight SIMD code than LLVM gets
from the Rust scalar loops. ARM benefits more because the baseline Rust
auto-vectorization was weaker there.

## Files Changed

- **New**: `rust/lance-linalg/src/simd/bf16.c` — C kernels for dot, L2,
cosine, norm_l2
- `rust/lance-linalg/build.rs` — compile bf16.c for each architecture
- `rust/lance-linalg/src/distance/{dot,l2,cosine,norm_l2}.rs` — runtime
SIMD dispatch for bf16
- `rust/lance-linalg/Cargo.toml` — removed `numkong` dependency and
feature
- `rust/lance-linalg/benches/{dot,l2,cosine}.rs` — removed numkong
benchmark sections
- **Deleted**: `scripts/bench_numkong.sh`

## Test plan

- [x] `cargo test -p lance-linalg --features fp16kernels` — all bf16
tests pass (kernel path)
- [x] `cargo test -p lance-linalg` — all bf16 tests pass (scalar
fallback)
- [x] `cargo clippy -p lance-linalg --features fp16kernels --tests
--benches -- -D warnings` — clean
- [x] Benchmarked on Apple Silicon (ARM NEON)
- [x] Benchmarked on AMD Ryzen 5 4500 (x86_64 AVX2)
- To reproduce:
  ```bash
  git checkout HEAD~1
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
dot -- --save-baseline before "bf16"
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
norm_l2 -- --save-baseline before "bf16"
  git checkout -
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
dot -- --baseline before "bf16"
TARGET_TIME=3 cargo bench -p lance-linalg --features fp16kernels --bench
norm_l2 -- --baseline before "bf16"
  ```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…a object store calls (lance-format#6507)

## Summary

- Adds `dir_listing_to_manifest_migration_enabled` flag (default:
`false`) to `DirectoryNamespaceBuilder` and `DirectoryNamespace`
- When `false` and both `manifest_enabled` and `dir_listing_enabled` are
`true`, root-level table operations (`table_exists`, `describe_table`,
`list_tables`) skip the manifest check and use directory listing
directly, avoiding extra object store listing calls
- When `true`, preserves the existing hybrid behavior of checking
manifest first then falling back to directory listing
- Includes a test with a counting object store wrapper verifying only a
single `list_with_delimiter` call is made for root-level table
operations without migration mode

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…6562)

## Summary

- The `FairSpillPool` divides memory evenly across spillable consumers.
With up to 8 partitions, each sort consumer was limited to ~12.5MB from
a flat 100MB pool, causing merge_insert operations with large payloads
to fail with "not enough memory to continue external sort" at very small
batch sizes (e.g. 5 rows with 1MB payloads).
- Scale the default pool size to 100MB **per partition** so each
consumer gets a reasonable allocation. Explicit `LANCE_MEM_POOL_SIZE` or
`mem_pool_size` settings are respected as-is.
- This is a partial fix — very large batches can still exhaust the
per-partition budget. A more complete fix may involve revisiting the
pool type or spilling behavior for merge_insert.

## Test plan

- [x] Added unit test `test_mem_pool_size_scales_with_partitions`
verifying pool size scales correctly
- [x] Verified with a Python repro script that merge_insert with
1MB-per-row payloads no longer fails at 5 rows (now succeeds up to ~50
rows)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jackye1995 jackye1995 changed the title fix(namespace): align RestNamespace error handling with REST spec fix(namespace): align error handling with namespace spec Apr 19, 2026
LuciferYang and others added 2 commits April 18, 2026 23:19
…e-format#6570)

## Summary

The `linux-build` CI job installs unpinned `nightly`, which broke after
2026-04-17 because `ethnum 1.5.2` uses `unsafe { mem::transmute(()) }`
to create `TryFromIntError`. Newer nightly builds reject this with
`error[E0512]: cannot transmute between types of different sizes`.

Dependency chain: `lance-arrow` → `jsonb 0.5.6` → `ethnum 1.5.2`

This PR pins `nightly-2026-04-16` (last known-good date) in both the
toolchain install step and the `cargo +nightly` invocation.

## Root Cause

`ethnum-1.5.2/src/error.rs:16`:
```rust
pub const fn tfie() -> TryFromIntError {
    unsafe { mem::transmute(()) }  // () is 0 bits, TryFromIntError is 8 bits
}
```

Rust nightly `e9e32aca5` (2026-04-17) tightened `transmute` checks,
making this a hard error.

## Follow-up

- Upstream fix needed in
[`nlordell/ethnum-rs`](https://github.com/nlordell/ethnum-rs) to replace
the transmute hack
- Once `ethnum` publishes a fix and `jsonb` picks it up, the pin can be
removed

## Test plan

- [x] `linux-build` job passes with pinned nightly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.