Add experimental columnar indexing api by Tim-Brooks · Pull Request #15990 · apache/lucene

Tim-Brooks · 2026-04-28T01:47:37Z

This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.

…exing_chain+columns

…exing_chain+columns_simpler

Tim-Brooks · 2026-04-28T01:48:26Z

This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward.

Tim-Brooks · 2026-04-28T01:56:12Z

+   * <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more
+   * efficient bulk fill (for example a {@link System#arraycopy} from a backing array).
+   */
+  public void fill(long[] dst, int offset, int length) {


This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.

Makes much less sense for sorted set doc values, etc where the value widths are variable.

This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.

msfroh · 2026-04-28T21:51:24Z

Conceptually, I'm really excited about this.

Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the IndexWriter could read an Arrow RecordBatch without copying).

I'll try to take a look in the coming days. This sounds cool!

Tim-Brooks · 2026-04-29T02:47:04Z

IMO, this would make that considerably easier (and ideally more efficient

Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR.

…exing_chain+columns_simpler

mayya-sharipova · 2026-05-04T10:49:09Z

+    };
+  }
+
+  private static class ArrayLongColumn extends LongColumn {


Do you plan to make ArrayLongColumn and similar classes public for easier use?

I had sent an email to the Lucene mailing list about this. I had mentioned that at some point Lucene might want to add ergonomic builders (similar to IntField, DoubleField, etc) vs just low level apis (similar to how users can directly override Field to encode points / dv building).

I did really plan to make them in this PR as I was worried about the surface area and/or quickly adding implementations to apis which are still under flux.

But I can certainly had them if there is specific interest.

mayya-sharipova · 2026-05-04T10:49:56Z

Great idea! And very intuitively and relatively simply implemented!

msfroh

I really like this!

I see some nice similarity with e.g. Arrow columns / record-batches (which makes sense -- I imagine lots of columnar APIs will look similar).

msfroh · 2026-04-29T22:10:56Z

+ * @param <T> the vector array type, either {@code float[]} or {@code byte[]}
+ * @lucene.experimental
+ */
+public abstract class VectorTupleCursor<T> {


Does this necessarily need to be a dedicated vector cursor?

Or could this just be ObjectTupleCursor<T>? Just from an interface standpoint, T can be any non-primitive type.

By the same token, BinaryTupleCursor is pretty much the same thing with T as BytesRef.

Maybe these TupleCursor classes don't need to be directly connected to the associated Column classes. Instead, we could do specialized primitive TupleCursors and a single generic ObjectTupleCursor<T>?

I'll take a look at some change in this area in a day or tow.

msfroh · 2026-05-04T16:33:41Z

+      ColumnValidation.validateColumnHasIndexingFeature(fieldName, fieldType);
+
+      if (column instanceof BinaryColumn bc) {
+        ColumnValidation.validateBinaryColumn(bc, fieldType);
+      } else if (column instanceof LongColumn lc) {
+        ColumnValidation.validateLongColumn(lc, fieldType);
+      } else if (column instanceof VectorColumn<?> vc) {
+        ColumnValidation.validateVectorColumn(vc, fieldType);
+      }


IMO, not worth addressing in this PR, but I wonder if this repeated validation will eventually take a non-trivial amount of compute, especially for a large schema. We can probably define a "schema" class (I'm thinking of Arrow's VectorSchemaRoot), which can be "frozen" after validation. Then each batch can reference the same schema and we can skip validation on subsequent batches.

Just a thought -- for now I think the per-batch validation should be fine.

I do agree that this is not something I will address here as I think embedded schemas into Lucene would probably be a discussion of its own.

My use case does demonstrate significant overhead for field validation using the document API. I was doing something like 150 fields per document and the time spend in processDocument was greater than the actual indexing.

Moving towards batches of 1000 with 150 columns made all the validation disappear from any sample-able threads. Doesn't mean it is not there though.

This was related: #15886.

msfroh · 2026-05-04T16:44:16Z

+   * #processField}) and the column-batch row pass ({@link #processRowColumns}). Returns {@code
+   * true} if this is a unique indexed field with postings.
+   */
+  private boolean invertAndStore(int docID, IndexableField field, PerField pf) throws IOException {


While stored fields definitely make sense as a row-oriented thing, I'm wondering if there's anything clever we can do on the "invert" side if we want to process a bunch of values for the same field.

Since the resulting data structures are field-oriented, it feels to me like there should be something we can do, but I haven't had enough coffee yet to have a clever idea.

I wrote in the mailing list that DOC + no norms can be processed columnar.

In terms of optimizations, once the api had landed I planned to propose a long (or int) column with an associated array dictionary. And then Lucene would only index the dictionary each column batch.

This would be targeting inverted index and sorted set DV optimizations for low cardinality use cases. Without exposing any Lucene hashing or equality internals.

But I have not actually gone through the steps of implementing something like this yet.

…exing_chain+columns_simpler

Tim-Brooks · 2026-05-08T03:07:30Z

This is now imo in a ready to review for (hopefully) eventual merge state.

Tim-Brooks added 19 commits April 9, 2026 10:23

Change

d97c250

Change

dccfa0b

Chagne'

c44c47f

Change

23da317

Change

051c2f7

Change

18ec559

Change

5b26e65

Change

f0b15f8

Fix

ab724d0

Change

7473d74

Change

54fa585

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

8f44091

…exing_chain+columns

Change

7d1b5e9

Change

5c700ff

Change

0f41ddb

Fix

efc2839

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

2656bdc

…exing_chain+columns_simpler

Vectors

536ba06

Change

a80ae0b

github-actions Bot added the module:core/index label Apr 28, 2026

Tim-Brooks commented Apr 28, 2026

View reviewed changes

Change

43ec640

github-actions Bot added this to the 10.5.0 milestone Apr 28, 2026

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

2e3fe98

…exing_chain+columns_simpler

mayya-sharipova reviewed May 4, 2026

View reviewed changes

msfroh reviewed May 4, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

0af6c61

…exing_chain+columns_simpler

neoremind mentioned this pull request May 7, 2026

Add binary format for GeoNames indexing benchmark mikemccand/luceneutil#578

Open

Tim-Brooks added 4 commits May 7, 2026 10:32

Changes

3a3a94d

Merge remote-tracking branch 'upstream/main' into parent_field_to_ind…

7911f96

…exing_chain+columns_simpler

Changes

6227664

Change

29d0548

Conversation

Tim-Brooks commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msfroh commented Apr 28, 2026

Uh oh!

Tim-Brooks commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented May 4, 2026

Uh oh!

msfroh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tim-Brooks commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants