Skip to content

Add experimental columnar indexing api#15990

Open
Tim-Brooks wants to merge 26 commits intoapache:mainfrom
Tim-Brooks:parent_field_to_indexing_chain+columns_simpler
Open

Add experimental columnar indexing api#15990
Tim-Brooks wants to merge 26 commits intoapache:mainfrom
Tim-Brooks:parent_field_to_indexing_chain+columns_simpler

Conversation

@Tim-Brooks
Copy link
Copy Markdown
Contributor

This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.

@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward.

* <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more
* efficient bulk fill (for example a {@link System#arraycopy} from a backing array).
*/
public void fill(long[] dst, int offset, int length) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.

Makes much less sense for sorted set doc values, etc where the value widths are variable.

This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.

@github-actions github-actions Bot added this to the 10.5.0 milestone Apr 28, 2026
@msfroh
Copy link
Copy Markdown
Contributor

msfroh commented Apr 28, 2026

Conceptually, I'm really excited about this.

Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the IndexWriter could read an Arrow RecordBatch without copying).

I'll try to take a look in the coming days. This sounds cool!

@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

IMO, this would make that considerably easier (and ideally more efficient

Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR.

};
}

private static class ArrayLongColumn extends LongColumn {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan to make ArrayLongColumn and similar classes public for easier use?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had sent an email to the Lucene mailing list about this. I had mentioned that at some point Lucene might want to add ergonomic builders (similar to IntField, DoubleField, etc) vs just low level apis (similar to how users can directly override Field to encode points / dv building).

I did really plan to make them in this PR as I was worried about the surface area and/or quickly adding implementations to apis which are still under flux.

But I can certainly had them if there is specific interest.

@mayya-sharipova
Copy link
Copy Markdown
Contributor

Great idea! And very intuitively and relatively simply implemented!

Copy link
Copy Markdown
Contributor

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this!

I see some nice similarity with e.g. Arrow columns / record-batches (which makes sense -- I imagine lots of columnar APIs will look similar).

* @param <T> the vector array type, either {@code float[]} or {@code byte[]}
* @lucene.experimental
*/
public abstract class VectorTupleCursor<T> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this necessarily need to be a dedicated vector cursor?

Or could this just be ObjectTupleCursor<T>? Just from an interface standpoint, T can be any non-primitive type.

By the same token, BinaryTupleCursor is pretty much the same thing with T as BytesRef.

Maybe these TupleCursor classes don't need to be directly connected to the associated Column classes. Instead, we could do specialized primitive TupleCursors and a single generic ObjectTupleCursor<T>?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at some change in this area in a day or tow.

Comment on lines +713 to +721
ColumnValidation.validateColumnHasIndexingFeature(fieldName, fieldType);

if (column instanceof BinaryColumn bc) {
ColumnValidation.validateBinaryColumn(bc, fieldType);
} else if (column instanceof LongColumn lc) {
ColumnValidation.validateLongColumn(lc, fieldType);
} else if (column instanceof VectorColumn<?> vc) {
ColumnValidation.validateVectorColumn(vc, fieldType);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, not worth addressing in this PR, but I wonder if this repeated validation will eventually take a non-trivial amount of compute, especially for a large schema. We can probably define a "schema" class (I'm thinking of Arrow's VectorSchemaRoot), which can be "frozen" after validation. Then each batch can reference the same schema and we can skip validation on subsequent batches.

Just a thought -- for now I think the per-batch validation should be fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree that this is not something I will address here as I think embedded schemas into Lucene would probably be a discussion of its own.

My use case does demonstrate significant overhead for field validation using the document API. I was doing something like 150 fields per document and the time spend in processDocument was greater than the actual indexing.

Moving towards batches of 1000 with 150 columns made all the validation disappear from any sample-able threads. Doesn't mean it is not there though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was related: #15886.

* #processField}) and the column-batch row pass ({@link #processRowColumns}). Returns {@code
* true} if this is a unique indexed field with postings.
*/
private boolean invertAndStore(int docID, IndexableField field, PerField pf) throws IOException {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While stored fields definitely make sense as a row-oriented thing, I'm wondering if there's anything clever we can do on the "invert" side if we want to process a bunch of values for the same field.

Since the resulting data structures are field-oriented, it feels to me like there should be something we can do, but I haven't had enough coffee yet to have a clever idea.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote in the mailing list that DOC + no norms can be processed columnar.

In terms of optimizations, once the api had landed I planned to propose a long (or int) column with an associated array dictionary. And then Lucene would only index the dictionary each column batch.

This would be targeting inverted index and sorted set DV optimizations for low cardinality use cases. Without exposing any Lucene hashing or equality internals.

But I have not actually gone through the steps of implementing something like this yet.

@Tim-Brooks
Copy link
Copy Markdown
Contributor Author

This is now imo in a ready to review for (hopefully) eventual merge state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants