Add experimental columnar indexing api#15990
Add experimental columnar indexing api#15990Tim-Brooks wants to merge 26 commits intoapache:mainfrom
Conversation
…exing_chain+columns
…exing_chain+columns_simpler
|
This is not ready. It is a POC to provide an example to the Lucene mailing list discussion of adding a columnar api. It would need more tests and cursor API refinement before moving forward. |
| * <p>The default implementation calls {@link #nextLong()} in a loop. Override to provide a more | ||
| * efficient bulk fill (for example a {@link System#arraycopy} from a backing array). | ||
| */ | ||
| public void fill(long[] dst, int offset, int length) { |
There was a problem hiding this comment.
This would essentially be a fast path users could optionally implement if they wanted to optimize bulk adds from whatever their binary backing source is -> the doc value writer. It would probably also make sense to add for points support.
Makes much less sense for sorted set doc values, etc where the value widths are variable.
This obviously isn't a requirement, but was helpful when I was prototyping as it made a considerable difference in the low-level performance.
|
Conceptually, I'm really excited about this. Some colleagues and I have been working on ingesting data into Lucene from Apache Arrow format. IMO, this would make that considerably easier (and ideally more efficient -- imagine if the I'll try to take a look in the coming days. This sounds cool! |
Yes ideally this would be designed to work nicely with other columnar formats. I don't think it would be zero copy as there would still be a copy from format -> Lucene's buffers. But there would ideally be one copy with optimized paths for the dense batches which can be copied in bulk. And it would be up to the user to implement the bytes -> longs (endianness, unpacking, etc) step for doc values. And would have to be on-top of the sort order encoding for points in binary columns. I haven't really gone too far with points in this PR. |
…exing_chain+columns_simpler
| }; | ||
| } | ||
|
|
||
| private static class ArrayLongColumn extends LongColumn { |
There was a problem hiding this comment.
Do you plan to make ArrayLongColumn and similar classes public for easier use?
There was a problem hiding this comment.
I had sent an email to the Lucene mailing list about this. I had mentioned that at some point Lucene might want to add ergonomic builders (similar to IntField, DoubleField, etc) vs just low level apis (similar to how users can directly override Field to encode points / dv building).
I did really plan to make them in this PR as I was worried about the surface area and/or quickly adding implementations to apis which are still under flux.
But I can certainly had them if there is specific interest.
|
Great idea! And very intuitively and relatively simply implemented! |
msfroh
left a comment
There was a problem hiding this comment.
I really like this!
I see some nice similarity with e.g. Arrow columns / record-batches (which makes sense -- I imagine lots of columnar APIs will look similar).
| * @param <T> the vector array type, either {@code float[]} or {@code byte[]} | ||
| * @lucene.experimental | ||
| */ | ||
| public abstract class VectorTupleCursor<T> { |
There was a problem hiding this comment.
Does this necessarily need to be a dedicated vector cursor?
Or could this just be ObjectTupleCursor<T>? Just from an interface standpoint, T can be any non-primitive type.
By the same token, BinaryTupleCursor is pretty much the same thing with T as BytesRef.
Maybe these TupleCursor classes don't need to be directly connected to the associated Column classes. Instead, we could do specialized primitive TupleCursors and a single generic ObjectTupleCursor<T>?
There was a problem hiding this comment.
I'll take a look at some change in this area in a day or tow.
| ColumnValidation.validateColumnHasIndexingFeature(fieldName, fieldType); | ||
|
|
||
| if (column instanceof BinaryColumn bc) { | ||
| ColumnValidation.validateBinaryColumn(bc, fieldType); | ||
| } else if (column instanceof LongColumn lc) { | ||
| ColumnValidation.validateLongColumn(lc, fieldType); | ||
| } else if (column instanceof VectorColumn<?> vc) { | ||
| ColumnValidation.validateVectorColumn(vc, fieldType); | ||
| } |
There was a problem hiding this comment.
IMO, not worth addressing in this PR, but I wonder if this repeated validation will eventually take a non-trivial amount of compute, especially for a large schema. We can probably define a "schema" class (I'm thinking of Arrow's VectorSchemaRoot), which can be "frozen" after validation. Then each batch can reference the same schema and we can skip validation on subsequent batches.
Just a thought -- for now I think the per-batch validation should be fine.
There was a problem hiding this comment.
I do agree that this is not something I will address here as I think embedded schemas into Lucene would probably be a discussion of its own.
My use case does demonstrate significant overhead for field validation using the document API. I was doing something like 150 fields per document and the time spend in processDocument was greater than the actual indexing.
Moving towards batches of 1000 with 150 columns made all the validation disappear from any sample-able threads. Doesn't mean it is not there though.
| * #processField}) and the column-batch row pass ({@link #processRowColumns}). Returns {@code | ||
| * true} if this is a unique indexed field with postings. | ||
| */ | ||
| private boolean invertAndStore(int docID, IndexableField field, PerField pf) throws IOException { |
There was a problem hiding this comment.
While stored fields definitely make sense as a row-oriented thing, I'm wondering if there's anything clever we can do on the "invert" side if we want to process a bunch of values for the same field.
Since the resulting data structures are field-oriented, it feels to me like there should be something we can do, but I haven't had enough coffee yet to have a clever idea.
There was a problem hiding this comment.
I wrote in the mailing list that DOC + no norms can be processed columnar.
In terms of optimizations, once the api had landed I planned to propose a long (or int) column with an associated array dictionary. And then Lucene would only index the dictionary each column batch.
This would be targeting inverted index and sorted set DV optimizations for low cardinality use cases. Without exposing any Lucene hashing or equality internals.
But I have not actually gone through the steps of implementing something like this yet.
…exing_chain+columns_simpler
|
This is now imo in a ready to review for (hopefully) eventual merge state. |
This commit adds an experimental columnar indexing api to the
IndexWriter. It allows the user to provide Long, Binary, and Vector
columns to the indexing chain to significantly reduce the per field
overhead when indexing batches of documents.