Skip to content

Cuvs-Lucene-139: Fix GPU OOMs && Java Heap OOMs#141

Open
nvzm123 wants to merge 2 commits into
rapidsai:mainfrom
nvzm123:cuvslucene-139__zackm
Open

Cuvs-Lucene-139: Fix GPU OOMs && Java Heap OOMs#141
nvzm123 wants to merge 2 commits into
rapidsai:mainfrom
nvzm123:cuvslucene-139__zackm

Conversation

@nvzm123
Copy link
Copy Markdown

@nvzm123 nvzm123 commented Apr 29, 2026

GPU OOM fix:
Replaced usage of CuVSMatrix.deviceBuilder with usage of CuVSMatrix.hostBuilder instead. deviceBuilder was eagerly loading the full dataset to GPU. using hostBuilder gives responsibility to CAGRA as to how to optimally stream data from host memory to GPU.

Java Heap OOM fix:
Stream subsets of data during HNSW graph construction instead of trying to load the full dataset onto the Java Heap.

Note: this PR also includes the commit from cuvs-lucene-137 (#137)

EC2 Default User added 2 commits April 28, 2026 21:13
without loading the full set of data on the Java Heap, but instead allows us
to stream the set of data to the Java Heap.
@nvzm123 nvzm123 requested a review from a team as a code owner April 29, 2026 04:26
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@nvzm123
Copy link
Copy Markdown
Author

nvzm123 commented Apr 29, 2026

@narangvivek10 narangvivek10 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Apr 30, 2026
int size,
int dimensions,
CuVSMatrix adjacencyListMatrix,
CuVSMatrix vectorDataset,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have createMultiLayerHnswGraph method just above this one, mainly differing in this parameter. Is there any strong reasoning behind having a copy of the method almost the same code? Can we not just modify the above to mainly change vectors param to vectorDataset?

builder.addVector(mergedVectors.vectorValue(it.index()));
}
CuVSHostMatrix dataset = builder.build();
writeFieldInternal(fieldInfo, dataset, size);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can derive size from dataset so is there a strong reasoning behind passing size as a separate primitive int parameter? Will there a situation where the size parameter's value will be different from dataset.size()?

* @param size number of vectors in the dataset
* @throws IOException
*/
private void writeFieldInternal(FieldInfo fieldInfo, CuVSHostMatrix dataset, int size)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar observation with the duplicated writeFieldInternal method and almost the same code as above, with mainly one parameter change from vectors to dataset.

CuVSMatrix.DataType.FLOAT);

// Add vectors one by one - builder copies directly to device memory
CuVSMatrix.hostBuilder( // was: CuVSMatrix.deviceBuilder(resources, ...
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being part of Utils, maybe we can think of allowing the caller of this method to choose between host or device in case it is needed anywhere in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants