Skip to content

Commit d3a606c

Browse files
beinanclaude
andcommitted
docs: add TODO for batched blob resolution performance optimization
Currently takeBlobs() is called once per row, which will be a bottleneck for large joins (e.g. 20B rows). Document the planned optimization to batch references by (datasetUri, columnName) and call takeBlobs() once per group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 2e655be commit d3a606c

1 file changed

Lines changed: 6 additions & 0 deletions

File tree

lance-spark-base_2.12/src/main/java/org/lance/spark/utils/BlobReferenceResolver.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@
3434
*
3535
* <p>Datasets are cached for the lifetime of this resolver to avoid re-opening the same dataset for
3636
* every row.
37+
*
38+
* <p>TODO: Batch blob resolution for performance. Currently {@code resolve()} calls {@code
39+
* Dataset.takeBlobs()} once per row, which incurs JNI + I/O overhead per call. For large joins
40+
* (e.g. 20B rows), this becomes a bottleneck. The fix is to collect all blob references in a
41+
* batch, group by (datasetUri, columnName), and call {@code takeBlobs()} once per group with all
42+
* row addresses. This would reduce the number of calls from O(N) to O(N/batch_size).
3743
*/
3844
public class BlobReferenceResolver implements AutoCloseable {
3945

0 commit comments

Comments
 (0)