docs: add TODO for batched blob resolution performance optimization

beinan · claude · beinan · commit d3a606ccf7ad · 2026-03-27T22:21:27.000Z
Currently takeBlobs() is called once per row, which will be a
bottleneck for large joins (e.g. 20B rows). Document the planned
optimization to batch references by (datasetUri, columnName) and
call takeBlobs() once per group.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/lance-spark-base_2.12/src/main/java/org/lance/spark/utils/BlobReferenceResolver.java b/lance-spark-base_2.12/src/main/java/org/lance/spark/utils/BlobReferenceResolver.java
@@ -34,6 +34,12 @@
  *
  * <p>Datasets are cached for the lifetime of this resolver to avoid re-opening the same dataset for
  * every row.
+ *
+ * <p>TODO: Batch blob resolution for performance. Currently {@code resolve()} calls {@code
+ * Dataset.takeBlobs()} once per row, which incurs JNI + I/O overhead per call. For large joins
+ * (e.g. 20B rows), this becomes a bottleneck. The fix is to collect all blob references in a
+ * batch, group by (datasetUri, columnName), and call {@code takeBlobs()} once per group with all
+ * row addresses. This would reduce the number of calls from O(N) to O(N/batch_size).
  */
 public class BlobReferenceResolver implements AutoCloseable {