Commit 8a48a87
perf: optimize object store requests when reading JSON (#20823)
## Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->
- Closes #.
## Rationale for this change
This is an alternative approach to
- #19687
Instead of reading the entire range in the json FileOpener, implement an
AlignedBoundaryStream which scans the range for newlines as the
FileStream
requests data from the stream, by wrapping the original stream returned
by the
ObjectStore.
This eliminated the overhead of the extra two get_opts requests needed
by
calculate_range and more importantly, it allows for efficient read-ahead
implementations by the underlying ObjectStore. Previously this was
inefficient
because the streams opened by calculate_range included a stream from
`(start - 1)` to file_size and another one from `(end - 1)` to
end_of_file, just to
find the two relevant newlines.
## What changes are included in this PR?
Added the AlignedBoundaryStream which wraps a stream returned by the
object
store and finds the delimiting newlines for a particular file range.
Notably it doesn't
do any standalone reads (unlike the calculate_range function),
eliminating two calls
to get_opts.
## Are these changes tested?
Yes, added unit tests.
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->
## Are there any user-facing changes?
No
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>1 parent 7c3b22c commit 8a48a87
File tree
4 files changed
+1672
-34
lines changed- datafusion
- core/tests/datasource
- datasource-json/src
4 files changed
+1672
-34
lines changed
0 commit comments