Debug logging and fixes for column stats write memory#3136
Draft
poodlewars wants to merge 1 commit into
Draft
Conversation
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
Collaborator
Author
|
Big explanation from Claude about why swapping to io executor wasn't enough, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LMDB can mmap in large amounts of data.
With random (uncompressible) data, this is the size of the dataframe we're testing.
So the benchmark uses all zeros which compresses down, and the peak RSS measurement is not affected by LMDB's mmap (which is compressed data, so is now tiny). Alternatively could do this testing against S3.
Changing to InlineExecutor here is important:
It guarantees that the processing (column stats creation) is doing immediately after the segment load. The
.via(io_executor).thenValueInline(process)still queued the processing after all the segment loads.There are advantages to implementing this with the semaphore though, because then we can keep the whole IO pool busy in the case where the stats generation is fast relative to IO, and we can utilize the CPU pool (which is not being used at all with this idea). Having said that, in the long run column stats generation will be automatic - we don't need to get this manual pipeline too perfect. The semaphore idea is only clearly safe in the special case where processing consumes a single segment at a time - otherwise we need to worry about deadlocks.
I think something like this approach should be fine.