-
Notifications
You must be signed in to change notification settings - Fork 985
External Sort Operator
The external operator works like other Drill operators in it's use of an incoming batch, the next( ) call hierarchy, and that the operator is a record batch for its "output" (downstream, up-the-call-stack) operator. The external sort operator is interesting for two key reasons:
- It buffers all the incoming rows so that they can be sorted together, and
- It optionally spills data to disk due to memory pressure.
Like all operators, the external sort (xsort) operator does the following:
- Holds an incoming operator (
incoming). - Calls
incoming.next( )to obtain a record batch. - Supports an incoming batch with an associated selection vector. (?)
Sorting is done via an indirection vector. That is, the data in the vectors does not move during the sort. Instead, the xsort operator uses a int-based selection vector (SelectionVector4 sv4) to hold indexes to the data. During sort, the indexes are reordered, not the vector data.
PriorityQueueCopier
The operator spills to disk (when)? batch group size? threshold?
Disk management is integrated into the xsort operator class itself. The algorithm is:
- Use the HDFS
FileSystemclass, referenced byfsto manage spill files. Allows files to be written into the DFS file system in addition to local disk. This is done when cluster nodes are configured with limited local storage, and so large spill files are written to DFS instead. - A list of spill directories (
dirs) to which spill files are written (in round robin manner?)
- MappingSet
- Dir algorithm