Skip to content

External Sort Operator

Paul Rogers edited this page Oct 31, 2016 · 12 revisions

External Sort Operator

The external operator works like other Drill operators in it's use of an incoming batch, the next( ) call hierarchy, and that the operator is a record batch for its "output" (downstream, up-the-call-stack) operator. The external sort operator is interesting for two key reasons:

  1. It buffers all the incoming rows so that they can be sorted together, and
  2. It optionally spills data to disk due to memory pressure.

General Flow

Like all operators, the external sort (xsort) operator does the following:

  • Holds an incoming operator (incoming).
  • Calls incoming.next( ) to obtain a record batch.
  • Supports an incoming batch with an associated selection vector. (?)

Sort Implementation

Sorting is done via an indirection vector. That is, the data in the vectors does not move during the sort. Instead, the xsort operator uses a int-based selection vector (SelectionVector4 sv4) to hold indexes to the data. During sort, the indexes are reordered, not the vector data.

Schema Change Handling

PriorityQueueCopier

Spilling

The operator spills to disk (when)? batch group size? threshold?

Disk Management

Disk management is integrated into the xsort operator class itself. The algorithm is:

  • Use the HDFS FileSystem class, referenced by fs to manage spill files. Allows files to be written into the DFS file system in addition to local disk. This is done when cluster nodes are configured with limited local storage, and so large spill files are written to DFS instead.
  • A list of spill directories (dirs) to which spill files are written (in round robin manner?)

Open Questions

  • MappingSet
  • Dir algorithm

Clone this wiki locally