Optimization of External Sort Performance

Sorting is one of the fundamental operations of any query system. Good sort performance pays benefits in overall query performance. Drill external sort performs an in-memory sort-merge when data fits in memory, and spills to disk otherwise. The performance of the in-memory sort is already quite good. Can we make it better? It turns out we can, by about 23%.

Baseline Sort Numbers

To determine where we stand, we can start with a baseline using several available sort algorithms. All tests use a data set of 10 million randomly generated integers uniformly distributed over the entire range of values (both positive and negative.)

The sorts are:

Arrays.sort(int) time: 847
Arrays.sort(Integer) time: 4681
TimSort.sort(Integer, c) time: 4982
Hadoop QS time (Integer): 5920
Hadoop QS time (int): 2398
Hadoop QS time (ByteBuffer): 4090
Hadoop QS time (ByteBuffer, SV): 9840

The first two data points are for using the basic Arrays.sort function: the first on an array of int values, the second on an array of Integer objects. In Java 8, the sort method is implemented using a "Tim Sort" (see Javadocs for details.) The third test uses the TimSort directly passing in a comparator object to measure the overhead of the extra method call.

Hadoop provides a QuickSort algorithm which is the basis for Drill's in-memory sort. The next two lines measure the time using an array of Integer and int. Drill uses direct memory buffers. To simulate that, the penultimate line uses a Hadoop QuickSort with data stored in a ByteBuffer. Drill actually uses a level of indirection, the so-called "selection vector." The last line simulates that indirection using a simulated selection vector and data vector, both in `ByteBuffers.

The numbers above came from a test run on a Mac. Values shown are the last iteration of a run with five iterations of each test.

As we will see, Drill actually compares very well against these experiments.

Drill Baseline

To provide a place to start, consider a query of 10 million rows using the mock data source described on the Performance Basics page. The tests use an embedded Drillbit in the test framework also described on that page, along with the configuration options also described (turning of assertions, disabling debug logging, etc.) Sufficient memory is provided that the sort occurs in memory. The query is:

SELECT id_i FROM `mock`.employee_10M ORDER BY id_i

This simply says to select an integer field ("_i") that we call "id" for convenience. Create 10 million rows ("_10M") from a table we call "employee" for convenience. Sort by the integer field.

The query runs in 6201 ms (6.2 seconds). The physical plan, and time breakdown is as follows:

Op: 0 Screen
  Setup:   0 - 0%, 0%
  Process: 3 - 0%, 0%
  Wait:    4
Op: 1 Project
  Setup:   1 - 100%, 0%
  Process: 1 - 0%, 0%
Op: 2 SelectionVectorRemover
  Setup:   0 - 0%, 0%
  Process: 762 - 12%, 12%
Op: 3 Sort
  Setup:   0 - 0%, 0%
  Process: 5016 - 84%, 84%
  Spills: 0
  Memory: 118
Op: 4 Scan
  Setup:   0 - 0%, 0%
  Process: 143 - 2%, 2%
Total:
  Setup:   1
  Process: 5925

As expected, the bulk of the time (84%) is spent in the (in-memory) sort. A bit of analysis showed that the time is spent in two key "inner loops" the in-memory sorter and the in-memory merger (called, for some reason, an "MSort".)

Optimizing the In-Memory Sorter

The bulk of the time in the in-memory sorter is in the code generated from the SingleBatchSorterTemplate template. Fortunately, DRILL-5052 gives us the tools to see, and debug the generated code. Here is an abbreviated form:

public class SingleBatchSorterGen1 extends SingleBatchSorterTemplate {
    IntVector vv0;
    IntVector vv4;

    public int doEval(char leftIndex, char rightIndex) throws SchemaChangeException {
        IntHolder out3 = new IntHolder();
        out3.value = vv0.getAccessor().get(leftIndex);
        IntHolder out7 = new IntHolder();
        out7.value = vv4.getAccessor().get(rightIndex);
        IntHolder out8 = new IntHolder();
        final IntHolder out = new IntHolder();
        IntHolder left = out3;
        IntHolder right = out7;
        out.value = left.value < right.value ? -1 : (left.value == right.value ? 0 : 1);
        return out8 .value;
    }

This code is unique to the fact that we are sorting (non-nullable) integers. The code would vary if the sort were for nullable columns, were for another data type, were for variable-length columns (e.g. VARCHAR), etc. That's why Drill needs code generation.

What we see is that the sorter acts as if it has two incoming vector: vv0 and vv4. This is an artifact because, of course, a sorter just has a single vector. The code grabs a value from both, given the index, and compares them. Simple enough.

Note also that the generated code uses "holders" to work with vector values. Each holder is a new Java object. We might think that we could gain some performance by removing the temporary objects, but that already happens. Drill already performs "scalar value replacement" at the byte code level: it is the magic that replaces these temporary object with scalar variables. However, in modern Java, even that is unnecessary: the Java compiler and/or runtime does the work "for free."

To make our task a bit easier, we can rewrite the code as the JVM sees it:

    public int doEval(char leftIndex, char rightIndex) throws SchemaChangeException {
        left = vv0.getAccessor().get(leftIndex);
        right = vv0.getAccessor().get(rightIndex);
        return = left < right ? -1 : (left == right ? 0 : 1);
    }

Note that since vv0 and vv4 are the same, we retain just one of them.

What we want to notice runs a bit deeper. Notice that we get the values as follows:

left = vv0.getAccessor().get(leftIndex);

That is two function calls per value, four per comparison. That's got to add up. How could we replace them? First, we could cache accessors instead of vectors. Here is the current setup code (retaining just vv0):

    public void doSetup(FragmentContext context, VectorAccessible incoming, RecordBatch outgoing)
        throws SchemaChangeException
    {
        int[] fieldIds1 = new int[ 1 ] = { 0 };
        Object tmp2 = (incoming).getValueAccessorById(IntVector.class, fieldIds1).getValueVector();
        vv0 = ((IntVector) tmp2);
    }

Caching the accessor is simple:

        a0 = vv0.getAccessor();

Now eval is:

        left = a0.get(leftIndex);
        right = a0.get(rightIndex);

Optimization of External Sort Performance

Baseline Sort Numbers

Drill Baseline

Optimizing the In-Memory Sorter

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!