Skip to content

Value Vector Implementation

Paul Rogers edited this page Sep 26, 2016 · 7 revisions

Value Vector Implementation

Value vectors are the heart and soul of Drill: readers build them, operators transform them, network operations ship them around, and clients consume them. The core concepts of value vectors appear here. The implementation, however, is quite complex.

Value vectors work with Java primitive types (bits, ints and so on.) In Java, each primitive requires distinct code. That is a "get" method returns a short or an int; a single method can't do both. Many applications provide uniform access by converting primitives to objects ("boxing"), but doing so creates garbage. In Drill's case, high-speed processing of vectors would produce vast amounts of garbage, resulting in large GC overhead. Drill's solution is to provide type-specific vectors with the corresponding primitive get/set methods.

Each primitive also has a distinct width, which is influences how to map the primitive elements into a byte vector. So, each vector type encodes its width informaton into its implementation methods.

Since Drill provides around 38 types ("minor types" in Drill terminology), Drill provides 38 different value vector types.

Further, Drill provides three forms of cardinality ("mode" in Drill terminology): Required (non-nullable), Optional (nullable) and Repeated (array).

The combination of (minor type, mode) gives rise to a "major type" in Drill, of which about 108 exist. Drill provides a separate value vector class for each major type. Because it is not practical to maintain this many classes (with virtually identical code) by hand, Drill generates the classes. You can find them in the vector project in org.apache.drill.exec.vector.

Since each major type has its own vector class, code that uses vectors must also be written specific to each major type. This means writing 100+ different implementations. Because this would be impossible to do by hand, all code anywhere in Drill that works with vectors is also generated. Thus, when working with vectors, one must think in terms of meta-programming: writing code that generates code that works with vectors.

Every value vector consists of:

  • A "payload" vector with the actual data values.
  • An optonal set of offset vectors that map into the payload vector. (See below.)
  • An accessor class to retrieve values from the vector.
  • A mutator class to write values into the vector.
  • A field reader that ((more info needed...))

Required (non-nullable) Vectors

We start with vectors for the required mode since they are the simplest. Consider the IntVector class which holds signed, 32-bit integers. Client code does not work with vectors directly. Instead, clients work via (type-specific) accessor classes:

class Accessor extends BaseDataValueVector.BaseAccessor {
  public int getValueCount() ...
  public boolean isNull(int index) ...
  public int get(int index) ...
  ...
}

The getValueCount( ) and isNull() methods are inherited (and hence generic to all vectors.) But, the get() method is type specific. Both isNull() and get() take an index, which is the record index relative to the start of the record (vector) batch. In this case, since the vector stores only required values, isNull() always returns false. Internally, the get() method converts the record index to a byte buffer index. Since ints are a constant 4 bytes, the conversion is simple:

public int get(int index) { return data.getInt(index * 4); }

For variable-length vectors, such as VarCharVector, the vector has a level of indirection through an offset vector.

Optional (nullable) Vectors

Repeated (array) Vectors

Clone this wiki locally