EVF Row Set

This is a practical guide, so let's immediately start using the EVF in its simplest form: the Row Set framework used to create, read and compare record batches in unit tests. The row set framework makes a number of simplifying assumptions to make tests as simple as possible:

When creating a batch, we define the schema up front.
No projection, null columns, or type conversion is needed.
Batches are "small"; the framework does not enforce memory limits.

While the row set framework is specific to tests, the column accessor mechanism is used throughout the EVF. The easiest way to understand the column accessors is to start with the row set framework. (The EVF uses the term "accessor" to mean either a vector reader or a vector writer.)

In this example, we refer to this documentation and the ExampleTest class.

Create a Test File

Since the row set framework is typically used in a test, our example will work in that context. Find a handy spot within Drill to define your temporary test file. (Unfortunately, Drill is not designed to allow you to create these files as a separate project outside of Drill.) You can create the file in the test package if you like.

public class ExampleTest extends SubOperatorTest {

  @Test
  public void myTest() {
  }
}

The SubOperatorTest base class takes care of configuring and launching an in-memory version of Drill so that you can focus on the specific test case at hand.

Define Your Schema

Next define your schema using the SchemaBuilder class. Careful, there are two such classes in Drill: you want the one in the 'metadatapackage. Let's define a simple schema with two columns: a non-nullableIntand a nullableVarchar`.

import org.apache.drill.exec.record.metadata.SchemaBuilder;
import org.apache.drill.common.types.TypeProtos.MinorType;
...

  @Test
  public void myTest() {
    final TupleMetadata schema = new SchemaBuilder()
      .add("id", MinorType.INT)
      .addNullable("name", MinorType.VARCHAR)
      .buildSchema();
  }

Some things to note:

The schema builder allows a fluent notation which is very handy in tests. Production code is never this easy since the schema is not known at compile time.
The add() methods typically add a column with no options which is a non-nullable (AKA Required) column.
The addNullable() methods add a nullable (Optional) column.
Drill defines two classes called `MinorType. Use the import shown above to get the correct one.
The result of buildSchema() is a TupleSchema.
The schema builder an also build a BatchSchema by calling build(). BatchSchema is used by the VectorContainer class, but TupleMetadata holds a more complete set of metadata, and can define extended types properly that BatchSchema cannot.

The TupleMetadata class describes both a record and (as we'll see later) a Drill "Map" (really a struct.) Each tuple is made up of columns, defined by the ColumnMetadata interface. ColumnMetadata provides a rich set of information about each column. Combined, the metadata classes drive much of the EVF as we'll see.

Now that you are familiar with the schema classes, we'll leave it as an exercise for the reader to explore them and learn all that they have to offer.

Vector Types

Drill's value vectors are at the core of the Drill execution engine. Vectors are of multiple types. We'll work through each type one-by-one.

Create a Record Batch the Easy Way

The next step is to create a record batch using the schema. In tests, the easy way to do this is with the RowSetBuilder class:

      final RowSet rowSet = new RowSetBuilder(allocator, schema)
        .addRow(1, "kiwi")
        .addRow(2, "watermelon")
        .build();

Things to notice here:

The RowSet interface provides an easy-to-use wrapper around the actual record batch.
For more advanced tests, you may need to use one of the subclasses of RowSet.
The record batch itself is available via the foo() method.
The RowSetBuilder class provides a fluent way to create, populate, and return a row set.
The addRow() method takes a list of Java objects. The code uses the type of the Java object to figure out which set method to call. (We'll discuss those methods shortly.)
If you want to create a row with a single column, use addSingleCol() instead. Otherwise, Java sometimes gets confused about the type of the single argument.

Create a Record Batch using Column Writers

The above technique is often all you need when writing tests to verify some operation. (You will write such unit tests for your present work, right? I thought so.)

If you are creating an operator that works in production code, you won't know the data at compile time. Instead, you must work with each column one-by-one using the column writer classes.

    DirectRowSet rs = DirectRowSet.fromSchema(allocator, schema);
    RowSetWriter writer = rs.writer();
    writer.scalar("id").setInt(1);
    writer.scalar("name").setString("kiwi");
    writer.save();
    ...
    final SingleRowSet rowSet = writer.done();

Some things to note:

Here we saw a number of row set subclasses. DirectRowSet holds a writeable row set.
SingleRowSet holds a readable row set which may or may not have a single-batch (SV2) selection vector. In our case, it has no selection vector.
The RowSetWriter is a kind of TupleWriter that provides extra methods to work with entire rows, such as the save() method that says that the row is complete. (TupleWriter is also used to write to Map vectors.)
The row set writer is always ready to write a row, so there is no "start row" method here. (Note that there is such a method in the result set loader as we'll see later.)
If you omit the call to save(), the row set writer will happily overwrite any existing value in the current row. This is done deliberately to handle advanced use cases.
The scalar(name) method looks up a ColumnWriter by name.
The returned column writer has many different set methods. We use setInt() and setString() here.
The setString() method is a convenience method: it converts a Java string into the byte array required by the vector. If you already have a byte array, you can call the setBytes() method instead.
Every scalar reader supports all the set methods. This avoids the need for casting to the correct writer type. Also, as we'll see later, it allows automatic type conversions when configured to do so.

Scalar Vectors

The term "scalar" above refers to vectors that hold a single value per row.

Non-nullable fixed-width vectors: provide a simple array of values:

Simple Value Vector

Non-nullable variable-width vectors: a combination of two vectors: a buffer that contains variable-size chunks of data, along with an offset vector that points to the start of each data value:

Variable-Width Vector

(Thanks to the Drill documentation team for the images!)

In both cases, the call to set() (for writing) or get() for reading converts the vector value to/from the corresponding Java data type. Typically the data type is obvious (int, String, double, etc.)

The accessors convert most integral types to a Java int (TinyInt, SmallInt, Int, UInt1, UInt2). (There is no advantage to having, say, setShort() or setByte() methods.)

Larger integrals use long (BigInt, UInt4).

Floating point values use double (Float4, Float8).

Date/time types use the Joda classes. (Limitations of the Java 8 date/time classes prevent their use with Drill's vectors.)

Caching Column Writers

The above used the "get by name" methods to simplify the code. You'll want to optimize production code. You can do so by referencing columns by position (as defined by the schema):

    writer.scalar(0).setInt(1);
    writer.scalar(1).setString("kiwi");

Or, you can cache the column writers:

    RowSetWriter writer = rs.writer();
    ScalarWriter idWriter = writer.scalar("id");
    ScalarWriter nameWriter = writer.scalar("name");
    idWriter.setInt(1);
    nameWriter.setString("kiwi");
    writer.save();
    ...

Note that the set() methods themselves are heavily optimized: they do the absolute minimum work to write your value into the underlying value vector. This consists of a couple of checks (for empty slots and to detect when the vector is full). Using the column writers has been shown to be at least as efficient as using the value vector Mutator classes (and, for non-nullable and array values, much faster.)

Reading a Row Set

Now that you have a record batch, the next step is to do something with it. The simplest thing you can do (in a test) is to print the record batch so you can see what you have:

    rowSet.print();

Output:

You can also verify vectors. Suppose we want to verify that the two forms of writing to vectors above produces the same record batch:

   RowSet rs1 = // Build using RowSetBuilder
   RowSet rs2 = // Build using column writers

   RowSetUtilities.verify(rs1, rs2);

The above takes the first argument as the "expected" value and the second as the "actual", then compares the schemas and values. This is how we use the row set framework to verify the result of some operation on record batch (including the result of an entire query.)

If we want to work with individual values, we can use the column readers which work much like the column writers. Let's assume we've created a print() method that will print a value.

   final RowSetReader reader = rowSet.reader();
   while (reader.next()) {
     print(reader.scalar("id").getInt());
     print(reader.scalar("name").getString());
   }

Notes:

The RowSetReader is a specialized TupleReader that iterates over records in a batch by calling the next() method.
The reader starts positioned before the first record, so you must call next() to move to the first record.
Access to column readers works very much like the writer example. You can cache the column readers for performance, or access them by column index.
For reading, you call get() methods of the type appropriate for your column.

Nullable Columns

Thus far we've shown how to work with non-nullable columns and values. Our name column is nullable, however. How do we work with nulls? Drill defines two kinds of nullable vectors:

Nullable, fixed-width vectors: a combination of a two fixed width vectors: one for the data, another for the "null bits" (really, the is-set bit: the value is 1 if set, 0 if null.)

Nullable Fixed-width Vector

Note that, in actual practice, the is-set flags are bytes, not bits as suggested by the diagram.

Nullable variable-width vectors: a combination of two vectors (one of which itself contains two vectors): an is-set vector and a variable-width vector.

When writing, we can either set a column to null explicitly:

      nameWriter.setNull();

Or, we can simply omit writing any value to the column:

    idWriter.setInt(1);
    // No value set for the `name` column.
    writer.save();

When reading, we must first ask if the column is NULL:

    ColumnReader nameReader = reader.scalar("name");
    if (nameReader.isNull()) {
      print("null");
    } else {
      print(nameReader.getString());
    }

Arrays (Repeated Vectors)

Arrays in Drill use an offset vector, similar to variable-width vectors To model this, array accessors introduce another level of structure, as in JSON: the array writer and reader. You use the array accessor to traverse the array, then a value-specific accessor (typically scalar) to work with each value. First define a schema:

  @Test
  public void arrayTest() {
    final TupleMetadata schema = new SchemaBuilder()
      .add("id", MinorType.INT)
      .addArray("names", MinorType.VARCHAR)
      .buildSchema();
  }

We can use the RowSetBuilder with some convenience functions:

      final RowSet rowSet = new RowSetBuilder(allocator, schema)
        .addRow(1, strArray("apple", "manzana"))
        .addRow(2, strArray("watermelon", "sandía"))
        .build();

We can use the column writers:

    RowSetWriter writer = rs.writer();
    ArrayWriter nameArray = writer.array("names");
    ScalarWriter nameWriter = nameArray.scalar();
    writer.scalar("id").setInt(1);
    nameWriter.setString("apple");
    arrayWriter.save();
    nameWriter.setString("manzana");
    arrayWriter.save();
    writer.save();

Notes:

Notice the two-level structure as described earlier: the ArrayWriter that contains a ScalarWriter.
The ArrayWriter iterates over the array. The writer starts pointing to the first entry for the current row. Call save() to advance to the next position.
As before, the writers live for the life of the RowSetWriter and can be cached if desired.

To read an array:

   final RowSetReader reader = rowSet.reader();
   ArrayReader arrayReader = reader.array("names");
   ScalarReader nameReader = arrayReader.scalar();
   while (reader.next()) {
     print(reader.scalar("id").getInt());
     while (arrayReader.next()) {
       print(nameReader.getString());
     }
   }

Maps (Structs)

The final type you should now are Drill maps. As we've said multiple times, a Drill "Map" is not a true map: it is closer to a C or Hive "struct". Every row has the same set of columns. (In a true map, each row would have an independent set of name/value pairs.) The map is, in fact, little different from Drill's top-level row: both contain columns indexed by name (and, in the EVF, by position.) As a result, both are created using the mechanisms:

TupleSchema to describe both a row and a struct.
TupleWriter to write both a row and a struct.
TupleReader to read both a row and a struct.

In fact, working with maps is nearly identical to working with rows (except that maps don't contain the row-specific methods.)

To define a schema:

  @Test
  public void mapTest() {
    final TupleMetadata schema = new SchemaBuilder()
      .add("id", MinorType.INT)
      .addMap("names")
        .addNullable("english", MinorType.VARCHAR)
        .addNullable("spanish", MinorType.VARCHAR)
        .resumeSchema()
      .buildSchema();
  }

Notes:

The addMap() method creates the map and returns a builder for that map.
Build the map just as you built the top-level row.
Call the resumeSchema() method to mark the map as complete and to return to building the top-level row.

You can build a map using the RowSetBuilder:

      final RowSet rowSet = new RowSetBuilder(allocator, schema)
        .addRow(1, map("apple", "manzana"))
        .addRow(2, map("watermelon", "sandía"))
        .build();

Using the column accessors:

    RowSetWriter writer = rs.writer();
    MapWriter nameMap = writer.map("names");
    ScalarWriter englishWriter = nameMap.scalar("english");
    ScalarWriter spanishWriter = nameMap.scalar("spanish");
    writer.scalar("id").setInt(1);
    englishWriter .setString("apple");
    spanishWriter .setString("manzana");
    writer.save();

Notes:

You access writers within the map exactly as you do for those in the row.
There is no save() method to call for the map since there is exactly one map value per row.

To read a map:

   final RowSetReader reader = rowSet.reader();
   MapReader mapReader= reader.map("names");
   ScalarReader englishReader = mapReader.scalar("english");
   ScalarReader spanishReader = mapReader.scalar("spanish");
   while (reader.next()) {
     print(reader.scalar("id").getInt());
     print(englishReader.getString());
     print(spanishReader .getString());
   }

Advanced Accessors

Drill provides a number of advanced data types that you can also use:

Repeated Map: Represented as an array accessor that contains a map accessor .
Repeated List: Represented as an array accessor that contains an array accessor.
Union: Represented as a "Variant" accessor which acts like a map accessor, except that the members are indexed by type rather than name.
List: Represented as a array accessor that contains a union accessor.

Of these, only Repeated Map is fully supported in Drill. Although the accessors work for all types (some required considerable bug fixes to make the underlying vectors work), most Drill operators do not support these vector types. Unions and Lists are listed as "experimental" in the documentation (and have been for many years.)

The List type is particularly complex since it can act like a Repeated vector if it has just one type, or a Repeated Union if it has multiple types.

The general advice is to stick to scalar types, repeated scalars, maps and repeated maps. Expect considerable work throughout Drill to get the other types to work.

EVF Row Set

Create a Test File

Define Your Schema

Vector Types

Create a Record Batch the Easy Way

Create a Record Batch using Column Writers

Scalar Vectors

Caching Column Writers

Reading a Row Set

Nullable Columns

Arrays (Repeated Vectors)

Maps (Structs)

Advanced Accessors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!