Storage Plugin Model

Drill exists to rapidly scan (and analyze) data from a variety of data sources. Each data source is represented in Drill as a storage plugin. This discussion describes how the storage plugin system works from a number of angles. The goal is to provide background information for those who maintain the existing storage plugins, and perhaps ease the learning curve for those creating new storage plugins.

Topics

Overview

A storage plugin provides access to a set of tables. In Hadoop-based systems, tables are implemented as files in the HDFS file system, so Drill's primary storage plugin is the FileSystemPlugin. Since a file system stores many kinds of files, the file system plugin is associated with a collection of format plugins (see below). Drill supports any arbitrary storage plugin, including for systems other than file systems.

The storage plugin is a plan-time concept and is not directly available at run time (while fragments execute.) The storage plugin itself is meant to be a light-weight description of the external system and, as such, can be created and discarded frequently, even within the scope of a single query plan session.

A storage plugin itself is a "type": it provides system-specific behavior but typically provides the ability to define multiple storage system instances for a given storage plugin type. It is these instances that we see in the storage plugin section of the Drill UI: we often call these storage plugins, but they are really storage plugin configurations: the information needed to define a specific instance.

Every storage plugin must implement the StoragePlugin interface, often by subclassing the AbstractStoragePlugin class. Each storage plugin also defines a Jackson-serialized configuration object which extends StoragePluginConfig: it is the JSON-serialized version of this object which we see in Drill's storage plugin web UI.

Drill provides a top-level namespace of storage plugins. The names here are not the storage plugin names (types) themselves, but rather the names associated with storage plugin configurations. (The schema name space also holds the names of schemas, which will be discussed later.) Then each storage plugin defines a name space for tables:

SELECT * FROM `myPlugin`.`myTable`

Here, myPlugin is the name of a storage plugin configuration (that maps to a storage plugin via the type field in the configuration), and myTable is a table defined by that plugin instance.

Scan Operator

With the plan complete, attention now shifts to execution. The foreman serializes the plan (including the sub scan operator descriptions), then ships the plan to each Drillbit for execution. Each minor fragment is run by a dedicated thread. The first task is to create the record batch (operator implementation) for each operator (description). To do this, the Foreman deserializes the plan to recreate the operator tree, using Jackson to create the plan objects as described above.

Next, the fragment executor must create an record batch each operator definition via the indirection of a record batch creator which implements BatchCreator. This is done using a process similar to the way that plugin configurations are linked to plugins: via a map from operator definition class to record batch creator constructor.

During start the Drillbit searches the scan path for classes which implement BatchCreator. For each, the code gets the list of implemented interfaces, one of which must be BatchCreator. BatchCreator itself is a parameterized type, so the code next looks for the type of the parameterization argument. For example:

public class MockScanBatchCreator implements BatchCreator<MockSubScanPOP> {

The creator class must have a zero-argument constructor. If so, that constructor is placed into a map keyed by the parameterized class (here MockSubScanPOP.) The result: a map from physical operator (pop) definition to creator constructor.

Given this map, the fragment executor can easily find the constructor for the batch creator. From the constructor the executor gets the batch creator class and creates an instance. Finally, the executor calls:

  public ScanBatch getBatch(FragmentContext context, MockSubScanPOP config, List<RecordBatch> children)

To get an instance of the scan batch (operator implementation) given the fragment context, operator definition and the list of previously-created children (upstream operators) for this operator. Since our focus here is on scanners, the child list will always be empty.

Given the batch created above, the fragment executor initializes the batch by calling the setup method. Operators that know their schema can set up the schema here. Otherwise, schema setup can wait until later.

Most scanners extend ScanBatch, but doing so is not required; it is simply a convenience. ScanBatch creates an operator-specific RecordReader subclass to handle actual reading, allowing the generic ScanBatch operator implementation to handle interfacing into the Drill operator hierarchy.

Runtime Events

The fragment now runs. The tree calls next() on the ScanBatch to return a batch of records. (By convention, the first batch should return just a schema.)

ScanBatch calls allocate() on the RecordReader to set up the vectors for the record batch.
Repeatedly calls the next() method on the RecordReader to read a batch of rows into the vectors allocated above.
Sets the row count on each of the value vectors.
Determines if the batch includes a schema different than the previous batch.
Passes along the proper status code to the caller.

Of course, some of these steps involve more than the summary suggests.

ScanBatch defines a Mutator class which must be used to build the set of value vectors. The Mutator takes a field schema in the form of a MaterializedField.
Uses the TypeHelper to convert from the field schema to a value vector instance.
Registers the value vector with the vector container associated with the ScanBatch.
Adds the value vector to the field vector map, indexed by field name (actually, a field path for nested structures.)

Comments on Current Design

A number of improvements are possible to the current design.

Each plugin should have a registration file that identifies the class name of the plugin. It is far cheaper to search for such files than to scan every class looking for those that extends some particular interface.
Each plugin should use annotations to provide static properties such as the plugin tag name (file, mock or whatever) instead of looking for the type of the first constructor argument as is done now.
Rather than searching for a known constructor, require that plugins have no constructor or a zero-argument constructor. In the plugin interface define a register method that does what the three-argument constructor currently does. This moves registration into the API rather than as a special non-obvious form of the constructor.
Each plugin configuration should name its storage plugin using the tag from the plugin definition.

The above greatly simplifies the storage plugin system:

Definition files name the storage plugin class.
The class (or definition entry) gives a tag to identify the plugin.
Given a definition file, Drill builds a (tag --> plugin) table.
Given a storage plugin definition, the plugin tag in the definition maps, via the above table, to the plugin itself.

Other changes:

Storage plugins should have extended lives. As it is, they are created multiple times for each query, and thus many times across queries. This makes it had for the plugin to hold onto state (such as a connection to an external system, cached data, etc.)
Clearly separate the planner-time (transient) objects and the persistent plan objects passed across the network. This will allow the planner-only objects to contain handles to internal state (something that now is impossible given the internal Jackson serializations.)
Avoid unnecessary copies of the planning-time objects by tightening up the API. (Provide columns once rather than twice, etc.)

Storage Plugin Model

Topics

Overview

Scan Operator

Runtime Events

Comments on Current Design

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!