Skip to content

EVF Tutorial Scan Framework Creator

Paul Rogers edited this page May 27, 2019 · 7 revisions

Scan Framework Creator

We've created a batch reader for the log plugin. But, thus far been an "orphan": nothing calls it. Let's fix that.

EVF is a general-purpose framework: it handles many kinds of scans. We must customize it for each specific reader. Rather than doing so by creating subclasses, we instead assemble the pieces we needed through composition by providing a framework builder class. This builder class is what allows the Easy framework to operator with both "legacy" and EVF-based readers.

Note: The description here depends on the PR for DRILL-7261, which has not yet been merged into master.

Create Row Batch Reader

Prior to the EVF, Easy format plugins were based on the original ScanBatch. At the start of execution, the Easy framework calls getRecordReader() in your plugin class to create a record reader for each file split. The Easy framework then passes all the readers to the scan batch.

With EVF, we use the record batch reader we just created. Instead of creating all readers up-front, EVF creates them on-the-fly as needed. EVF will create a separate instance for each file split. The easiest way to do so is to override the following method in your plugin class:

TODO: Replace with code for log reader.

  @Override
  public ManagedReader<? extends FileSchemaNegotiator> newBatchReader(
      EasySubScan scan, OptionManager options) throws ExecutionSetupException {
    TextParsingSettingsV3 settings = new TextParsingSettingsV3();
    settings.set(getConfig());
    return new CompliantTextBatchReader(settings);
  }

In an advanced case, we could even create a different reader depending on some interesting condition. For example, the Parquet reader has both a "new" and "old" version with different capabilities. We have access to the scan, options, and the format config (via the getConfig() method).

This method is not yet called anywhere, so the plugin should still run using the old reader.

Create the Scan Framework Builder

EVF supports a number of "scan frameworks" and a wide variety of options. We use the "builder" pattern to specify how we want the scan to work: we create a builder, pass it options to configure the framework, then let the Easy scan framework do the actual building for us. Here's how we configure the file scan framework by adding a method to the plugin class:

  @Override
  protected FileScanBuilder frameworkBuilder(
      FragmentContext context, EasySubScan scan) throws ExecutionSetupException {
    FileScanBuilder builder = new FileScanBuilder();
 
    // The default type of regex columns is nullable VarChar,
    // so let's use that as the missing column type.

    builder.setNullType(Types.optional(MinorType.VARCHAR));
    return builder;
  }

The log reader reads from a file, so we use the FileScanBuilder class. We could support the columns column to read into an array, like CSV, if we wanted.

We call setNullType() to define a type to use for missing columns rather than the traditional nullable INT. We observe that the native type of a regex column is nullable Varchar. So, if the user asked for a column that we don't have, we should use that same type so that types remain unchanged when the user later decides to define that column.

After you add this method, the log reader will still use the old version of the reader because we've not told the Easy framework to call the method we just created.

Alternative: Define the Row Batch Reader Creator

The above method handles typical cases. If you find you have a special case (such as in the text reader), you can create a gain more control by defining "reader creator" class. (The Easy framework provides a default version that calls the above method.) Here's an example, defined as a nested class within LogFormatPlugin:

  private static class LogReaderFactory extends FileReaderFactory {

    private final LogFormatPlugin plugin;

    public LogReaderFactory(LogFormatPlugin plugin) {
      this.plugin = plugin;
    }

    @Override
    public ManagedReader<? extends FileSchemaNegotiator> newReader() {
       return new LogBatchReader(plugin.getConfig());
    }
  }

This is simple enough: EVF calls the newReader() method when it is ready to read the next file split. The split itself is obtained as shown previously. The advantage of this technique is that we an pass additional information from our plugin into the batch reader if we need more than that provided by the Easy framework.

We then use our class as follows:

  @Override
  protected FileScanBuilder frameworkBuilder(
      FragmentContext context, EasySubScan scan) throws ExecutionSetupException {
    FileScanBuilder builder = new FileScanBuilder();
    builder.setReaderFactory(new LogReaderFactory(this));
    ...

We specify the builder for our batch readers by calling setReaderFactory() with an instance of our reader creator.

If you use this method, then you don't need to implement the newReader() method on the plugin as we did earlier.

Select the Traditional or Enhanced Scan Framework

We are now ready to switch over to the new, enhanced (EVF-based) reader. To do so, we simply set one option in our plugin configiruation:

 private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) {
    EasyFormatConfig config = new EasyFormatConfig();
    ...
    config.useEnhancedScan = true;
    return config;
  }

With this change, the EVF-based version is now live.

Conditionally Selecting the Original and EVF-Based Readers

If you are especially cautious, you can leverage the framework builder mechanism to offer both the new and old versions of your reader. Just override the useEnhancedScan() method. By default, the method just returns the option we set above:

  protected boolean useEnhancedScan(OptionManager options) {
    return easyConfig.useEnhancedScan;
  }

But, we could select the framework based on a system/session option as was done with the "v2" and "v3" versions of the text (CSV) reader in Drill 1.16.

  @Override
  protected boolean useEnhancedScan(OptionManager options) {
    return options.getBoolean(ExecConstants.ENABLE_V3_TEXT_READER_KEY);
  }

Test

With this method in place, our new version is "live". You should use your unit tests to step through the new code to make sure it works -- and to ensure you understand the EVF, or at least the parts you need.

Next Steps

We've now completed a "bare bones" conversion to the new framework. We'd be fine if we stopped here.

The new framework offers additional features that can further simplify the log format plugin. We'll look at those topics in the next section.


Next: Discover Schema While Reading

Clone this wiki locally