Skip to content
Paul Rogers edited this page Jan 15, 2018 · 8 revisions

As previously mentioned, this project upgraded two readers to test out the various frameworks: the CSV reader and the JSON reader. The JSON reader has special challenges because of a set of semi-supported features. As we will see, upgrading JSON to handle supported features was simple. Problems arose, however, in the semi- and un-supported features of JSON. Additional issues arose when realizing that Drill's design for handling JSON has gaping design issues that we've tended to ignore, perhaps because JSON is not used as frequently as we'd expect.

Requirements

To understand the JSON work, it is best to start with requirements.

  • Use the Jackson JSON parser to produce a stream of JSON tokens.
  • Create a record batch schema based on the type of the first value.
  • Support two forms of records: a sequence of top-level objects, or a single array of objects.
  • Read JSON scalars as either typed values or text (the so-called "all-text" mode.)
  • Support maps and arrays of maps to any depth.
  • Support varying types (maybe.)
  • Support varying lists (maybe.)
  • Suport two-dimensional lists.
  • Support null values in place of lists, maps or scalars.

The original code handles all of the above with some caveats:

  • Nulls cannot appear before a non-null value. (If they do, Drill guesses some arbitrary type.)
  • Varying types is supported with unsupported Union and List vectors.
  • List vectors support is very vague. List vectors themselves have many bugs. Other operators seem to not support a List of Unions.

The revised code has a different set of caveats:

  • Handles any number of leading nulls, as long as a non-null value appears in the first batch.
  • Handles List vectors, but other operators still don't handle them.
  • Does not handle varying types (unions).

The discussion that follows will explain this odd turn of events.

Structure

Structurally, the revised JSON reader follows the overall structure of the existing reader.

Layer Existing Revised
Format Plugin Derived from Easy plugin Same, using new framework
Reader Derived from RecordReader Derived from `Managed Reader
Semantic Parser JSONReader JSONParser
Vector writer Complex writers Result set loader
Token Parser Jackson parser Jackson parser

The json package contains both the old and new JSON implementations. The old implementation is still used by the Drill function to parse JSON, by the Kafka plugin, and by the MapR-DB JSON format plugin.

Format Plugin

The JSONFormatPlugin is modified as described in the Easy Format Plugin section:

  • Uses the new EasyFormatConfig to configure the plugin.
  • Implements scanBatchCreator() to create the scan batch mechanism.
  • Provides the JsonScanBatchCreator class which builds up the scan operator from the required components, including the file scan framework and so on. Sets the type of null columns to nullable VarChar (a better guess than nullable Int.)
  • Implements the JsonReaderCreator class to create JSON batch readers as needed.

Batch Reader

As in the original implementation, the JsonBatchReader acts as a bridge between the scan operator and the JSON semantic parser:

  • Opens the file described by a FileWork object.
  • Retrieves JSON session options and passes them to the JSON semantic parser.
  • Sets up a "type negotiator" (see below).
  • On each call to next(), reads records from the semantic parser until the batch is full.
  • Releases resources on close()

JSON Loader

Continuing to follow the previous design, the new code defines a JSONLoader interface, akin to the prior JSONReader that defines the services to read JSON records. In keeping with a theme for this project, th interface is dead simple:

public interface JsonLoader {
  boolean next();
  void endBatch();
  void close();
}

Here, next() reads one record.

Semantic Parser

The bulk of the actual JSON parsing work resides in the new parser package.

Briefly, the semantic parser creates a parse tree to parse each JSON element. Thus, there is a parser for the root object, another for each field, and more for nested maps. The idea is to use a separate class for each JSON element, then combine them into an overall parse tree for the file. This approach contrasts with the flags-and-states approach of the prior version: it kept track of differences by setting a large variety of flags. That approach, however, is very difficult to maintain, enhance and unit tests, hence the revised approach.

JSON Loader Implementation

The JsonLoaderImpl class is the core of the semantic processor. It:

  • Holds onto the JSON parse options provided by the caller (by the JSONBatchReader).
  • Reads and returns rows.
  • Maintains the root of the parse tree for the document.
  • Resolves ambiguous types (discussed below).
  • Handles errors.
  • Provides the input error recovery created by a community member.

Jackson Parser

The Jackson JSON parser does the low-level parsing, providing a series of JSON tokens.

It is worth noting some details of the Jackson parser. First, it maintains its own internal state machine about the syntax of the JSON document. In this way, it is not a simple "tokenizer", it is a full parser in its own right. This behavior means that it is very, very difficult to get the parser to recover from errors. Once the parser gets into an error state, it does not want to come back out.

Second, it means that the semantic layer parser does not have to worry about syntax errors or details. The Jackson parser handles, say, the colons that separate key/value pairs, handles the difference between quoted and unquoted identifiers, and will catch invalid token sequences. The semantic parser just worries about a stream of valid tokens.

Third, the parser cannot handle look-ahead. Look-ahead, however, is often needed, so we must provide our own implementation in the form of the TokenIterator class.

In general, look-ahead is difficult because we have to buffer the value of prior tokens. Fortunately, when parsing JSON, the only tokens that are subject to look-ahead are simple tokens such as {, [, and so on. So, the token iterator caches only simple tokens, but not token values.

Parse Node

The semantic parser is based on the idea of a parse tree. Each node in the tree parses one JSON element, then passes control to child nodes to parse any contained elements. The idea is classic parser theory, but the implementation requires a bit of explanation. As we noted, the Jackson parser does syntax validation. All the semantic parser has to do is handle the meaning of the tokens (hence the name.)

Each parse node handles one JSON element: a map, a scalar, an array, etc. It understands the tokens expected in that specific point in the parse. Often this means dealing with a scalar or null, or dealing with a nested structure.

Since each node handles just one task, it is easy to reason about the nodes. Also, it is quite easy to assemble them to handle JSON structures of any depth.

Each parse node implements JsonElementParser :

  interface JsonElementParser {
    String key();
    JsonLoader loader();
    JsonElementParser parent();
    boolean isAnonymous();
    boolean parse();
    ColumnMetadata schema();
  }

Note: many parse node classes appear as classes nested inside other top-level classes. This was simply for convenience when developing as the classes were subject to rapid change. If desired, the classes can be pulled out to be top-level classes now that the structure has settled down.

Root Parser: JSON as Sequence of Objects

Root Parser: JSON as a Single Array

Object as Row

Object Members

Scalars

Arrays

Maps

Multi-Dimensional Arrays

Ambiguous Elements

All of the above would be really quite simple if JSON were well-behaved. But, JSON in the wild is complex (as we'll discuss below.) A whole set of parsers handles the ambiguities inherent in the JSON model:

  • Runs of nulls before seeing a type
  • Runs of empty arrays before seeing a type

Design Flaws

Experimental Workarounds

Null Handling

Empty List Handling

Ambiguous Type Handling

  • Type negotiator

Components

Plugin

Semantic Parser

Element Parsers

Future Work

  • Move JSON options from session options to format plugin options to allow per-file (rather than per-session) values.

Clone this wiki locally