-
Notifications
You must be signed in to change notification settings - Fork 987
BH JSON Reader
As previously mentioned, this project upgraded two readers to test out the various frameworks: the CSV reader and the JSON reader. The JSON reader has special challenges because of a set of semi-supported features. As we will see, upgrading JSON to handle supported features was simple. Problems arose, however, in the semi- and un-supported features of JSON. Additional issues arose when realizing that Drill's design for handling JSON has gaping design issues that we've tended to ignore, perhaps because JSON is not used as frequently as we'd expect.
To understand the JSON work, it is best to start with requirements.
- Use the Jackson JSON parser to produce a stream of JSON tokens.
- Create a record batch schema based on the type of the first value.
- Support two forms of records: a sequence of top-level objects, or a single array of objects.
- Read JSON scalars as either typed values or text (the so-called "all-text" mode.)
- Support maps and arrays of maps to any depth.
- Support varying types (maybe.)
- Support varying lists (maybe.)
- Suport two-dimensional lists.
- Support null values in place of lists, maps or scalars.
The original code handles all of the above with some caveats:
- Nulls cannot appear before a non-null value. (If they do, Drill guesses some arbitrary type.)
- Varying types is supported with unsupported Union and List vectors.
- List vectors support is very vague. List vectors themselves have many bugs. Other operators seem to not support a List of Unions.
The revised code has a different set of caveats:
- Handles any number of leading nulls, as long as a non-null value appears in the first batch.
- Handles List vectors, but other operators still don't handle them.
- Does not handle varying types (unions).
The discussion that follows will explain this odd turn of events.
Structurally, the revised JSON reader follows the overall structure of the existing reader.
| Layer | Existing | Revised |
|---|---|---|
| Format Plugin | Derived from Easy plugin | Same, using new framework |
| Reader | Derived from RecordReader
|
Derived from `Managed Reader |
| Semantic Parser | JSONReader |
JSONParser |
| Vector writer | Complex writers | Result set loader |
| Token Parser | Jackson parser | Jackson parser |
The json package contains both the old and new JSON implementations. The old implementation is still used by the Drill function to parse JSON, by the Kafka plugin, and by the MapR-DB JSON format plugin.
The JSONFormatPlugin is modified as described in the Easy Format Plugin section:
- Uses the new
EasyFormatConfigto configure the plugin. - Implements
scanBatchCreator()to create the scan batch mechanism. - Provides the
JsonScanBatchCreatorclass which builds up the scan operator from the required components, including the file scan framework and so on. Sets the type of null columns to nullableVarChar(a better guess than nullableInt.) - Implements the
JsonReaderCreatorclass to create JSON batch readers as needed.
As in the original implementation, the JsonBatchReader acts as a bridge between the scan operator and the JSON semantic parser:
- Opens the file described by a
FileWorkobject. - Retrieves JSON session options and passes them to the JSON semantic parser.
- Sets up a "type negotiator" (see below).
- On each call to
next(), reads records from the semantic parser until the batch is full. - Releases resources on
close()
Continuing to follow the previous design, the new code defines a JSONLoader interface, akin to the prior JSONReader that defines the services to read JSON records. In keeping with a theme for this project, th interface is dead simple:
public interface JsonLoader {
boolean next();
void endBatch();
void close();
}
Here, next() reads one record.
The bulk of the actual JSON parsing work resides in the new parser package.
Briefly, the semantic parser creates a parse tree to parse each JSON element. Thus, there is a parser for the root object, another for each field, and more for nested maps. The idea is to use a separate class for each JSON element, then combine them into an overall parse tree for the file. This approach contrasts with the flags-and-states approach of the prior version: it kept track of differences by setting a large variety of flags. That approach, however, is very difficult to maintain, enhance and unit tests, hence the revised approach.
The JsonLoaderImpl class is the core of the semantic processor. It:
- Holds onto the JSON parse options provided by the caller (by the
JSONBatchReader). - Reads and returns rows.
- Maintains the root of the parse tree for the document.
- Resolves ambiguous types (discussed below).
- Handles errors.
- Provides the input error recovery created by a community member.
The Jackson JSON parser does the low-level parsing, providing a series of JSON tokens.
It is worth noting some details of the Jackson parser. First, it maintains its own internal state machine about the syntax of the JSON document. In this way, it is not a simple "tokenizer", it is a full parser in its own right. This behavior means that it is very, very difficult to get the parser to recover from errors. Once the parser gets into an error state, it does not want to come back out.
Second, it means that the semantic layer parser does not have to worry about syntax errors or details. The Jackson parser handles, say, the colons that separate key/value pairs, handles the difference between quoted and unquoted identifiers, and will catch invalid token sequences. The semantic parser just worries about a stream of valid tokens.
Third, the parser cannot handle look-ahead. Look-ahead, however, is often needed, so we must provide our own implementation in the form of the TokenIterator class.
In general, look-ahead is difficult because we have to buffer the value of prior tokens. Fortunately, when parsing JSON, the only tokens that are subject to look-ahead are simple tokens such as {, [, and so on. So, the token iterator caches only simple tokens, but not token values.
The semantic parser is based on the idea of a parse tree. Each node in the tree parses one JSON element, then passes control to child nodes to parse any contained elements. The idea is classic parser theory, but the implementation requires a bit of explanation. As we noted, the Jackson parser does syntax validation. All the semantic parser has to do is handle the meaning of the tokens (hence the name.)
Each parse node handles one JSON element: a map, a scalar, an array, etc. It understands the tokens expected in that specific point in the parse. Often this means dealing with a scalar or null, or dealing with a nested structure.
Since each node handles just one task, it is easy to reason about the nodes. Also, it is quite easy to assemble them to handle JSON structures of any depth.
Each parse node implements JsonElementParser :
interface JsonElementParser {
String key();
JsonLoader loader();
JsonElementParser parent();
boolean isAnonymous();
boolean parse();
ColumnMetadata schema();
}
Note: many parse node classes appear as classes nested inside other top-level classes. This was simply for convenience when developing as the classes were subject to rapid change. If desired, the classes can be pulled out to be top-level classes now that the structure has settled down.
All of the above would be really quite simple if JSON were well-behaved. But, JSON in the wild is complex (as we'll discuss below.) A whole set of parsers handles the ambiguities inherent in the JSON model:
- Runs of nulls before seeing a type
- Runs of empty arrays before seeing a type
- Type negotiator
- Move JSON options from session options to format plugin options to allow per-file (rather than per-session) values.