Skip to content

Latest commit

 

History

History
169 lines (152 loc) · 4.18 KB

File metadata and controls

169 lines (152 loc) · 4.18 KB

Creating a schema

A schema describes the data in a table. A schema consists of several fields. A field has a name and a type, e.g. a field with name 'id' and type 'string'. A schema has three types of field: row keys, sort keys and values.

See a full example for an example of a schema.

For example, a simple key-value schema with a string key and a string value would allow rows to be retrieved by querying for a key. The schema for this would be:

    {
      "rowKeyFields": [
        {
          "name": "key",
          "type": "StringType"
        }
      ],
      "sortKeyFields": [],
      "valueFields": [
        {
          "name": "value",
          "type": "StringType"
        }
      ]
    }

Note that if there are no sort or value fields then this must be indicated with an empty list.

If we wanted to sort the rows for a particular field by a timestamp, we could add a sort field of type long:

    {
      "rowKeyFields": [
        {
          "name": "key",
          "type": "StringType"
        }
      ],
      "sortKeyFields": [
        {
          "name": "timestamp",
          "type": "LongType"
        }
      ],
      "valueFields": [
        {
          "name": "value",
          "type": "StringType"
        }
      ]
    }

This would cause rows for a particular key to be stored (and retrieved) in increasing order of timestamps.

The following types are permitted as row keys and sort keys: IntType, LongType, StringType, ByteArrayType. All of these types can be used for values. Additionally, value fields may be of type ListType or MapType. Here is an example schema where there are several value fields:

    {
      "rowKeyFields": [
        {
          "name": "key",
          "type": "StringType"
        }
      ],
      "sortKeyFields": [
        {
          "name": "timestamp",
          "type": "LongType"
        }
      ],
      "valueFields": [
        {
          "name": "value1",
          "type": "StringType"
        },
        {
          "name": "value2",
          "type": "ByteArrayType"
        },
        {
          "name": "value3",
          "type": {
            "ListType": {
              "elementType": "IntType"
            }
          }
        },
        {
          "name": "value4",
          "type": {
            "MapType": {
              "keyType": "IntType",
              "valueType": "StringType"
            }
          }
        }
      ]
    }

The field with name value3 is a list with integer elements. The field with name value4 is a map with integer keys and string values.

Value fields can optionally be declared as nullable, meaning a row may have no value for that field. This is done by adding "nullable": true to the field definition:

    {
      "rowKeyFields": [
        {
          "name": "key",
          "type": "StringType"
        }
      ],
      "sortKeyFields": [],
      "valueFields": [
        {
          "name": "required_value",
          "type": "StringType"
        },
        {
          "name": "optional_value",
          "type": "StringType",
          "nullable": true
        }
      ]
    }

When a nullable field is absent from a row, querying that field will return null. Row key and sort key fields cannot be nullable.

There may be multiple row key fields. In the following example two string fields are used as row keys:

    {
      "rowKeyFields": [
        {
          "name": "key1",
          "type": "StringType"
        },
        {
          "name": "key2",
          "type": "StringType"
        }
      ],
      "sortKeyFields": [
        {
          "name": "timestamp",
          "type": "LongType"
        }
      ],
      "valueFields": [
        {
          "name": "value",
          "type": "StringType"
        }
      ]
    }

Sleeper will store the rows sorted by key1 and then key2. Thus retrieving all rows where key1 and key2 have specified values will be quick. A range scan to retrieve all rows where key1 has a certain value and key2 can take any value will also be quick. But a query for all rows where key2 has a specified value but key1 can take any value will not be quick.