Skip to content

Clarify and migrate semantically meaningful *_UNSPECIFIED enum defaults #1069

@benbellick

Description

@benbellick

Problem

Substrait uses protobuf enum values like *_UNSPECIFIED = 0, which is consistent with protobuf best practices. However, some fields currently assign semantic meaning to either the unspecified value or the proto3 default/unset state.

This is problematic because protobuf recommends that the zero enum value not be semantically meaningful:
https://protobuf.dev/best-practices/dos-donts/#unspecified-enum

This came up in PR #1044:
#1044 (comment)

Examples in the current spec:

  • HashJoinRel.build_input says the field defaults to BUILD_INPUT_RIGHT, but the enum has BUILD_INPUT_UNSPECIFIED = 0.

    // Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`.
    BuildInput build_input = 9;
    enum JoinType {
    JOIN_TYPE_UNSPECIFIED = 0;
    JOIN_TYPE_INNER = 1;
    JOIN_TYPE_OUTER = 2;
    JOIN_TYPE_LEFT = 3;
    JOIN_TYPE_RIGHT = 4;
    JOIN_TYPE_LEFT_SEMI = 5;
    JOIN_TYPE_RIGHT_SEMI = 6;
    JOIN_TYPE_LEFT_ANTI = 7;
    JOIN_TYPE_RIGHT_ANTI = 8;
    JOIN_TYPE_LEFT_SINGLE = 9;
    JOIN_TYPE_RIGHT_SINGLE = 10;
    JOIN_TYPE_LEFT_MARK = 11;
    JOIN_TYPE_RIGHT_MARK = 12;
    }
    enum BuildInput {
    BUILD_INPUT_UNSPECIFIED = 0;
    BUILD_INPUT_LEFT = 1;
    BUILD_INPUT_RIGHT = 2;

  • The physical join docs also describe Build Input as optional and defaulting to right.

    | Left Input | A relational input. | Required |
    | Right Input | A relational input. | Required |
    | Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. |
    | Left Keys | References to the fields to join on in the left input. | Required |
    | Right Keys | References to the fields to join on in the right input. | Required |
    | Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |

  • AggregationPhase gives AGGREGATION_PHASE_UNSPECIFIED semantic meaning.

    // Describes which part of an aggregation or window function to perform within
    // the context of distributed algorithms.
    enum AggregationPhase {
    // Implies `INTERMEDIATE_TO_RESULT`.
    AGGREGATION_PHASE_UNSPECIFIED = 0;
    // Specifies that the function should be run only up to the point of
    // generating an intermediate value, to be further aggregated later using
    // INTERMEDIATE_TO_INTERMEDIATE or INTERMEDIATE_TO_RESULT.
    AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE = 1;
    // Specifies that the inputs of the aggregate or window function are the
    // intermediate values of the function, and that the output should also be
    // an intermediate value, to be further aggregated later using
    // INTERMEDIATE_TO_INTERMEDIATE or INTERMEDIATE_TO_RESULT.
    AGGREGATION_PHASE_INTERMEDIATE_TO_INTERMEDIATE = 2;
    // A complete invocation: the function should aggregate the given set of
    // inputs to yield a single return value. This style must be used for
    // aggregate or window functions that are not decomposable.
    AGGREGATION_PHASE_INITIAL_TO_RESULT = 3;
    // Specifies that the inputs of the aggregate or window function are the
    // intermediate values of the function, generated previously using
    // INITIAL_TO_INTERMEDIATE and possibly INTERMEDIATE_TO_INTERMEDIATE calls.
    // This call should combine the intermediate values to yield the final
    // return value.
    AGGREGATION_PHASE_INTERMEDIATE_TO_RESULT = 4;

  • AggregationInvocation gives AGGREGATION_INVOCATION_UNSPECIFIED semantic meaning.

    // Specifies whether equivalent records are merged before being aggregated.
    // Optional, defaults to AGGREGATION_INVOCATION_ALL.
    AggregationInvocation invocation = 6;
    // deprecated; use arguments instead
    repeated Expression args = 2 [deprecated = true];
    // Method in which equivalent records are merged before being aggregated.
    enum AggregationInvocation {
    // This default value implies AGGREGATION_INVOCATION_ALL.
    AGGREGATION_INVOCATION_UNSPECIFIED = 0;
    // Use all values in the aggregation calculation.
    AGGREGATION_INVOCATION_ALL = 1;
    // Use only distinct values in the aggregation calculation.
    AGGREGATION_INVOCATION_DISTINCT = 2;

  • FetchMode says producers must set a defined mode, but the docs say Top-N mode defaults to ROWS_ONLY.

    // Determines how a fetch operation handles rows tied with the last returned row.
    enum FetchMode {
    // Unspecified. Producers must set one of the defined modes.
    FETCH_MODE_UNSPECIFIED = 0;
    // Only return the requested number of rows.
    FETCH_MODE_ROWS_ONLY = 1;
    // Include additional rows tied with the last row (per the sort fields).
    FETCH_MODE_WITH_TIES = 2;

    | Property | Description | Required |
    | ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- |
    | Input | The relational input. | Required |
    | Sort Fields | List of one or more fields to sort by. Uses the same properties as the [orderedness](basics.md#orderedness) property. | At least one field is required |
    | Offset Expression | An expression evaluating to a non-negative integer or null. Declares the offset for retrieval of records. Null is treated as 0. | Optional, defaults to 0. |
    | Count Expression | An expression evaluating to a non-negative integer or null. Declares the number of records to return. Null means ALL. | Required |
    | Mode | Determines how to handle rows tied with the last returned row. `ROWS_ONLY` (default) returns exactly `count` rows. `WITH_TIES` also includes additional rows tied with the last row (per the sort fields). | Optional, defaults to `ROWS_ONLY`. |

Proposal

For each affected field, choose one of two approaches:

  1. Deprecate the existing field and replace it with a new presence-aware field.
  2. Migrate in place: producers stop emitting *_UNSPECIFIED, and after a compatibility period, consumers start rejecting *_UNSPECIFIED for semantically meaningful fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions