Problem
Substrait uses protobuf enum values like *_UNSPECIFIED = 0, which is consistent with protobuf best practices. However, some fields currently assign semantic meaning to either the unspecified value or the proto3 default/unset state.
This is problematic because protobuf recommends that the zero enum value not be semantically meaningful:
https://protobuf.dev/best-practices/dos-donts/#unspecified-enum
This came up in PR #1044:
#1044 (comment)
Examples in the current spec:
-
HashJoinRel.build_input says the field defaults to BUILD_INPUT_RIGHT, but the enum has BUILD_INPUT_UNSPECIFIED = 0.
|
// Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`. |
|
BuildInput build_input = 9; |
|
|
|
enum JoinType { |
|
JOIN_TYPE_UNSPECIFIED = 0; |
|
JOIN_TYPE_INNER = 1; |
|
JOIN_TYPE_OUTER = 2; |
|
JOIN_TYPE_LEFT = 3; |
|
JOIN_TYPE_RIGHT = 4; |
|
JOIN_TYPE_LEFT_SEMI = 5; |
|
JOIN_TYPE_RIGHT_SEMI = 6; |
|
JOIN_TYPE_LEFT_ANTI = 7; |
|
JOIN_TYPE_RIGHT_ANTI = 8; |
|
JOIN_TYPE_LEFT_SINGLE = 9; |
|
JOIN_TYPE_RIGHT_SINGLE = 10; |
|
JOIN_TYPE_LEFT_MARK = 11; |
|
JOIN_TYPE_RIGHT_MARK = 12; |
|
} |
|
|
|
enum BuildInput { |
|
BUILD_INPUT_UNSPECIFIED = 0; |
|
BUILD_INPUT_LEFT = 1; |
|
BUILD_INPUT_RIGHT = 2; |
-
The physical join docs also describe Build Input as optional and defaulting to right.
|
| Left Input | A relational input. | Required | |
|
| Right Input | A relational input. | Required | |
|
| Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. | |
|
| Left Keys | References to the fields to join on in the left input. | Required | |
|
| Right Keys | References to the fields to join on in the right input. | Required | |
|
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. | |
-
AggregationPhase gives AGGREGATION_PHASE_UNSPECIFIED semantic meaning.
|
// Describes which part of an aggregation or window function to perform within |
|
// the context of distributed algorithms. |
|
enum AggregationPhase { |
|
// Implies `INTERMEDIATE_TO_RESULT`. |
|
AGGREGATION_PHASE_UNSPECIFIED = 0; |
|
|
|
// Specifies that the function should be run only up to the point of |
|
// generating an intermediate value, to be further aggregated later using |
|
// INTERMEDIATE_TO_INTERMEDIATE or INTERMEDIATE_TO_RESULT. |
|
AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE = 1; |
|
|
|
// Specifies that the inputs of the aggregate or window function are the |
|
// intermediate values of the function, and that the output should also be |
|
// an intermediate value, to be further aggregated later using |
|
// INTERMEDIATE_TO_INTERMEDIATE or INTERMEDIATE_TO_RESULT. |
|
AGGREGATION_PHASE_INTERMEDIATE_TO_INTERMEDIATE = 2; |
|
|
|
// A complete invocation: the function should aggregate the given set of |
|
// inputs to yield a single return value. This style must be used for |
|
// aggregate or window functions that are not decomposable. |
|
AGGREGATION_PHASE_INITIAL_TO_RESULT = 3; |
|
|
|
// Specifies that the inputs of the aggregate or window function are the |
|
// intermediate values of the function, generated previously using |
|
// INITIAL_TO_INTERMEDIATE and possibly INTERMEDIATE_TO_INTERMEDIATE calls. |
|
// This call should combine the intermediate values to yield the final |
|
// return value. |
|
AGGREGATION_PHASE_INTERMEDIATE_TO_RESULT = 4; |
-
AggregationInvocation gives AGGREGATION_INVOCATION_UNSPECIFIED semantic meaning.
|
// Specifies whether equivalent records are merged before being aggregated. |
|
// Optional, defaults to AGGREGATION_INVOCATION_ALL. |
|
AggregationInvocation invocation = 6; |
|
|
|
// deprecated; use arguments instead |
|
repeated Expression args = 2 [deprecated = true]; |
|
|
|
// Method in which equivalent records are merged before being aggregated. |
|
enum AggregationInvocation { |
|
// This default value implies AGGREGATION_INVOCATION_ALL. |
|
AGGREGATION_INVOCATION_UNSPECIFIED = 0; |
|
|
|
// Use all values in the aggregation calculation. |
|
AGGREGATION_INVOCATION_ALL = 1; |
|
|
|
// Use only distinct values in the aggregation calculation. |
|
AGGREGATION_INVOCATION_DISTINCT = 2; |
-
FetchMode says producers must set a defined mode, but the docs say Top-N mode defaults to ROWS_ONLY.
|
// Determines how a fetch operation handles rows tied with the last returned row. |
|
enum FetchMode { |
|
// Unspecified. Producers must set one of the defined modes. |
|
FETCH_MODE_UNSPECIFIED = 0; |
|
// Only return the requested number of rows. |
|
FETCH_MODE_ROWS_ONLY = 1; |
|
// Include additional rows tied with the last row (per the sort fields). |
|
FETCH_MODE_WITH_TIES = 2; |
|
| Property | Description | Required | |
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | |
|
| Input | The relational input. | Required | |
|
| Sort Fields | List of one or more fields to sort by. Uses the same properties as the [orderedness](basics.md#orderedness) property. | At least one field is required | |
|
| Offset Expression | An expression evaluating to a non-negative integer or null. Declares the offset for retrieval of records. Null is treated as 0. | Optional, defaults to 0. | |
|
| Count Expression | An expression evaluating to a non-negative integer or null. Declares the number of records to return. Null means ALL. | Required | |
|
| Mode | Determines how to handle rows tied with the last returned row. `ROWS_ONLY` (default) returns exactly `count` rows. `WITH_TIES` also includes additional rows tied with the last row (per the sort fields). | Optional, defaults to `ROWS_ONLY`. | |
Proposal
For each affected field, choose one of two approaches:
- Deprecate the existing field and replace it with a new presence-aware field.
- Migrate in place: producers stop emitting
*_UNSPECIFIED, and after a compatibility period, consumers start rejecting *_UNSPECIFIED for semantically meaningful fields.
Problem
Substrait uses protobuf enum values like
*_UNSPECIFIED = 0, which is consistent with protobuf best practices. However, some fields currently assign semantic meaning to either the unspecified value or the proto3 default/unset state.This is problematic because protobuf recommends that the zero enum value not be semantically meaningful:
https://protobuf.dev/best-practices/dos-donts/#unspecified-enum
This came up in PR #1044:
#1044 (comment)
Examples in the current spec:
HashJoinRel.build_inputsays the field defaults toBUILD_INPUT_RIGHT, but the enum hasBUILD_INPUT_UNSPECIFIED = 0.substrait/proto/substrait/algebra.proto
Lines 868 to 890 in bf7a9ca
The physical join docs also describe
Build Inputas optional and defaulting to right.substrait/site/docs/relations/physical_relations.md
Lines 21 to 26 in bf7a9ca
AggregationPhasegivesAGGREGATION_PHASE_UNSPECIFIEDsemantic meaning.substrait/proto/substrait/algebra.proto
Lines 1788 to 1815 in bf7a9ca
AggregationInvocationgivesAGGREGATION_INVOCATION_UNSPECIFIEDsemantic meaning.substrait/proto/substrait/algebra.proto
Lines 1861 to 1877 in bf7a9ca
FetchModesays producers must set a defined mode, but the docs say Top-N mode defaults toROWS_ONLY.substrait/proto/substrait/algebra.proto
Lines 588 to 595 in bf7a9ca
substrait/site/docs/relations/physical_relations.md
Lines 152 to 158 in bf7a9ca
Proposal
For each affected field, choose one of two approaches:
*_UNSPECIFIED, and after a compatibility period, consumers start rejecting*_UNSPECIFIEDfor semantically meaningful fields.