diff --git a/LogicalTypes.md b/LogicalTypes.md index e7a0ce04..d9fd6a29 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller footprint and potenti The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`. -The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. +The type-defined sort order for `FLOAT16` is signed (with special handling of NaNs and signed zeros), +as for `FLOAT` and `DOUBLE`. It is recommended that writers use IEEE754TotalOrder when writing columns +of this type for a well-defined handling of NaNs and signed zeros. See the `ColumnOrder` union in the +[Thrift definition](src/main/thrift/parquet.thrift) for details. ## Temporal Types diff --git a/README.md b/README.md index ae7272fb..afecf332 100644 --- a/README.md +++ b/README.md @@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types]. Parquet stores min/max statistics at several levels (such as Column Chunk, Column Index, and Data Page). These statistics are according to a sort order, which is defined for each column in the file footer. Parquet supports common -sort orders for logical and primitve types. The details are documented in the +sort orders for logical and primitve types and also special orders for types +where the common sort order is not unambiguously defined (e.g., NaN ordering +for floating point types). The details are documented in the [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. ## Nested Encoding diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 59ec5f17..1c49106b 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -1030,6 +1030,9 @@ struct RowGroup { /** Empty struct to signal the order defined by the physical or logical type */ struct TypeDefinedOrder {} +/** Empty struct to signal IEEE 754 total order for floating point types */ +struct IEEE754TotalOrder {} + /** * Union to specify the order used for the min_value and max_value fields for a * column. This union takes the role of an enhanced enum that allows rich @@ -1038,6 +1041,7 @@ struct TypeDefinedOrder {} * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or * physical type (if there is no logical type). + * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order. * * If the reader does not support the value of this union, min and max stats * for this column should be ignored. @@ -1082,8 +1086,12 @@ union ColumnOrder { * BYTE_ARRAY - unsigned byte-wise comparison * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison * - * (*) Because the sorting order is not specified properly for floating - * point values (relations vs. total ordering) the following + * (*) Because the precise sorting order is ambiguous for floating + * point types due to underspecified handling of NaN and -0/+0, + * it is recommended that writers use IEEE_754_TOTAL_ORDER + * for these types. + * + * If this ordering is used for floating point types, then the following * compatibility rules should be applied when reading statistics: * - If the min is a NaN, it should be ignored. * - If the max is a NaN, it should be ignored. @@ -1099,6 +1107,53 @@ union ColumnOrder { * `-0.0` should be written into the min statistics field. */ 1: TypeDefinedOrder TYPE_ORDER; + + /* + * The floating point type is ordered according to the totalOrder predicate, + * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of + * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this ordering. + * + * Intuitively, this orders floats mathematically, but defines -0 to be less + * than +0, -NaN to be less than anything else, and +NaN to be greater than + * anything else. It also defines an order between different bit representations + * of the same value. + * + * The formal definition is as follows: + * a) If xy, totalOrder(x, y) is false. + * c) If x=y: + * 1) totalOrder(−0, +0) is true. + * 2) totalOrder(+0, −0) is false. + * 3) If x and y represent the same floating-point datum: + * i) If x and y have negative sign, totalOrder(x, y) is true if and + * only if the exponent of x ≥ the exponent of y + * ii) otherwise totalOrder(x, y) is true if and only if the exponent + * of x ≤ the exponent of y. + * d) If x and y are unordered numerically because x or y is NaN: + * 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with + * negative sign bit and y is a non-NaN floating-point number. + * 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with + * positive sign bit and x is a non-NaN floating-point number. + * 3) If x and y are both NaNs, then totalOrder reflects a total ordering + * based on: + * i) negative sign orders below positive sign + * ii) signaling orders below quiet for +NaN, reverse for −NaN + * iii) lesser payload, when regarded as an integer, orders below + * greater payload for +NaN, reverse for −NaN. + * + * Note that this ordering can be implemented efficiently in software by bit-wise + * operations on the integer representation of the floating point values. + * E.g., this is a possible implementation for DOUBLE in Rust: + * + * pub fn totalOrder(x: f64, y: f64) -> bool { + * let mut x_int = x.to_bits() as i64; + * let mut y_int = y.to_bits() as i64; + * x_int ^= (((x_int >> 63) as u64) >> 1) as i64; + * y_int ^= (((y_int >> 63) as u64) >> 1) as i64; + * return x_int <= y_int; + * } + */ + 2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER; } struct PageLocation {