feat: optional codec and data type#33
Conversation
|
Looks good, the only issue I see is the fill value representation not allowing null for the base data type fill value, e.g. for a base data type of json or nested optional. Instead you could specify the base data type fill value in a single-element array: null -> missing |
|
While I see the simplicity here of toggling between |
I suppose it is a limitation, but I'd note that we don't have any data types that permit a Also, for a multiply nested optional type like |
I think |
|
I've implemented this in
I am open to explicit suggestions that satisfy both, or only the first. The latter is a bit burdensome to support for something I suspect nobody would use. What I currently do:
|
I previously suggested wrapping any non-None fill value in a one-element array. That solves both issues and is syntactically pretty minimal. |
|
I'm still not quite following why this current implementation is not just a special case of a more general sum type, an enum type in Rust. Currently, Optional has a bit that toggles between
Why build this infrastructure for |
|
Interesting... masked data has come up in a few discussions I've had, but never a more general sum type. Will people use this? It seems like not many people are complaining about the lack of struct support in Zarr V3. A more general enum type would probably need to encode each variant through separate codec chains. E.g. enum EnumType {
U8(u8),
String(String),
}{
"data_type": {
"name": "enum",
"configuration": {
"data_types": [
{
"name": "uint8"
},
{
"name": "string"
}
]
}
},
"fill_value": "?",
"codecs": [
{
"name": "enum",
"configuration": {
"discriminator_data_type": {
"name": "uint8"
},
"discriminator_codecs": [
{
"name": "bytes"
}
],
"variant_codecs": [
[
{
"name": "bytes"
}
],
[
{
"name": "vlen-utf8"
}
]
]
}
}
]
}Specialising the above for an optional would need a new |
|
The other two null-like types I would immediately like to use this for are:
I do not think these are well represented by julia> NaN == true
false
julia> missing == true
missingIf we could generalize this to any two types and introduce a |
|
A recent application is in the tracking standard GEFF, they introduced a While they use Python's |
Thanks @jbms, that ended up working out quite nicely! I've updated the spec and implementation, and added example data. |
|
Would it be in-scope to map between the optional codec and nullable string dtype? Like when people do the following, it could store as optional? >>> import numpy as np
>>> import zarr
>>> arr_str = np.array(["x", None], dtype=np.dtypes.StringDType(na_object=np.nan))
>>> arr_str
array(['x', nan], dtype=StringDType(na_object=nan))
>>> g = zarr.open_group(store={})
>>> g["arr_str"] = arr_str # errors on `main` as of #3695And be read >>> g["arr_str"][:]
array(['x', nan], dtype=StringDType(na_object=nan))I believe since a 1-element array containing |
|
No problem, "data_type": {
"name": "optional",
"configuration": {
"name": "string",
"configuration": {}
}In zarr.json, and handle converting to a masked representation internally. |
|
OK, let me rephrase: I’m questioning this:
In the Python world, the NA objects are often not necessarily of the same type as the array: A nullable numpy string array dtype might have it set to So I guess as given, there would be no way to express that, and it would have to be expressed externally (e.g. by storing a separate 1 element array with the NA object. Would it make sense to have this NA value be different from a Footnotes
|
|
the numpy string dtype na_object semantics are a bit complex for zarr to handle -- I think When is it important that the na_object gets stored? The alternative where zarr just stores a mask, and readers agree on a convention for interpreting values outside the mask, seems easier at the moment. |
That’s correct. I think it might be possible to change the spec so it would be possible to natively support all of them that zarr-python can handle already (which might only be
See footnote of previous comment. But yeah, deciding that this is out-of-scope of the spec is valid. As said, I think that’d probably mean that |
a third option is to roll this logic into a self-contained data type that basically adds the na_object to the variable-length utf8 string configuration. If it's important that readers across language barriers agree on the in-memory interpretation of the |
I really want to emphasize that conflating a (missing or NA) type with
Polars has a There is also a summary of attempts to add a NA type to NumPy in NEP 26: Just because NumPy has not worked this out should not constrain our choices here. R, being developed as a statistical language has had NA values and semantics for quite some time: Julia copied R's NA semantics as the |
mkitti
left a comment
There was a problem hiding this comment.
We may need to define null better here. Perhaps we should follow null from Polars and Apache Arrow.
|
|
||
| For nested optional types, this representation is applied recursively. | ||
|
|
||
| The table below demonstrates valid `data_type` and `fill_value` combinations with an `optional` and nested `optional` data type, along with their equivalent Rust [`Option`](https://doc.rust-lang.org/std/option/) values. |
There was a problem hiding this comment.
I think we may need to provide additional mappings here to more clearly illustrate what we mean. In Rust, this conceptually seems more similar to how nullable column types work in Polars and Apache Arrow than how Optional works.
It may be necessary to define null as its own data type. I think we should also clarify if null is also meant to be analogous to NA in R and missing in Julia.
From the scale-offset codec conversation, I can also see a potential need to define how to compute with Optional types. I propose that null values with optional booleans should participate in Kleene ternary logic. Additionally, scaling or offsetting a null value should result in a null value.
There was a problem hiding this comment.
I don't think zarr usually defines runtime behavior of data, which would mean that discussing the “best” semantics here is probably a distraction. That being said numpy explicitly made it possible to use e.g. pd.NA there so you're treating it too harshly.
As it currently stands, fill_value is either of the current data type or null, and I don't think zarr is in the business of telling people which runtime behavior in-memory representations of types should have. But maybe using R’s/panda’s NA as an example wouldn't hurt.
There was a problem hiding this comment.
It may be necessary to define
nullas its own data type
Why? null is just being used as a fill value with specific behaviour for the optional data type.
From the scale-offset codec conversation
Codecs like scale_offset could add support for the optional data type if there is demand, but I think that is outside the scope of this PR. Inner codecs in data_codecs would not interact with the optional data type anyway. See below:
While array-to-array codecs MAY support the
optionaldata type, implementations SHOULD use theoptionalcodec as the sole top-level codec.
This approach is preferred because the codecs contained within theoptionalcodec configuration do not need to explicitly handle optional data type semantics.
Array-to-array codecs that perform shape manipulation (e.g. reshape) could be an exception here as they support all data types.
There was a problem hiding this comment.
Why?
nullis just being used as a fill value with specific behaviour for theoptionaldata type.
null has meaning in other contexts, particularly in other storage contexts. The term arises in SQL, Apache Arrow, and Polars. In the data type fill value description here, we state that null indicates the absence of value.
There are two possible ways to interpret an absence of value:
- The value does not exist.
- The value is not known.
The first interpretation is typically from the perspective of a software engineer. The value is completely absent. This may be because it was not defined. Trying to access a value that was has not been defined is then thought of as an error that must be handled.
The second interpretation may be from a statistican. The value is missing because we have not measured it or we do not know it. The value may exist, but it is not known to us. Trying to access the value is not an error but rather the lack of knowledge is represented.
Knowing the history of this specification, I would guess that you may mean the first interpretation. In Zarr v2, we also a null fill value meant "no fill value" or the values of an "empty chunk" are undefined. However, in the context of other analogous data libraries, SQL, Apache Arrow, and Polars, null means the second interpretation.
Leaving the interpretation ambiguous will result in interoperability issues. In the second interpretation, scale-offset is actually well defined in a number of languages:
Julia:
julia> [1, 2, 3, missing] .- 1
4-element Vector{Union{Missing, Int64}}:
0
1
2
missing
julia> ([1, 2, 3, missing] .- 1) .* 5
4-element Vector{Union{Missing, Int64}}:
0
5
10
missingR:
> c(1,2,3, NA) - 1
[1] 0 1 2 NA
> (c(1,2,3, NA) - 1) * 5
[1] 0 5 10 NAPlease clarify how null should be interpreted here.
There was a problem hiding this comment.
Another ambiguity is that it is not clear that if the implementation of the optional data type should be as an enum or via a sentinel value.
It is not clearly stated that the null/missing value is not also of the underlying data type. In this case, a specific value of the original data type could act as a sentinel. Perhaps one may read this as understanding that the fill_value attribute defines that sentinel value.
There was a problem hiding this comment.
I don't think zarr usually defines runtime behavior of data, which would mean that discussing the “best” semantics here is probably a distraction. That being said numpy explicitly made it possible to use e.g.
pd.NAthere so you're treating it too harshly.
The issue I'm pointing out is that datatypes do come with some implied runtime behavior. The floating point types rely on IEEE 754 for both their formatting, but this also describes arithmetic on those types. There is implied modular arithmetic on signed and unsigned integers of fixed width.
In this case, the definition of this codec and data type come very close to how Pola.rs adresses missing data. For each column, Pola.rs also stores a validity array. In Rust, missing values are also referred to by Option<T> when interacting with Series.
use polars::prelude::*;
fn main() -> PolarsResult<()> {
// Explicitly defining a vector with missing values
let values: Vec<Option<i32>> = vec![
Some(10),
None, // This becomes a null in Polars
Some(30),
None
];
let s = Series::new("counts".into(), values);
println!("Series with nulls:\n{}", s);
Ok(())
}
fn decode_series(s: &Series) {
// Convert a Series back into a Vec of Options
// .i32() attempts to view the series as Int32 chunks
let decoded: Vec<Option<i32>> = s.i32()
.unwrap()
.into_iter()
.collect();
for val in decoded {
match val {
Some(v) => println!("Found value: {}", v),
None => println!("Found a missing entry!"),
}
}
}However, in Pola.rs, you can multiply a column with missing data by a scalar.
use polars::prelude::*;
fn main() -> PolarsResult<()> {
// 1. Create a DataFrame with Some values and None (nulls)
let df = df!(
"sensor_id" => ["A1", "A2", "A3", "A4"],
"reading" => [Some(10.0), None, Some(25.0), None],
"visited" => [Some(true), Some(false), None, Some(true)],
"checked" => [Some(false), None, Some(true), None]
)?;
// 2. Perform arithmetic or logic on columns with missing data
// Polars handles the 'None' entries automatically—they stay 'null'.
let calibrated_df = df.lazy()
.with_column(
(col("reading") * lit(1.5)).alias("calibrated_reading")
)
.with_column(
col("visited").or(col("checked")).alias("visited_or_checked")
)
.collect()?;
println!("Resulting DataFrame:\n{}", calibrated_df);
Ok(())
}This results in the following output:
Resulting DataFrame:
shape: (4, 6)
┌───────────┬─────────┬─────────┬─────────┬────────────────────┬────────────────────┐
│ sensor_id ┆ reading ┆ visited ┆ checked ┆ calibrated_reading ┆ visited_or_checked │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool ┆ bool ┆ f64 ┆ bool │
╞═══════════╪═════════╪═════════╪═════════╪════════════════════╪════════════════════╡
│ A1 ┆ 10.0 ┆ true ┆ false ┆ 15.0 ┆ true │
│ A2 ┆ null ┆ false ┆ null ┆ null ┆ null │
│ A3 ┆ 25.0 ┆ null ┆ true ┆ 37.5 ┆ true │
│ A4 ┆ null ┆ true ┆ null ┆ null ┆ true │
└───────────┴─────────┴─────────┴─────────┴────────────────────┴────────────────────┘
There was a problem hiding this comment.
My problem is that I might implement this in Julia. There I have a choice to map null to either missing or nothing. I demonstrated above that I can perform subtraction and multiplication on missing. Trying to do so on nothing is undefined.
julia> [1, 2, 3, nothing] .- 1
ERROR: MethodError: no method matching -(::Nothing, ::Int64)
The function `-` exists, but no method is defined for this combination of argument types.The minimum I think we should do is provide guidance that by "null" we do not define any arithmetic operations unlike Polars and Apache Arrow. Therefore, scale-offset is then also undefined on an optional datatype. As a implementer, I would then avoid implicitly mapping missing data to missing or NA. In Julia, I might choose nothing and force the user to explicitly map those values to missing.
That said I think should consider making our null consistent with null as used in Polars and Apache Arrow.
There was a problem hiding this comment.
Apache Arrow defines null as “unknown”, not “missing” https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/
There was a problem hiding this comment.
Apache Arrow defines null as “unknown”, not “missing” https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/
I'm a little confused on which point you are trying to make because my earlier dichotomy was between "unknown" and "undefined". The Julia missing type represents an "unknown" value and participates in logic and arithmetic.
Just to be clear, I double checked pyarrow and pyarrow.compute. There null participates in arithmetic and logic there as well.
import pyarrow as pa
import pyarrow.compute as pc
# 1. Create a column with missing data (None represents a Null)
data = [10, 20, None, 40, 50]
column = pa.array(data, type=pa.int64())
# 2. Define your scalar
scalar = 2
# 3. Perform the multiplication
# Arrow handles the null automatically: any_value * null = null
multiplied = pc.multiply(column, scalar)
# 4. Do comparisons
is_greater_than_25 = pc.greater(column, 25)
is_less_than_45 = pc.less(column, 45)
# 5. Logic
between_25_and_45 = pc.and_(is_greater_than_25, is_less_than_45)
print("Original Column:", column)
print("Scalar:", scalar)
print("Multiplied Column: ", multiplied)
print("Greater than 25 Column: ", is_greater_than_25)
print("Less than 45 Column: ", is_less_than_45)
print("Between 25 and 25 Column: ", between_25_and_45)Original Column: [
10,
20,
null,
40,
50
]
Scalar: 2
Multiplied Column: [
20,
40,
null,
80,
100
]
Greater than 25 Column: [
false,
false,
null,
true,
true
]
Less than 45 Column: [
true,
true,
null,
true,
false
]
Between 25 and 25 Column: [
false,
false,
null,
true,
false
]
There was a problem hiding this comment.
I also checked Arrow in Rust:
use arrow::array::{ArrayRef, Int64Array, BooleanArray};
use arrow::compute::kernels::cmp::{lt, gt};
use arrow::compute::kernels::numeric::mul;
use arrow::compute::{and,or_kleene};
use arrow::record_batch::RecordBatch;
use arrow::util::pretty::pretty_format_batches_with_options;
use arrow::util::display::FormatOptions;
use std::sync::Arc;
fn main() {
// 1. Setup Data
let data = vec![Some(10), Some(20), None, Some(40), Some(50)];
let column = Int64Array::from(data);
// 2. Perform Operations
let mul_val = Int64Array::new_scalar(2);
let gt_val = Int64Array::new_scalar(25);
let lt_val = Int64Array::new_scalar(45);
let multiplied = mul(&column, &mul_val).unwrap();
let is_gt_25 = gt(&column, >_val).unwrap();
let is_lt_45 = lt(&column, <_val).unwrap();
// and() still takes two BooleanArrays
let between = and(&is_gt_25, &is_lt_45).unwrap();
let true_array = BooleanArray::from(vec![true; between.len()]);
let or_true = or_kleene(&between, &true_array).unwrap();
// 3. Collect into a RecordBatch for compact printing
let batch = RecordBatch::try_from_iter(vec![
("original", Arc::new(column) as ArrayRef),
("multiplied", Arc::new(multiplied) as ArrayRef),
("> 25", Arc::new(is_gt_25) as ArrayRef),
("< 45", Arc::new(is_lt_45) as ArrayRef),
("between", Arc::new(between) as ArrayRef),
("or true", Arc::new(or_true) as ArrayRef)
]).unwrap();
let options = FormatOptions::default()
.with_null("null");
let table = pretty_format_batches_with_options(&[batch], &options).unwrap();
println!("{}", table);
}+----------+------------+-------+-------+---------+---------+
| original | multiplied | > 25 | < 45 | between | or true |
+----------+------------+-------+-------+---------+---------+
| 10 | 20 | false | true | false | true |
| 20 | 40 | false | true | false | true |
| null | null | null | null | null | true |
| 40 | 80 | true | true | true | true |
| 50 | 100 | true | false | false | true |
+----------+------------+-------+-------+---------+---------+
|
@mkitti @flying-sheep etc, I have made some changes that hopefully address concerns. Firstly, I've written a fairly general section on how other array-to-array codecs could deal with optional data types. Secondly, I've added a recommendation that if implementations wish to impose a specific in-memory representation for null fill values, they should do that through a registered attribute. I think this is fairly reasonable as opposed to adding an additional field to Spitballing (I have no intention of standardising this or anything similar myself): "attributes": {
"py_array_representation": "np.ma.MaskedArray"
# "py_array_representation": "np.typing.NDArray[np.dtypes.StringDType(na_object=np.nan)]"
"julia_optional_data_type_missing_element_representation": "missing"
}There could even be a more targeted convention around how missing data should be interpreted that is language-agnostic (e.g. missing, undefined, unknown). But I'd still recommend that as an attribute, and it should not block this PR. |
| An `optional` data type with no nesting could be represented using a masked array, such as a NumPy [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html). | ||
|
|
||
| A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type. | ||
| However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type. | ||
| The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example. |
There was a problem hiding this comment.
| An `optional` data type with no nesting could be represented using a masked array, such as a NumPy [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html). | |
| A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type. | |
| However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type. | |
| The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example. | |
| In Python, the representation of `null` values in an `optional` data type may depend on the use of other libraries. | |
| * `None` may be used when using only the Python standard library. | |
| * [`pandas.NA`](https://pandas.pydata.org/docs/reference/api/pandas.NA.html) may be used in conjunction with Pandas. | |
| * [`polars.null`](https://docs.pola.rs/user-guide/expressions/missing-data/) is an appropriate direct mapping for Polars. | |
| * [`pyarrow.null`](https://arrow.apache.org/docs/python/generated/pyarrow.null.html#pyarrow.null) could also be used with pyarrow. | |
| * [`np.nan`](https://numpy.org/doc/2.3/reference/constants.html#numpy.nan) or [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html) could be used in NumPy. | |
| A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type. | |
| However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type. | |
| The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example. |
To be clear my request has not been to add language-specific implementation details but rather to explicitly state what is defined or not defined for The other lingering issue is whether |
|
|
||
| Defines a data type for optional (nullable) values that can contain either a value of a specified underlying data type or be missing/undefined/null. | ||
|
|
||
| The `optional` data type does not define how null values should be _interpreted_ and represented in-memory. |
There was a problem hiding this comment.
| The `optional` data type does not define how null values should be _interpreted_ and represented in-memory. | |
| The `optional` data type does not define how null values should be _interpreted_, used semantically, or represented in-memory. |
|
|
||
| For nested optional types, this representation is applied recursively. | ||
|
|
||
| ## In-memory representations |
There was a problem hiding this comment.
While I realize that we intend to use this with the optional codec, is it invalid to create an array with just the bytes codec and an optional data type? If it is valid to create such an array what is the on-disk representation of the optional data type?
{
"zarr_format": 3,
"node_type": "array",
"shape": [100, 100],
"data_type": {
"name": "optional",
"configuration": {
"base_type": "uint8"
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [50, 50]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": null,
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
}
],
"attributes": {}
}There was a problem hiding this comment.
It is invalid until it is spec'd. A bytes representation would need to be defined for optional data.
I'd rather not include it here and would encourage the split data / mask encoding. However, the bytes codec could be supported simply with a tag (present/missing) + payload encoding, always the same size. A more efficient Rust-like niche optimisation could be supported for certain inner types, but perhaps that is a bit too fancy?
I'm still finalising an implementation, but here is a draft spec.