Skip to content

feat: optional codec and data type#33

Open
LDeakin wants to merge 16 commits intozarr-developers:mainfrom
LDeakin:optional_codec_and_data_type
Open

feat: optional codec and data type#33
LDeakin wants to merge 16 commits intozarr-developers:mainfrom
LDeakin:optional_codec_and_data_type

Conversation

@LDeakin
Copy link
Copy Markdown
Member

@LDeakin LDeakin commented Oct 21, 2025

I'm still finalising an implementation, but here is a draft spec.

@jbms
Copy link
Copy Markdown
Contributor

jbms commented Oct 21, 2025

Looks good, the only issue I see is the fill value representation not allowing null for the base data type fill value, e.g. for a base data type of json or nested optional. Instead you could specify the base data type fill value in a single-element array:

null -> missing
[1] -> value of 1
[null] -> value of null

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Oct 24, 2025

While I see the simplicity here of toggling between null and a type, I am wondering if there is a further opportunity to introduce a full type tag to implement tagged unions in Zarr.

@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Oct 28, 2025

not allowing null for the base data type fill value

I suppose it is a limitation, but I'd note that we don't have any data types that permit a null fill value. Can anyone think of any that might arise?

Also, for a multiply nested optional type like Option<Option<u8>>, you could potentially have the "null" at different levels. An implementation would have to track the "depth" of the fill value and change it each time it goes through the optional codec.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Oct 28, 2025

Can anyone think of any that might arise?

I think null will be a valid fill value for the JSON data type

@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Nov 30, 2025

I've implemented this in zarrs with codec/data type support for arbitrarily nested optional data. I'll add example data later. I have not addressed these two issues:

the only issue I see is the fill value representation not allowing null for the base data type fill value

for a multiply nested optional type like Option<Option>, you could potentially have the "null" at different levels

I am open to explicit suggestions that satisfy both, or only the first. The latter is a bit burdensome to support for something I suspect nobody would use. What I currently do:

  • null always means an empty element with the optional data type, irrespective of the inner type
  • A non-null fill value gets propagated down to the deepest non-optional inner type.

@jbms
Copy link
Copy Markdown
Contributor

jbms commented Nov 30, 2025

I've implemented this in zarrs with codec/data type support for arbitrarily nested optional data. I'll add example data later. I have not addressed these two issues:

the only issue I see is the fill value representation not allowing null for the base data type fill value

for a multiply nested optional type like Option<Option>, you could potentially have the "null" at different levels

I am open to explicit suggestions that satisfy both, or only the first. The latter is a bit burdensome to support for something I suspect nobody would use. What I currently do:

  • null always means an empty element with the optional data type, irrespective of the inner type
  • A non-null fill value gets propagated down to the deepest non-optional inner type.

I previously suggested wrapping any non-None fill value in a one-element array. That solves both issues and is syntactically pretty minimal.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Dec 1, 2025

I'm still not quite following why this current implementation is not just a special case of a more general sum type, an enum type in Rust.

Currently, Optional has a bit that toggles between None and Some<T>. Why not generalize this to any two data types?

null is just a singleton instance of a null_type. The bit in the mask is then a toggle between null_type and T.

Why build this infrastructure for null rather than any two types? Why not generalize further to N types?

@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Dec 1, 2025

Interesting... masked data has come up in a few discussions I've had, but never a more general sum type. Will people use this? It seems like not many people are complaining about the lack of struct support in Zarr V3.

A more general enum type would probably need to encode each variant through separate codec chains. E.g.

enum EnumType {
  U8(u8),
  String(String),
}
{
    "data_type": {
        "name": "enum",
        "configuration": {
            "data_types": [
                {
                    "name": "uint8"
                },
                {
                    "name": "string"
                }
            ]
        }
    },
    "fill_value": "?",
    "codecs": [
        {
            "name": "enum",
            "configuration": {
                "discriminator_data_type": {
                    "name": "uint8"
                },
                "discriminator_codecs": [
                    {
                        "name": "bytes"
                    }
                ],
                "variant_codecs": [
                    [
                        {
                            "name": "bytes"
                        }
                    ],
                    [
                        {
                            "name": "vlen-utf8"
                        }
                    ]
                ]
            }
        }
    ]
}

Specialising the above for an optional would need a new null_type or similar (zero-sized), as you mentioned. Yet another thing to standardise... I think there is space in the Zarr ecosystem to handle enum, optional, and struct data types separately

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Dec 1, 2025

The other two null-like types I would immediately like to use this for are:

  1. NA or missing to represent missing data in the statistical sense.
  2. NaN as a general analog to IEEE 754 floating point. Practically we probably just using a floating point type here.

I do not think these are well represented by null. Each has their own semantics.

julia> NaN == true
false

julia> missing == true
missing

If we could generalize this to any two types and introduce a null_type, I think this could cover a much wider array of statistical and numerical applications.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Dec 1, 2025

A recent application is in the tracking standard GEFF, they introduced a missing mask.

https://liveimagetrackingtools.org/geff/latest/specification/#the-props-group-and-node-property-groups.

While they use Python's None to represent this, the meaning is that the value does not exist and can either be ignored or imputed as opposed to the value being undefined or that there was an error.

@LDeakin LDeakin marked this pull request as ready for review December 14, 2025 02:24
@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Dec 14, 2025

I previously suggested wrapping any non-None fill value in a one-element array. That solves both issues and is syntactically pretty minimal.

Thanks @jbms, that ended up working out quite nicely! I've updated the spec and implementation, and added example data.

@flying-sheep
Copy link
Copy Markdown

flying-sheep commented Feb 11, 2026

Would it be in-scope to map between the optional codec and nullable string dtype?

Like when people do the following, it could store as optional?

>>> import numpy as np
>>> import zarr
>>> arr_str = np.array(["x", None], dtype=np.dtypes.StringDType(na_object=np.nan))
>>> arr_str
array(['x', nan], dtype=StringDType(na_object=nan))
>>> g = zarr.open_group(store={})
>>> g["arr_str"] = arr_str  # errors on `main` as of #3695

And be read

>>> g["arr_str"][:]
array(['x', nan], dtype=StringDType(na_object=nan))

I believe since a 1-element array containing np.nan can be serialized, this would work.

@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Feb 12, 2026

No problem, zarr-python could consider StringDType(na_object=nan) as mapping to

 "data_type": {
    "name": "optional",
    "configuration": {
      "name": "string",
      "configuration": {}
    }

In zarr.json, and handle converting to a masked representation internally. nulls don't need to be serialised with the actual string data, the mask is used to reconstruct them.

@flying-sheep
Copy link
Copy Markdown

flying-sheep commented Feb 12, 2026

OK, let me rephrase: np.nan isn’t special, it just happens to be a common choice for a “NA object” that is already serializable by zarr-python (as a 1-element float array).

I’m questioning this:

The value of the fill_value metadata key MUST be null or a single element array containing any valid fill value of the underlying data type.

In the Python world, the NA objects are often not necessarily of the same type as the array: A nullable numpy string array dtype might have it set to "" (str), None (types.NoneType), numpy.nan (float), or pandas.NA (some anonymous type).1

So I guess as given, there would be no way to express that, and it would have to be expressed externally (e.g. by storing a separate 1 element array with the NA object.

Would it make sense to have this NA value be different from a fill_value (as used in other parts of zarr) to allow for objects that aren’t of the same type, or should this spec stay as it is (which means that plain zarr-python probably won’t support nullable string arrays that don’t happen to have a str-typed NA value)

Footnotes

  1. The motivation is comparison semantics: np.nan compares unequal to everything, even itself, pd.NA comparison coerce scalars to itself, and None has no special behavior.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Feb 12, 2026

the numpy string dtype na_object semantics are a bit complex for zarr to handle -- I think na_object can be literally any python object:

>>> np.dtypes.StringDType(na_object={"a": 10}).na_object # set it to a dict
{'a': 10}
>>> np.dtypes.StringDType(na_object=np).na_object # set it to the numpy module
<module 'numpy' from '/Users/d-v-b/.cache/uv/archive-v0/iYIlm0FQZbEH39yv_HetG/lib/python3.13/site-packages/numpy/__init__.py'>
>>> x = {'mutable': True}
>>> np.dtypes.StringDType(na_object=x).na_object
{'mutable': True}
>>> x['mutable'] = False
>>> np.dtypes.StringDType(na_object=x).na_object
{'mutable': False}

When is it important that the na_object gets stored? The alternative where zarr just stores a mask, and readers agree on a convention for interpreting values outside the mask, seems easier at the moment.

@flying-sheep
Copy link
Copy Markdown

flying-sheep commented Feb 12, 2026

I think na_object can be literally any python object

That’s correct. I think it might be possible to change the spec so it would be possible to natively support all of them that zarr-python can handle already (which might only be np.nan, but at least it’s not an arbitrary choice)

When is it important that the na_object gets stored

See footnote of previous comment. But yeah, deciding that this is out-of-scope of the spec is valid.

As said, I think that’d probably mean that zarr-python won’t natively support np.StringDType(na_object=…) ever, and people would have to manually store things, making zarr-python a bit less convenient, as instead of g["arr"] = str_arr, people would have to pick the codec manually.

@d-v-b
Copy link
Copy Markdown
Contributor

d-v-b commented Feb 12, 2026

As said, I think that’d probably mean that zarr-python won’t natively support np.StringDType(na_object=…) ever, and people would have to manually store things, making zarr-python a bit less convenient, as instead of g["arr"] = str_arr, people would have to pick the codec manually.

a third option is to roll this logic into a self-contained data type that basically adds the na_object to the variable-length utf8 string configuration. If it's important that readers across language barriers agree on the in-memory interpretation of the na_object, then this is probably the best option.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Feb 13, 2026

OK, let me rephrase: np.nan isn’t special, it just happens to be a common choice for a “NA object” that is already serializable by zarr-python (as a 1-element float array).

I really want to emphasize that conflating a (missing or NA) type with NaN is a really terrible choice from a data science perspective.

  • NA indicates a semantically unknown or absent value.
  • NaN indicates a failed or undefined numerical calculation.

Polars has a null value and some discussion on this in the documentation:
https://docs.pola.rs/user-guide/expressions/missing-data/#not-a-number-or-nan-values

There is also a summary of attempts to add a NA type to NumPy in NEP 26:
https://numpy.org/neps/nep-0026-missing-data-summary.html

Just because NumPy has not worked this out should not constrain our choices here.

R, being developed as a statistical language has had NA values and semantics for quite some time:
https://rlang.r-lib.org/reference/missing.html

Julia copied R's NA semantics as the missing type:
https://docs.julialang.org/en/v1/manual/missing/#missing

Copy link
Copy Markdown
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to define null better here. Perhaps we should follow null from Polars and Apache Arrow.

Comment thread data-types/optional/README.md Outdated

For nested optional types, this representation is applied recursively.

The table below demonstrates valid `data_type` and `fill_value` combinations with an `optional` and nested `optional` data type, along with their equivalent Rust [`Option`](https://doc.rust-lang.org/std/option/) values.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to provide additional mappings here to more clearly illustrate what we mean. In Rust, this conceptually seems more similar to how nullable column types work in Polars and Apache Arrow than how Optional works.

It may be necessary to define null as its own data type. I think we should also clarify if null is also meant to be analogous to NA in R and missing in Julia.

From the scale-offset codec conversation, I can also see a potential need to define how to compute with Optional types. I propose that null values with optional booleans should participate in Kleene ternary logic. Additionally, scaling or offsetting a null value should result in a null value.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think zarr usually defines runtime behavior of data, which would mean that discussing the “best” semantics here is probably a distraction. That being said numpy explicitly made it possible to use e.g. pd.NA there so you're treating it too harshly.

As it currently stands, fill_value is either of the current data type or null, and I don't think zarr is in the business of telling people which runtime behavior in-memory representations of types should have. But maybe using R’s/panda’s NA as an example wouldn't hurt.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be necessary to define null as its own data type

Why? null is just being used as a fill value with specific behaviour for the optional data type.

From the scale-offset codec conversation

Codecs like scale_offset could add support for the optional data type if there is demand, but I think that is outside the scope of this PR. Inner codecs in data_codecs would not interact with the optional data type anyway. See below:

While array-to-array codecs MAY support the optional data type, implementations SHOULD use the optional codec as the sole top-level codec.
This approach is preferred because the codecs contained within the optional codec configuration do not need to explicitly handle optional data type semantics.

Array-to-array codecs that perform shape manipulation (e.g. reshape) could be an exception here as they support all data types.

Copy link
Copy Markdown
Contributor

@mkitti mkitti Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? null is just being used as a fill value with specific behaviour for the optional data type.

null has meaning in other contexts, particularly in other storage contexts. The term arises in SQL, Apache Arrow, and Polars. In the data type fill value description here, we state that null indicates the absence of value.

There are two possible ways to interpret an absence of value:

  1. The value does not exist.
  2. The value is not known.

The first interpretation is typically from the perspective of a software engineer. The value is completely absent. This may be because it was not defined. Trying to access a value that was has not been defined is then thought of as an error that must be handled.

The second interpretation may be from a statistican. The value is missing because we have not measured it or we do not know it. The value may exist, but it is not known to us. Trying to access the value is not an error but rather the lack of knowledge is represented.

Knowing the history of this specification, I would guess that you may mean the first interpretation. In Zarr v2, we also a null fill value meant "no fill value" or the values of an "empty chunk" are undefined. However, in the context of other analogous data libraries, SQL, Apache Arrow, and Polars, null means the second interpretation.

Leaving the interpretation ambiguous will result in interoperability issues. In the second interpretation, scale-offset is actually well defined in a number of languages:

Julia:

julia> [1, 2, 3, missing] .- 1
4-element Vector{Union{Missing, Int64}}:
 0
 1
 2
  missing

julia> ([1, 2, 3, missing] .- 1) .* 5
4-element Vector{Union{Missing, Int64}}:
  0
  5
 10
   missing

R:

> c(1,2,3, NA) - 1
[1]  0  1  2 NA

> (c(1,2,3, NA) - 1) * 5
[1]  0  5 10 NA

Please clarify how null should be interpreted here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another ambiguity is that it is not clear that if the implementation of the optional data type should be as an enum or via a sentinel value.

It is not clearly stated that the null/missing value is not also of the underlying data type. In this case, a specific value of the original data type could act as a sentinel. Perhaps one may read this as understanding that the fill_value attribute defines that sentinel value.

Copy link
Copy Markdown
Contributor

@mkitti mkitti Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think zarr usually defines runtime behavior of data, which would mean that discussing the “best” semantics here is probably a distraction. That being said numpy explicitly made it possible to use e.g. pd.NA there so you're treating it too harshly.

The issue I'm pointing out is that datatypes do come with some implied runtime behavior. The floating point types rely on IEEE 754 for both their formatting, but this also describes arithmetic on those types. There is implied modular arithmetic on signed and unsigned integers of fixed width.

In this case, the definition of this codec and data type come very close to how Pola.rs adresses missing data. For each column, Pola.rs also stores a validity array. In Rust, missing values are also referred to by Option<T> when interacting with Series.

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    // Explicitly defining a vector with missing values
    let values: Vec<Option<i32>> = vec![
        Some(10), 
        None,      // This becomes a null in Polars
        Some(30), 
        None
    ];

    let s = Series::new("counts".into(), values);
    
    println!("Series with nulls:\n{}", s);
    Ok(())
}

fn decode_series(s: &Series) {
    // Convert a Series back into a Vec of Options
    // .i32() attempts to view the series as Int32 chunks
    let decoded: Vec<Option<i32>> = s.i32()
        .unwrap()
        .into_iter()
        .collect();

    for val in decoded {
        match val {
            Some(v) => println!("Found value: {}", v),
            None    => println!("Found a missing entry!"),
        }
    }
}

However, in Pola.rs, you can multiply a column with missing data by a scalar.

use polars::prelude::*;

fn main() -> PolarsResult<()> {
    // 1. Create a DataFrame with Some values and None (nulls)
    let df = df!(
        "sensor_id" => ["A1", "A2", "A3", "A4"],
        "reading" => [Some(10.0), None, Some(25.0), None],
        "visited" => [Some(true), Some(false), None, Some(true)],
        "checked" => [Some(false), None, Some(true), None]
    )?;

    // 2. Perform arithmetic or logic on columns with missing data
    // Polars handles the 'None' entries automatically—they stay 'null'.
    let calibrated_df = df.lazy()
        .with_column(
            (col("reading") * lit(1.5)).alias("calibrated_reading")
        )
        .with_column(
            col("visited").or(col("checked")).alias("visited_or_checked")
        )
        .collect()?;

    println!("Resulting DataFrame:\n{}", calibrated_df);

    Ok(())
}

This results in the following output:

Resulting DataFrame:
shape: (4, 6)
┌───────────┬─────────┬─────────┬─────────┬────────────────────┬────────────────────┐
│ sensor_id ┆ reading ┆ visited ┆ checked ┆ calibrated_reading ┆ visited_or_checked │
│ ---       ┆ ---     ┆ ---     ┆ ---     ┆ ---                ┆ ---                │
│ str       ┆ f64     ┆ bool    ┆ bool    ┆ f64                ┆ bool               │
╞═══════════╪═════════╪═════════╪═════════╪════════════════════╪════════════════════╡
│ A1        ┆ 10.0    ┆ true    ┆ false   ┆ 15.0               ┆ true               │
│ A2        ┆ null    ┆ false   ┆ null    ┆ null               ┆ null               │
│ A3        ┆ 25.0    ┆ null    ┆ true    ┆ 37.5               ┆ true               │
│ A4        ┆ null    ┆ true    ┆ null    ┆ null               ┆ true               │
└───────────┴─────────┴─────────┴─────────┴────────────────────┴────────────────────┘

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My problem is that I might implement this in Julia. There I have a choice to map null to either missing or nothing. I demonstrated above that I can perform subtraction and multiplication on missing. Trying to do so on nothing is undefined.

julia> [1, 2, 3, nothing] .- 1
ERROR: MethodError: no method matching -(::Nothing, ::Int64)
The function `-` exists, but no method is defined for this combination of argument types.

The minimum I think we should do is provide guidance that by "null" we do not define any arithmetic operations unlike Polars and Apache Arrow. Therefore, scale-offset is then also undefined on an optional datatype. As a implementer, I would then avoid implicitly mapping missing data to missing or NA. In Julia, I might choose nothing and force the user to explicitly map those values to missing.

That said I think should consider making our null consistent with null as used in Polars and Apache Arrow.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Arrow defines null as “unknown”, not “missing” https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apache Arrow defines null as “unknown”, not “missing” https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/

I'm a little confused on which point you are trying to make because my earlier dichotomy was between "unknown" and "undefined". The Julia missing type represents an "unknown" value and participates in logic and arithmetic.

Just to be clear, I double checked pyarrow and pyarrow.compute. There null participates in arithmetic and logic there as well.

import pyarrow as pa
import pyarrow.compute as pc

# 1. Create a column with missing data (None represents a Null)
data = [10, 20, None, 40, 50]
column = pa.array(data, type=pa.int64())

# 2. Define your scalar
scalar = 2

# 3. Perform the multiplication
# Arrow handles the null automatically: any_value * null = null
multiplied = pc.multiply(column, scalar)

# 4. Do comparisons
is_greater_than_25 = pc.greater(column, 25)
is_less_than_45 = pc.less(column, 45)

# 5. Logic
between_25_and_45 = pc.and_(is_greater_than_25, is_less_than_45)

print("Original Column:", column)
print("Scalar:", scalar)
print("Multiplied Column:  ", multiplied)
print("Greater than 25 Column:  ", is_greater_than_25)
print("Less than 45 Column:  ", is_less_than_45)
print("Between 25 and 25 Column: ", between_25_and_45)
Original Column: [
  10,
  20,
  null,
  40,
  50
]
Scalar: 2
Multiplied Column:   [
  20,
  40,
  null,
  80,
  100
]
Greater than 25 Column:   [
  false,
  false,
  null,
  true,
  true
]
Less than 45 Column:   [
  true,
  true,
  null,
  true,
  false
]
Between 25 and 25 Column:  [
  false,
  false,
  null,
  true,
  false
]

Copy link
Copy Markdown
Contributor

@mkitti mkitti Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked Arrow in Rust:

use arrow::array::{ArrayRef, Int64Array, BooleanArray};
use arrow::compute::kernels::cmp::{lt, gt};
use arrow::compute::kernels::numeric::mul;
use arrow::compute::{and,or_kleene};
use arrow::record_batch::RecordBatch;
use arrow::util::pretty::pretty_format_batches_with_options;
use arrow::util::display::FormatOptions;
use std::sync::Arc;

fn main() {
    // 1. Setup Data
    let data = vec![Some(10), Some(20), None, Some(40), Some(50)];
    let column = Int64Array::from(data);
    
    // 2. Perform Operations
    let mul_val = Int64Array::new_scalar(2);
    let gt_val = Int64Array::new_scalar(25);
    let lt_val = Int64Array::new_scalar(45);
    let multiplied = mul(&column, &mul_val).unwrap();
    let is_gt_25 = gt(&column, &gt_val).unwrap();
    let is_lt_45 = lt(&column, &lt_val).unwrap();
    
    // and() still takes two BooleanArrays
    let between = and(&is_gt_25, &is_lt_45).unwrap();

    let true_array = BooleanArray::from(vec![true; between.len()]);
    let or_true = or_kleene(&between, &true_array).unwrap();

    // 3. Collect into a RecordBatch for compact printing
    let batch = RecordBatch::try_from_iter(vec![
        ("original", Arc::new(column) as ArrayRef),
        ("multiplied", Arc::new(multiplied) as ArrayRef),
        ("> 25", Arc::new(is_gt_25) as ArrayRef),
        ("< 45", Arc::new(is_lt_45) as ArrayRef),
        ("between", Arc::new(between) as ArrayRef),
        ("or true", Arc::new(or_true) as ArrayRef)
    ]).unwrap();

    let options = FormatOptions::default()
    .with_null("null");

    let table = pretty_format_batches_with_options(&[batch], &options).unwrap();
    println!("{}", table);
}
+----------+------------+-------+-------+---------+---------+
| original | multiplied | > 25  | < 45  | between | or true |
+----------+------------+-------+-------+---------+---------+
| 10       | 20         | false | true  | false   | true    |
| 20       | 40         | false | true  | false   | true    |
| null     | null       | null  | null  | null    | true    |
| 40       | 80         | true  | true  | true    | true    |
| 50       | 100        | true  | false | false   | true    |
+----------+------------+-------+-------+---------+---------+

@LDeakin
Copy link
Copy Markdown
Member Author

LDeakin commented Feb 24, 2026

@mkitti @flying-sheep etc, I have made some changes that hopefully address concerns.

Firstly, I've written a fairly general section on how other array-to-array codecs could deal with optional data types.

Secondly, I've added a recommendation that if implementations wish to impose a specific in-memory representation for null fill values, they should do that through a registered attribute. I think this is fairly reasonable as opposed to adding an additional field to optional data type, since it does not impact encoding/decoding and it would be ignored by implementations in alternate languages. I just don't like the idea of mixing in language-specific implementation details that do not impact the actual encoding and decoding of the data.

Spitballing (I have no intention of standardising this or anything similar myself):

"attributes": {
  "py_array_representation": "np.ma.MaskedArray"
  # "py_array_representation": "np.typing.NDArray[np.dtypes.StringDType(na_object=np.nan)]"
  "julia_optional_data_type_missing_element_representation": "missing"
}

There could even be a more targeted convention around how missing data should be interpreted that is language-agnostic (e.g. missing, undefined, unknown). But I'd still recommend that as an attribute, and it should not block this PR.

Comment thread data-types/optional/README.md Outdated
Comment thread data-types/optional/README.md Outdated
Comment thread data-types/optional/README.md Outdated
Comment thread data-types/optional/README.md Outdated
Comment thread data-types/optional/README.md
Comment on lines +75 to +79
An `optional` data type with no nesting could be represented using a masked array, such as a NumPy [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html).

A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type.
However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type.
The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An `optional` data type with no nesting could be represented using a masked array, such as a NumPy [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html).
A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type.
However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type.
The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example.
In Python, the representation of `null` values in an `optional` data type may depend on the use of other libraries.
* `None` may be used when using only the Python standard library.
* [`pandas.NA`](https://pandas.pydata.org/docs/reference/api/pandas.NA.html) may be used in conjunction with Pandas.
* [`polars.null`](https://docs.pola.rs/user-guide/expressions/missing-data/) is an appropriate direct mapping for Polars.
* [`pyarrow.null`](https://arrow.apache.org/docs/python/generated/pyarrow.null.html#pyarrow.null) could also be used with pyarrow.
* [`np.nan`](https://numpy.org/doc/2.3/reference/constants.html#numpy.nan) or [`numpy.ma.MaskedArray`](https://numpy.org/doc/stable/reference/maskedarray.generic.html) could be used in NumPy.
A `numpy` array using the `StringDType` with an `na_object` that is not `None` could use the `optional` data type with a `string` underlying data type.
However, the `na_object` itself would not be stored in the Zarr metadata of the `optional` data type.
The `na_object` could be set via a runtime option, or alternatively be encoded separately as an attribute, for example.

@mkitti
Copy link
Copy Markdown
Contributor

mkitti commented Feb 25, 2026

I just don't like the idea of mixing in language-specific implementation details that do not impact the actual encoding and decoding of the data.

To be clear my request has not been to add language-specific implementation details but rather to explicitly state what is defined or not defined for null in this specification and what is left to implementations or codecs. 348a3d8 mostly accomplishes this, although I would also explicitly include that the specification does not define how null should be used semantically.

The other lingering issue is whether null should be its own data type. In other words, should we allow for Zarr array of data_type null to exist where all values are exactly null and thus no chunks need to be stored and no validity bitmap is required? That would allow us to create metadata-only arrays with an explicit shape and chunking but no data. This could be used as a template or it may be useful to explicit construct a very sparse array using something like xarray.concat via VirtualiZarr by combining datasets.


Defines a data type for optional (nullable) values that can contain either a value of a specified underlying data type or be missing/undefined/null.

The `optional` data type does not define how null values should be _interpreted_ and represented in-memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `optional` data type does not define how null values should be _interpreted_ and represented in-memory.
The `optional` data type does not define how null values should be _interpreted_, used semantically, or represented in-memory.


For nested optional types, this representation is applied recursively.

## In-memory representations
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I realize that we intend to use this with the optional codec, is it invalid to create an array with just the bytes codec and an optional data type? If it is valid to create such an array what is the on-disk representation of the optional data type?

{
    "zarr_format": 3,
    "node_type": "array",
    "shape": [100, 100],
    "data_type": {
        "name": "optional",
        "configuration": {
            "base_type": "uint8"
        }
    },
    "chunk_grid": {
        "name": "regular",
        "configuration": {
            "chunk_shape": [50, 50]
        }
    },
    "chunk_key_encoding": {
        "name": "default",
        "configuration": {
            "separator": "/"
        }
    },
    "fill_value": null,
    "codecs": [
        {
            "name": "bytes",
            "configuration": {
                "endian": "little"
            }
        }
    ],
    "attributes": {}
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is invalid until it is spec'd. A bytes representation would need to be defined for optional data.

I'd rather not include it here and would encourage the split data / mask encoding. However, the bytes codec could be supported simply with a tag (present/missing) + payload encoding, always the same size. A more efficient Rust-like niche optimisation could be supported for certain inner types, but perhaps that is a bit too fancy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants