The NWB schema defines default_value: 1.0 for conversion. When reading TimeSeries.data.conversion = 1.0 from an NWB file, two distinct scenarios are indistinguishable: (1) the user measured or verified that their data requires no conversion (already in SI units), or (2) the user didn't determine the conversion factor and the system applied the default. This is especially problematic for TimeSeries subtypes that fix the unit, such as ElectricalSeries which fixes units to volts, CurrentClampSeries, and VoltageClampSeries which fix their units to volts and amperes respectively. In these types, correct interpretation of the data requires the user to set conversion and offset so that readers can apply the formula real_value_in_units = data * conversion + offset to obtain values in the declared unit. The data is usually stored as raw ADC counts, and if users forget to set the conversion and offset, docval silently fills conversion=1.0, making the file claim that raw counts are volts. The same ambiguity applies to offset (default_value: 0.0).
Heuristics in the inspector could help: data.dtype being float32 rather than int16 suggests the data may already be calibrated, or if unit is set to something other than "unknown", the user likely considered the conversion. However, these remain heuristics, and the fact is that we are failing to provide proper provenance for this value.
I think the best solution is to only write information that the user explicitly provided. Omit the attribute entirely if the user did not specify it (conversion=None), leaving it unset in the HDF5 file. This way None means the value was not set and 1.0 means it was chosen explicitly. When units are set, None also carries unambiguous meaning: no conversion is required and the data is already in the units that unit declares. The cost is backwards compatibility, as existing readers that assume conversion is always present would need updating.
Two other alternatives: making conversion and offset required at construction time would resolve the ambiguity but forces users to supply values even when the data is already in natural units, and worse, forces them to make up values when the information is genuinely unavailable. Alternatively, a companion schema attribute (e.g., is_conversion_user_set: bool) set automatically by the API would preserve backwards compatibility, but it seems logically wrong for the schema to track metadata about other elements of the schema rather than describe scientific data. I honestly prefer this less.
On a wider scope, two additional considerations are worth noting regarding the use of default_value. First, not all uses of default_value in the schema carry the same risk. There are three categories:
- Provenance-destroying numeric defaults: the core problem described above. These are
conversion (1.0) and offset (0.0) on TimeSeries, as well as ImagingPlane.conversion (1.0). These are valid numeric values indistinguishable from user-provided ones. resolution (-1.0) is the exception: -1.0 is not a natural value a user would set, so it effectively acts as a sentinel.
- Domain-specific units that should be fixed values:
SpatialSeries.data.unit (meters), ImagingPlane.origin_coords.unit (meters), ImagingPlane.grid_spacing.unit (meters), and Subject.age_reference (birth). These define the coordinate system or domain convention for the type and are not meant to be set by the user. Could the schema use something like value: (fixed) rather than default_value: for these, they seem different in an important sense.
- Sentinel strings: TimeSeries
description defaults to "no description", comments defaults to "no comments", AbstractFeatureSeries.data.unit defaults to "see 'feature_units'", and ImageSeries.format defaults to "raw". These are less problematic because the string values are clearly sentinel-like and carry no ambiguity about whether real information was provided.
Second, will the upcoming LinkML migration alleviate this problem? Currently, when a user creates a TimeSeries without specifying conversion, the docval decorator on TimeSeries fills in the default value 1.0 at construction time. This value is then indistinguishable from a user-provided value; there is no API to determine whether conversion was explicitly set or filled by docval. LinkML's ifabsent has clearer semantics, but the documentation does not specify whether there is an API to determine if a value was explicitly provided by the user or filled in by ifabsent.
Related issues: #639 ("Question: Functional difference between default_value for datasets and attributes?") and #641 ("Suggestion: Use default_value: null for unused fields in IndexSeries").
The NWB schema defines
default_value: 1.0for conversion. When readingTimeSeries.data.conversion = 1.0from an NWB file, two distinct scenarios are indistinguishable: (1) the user measured or verified that their data requires no conversion (already in SI units), or (2) the user didn't determine the conversion factor and the system applied the default. This is especially problematic for TimeSeries subtypes that fix the unit, such as ElectricalSeries which fixes units to volts, CurrentClampSeries, and VoltageClampSeries which fix their units to volts and amperes respectively. In these types, correct interpretation of the data requires the user to setconversionandoffsetso that readers can apply the formulareal_value_in_units = data * conversion + offsetto obtain values in the declared unit. The data is usually stored as raw ADC counts, and if users forget to set the conversion and offset, docval silently fillsconversion=1.0, making the file claim that raw counts are volts. The same ambiguity applies tooffset(default_value: 0.0).Heuristics in the inspector could help:
data.dtypebeing float32 rather than int16 suggests the data may already be calibrated, or ifunitis set to something other than "unknown", the user likely considered the conversion. However, these remain heuristics, and the fact is that we are failing to provide proper provenance for this value.I think the best solution is to only write information that the user explicitly provided. Omit the attribute entirely if the user did not specify it (
conversion=None), leaving it unset in the HDF5 file. This wayNonemeans the value was not set and1.0means it was chosen explicitly. When units are set,Nonealso carries unambiguous meaning: no conversion is required and the data is already in the units thatunitdeclares. The cost is backwards compatibility, as existing readers that assumeconversionis always present would need updating.Two other alternatives: making
conversionandoffsetrequired at construction time would resolve the ambiguity but forces users to supply values even when the data is already in natural units, and worse, forces them to make up values when the information is genuinely unavailable. Alternatively, a companion schema attribute (e.g.,is_conversion_user_set: bool) set automatically by the API would preserve backwards compatibility, but it seems logically wrong for the schema to track metadata about other elements of the schema rather than describe scientific data. I honestly prefer this less.On a wider scope, two additional considerations are worth noting regarding the use of
default_value. First, not all uses ofdefault_valuein the schema carry the same risk. There are three categories:conversion(1.0) andoffset(0.0) on TimeSeries, as well asImagingPlane.conversion(1.0). These are valid numeric values indistinguishable from user-provided ones.resolution(-1.0) is the exception:-1.0is not a natural value a user would set, so it effectively acts as a sentinel.SpatialSeries.data.unit(meters),ImagingPlane.origin_coords.unit(meters),ImagingPlane.grid_spacing.unit(meters), andSubject.age_reference(birth). These define the coordinate system or domain convention for the type and are not meant to be set by the user. Could the schema use something likevalue:(fixed) rather thandefault_value:for these, they seem different in an important sense.descriptiondefaults to"no description",commentsdefaults to"no comments",AbstractFeatureSeries.data.unitdefaults to"see 'feature_units'", andImageSeries.formatdefaults to"raw". These are less problematic because the string values are clearly sentinel-like and carry no ambiguity about whether real information was provided.Second, will the upcoming LinkML migration alleviate this problem? Currently, when a user creates a TimeSeries without specifying
conversion, the docval decorator on TimeSeries fills in the default value1.0at construction time. This value is then indistinguishable from a user-provided value; there is no API to determine whetherconversionwas explicitly set or filled by docval. LinkML'sifabsenthas clearer semantics, but the documentation does not specify whether there is an API to determine if a value was explicitly provided by the user or filled in byifabsent.Related issues: #639 ("Question: Functional difference between
default_valuefor datasets and attributes?") and #641 ("Suggestion: Usedefault_value: nullfor unused fields in IndexSeries").