Skip to content

NWB use cases and schema requirements: acquisition, processing, and archival #675

@h-mayorquin

Description

@h-mayorquin

@bendichter @rly @oruebel @ehennestad

Today in our CN+LBNL NWB sync (2026-03-17) we were discussing #672 and the team agreed that we should re-frame this in the following way: NWB is a single format used in at least three distinct contexts, and requirements that make sense for one context are inappropriate or impossible in another.

There are at least three distinct contexts in which NWB files are created and used:

  • Acquisition. NWB files written directly from hardware during an experiment, by tools like Open Ephys or AqNWB. At this stage, many required schema fields are simply unavailable: the software does not know the anatomical location of the probe, the calcium indicator, or the excitation_lambda. Time alignment between data streams is performed online and is inherently approximate. The session_start_time may be ambiguous. Requiring well-annotated, fully aligned data at this stage is not realistic.

  • Processing / conversion. This context covers two related but distinct cases.

    Analysis tools like CaImAn and SpikeInterface operate on already-processed data and offer NWB as an output format. They do not have access to experimental metadata: CaImAn hardcodes location="brain" and indicator="OGB-1" because that information is simply not in its input data and the fields are required. SpikeInterface users may want to write quality metrics or spike sorting results to NWB at some point during a pipeline, again without having the full experimental context available.

    Automated conversion tools like NeuroConv play a different role: they can integrate multiple source files, align timestamps across sources, and build a considerably more complete NWB file than any single analysis tool could. But there are hard limits to what can be inferred automatically. The experimental context that the user holds (what brain region, which subject, what protocol) is often not encoded in any source file and cannot be filled in automatically. Still, we should be able to provide files without those fields so the downstream tools or the user can complete them.

    In both cases, this is where the placeholder problem in Required fields need official placeholders for automated NWB file builders #672 is most acute. Time alignment can be corrected at this stage (post-hoc alignment is more accurate than online), but the file is not yet complete for archival purposes.

  • Archival. NWB files published to DANDI or shared for reuse. Here, the bar should be high: accurate time alignment, complete metadata, meaningful values in all required fields. DANDI already enforces stricter requirements on top of the schema via the NWB Inspector, creating a de facto two-tiered system. An archival file with location="unknown" is a quality failure.

When NWB was created, DANDI did not exist, so the schema carried all the validation weight alone. The heavy field requirements were the only filter between a well-annotated file and a poorly-annotated one. Now that DANDI and the NWB Inspector exist, a two-tier system is already in place in practice, even if it was never explicitly designed that way.

Two concrete examples illustrate how the same design question looks different across these contexts. The placeholder problem in #672 is a direct consequence of applying archival-level field requirements to acquisition and processing contexts: a MissingValueEnum or making certain fields optional both address this, but the right approach depends on which context we are designing for. Similarly, @oruebel raised that online time alignment between data streams during acquisition is often less precise than post-hoc correction, so an archival context can reasonably expect well-aligned timestamps while an acquisition context cannot. This has already surfaced concretely in AqNWB: NeurodataWithoutBorders/aqnwb#251 discusses how clock offsets between acquisition devices and the file writer need to be stored precisely because online alignment is approximate and the corrected alignment happens post-hoc.

My preferred direction. The approach I favor is the one @bendichter proposed in the meeting: make the problematic fields non-required in the base schema. I think we should calibrate the requirements to the acquisition context where the fewest fields are available. This is the simplest path and does not force anyone to invent data or adjust their pipelines. It opens NWB to more use cases and makes it easier to write. This does not mean abandoning the idea that NWB enforces structure: the type system, data organization, and semantics are still there. We just become more permissive on required metadata.

The concern @rly raised, that users complain "I have a valid NWB file, why can't I upload it to DANDI?", is real, but I don't think it has a clean solution at the schema level. DANDI's requirements have changed before and may change again, and we cannot play a cat-and-mouse game with the schema chasing them. If other archives emerge with different requirements, the problem compounds: the schema cannot satisfy all of them simultaneously and should not try. The right place to address this is better communication from archives about what they require on top of NWB, combined with deeper integration of the NWB Inspector with the APIs (PyNWB, MatNWB). Perhaps we can add per-archive validation profiles on the inspector.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions