Skip to content

Multi-file external_file on ImageSeries cannot represent time alignment with gaps #677

@h-mayorquin

Description

@h-mayorquin

The external_file dataset on ImageSeries accepts an array of file paths paired with a starting_frame attribute. I could not find the original discussion around this design (the external_file dataset predates the move to YAML and was already present in the initial JSON schema), but the intent appears to have been covering the common case of consecutive, back-to-back files from a single continuous recording. This is a standard output pattern in camera acquisition software (e.g., Basler Pylon) that automatically splits recordings when files hit a size or container limit. The design only addresses frame access ("which file do I open to get frame N?") and no per-file timing metadata was ever added, which suggests that temporal gaps between files were not a considered use case.

The schema, however, does not state this explicitly. Nothing prevents a writer from listing files with temporal gaps under a single ImageSeries, and some users are already attempting this (see catalystneuro/nwb-video-widgets#34, where a user needs to map a global session time to the right file and local time for each camera across multiple video files with gaps and offsets). When gaps are present, starting_time + rate cannot be used because a single rate implies uniform spacing across all frames, leaving no way to represent the gap between files. The only alternative is to provide the full concatenated timestamps array, where the gap appears as a jump in timestamp values, and look up timestamps[starting_frame[i]] to recover the start time of each file. The closest guidance for this is a comment in pynwb#1318, where Oliver Ruebel confirmed that timestamps should be concatenated and used alongside starting_frame, but this is not documented in the schema itself. It is also problematic from a performance standpoint as even when each segment has a constant sampling rate, the writer is forced to store the full timestamps array (one entry per frame across all files) since starting_time + rate cannot represent gaps.

In summary, the current data type does not have all the necessary information to resolve time alignment of individual videos in an ImageSeries with multiple external files. Time alignment is a critical feature of NWB and we should solve this. One option would be to extend the schema to add per-file timing fields (e.g., a per-file starting_time), but I think this is an undesirable increase in complexity with no real gain, as it would mean maintaining two separate timing mechanisms on the same object: one inherited from TimeSeries for the series as a whole, and a new one for individual files. In my view, multi-file external_file should be scoped to its original intent (gapless continuous recordings split across files for storage reasons), and the schema should guide users toward separate ImageSeries objects when they have multiple videos with gaps or offsets. Each ImageSeries already carries its own starting_time, rate, and timestamps through the TimeSeries inheritance, so separate series handle the gap case naturally without any schema changes.

Even for the gapless case, the schema is not fully self-describing: num_samples is missing for external-mode ImageSeries (#561), so a reader cannot determine the duration of the last file without opening it. Adding num_samples would close this gap and likely make #543 (num_frames_per_external_file) redundant, since frame counts for all but the last file are already inferrable from starting_frame.

Finally, it is true that even in the original intent of automatic file rollover there might be an occasionally drop frame because of a bug. These are recording artifacts, not intentional gaps and crucially the timestamps path is still available to represent the data in that edge case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions