Skip to content

Would it be possible to use a video codec as chunk compressor? #1086

Description

@FirefoxMetzger

I have a video dataset of around 500k (half a million) videos that I want to ML on so I am looking for an efficient data format that I can use to quickly read the data. Each video is 10 sec long and subsampled to 256x256@10Hz, i.e., when decoded a video can be viewed as a (frame, height, with, channel) ndarray of shape (100, 256, 256, 3) and dtype uint8 and the entire dataset as a ndarray of shape (500k, 100, 256, 256, 3) and dtype uint8.

The naive approach to storing this data would be to store it as individual videos. This is the format I have now, but it isn't ideal because each file is just about 1 MB on disk. This makes loading a pain since I can't keep that many open file handles. Instead, I am constantly opening and closing small files and I am wondering if this is really the best way to go.

My other idea would be to see if I can store the dataset (or shards of it) as zarr arrays where each chunk is compressed using a video codec. This way I can keep the amazing compression rate of video codecs while also getting some of python's nice ndarray semantics. I realize that this might be a crazy idea, but part of me thinks that it sounds like the kind of crazy that deserves a try.

Would something like this be achievable with Zarr?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions