Add optional chunk caching to get() function#296
Conversation
🦋 Changeset detectedLatest commit: 7fe18d9 The changes in this PR will be included in the next version bump. This PR includes changesets to release 2 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
@manzt - any thoughts on this? We can work around the issue but some kind of support in Zarrita itself would make it cleaner. zarr-python added caching recently so we could also consider modeling a solution closer to that. This PR is just the simplest thing I could think of when I first ran into issues. |
|
out of scope but please let us know how we can improve our cache on the zarr-python side! it's still flagged as experimental, and I would love the kind of feedback / design input one gets from another implementation |
|
Thanks for the PR and the ping! I've been thinking about this. My hesitation is that this adds caching at the indexing level (decompressed chunks), whereas I've generally thought of caching as a store-level concern. The vizarr approach (lru-store.ts) wraps the store itself, which keeps You could even do decompression in a store wrapper - but then we'd need to expose the codec pipeline somehow, which opens a different can of worms. Is decompression specifically the bottleneck? I'd be curious to explore that first since my assumption is that I/O latency is usually the limiting factor, so a store-level LRU (caching raw bytes) would help. That'd help me understand if this needs to live in
Would be curious to see/understand how other implementations are thinking about this use case. |
|
In my testing decompression does seem to be a bottleneck. Here is a gist with my benchmark in case it helps. I think this is for two reasons:
So far we've worked around this by using If these logical chunks are smaller than the source chunk size, it may decompress the same source chunk for each sub-chunk. As another example, we were using Again I totally understand if this is not an intended use case! My other attempt was to create a wrapper around the |
02099c7 to
1facac0
Compare
|
Hey @aganders3, sorry for the slow reply. I've been mulling this over. Your benchmarks are convincing that decompression is a bottleneck. But I don't think All import type { Array, Chunk, DataType, Readable } from "zarrita";
function cached<D extends DataType, S extends Readable>(
arr: Array<D, S>,
maxSize = 256,
): Array<D, S> {
let cache = new Map<string, Chunk<D>>();
return new Proxy(arr, {
get(target, prop, receiver) {
if (prop === "getChunk") {
return async (
chunk_coords: number[],
...rest: unknown[]
): Promise<Chunk<D>> => {
let key = chunk_coords.join("/");
let hit = cache.get(key);
if (hit) {
cache.delete(key);
cache.set(key, hit);
return hit;
}
let chunk = await target.getChunk(chunk_coords, ...rest);
cache.set(key, chunk);
while (cache.size > maxSize) {
cache.delete(cache.keys().next().value!);
}
return chunk;
};
}
return Reflect.get(target, prop, receiver);
},
});
}Then usage is just: let arr = cached(await zarr.open(store, { kind: "array" }));
let a = await zarr.get(arr, [zarr.slice(0, 1), null]);
let b = await zarr.get(arr, [zarr.slice(1, 2), null]); // reuses cached chunksHopefully that covers your use case. I'm going to close this out, but ping me if it doesn't. I do think a byte-level store cache is within scope for the library (similar to zarr-python's |
|
Thanks - no worries on closing. I appreciate the help here and I agree threading the cache through each call to I think what you propose covers my use case, and seems closer to what I was initially trying to do with this attempt but cleaner and more concise. It looks like it will be even better once #384 is landed, so I will keep an eye out for that. |
`AsyncReadable<Options>` existed so stores could receive arbitrary per-call state. Auth headers, presigning context, cancellation signals, even chunk-layer concerns like caching and prefetch priority (#296, vole-core's `wrapArray`). It was a catch-all for extensions that had nowhere else to live, and the cost was a pile of type magic: higher-kinded-type encoding in the store middleware system (#384), threading generic types that didn't actually provide that much type safety. (TypeScript could often bail out to `any` when inference broke.) Those extensions now have proper homes. Store middleware (#384) gives transport-layer concerns (auth, presigning, request transformation) a proper extension point, and the custom `fetch` option on `FetchStore` (#388) handles the per-store cases at the callsite. Chunk-layer concerns will move to `zarr.extendArray` in a follow-up. What's left is `signal`, which now lives properly on the (non-generic) `AsyncReadable` interface and can be passed directly in `zarr.get` and `zarr.set` from the caller: // Before interface AsyncReadable<Options = unknown> { get(key: AbsolutePath, opts?: Options): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { opts: { signal: ctl.signal } }); // After interface AsyncReadable { get(key: AbsolutePath, opts?: { signal?: AbortSignal }): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { signal: ctl.signal }); Batched caller signals in `withRangeBatching` are now merged with `AbortSignal.any` instead of a user-supplied `mergeOptions` reducer. The deprecated `opts?: { signal? }` shape still works for one major version and is folded into the new `signal` via `AbortSignal.any`.
`AsyncReadable<Options>` existed so stores could receive arbitrary per-call state. Auth headers, presigning context, cancellation signals, even chunk-layer concerns like caching and prefetch priority (#296, vole-core's `wrapArray`). It was a catch-all for extensions that had nowhere else to live, and the cost was a pile of type magic: higher-kinded-type encoding in the store middleware system (#384), threading generic types that didn't actually provide that much type safety. (TypeScript could often bail out to `any` when inference broke.) Those extensions now have proper homes. Store middleware (#384) gives transport-layer concerns (auth, presigning, request transformation) a proper extension point, and the custom `fetch` option on `FetchStore` (#388) handles the per-store cases at the callsite. Chunk-layer concerns will move to `zarr.extendArray` in a follow-up. What's left is `signal`, which now lives properly on the (non-generic) `AsyncReadable` interface and can be passed directly in `zarr.get` and `zarr.set` from the caller: // Before interface AsyncReadable<Options = unknown> { get(key: AbsolutePath, opts?: Options): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { opts: { signal: ctl.signal } }); // After interface AsyncReadable { get(key: AbsolutePath, opts?: { signal?: AbortSignal }): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { signal: ctl.signal }); Batched caller signals in `withRangeBatching` are now merged with `AbortSignal.any` instead of a user-supplied `mergeOptions` reducer. The deprecated `opts?: { signal? }` shape still works for one major version and is folded into the new `signal` via `AbortSignal.any`.
`AsyncReadable<Options>` existed so stores could receive arbitrary per-call state. Auth headers, presigning context, cancellation signals, even chunk-layer concerns like caching and prefetch priority (#296, vole-core's `wrapArray`). It was a catch-all for extensions that had nowhere else to live, and the cost was a pile of type magic: higher-kinded-type encoding in the store middleware system (#384), threading generic types that didn't actually provide that much type safety. (TypeScript could often bail out to `any` when inference broke.) Those extensions now have proper homes. Store middleware (#384) gives transport-layer concerns (auth, presigning, request transformation) a proper extension point, and the custom `fetch` option on `FetchStore` (#388) handles the per-store cases at the callsite. Chunk-layer concerns will move to `zarr.extendArray` in a follow-up. What's left is `signal`, which now lives properly on the (non-generic) `AsyncReadable` interface and can be passed directly in `zarr.get` and `zarr.set` from the caller: // Before interface AsyncReadable<Options = unknown> { get(key: AbsolutePath, opts?: Options): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { opts: { signal: ctl.signal } }); // After interface AsyncReadable { get(key: AbsolutePath, opts?: { signal?: AbortSignal }): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { signal: ctl.signal }); Batched caller signals in `withRangeBatching` are now merged with `AbortSignal.any` instead of a user-supplied `mergeOptions` reducer. The deprecated `opts?: { signal? }` shape still works for one major version and is folded into the new `signal` via `AbortSignal.any`.
* Drop `Options` generic and thread `signal` directly `AsyncReadable<Options>` existed so stores could receive arbitrary per-call state. Auth headers, presigning context, cancellation signals, even chunk-layer concerns like caching and prefetch priority (#296, vole-core's `wrapArray`). It was a catch-all for extensions that had nowhere else to live, and the cost was a pile of type magic: higher-kinded-type encoding in the store middleware system (#384), threading generic types that didn't actually provide that much type safety. (TypeScript could often bail out to `any` when inference broke.) Those extensions now have proper homes. Store middleware (#384) gives transport-layer concerns (auth, presigning, request transformation) a proper extension point, and the custom `fetch` option on `FetchStore` (#388) handles the per-store cases at the callsite. Chunk-layer concerns will move to `zarr.extendArray` in a follow-up. What's left is `signal`, which now lives properly on the (non-generic) `AsyncReadable` interface and can be passed directly in `zarr.get` and `zarr.set` from the caller: // Before interface AsyncReadable<Options = unknown> { get(key: AbsolutePath, opts?: Options): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { opts: { signal: ctl.signal } }); // After interface AsyncReadable { get(key: AbsolutePath, opts?: { signal?: AbortSignal }): Promise<Uint8Array | undefined>; } await zarr.get(arr, null, { signal: ctl.signal }); Batched caller signals in `withRangeBatching` are now merged with `AbortSignal.any` instead of a user-supplied `mergeOptions` reducer. The deprecated `opts?: { signal? }` shape still works for one major version and is folded into the new `signal` via `AbortSignal.any`. * Add regression tests for signal propagation Covers the three paths that previously only worked incidentally through the `Options` generic: `zarr.get()` with a first-class `signal`, propagation through the sharded chunk getter (#306), and the deprecated `opts.signal` shim.

Summary
This adds optional chunk caching to the
get()function to avoid repeated decompression of chunks when accessing overlapping selections or making multiple calls to the same data. I've noticed this can be a significant bottleneck when indexing data slice-by-slice, for example, where the chunks span many slices.This may be a niche need, so no worries if this is out of scope for this library!
New API
cache?: ChunkCacheinGetOptionsChunkCacheinterface withget(key: string): Chunk<DataType> | undefinedandset(key: string, value: Chunk<DataType>): any(meant to be compatible with simpleMap)Usage
Implementation
store_N:${array.path}:${chunkKey}formatWeakMapassigns unique IDs to store instances to prevent cache collisions when sharing caches across stores (perhaps not recommended)