Conversation
dstansby
left a comment
There was a problem hiding this comment.
I've left a first round of comments - hopefully enough to start iterating on. I haven't yet reviewed the implementation or tests, and will do that next.
Co-authored-by: David Stansby <dstansby@gmail.com>
| "ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning" | ||
| ] | ||
| markers = [ | ||
| "asyncio: mark test as asyncio test", |
There was a problem hiding this comment.
Did you need to add this to stop some tests failing (or pytest in general failing)? It seems unrelated to the tests you added, so I'm a bit surprised.
There was a problem hiding this comment.
I can remove it for now, but yes I think there were some failing tests within the github actions when I didn't include it. Tests pass locally fine without it.
dstansby
left a comment
There was a problem hiding this comment.
I've reviewd the implementation, and have left (lots of!) comments. I get the impression that you might have taken code from v2 of zarr-python and added code to work with v3 of zarr-python, but then haven't removed a lot of the code that is no longer relevant for v3 zarr-python. For example, there are lots of hasattr() checks that will either always return False or True if you look at the definition of the Store abstract class, and there are lots of methods you've defined that duplicate functionality in the async methods.
I think my comments inline cover most of what needs changing/updating, but it would be good if you could cross check what you've written against some other v3 stores to make sure the code structure and methods implemented are similar.
Co-authored-by: David Stansby <dstansby@gmail.com>
dstansby
left a comment
There was a problem hiding this comment.
Some general comments:
- There are several
if hasattr(...)blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all. - There are several branches that claim to handle "dict-like" objects, but
self._storeis always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all. - There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)
| Extract directory listing from store keys by filtering keys that start with the given path. | ||
| """ | ||
| children: set[str] = set() | ||
| # Handle both Store objects and dict-like objects |
There was a problem hiding this comment.
Why are you handling both Store and dict-like objects, when the type of store in the signature only allows Store objects?
There was a problem hiding this comment.
Remove handling of both types of objects.
| self._keys_cache = [] | ||
| return self._keys_cache | ||
|
|
||
| def listdir(self, path: Path) -> list[str]: |
There was a problem hiding this comment.
This whole method can be deleted, because it's made redundant by list_dir(), which is the method that the Store base class says has to be implemented.
| return sorted(children) | ||
|
|
||
|
|
||
| def listdir(store: Store, path: Path | None = None) -> list[str]: |
There was a problem hiding this comment.
This whole function can be deleted, because it's only used in listdir() below, which can also be delted (see other comment below).
| supports_writes: bool = True | ||
| supports_deletes: bool = True | ||
| supports_partial_writes: bool = True | ||
| supports_listing: bool = True |
There was a problem hiding this comment.
I think this will not always be True, because it depends on if the underlying store supports listing. In fact, I think all these properties will depend on the underlying store. So they should be re-implemented as properties that return the values from the underlying store.
There was a problem hiding this comment.
Made into properties that use underlying store.
| self._contains_cache.clear() | ||
| self._listdir_cache.clear() | ||
|
|
||
| def _invalidate_keys_unsafe(self) -> None: |
There was a problem hiding this comment.
Instead of defining this and invalidate_keys() and duplicating code, I would keep just invalidate_keys(), and when you need to call it just always call it outsdie a with self._mutex: block (because it handles the mutex iteself).
There was a problem hiding this comment.
Fair enough, implemented this.
| self._contains_cache.clear() | ||
| self._listdir_cache.clear() | ||
|
|
||
| def _invalidate_value_unsafe(self, key: Any) -> None: |
| self._check_writable() | ||
|
|
||
| # Check if it's a Store object vs dict-like object | ||
| if hasattr(self._store, "supports_listing"): |
There was a problem hiding this comment.
This is redundant, all stores define the supports_listing property.
| with self._mutex: | ||
| self._invalidate_keys_unsafe() | ||
| cache_key = self._normalize_key(key) | ||
| self._invalidate_value_unsafe(cache_key) |
There was a problem hiding this comment.
Here I think it's easier to just do _invalidate_keys() and then _invalidate_value() without the mutex block.
|
|
||
| async def exists(self, key: str) -> bool: | ||
| # Delegate to the underlying store | ||
| return await self._store.exists(key) |
There was a problem hiding this comment.
Shouldn't this be checking the cache first instead of checking the store every time? Otherwise it defeats the point of having the cache!
There was a problem hiding this comment.
True, matching with getsize now.
Co-authored-by: David Stansby <dstansby@gmail.com>
I've addressed a lot of the inline comments and removed the dict-like branches, but I'm getting lots of the same mypy errors when removing if hasattr blocks either: Will we always be delegating to the underlying store in these cases? In which instance, I can get rid of the if-else statement which resolves the mypy issues. If we want to have the underlying store or the cache for these methods then I'm not sure how to keep the if-else statement without running into mypy issues. |
ruaridhg
left a comment
There was a problem hiding this comment.
I've commented and/or implemented changes for the inline comments.
| Extract directory listing from store keys by filtering keys that start with the given path. | ||
| """ | ||
| children: set[str] = set() | ||
| # Handle both Store objects and dict-like objects |
There was a problem hiding this comment.
Remove handling of both types of objects.
| self._keys_cache = [] | ||
| return self._keys_cache | ||
|
|
||
| def listdir(self, path: Path) -> list[str]: |
| return sorted(children) | ||
|
|
||
|
|
||
| def listdir(store: Store, path: Path | None = None) -> list[str]: |
| supports_writes: bool = True | ||
| supports_deletes: bool = True | ||
| supports_partial_writes: bool = True | ||
| supports_listing: bool = True |
There was a problem hiding this comment.
Made into properties that use underlying store.
| supports_partial_writes: bool = True | ||
| supports_listing: bool = True | ||
|
|
||
| root: Path |
| with self._mutex: | ||
| self._invalidate_keys_unsafe() | ||
| cache_key = self._normalize_key(key) | ||
| self._invalidate_value_unsafe(cache_key) |
| self._check_writable() | ||
|
|
||
| # Check if it's a Store object vs dict-like object | ||
| if hasattr(self._store, "supports_listing"): |
|
|
||
| async def exists(self, key: str) -> bool: | ||
| # Delegate to the underlying store | ||
| return await self._store.exists(key) |
There was a problem hiding this comment.
True, matching with getsize now.
| underlying_store = self._store.with_read_only(read_only) | ||
| return LRUStoreCache(underlying_store, max_size=self._max_size) | ||
|
|
||
| def _normalize_key(self, key: str | Path) -> str: |
There was a problem hiding this comment.
Changed to cache_key = key without normalize_key function so that it's obvious that it's a cached key but can change if cleaner.
| raise ValueError("max_size must be a positive integer (bytes)") | ||
|
|
||
| # Always inherit read_only state from the underlying store | ||
| read_only = getattr(store, "read_only", False) |
|
dstansby
left a comment
There was a problem hiding this comment.
The implementaiton is looking real nice now 🕺 . I left a few more comments, but nothing major. Mostly more code that I think can be binned.
I had a first look at the tests, and they're a bit funny at the moment. I really like the idea behind CountingDict to track the method calls, but it needs some improvement because you should be testing the method calls on an actual Store object, not a dict. I left a suggestion as to the best way to implement this in the inline comments. Once that's done, I will probably have some more comments/questions on the tests.
| from zarr.testing.store import StoreTests | ||
|
|
||
|
|
||
| class CountingDict(dict[Any, Any]): |
There was a problem hiding this comment.
Checking how many times the different methods have been called is a really neat idea! This currently needs some fixing though, because you should be testing how many times method on a Store obejct is called, but this is not a Store object (although it sort of looks like one).
So my recommendation to fix this is to remove CountingDict, and instead implement a store that is a thin wrapper of a MemoryStore, that also contains logic to track method calls. Something like:
class CounterStore(MemoryStore):
"""
A thin wrapper of MemoryStore to count different method calls for testing.
"""
def __init__(self) -> None:
super().__init__()
self.counter: Counter[tuple[str, Any] | str] = Counter()
async def clear(self) -> None:
self.counter["clear"] += 1
# docstring inherited
self._store_dict.clear()
# TODO: implement other methods that should be trackedDoes that make sense?
dstansby
left a comment
There was a problem hiding this comment.
I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.
| self.misses += 1 | ||
| size = await self._store.getsize(key) | ||
|
|
||
| # Try to get and cache the value if it's reasonably small i.e. ≤10% of max cache size |
There was a problem hiding this comment.
What was the motivation for you choosing this threshold for the value size?
One of the ideas of the cache store is it's useful for stores (e.g., remote stores) where making a query for the store is the reason for them being slow, not then getting a value from the store. So my thinking was it getsize(key) takes just as long as getvalue(key) for a remote store, the implementation of LRUCacheStore.getsize might as well 1) get the value 2) cache it, and 3) 'manually' calculate and return the size of the value, instead of relying on the underlying _store.getsize(key) implementation.
Please could you do some benchmarking using a remote store to see if it makes sense to do what I suggested above?
There was a problem hiding this comment.
I changed this now and did some quick benchmarking so your suggestion is correct that the time saved is worth the new implementation.
I've run pre-commit for the 2 files I changed i.e. _cache.py and the tests |
There was a problem hiding this comment.
Looks like you accidentally committed this file?
There was a problem hiding this comment.
Looks like you accidentally committed this file too?
[Description of PR]
TODO:
docs/user-guide/*.rstchanges/