LRUStoreCache by ruaridhg · Pull Request #1 · ruaridhg/zarr-python

ruaridhg · 2025-08-27T14:34:01Z

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.rst
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

dstansby

I've left a first round of comments - hopefully enough to start iterating on. I haven't yet reviewed the implementation or tests, and will do that next.

Co-authored-by: David Stansby <dstansby@gmail.com>

…an option

dstansby · 2025-09-01T09:23:56Z

    "ignore:Unclosed client session <aiohttp.client.ClientSession.*:ResourceWarning"
 ]
 markers = [
+    "asyncio: mark test as asyncio test",


Did you need to add this to stop some tests failing (or pytest in general failing)? It seems unrelated to the tests you added, so I'm a bit surprised.

I can remove it for now, but yes I think there were some failing tests within the github actions when I didn't include it. Tests pass locally fine without it.

dstansby

I've reviewd the implementation, and have left (lots of!) comments. I get the impression that you might have taken code from v2 of zarr-python and added code to work with v3 of zarr-python, but then haven't removed a lot of the code that is no longer relevant for v3 zarr-python. For example, there are lots of hasattr() checks that will either always return False or True if you look at the definition of the Store abstract class, and there are lots of methods you've defined that duplicate functionality in the async methods.

I think my comments inline cover most of what needs changing/updating, but it would be good if you could cross check what you've written against some other v3 stores to make sure the code structure and methods implemented are similar.

Co-authored-by: David Stansby <dstansby@gmail.com>

dstansby

Some general comments:

There are several if hasattr(...) blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
There are several branches that claim to handle "dict-like" objects, but self._store is always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.
There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)

dstansby · 2025-09-04T13:14:22Z

+    Extract directory listing from store keys by filtering keys that start with the given path.
+    """
+    children: set[str] = set()
+    # Handle both Store objects and dict-like objects


Why are you handling both Store and dict-like objects, when the type of store in the signature only allows Store objects?

Remove handling of both types of objects.

dstansby · 2025-09-04T13:20:02Z

+                    self._keys_cache = []
+        return self._keys_cache
+
+    def listdir(self, path: Path) -> list[str]:


This whole method can be deleted, because it's made redundant by list_dir(), which is the method that the Store base class says has to be implemented.

Deleted this.

dstansby · 2025-09-04T13:20:35Z

+    return sorted(children)
+
+
+def listdir(store: Store, path: Path | None = None) -> list[str]:


This whole function can be deleted, because it's only used in listdir() below, which can also be delted (see other comment below).

Deleted this.

dstansby · 2025-09-04T13:22:44Z

+    supports_writes: bool = True
+    supports_deletes: bool = True
+    supports_partial_writes: bool = True
+    supports_listing: bool = True


I think this will not always be True, because it depends on if the underlying store supports listing. In fact, I think all these properties will depend on the underlying store. So they should be re-implemented as properties that return the values from the underlying store.

Made into properties that use underlying store.

dstansby · 2025-09-04T13:45:56Z

+            self._contains_cache.clear()
+            self._listdir_cache.clear()
+
+    def _invalidate_keys_unsafe(self) -> None:


Instead of defining this and invalidate_keys() and duplicating code, I would keep just invalidate_keys(), and when you need to call it just always call it outsdie a with self._mutex: block (because it handles the mutex iteself).

Fair enough, implemented this.

dstansby · 2025-09-04T13:46:10Z

+        self._contains_cache.clear()
+        self._listdir_cache.clear()
+
+    def _invalidate_value_unsafe(self, key: Any) -> None:


Same comment as above.

dstansby · 2025-09-04T13:49:28Z

+        self._check_writable()
+
+        # Check if it's a Store object vs dict-like object
+        if hasattr(self._store, "supports_listing"):


This is redundant, all stores define the supports_listing property.

dstansby · 2025-09-04T13:50:57Z

+        with self._mutex:
+            self._invalidate_keys_unsafe()
+            cache_key = self._normalize_key(key)
+            self._invalidate_value_unsafe(cache_key)


Here I think it's easier to just do _invalidate_keys() and then _invalidate_value() without the mutex block.

dstansby · 2025-09-04T13:51:19Z

+
+    async def exists(self, key: str) -> bool:
+        # Delegate to the underlying store
+        return await self._store.exists(key)


Shouldn't this be checking the cache first instead of checking the store every time? Otherwise it defeats the point of having the cache!

True, matching with getsize now.

Co-authored-by: David Stansby <dstansby@gmail.com>

…tests to match

ruaridhg · 2025-09-05T09:46:14Z

Some general comments:

There are several if hasattr(...) blocks that are redundant, because all stores define those attributes. You can check which properties/attributes/methods all stores have at https://zarr.readthedocs.io/en/stable/api/zarr/abc/store/index.html#zarr.abc.store.Store. These redudnant blocks should be simplified. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.

There are several branches that claim to handle "dict-like" objects, but self._store is always a store. So these branches can be removed. I started commenting on these, but there's enough that I stopped, so you'll have to find them all.

There are a few methods that are redundant or duplicate functionality, and should be removed (see inline comments)

I've addressed a lot of the inline comments and removed the dict-like branches, but I'm getting lots of the same mypy errors when removing if hasattr blocks either:
Function "list" could always be true in boolean context [truthy-function] - when I changed it to if self._store.list:
Statement is unreachable [unreachable] - when I changed it to if callable(self._store.keys):

async def list(self) -> AsyncIterator[str]:
        # Delegate to the underlying store
        if self._store.list:
            async for key in self._store.list():
                yield key
        else:
            # Fallback for stores that don't have async list
            if callable(self._store.keys):
                for key in list(self._store.keys()):
                    yield key

Will we always be delegating to the underlying store in these cases? In which instance, I can get rid of the if-else statement which resolves the mypy issues. If we want to have the underlying store or the cache for these methods then I'm not sure how to keep the if-else statement without running into mypy issues.

ruaridhg

I've commented and/or implemented changes for the inline comments.

ruaridhg · 2025-09-05T07:48:58Z

+    Extract directory listing from store keys by filtering keys that start with the given path.
+    """
+    children: set[str] = set()
+    # Handle both Store objects and dict-like objects


Remove handling of both types of objects.

ruaridhg · 2025-09-05T07:52:13Z

+                    self._keys_cache = []
+        return self._keys_cache
+
+    def listdir(self, path: Path) -> list[str]:


Deleted this.

ruaridhg · 2025-09-05T07:52:56Z

+    return sorted(children)
+
+
+def listdir(store: Store, path: Path | None = None) -> list[str]:


Deleted this.

ruaridhg · 2025-09-05T08:00:20Z

+    supports_writes: bool = True
+    supports_deletes: bool = True
+    supports_partial_writes: bool = True
+    supports_listing: bool = True


Made into properties that use underlying store.

ruaridhg · 2025-09-05T08:01:13Z

+    supports_partial_writes: bool = True
+    supports_listing: bool = True
+
+    root: Path


Removed this.

ruaridhg · 2025-09-05T08:23:41Z

+        with self._mutex:
+            self._invalidate_keys_unsafe()
+            cache_key = self._normalize_key(key)
+            self._invalidate_value_unsafe(cache_key)


ruaridhg · 2025-09-05T08:24:23Z

+        self._check_writable()
+
+        # Check if it's a Store object vs dict-like object
+        if hasattr(self._store, "supports_listing"):


ruaridhg · 2025-09-05T08:26:36Z

+
+    async def exists(self, key: str) -> bool:
+        # Delegate to the underlying store
+        return await self._store.exists(key)


True, matching with getsize now.

ruaridhg · 2025-09-05T08:31:37Z

+        underlying_store = self._store.with_read_only(read_only)
+        return LRUStoreCache(underlying_store, max_size=self._max_size)
+
+    def _normalize_key(self, key: str | Path) -> str:


Changed to cache_key = key without normalize_key function so that it's obvious that it's a cached key but can change if cleaner.

ruaridhg · 2025-09-05T09:12:58Z

+            raise ValueError("max_size must be a positive integer (bytes)")
+
+        # Always inherit read_only state from the underlying store
+        read_only = getattr(store, "read_only", False)


Yep changed this now.

dstansby · 2025-09-05T09:56:36Z

Function "list" could always be true in boolean context [truthy-function] - when I changed it to if self._store.list:
Statement is unreachable [unreachable] - when I changed it to if callable(self._store.keys):

.list is a function, so I'm not actually sure how if self._store.list: is evaluated. Instead, you can use .supports_listing to check if the store supports listing or not, and if it doesn't an error should be raised by LRUCacheStore because it won't be possible to list the underlying store.

dstansby

The implementaiton is looking real nice now 🕺 . I left a few more comments, but nothing major. Mostly more code that I think can be binned.

I had a first look at the tests, and they're a bit funny at the moment. I really like the idea behind CountingDict to track the method calls, but it needs some improvement because you should be testing the method calls on an actual Store object, not a dict. I left a suggestion as to the best way to implement this in the inline comments. Once that's done, I will probably have some more comments/questions on the tests.

dstansby · 2025-09-11T12:26:18Z

+from zarr.testing.store import StoreTests
+
+
+class CountingDict(dict[Any, Any]):


Checking how many times the different methods have been called is a really neat idea! This currently needs some fixing though, because you should be testing how many times method on a Store obejct is called, but this is not a Store object (although it sort of looks like one).

So my recommendation to fix this is to remove CountingDict, and instead implement a store that is a thin wrapper of a MemoryStore, that also contains logic to track method calls. Something like:

class CounterStore(MemoryStore): """ A thin wrapper of MemoryStore to count different method calls for testing. """ def __init__(self) -> None: super().__init__() self.counter: Counter[tuple[str, Any] | str] = Counter() async def clear(self) -> None: self.counter["clear"] += 1 # docstring inherited self._store_dict.clear() # TODO: implement other methods that should be tracked

Does that make sense?

dstansby

I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.

dstansby · 2025-09-15T12:40:50Z

+        self.misses += 1
+        size = await self._store.getsize(key)
+
+        # Try to get and cache the value if it's reasonably small i.e. ≤10% of max cache size


What was the motivation for you choosing this threshold for the value size?

One of the ideas of the cache store is it's useful for stores (e.g., remote stores) where making a query for the store is the reason for them being slow, not then getting a value from the store. So my thinking was it getsize(key) takes just as long as getvalue(key) for a remote store, the implementation of LRUCacheStore.getsize might as well 1) get the value 2) cache it, and 3) 'manually' calculate and return the size of the value, instead of relying on the underlying _store.getsize(key) implementation.

Please could you do some benchmarking using a remote store to see if it makes sense to do what I suggested above?

I changed this now and did some quick benchmarking so your suggestion is correct that the time saved is worth the new implementation.

ruaridhg · 2025-09-16T08:10:14Z

I have some more questions, left inline. As well as checking the tests run, can you check that pre-commit runs too, and fix any errors? It looks like there's currently a few typing errors that need addressing in the new code.

I've run pre-commit for the 2 files I changed i.e. _cache.py and the tests

dstansby · 2025-09-17T09:03:24Z

Looks like you accidentally committed this file?

dstansby · 2025-09-17T09:04:34Z

Looks like you accidentally committed this file too?

ruaridhg and others added 9 commits August 7, 2025 14:27

Add _cache.py first attempt

abb764e

test.py ran without error, creating test.zarr/

d72078f

Added testing for cache.py LRUStoreCache for v3

e1266b4

Fix ruff errors

40e6f46

Add working example comparing LocalStore to LRUStoreCache

eadc7bb

Delete test.py to clean-up

5f90a71

Added lrustorecache to changes and user-guide docs

ae51d23

Fix linting issues

e58329a

Merge branch 'zarr-developers:main' into rmg/LRUStoreCache

995ad1b

ruaridhg marked this pull request as ready for review August 27, 2025 14:37

Fix doctest errors

e84ebbe

dstansby reviewed Aug 29, 2025

View reviewed changes

ruaridhg and others added 5 commits August 29, 2025 11:48

Update docs/user-guide/lrustorecache.rst

8b22c6b

Co-authored-by: David Stansby <dstansby@gmail.com>

Update LRUStoreCache docstring and modify max_size to remove None as …

715296e

…an option

Expand changes description

34328f4

Improve wording in lrustorecache.rst

6033416

Fix pre-commit errors and failing tests

b41d9e4

dstansby reviewed Sep 1, 2025

View reviewed changes

Remove asyncio marker from pyproject.toml

54322d2

dstansby reviewed Sep 1, 2025

View reviewed changes

ruaridhg and others added 10 commits September 1, 2025 12:46

Apply suggestions from code review

ae65b38

Co-authored-by: David Stansby <dstansby@gmail.com>

Fixed failing tests with some PR review comments addressed

94634b3

Modify **_item before potential deletion

b31fd7c

Remove **_item methods

f211f9a

Add warning for data exceeding cache and test

5431e41

Remove unused functions

fcab264

Fix linting

b27014d

Add tests to increase code coverage

aa9f12e

Add methods for consistency with other stores

b7f4458

Add in test for else statement in listdir

fde1ff7

ruaridhg requested a review from dstansby September 4, 2025 10:06

Modify listdir method for LRUStoreCache

2a2692f

dstansby reviewed Sep 4, 2025

View reviewed changes

ruaridhg and others added 4 commits September 4, 2025 16:38

Apply suggestions from code review

7e5b83d

Co-authored-by: David Stansby <dstansby@gmail.com>

Matching underline lengths for titles

5915d84

Address latest PR comments removing redundant functions and updating …

7c1ff74

…tests to match

Remove hasattr and dict-like object references

8aaef7e

ruaridhg commented Sep 5, 2025

View reviewed changes

Fix remaining mypy issues

80fa2b2

ruaridhg requested a review from dstansby September 5, 2025 14:01

dstansby reviewed Sep 11, 2025

View reviewed changes

ruaridhg added 2 commits September 15, 2025 11:53

Updated _cache.py to remove redundant functions

4f4be57

Add tests for new getsize implementation

b4c2aca

ruaridhg requested a review from dstansby September 15, 2025 12:20

dstansby reviewed Sep 15, 2025

View reviewed changes

ruaridhg added 5 commits September 15, 2025 15:14

Modify getsize

7761c5c

Delete test files

95353d9

Delete local tests

5ade440

Remove dict-like references in LRUStoreCache and tests

8b46576

Remove dimension separator test function

115390f

ruaridhg requested a review from dstansby September 16, 2025 15:36

dstansby reviewed Sep 17, 2025

View reviewed changes

Comment thread caching_store.py Outdated

Copy link
Copy Markdown

dstansby Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you accidentally committed this file?

dstansby reviewed Sep 17, 2025

View reviewed changes

Comment thread uv.lock Outdated

Copy link
Copy Markdown

dstansby Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you accidentally committed this file too?

Remove unused files

51aeab7

		return sorted(children)


		def listdir(store: Store, path: Path \| None = None) -> list[str]:

		from zarr.testing.store import StoreTests


		class CountingDict(dict[Any, Any]):

Conversation

ruaridhg commented Aug 27, 2025

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dstansby left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruaridhg commented Sep 5, 2025

Uh oh!

ruaridhg left a comment

Choose a reason for hiding this comment

Uh oh!