Cache storage.Client per-thread in GCSFileSystem#929
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces thread-local caching of the Google Cloud Storage client in GCSFileSystem.sign to avoid redundant client instantiations while ensuring thread isolation. However, storing a threading.local object directly as an instance attribute will break pickling of GCSFileSystem instances, which is critical for distributed environments like Dask. To resolve this, the reviewer suggests storing the thread-local clients in a module-level weakref.WeakKeyDictionary mapping filesystem instances to their respective threading.local containers, and provides corresponding code suggestions for the implementation and tests.
| import threading | ||
| import uuid | ||
| import warnings | ||
| import weakref |
There was a problem hiding this comment.
Storing a threading.local object directly as an instance attribute (self._thread_local_storage_client) will break pickling of GCSFileSystem instances, raising TypeError: cannot pickle '_thread._local' object. This is a critical issue because gcsfs filesystems are frequently pickled and distributed in multi-processing/distributed environments like Dask.
To resolve this cleanly without breaking pickling, we can store the thread-local clients in a module-level weakref.WeakKeyDictionary mapping the filesystem instances to their respective threading.local containers.
| import threading | |
| import uuid | |
| import warnings | |
| import weakref | |
| import threading | |
| import uuid | |
| import warnings | |
| import weakref | |
| _THREAD_LOCAL_STORAGE_CLIENTS = weakref.WeakKeyDictionary() |
There was a problem hiding this comment.
lets make sure all testing scenarios including pickling are added for this change like integration tests which run on actual bucket for single process multi thread env and multi process with single thread env and multi thread with multi thread env, to make sure this is not breaking for any scenario
Also, please update PR desciprion with before and after numbers using real world scneario which is benefitted from this
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #929 +/- ##
==========================================
+ Coverage 89.50% 89.68% +0.18%
==========================================
Files 16 16
Lines 3553 3558 +5
==========================================
+ Hits 3180 3191 +11
+ Misses 373 367 -6 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| credentials=self.credentials.credentials, | ||
| project=self.project, | ||
| ) | ||
| client = storage.Client( |
There was a problem hiding this comment.
Is this Client fundamentally different from the one used for IO elsewhere? Can some part of the IO client be used instead, since that's running on a known asyncio thread anyway?
There was a problem hiding this comment.
It's different from IO client because it contains a complex cryptographic process. https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/_signing.py
The
signmethod was instantiating a new synchronousstorage.Clienton every single call. Creating clients is highly expensive due to the overhead of credential resolution.This commit introduces a thread-local cache (
_thread_local_storage_client) on theGCSFileSysteminstance usingthreading.local(). The client is lazily initialized on the first call tosign()within any given thread and cached as the.instanceattribute.Subsequent calls to
sign()on the same thread reuse the cached client, completely eliminating the redundant instantiation overhead. Using a thread-local container guarantees full thread-safety, ensuring threads do not share the underlying connection pool or transport.