refactor!: Overhaul the StorageClient interface#3570
Conversation
barjin
left a comment
There was a problem hiding this comment.
some quick ideas, otherwise this looks pretty solid to me 👍 thank you!
| createStorageCollectionClient: client[collectionClientName!].bind(client), | ||
| }; | ||
| ): Promise<DatasetClient | KeyValueStoreClient | RequestQueueClient> { | ||
| switch (storageConstructorName) { |
There was a problem hiding this comment.
Some devs were having issues with this when subclassing the storage classes (building an encrypted KVS - ancient Slack thread).
Do we want to do anything about this? The reflection-based approach has always felt a little too hacky to me, but I cannot imagine a better way for this without a larger refactor right now.
There was a problem hiding this comment.
Oh we are in for a refactor of the whole StorageManager, hold on to your pants! And being able to use something like an "encryption wrapper" is definitely on the radar. I'd prefer to do it in a separate PR though.
| const config = serviceLocator.getConfiguration(); | ||
| const requestQueue = new RequestQueue({ id: requestQueueInfo.id, client }, config); | ||
| const rqInfo = await rqClient.getMetadata(); | ||
| const requestQueue = new RequestQueue({ id: rqInfo.id, client: rqClient }, config); |
There was a problem hiding this comment.
idea: is passing the id necessary, if we have the whole "backend" (rqClient) already?
There was a problem hiding this comment.
Just to be clear, are you suggesting that we update the RequestQueue constructor, or are you looking to improve the test code?
There was a problem hiding this comment.
The constructor part - having to pass the id feels like the rqClient should serve multiple queues (and you use id to differentiate which one to access). This imo gets even more intriguing if we rename the option to backend.
Do we need the id in the constructor (synchronously) for something? I thought the RequestQueue / Dataset ... classes are meant as thin wrappers over the backend implementations.
There was a problem hiding this comment.
Probably not, I'd look into this in a subsequent PR. You are not supposed to call the RequestQueue constructor anyway, as a regular Joe, you should use RequestQueue.open.
|
Regarding the |
|
This should now be ready for review. It's a first step in a larger series of refactors to achieve parity with the storage system in crawlee-python. See the PR description for a rough outline of what's missing and let me know if something needs to be added. |
|
|
||
| // Create a key value store for all images we find | ||
| const imageStore = await KeyValueStore.open('images'); | ||
| const imageStore = await KeyValueStore.open({ name: 'images' }); |
There was a problem hiding this comment.
can we keep the old signature supported too? this feels like unnecessary break that will affect most of the users
There was a problem hiding this comment.
We can, but at the cost of higher complexity. Also once we add aliases, it will become even more confusing - if you want to use an alias, you need the "new" signature. I would prefer a clean break here, but your call.
There was a problem hiding this comment.
It's a one-liner, or what am I missing?
function open(k: string | { name: string } | { alias: string }) {
const opts = k = typeof k === 'string' ? { name: k } : k;
// ...
}There was a problem hiding this comment.
This is a one-liner, but the "try name, then ID" logic needs a bit more care. The storage client interface would need to specify clearly what must happen if the storage is not found. Not impossible, but annoying.
There was a problem hiding this comment.
That sounds like a second one-liner. I get your point, but this would really affect pretty much every project, so I would rather preserve the BC here.
If we're making this break, I would at least do this in v3 with a deprecation warning, but IMO better to stay compatible here.
There was a problem hiding this comment.
Sure, let's try it and see what happens
barjin
left a comment
There was a problem hiding this comment.
Thank you, @janbuchar ! I've got a few more questions, mostly interface design (so they might be noop) ⬇️
| clientKey: s.string().optional(), | ||
| timeoutSecs: s.number().optional(), | ||
| }).parse(options); | ||
| async storageExists(id: string, type: 'Dataset' | 'KeyValueStore' | 'RequestQueue'): Promise<boolean> { |
There was a problem hiding this comment.
I'm not sure how I feel about the "constructor name" parameter (as that's exactly the thing causing this).
The alternative would be datasetExists, keyValueStoreExists, and requestQueueExists methods, which kinda bloat the interface. On the other hand, it mimics the structure of the createXYZClient methods.
Wdyt, have you considered something like this?
There was a problem hiding this comment.
I considered the three separate methods approach and it felt odd to me to require three methods just to ensure BC. I'd argue that accepting literals here is much less of a problem than in the linked comment - it's at a much deeper level.
barjin
left a comment
There was a problem hiding this comment.
Just a few more nits, otherwise this is good to go from me 👍
*CollectionClientinterfacesStorageClientnow requires acreate*Clientmethod insteadDataset,RequestQueue,KeyValueStore), thenopenmethod expects aStorageIdentifier({id?: string; name?: string}) to match what crawlee-python doesDeliberately left out:
aliassupportStorageManagerrefactorMemoryStorageinto two storage clients and using crawlee-storage for the filesystem one (once this happens, we can also removeMemoryStorageEmulatorfrom tests)