Skip to content

refactor!: Overhaul the StorageClient interface#3570

Merged
janbuchar merged 21 commits intov4from
storages-refactor
Apr 22, 2026
Merged

refactor!: Overhaul the StorageClient interface#3570
janbuchar merged 21 commits intov4from
storages-refactor

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar commented Apr 14, 2026

Deliberately left out:

  • alias support
  • final StorageManager refactor
  • splitting MemoryStorage into two storage clients and using crawlee-storage for the filesystem one (once this happens, we can also remove MemoryStorageEmulator from tests)
  • revisiting storage "frontend" (Dataset/RequestQueue/KeyValueStore) constructors - unnecessary parameters (refactor!: Overhaul the StorageClient interface #3570 (comment)) etc.

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 14, 2026
@janbuchar janbuchar requested review from B4nan, barjin and l2ysho April 14, 2026 16:02
@janbuchar janbuchar changed the base branch from master to v4 April 14, 2026 16:03
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some quick ideas, otherwise this looks pretty solid to me 👍 thank you!

createStorageCollectionClient: client[collectionClientName!].bind(client),
};
): Promise<DatasetClient | KeyValueStoreClient | RequestQueueClient> {
switch (storageConstructorName) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some devs were having issues with this when subclassing the storage classes (building an encrypted KVS - ancient Slack thread).

Do we want to do anything about this? The reflection-based approach has always felt a little too hacky to me, but I cannot imagine a better way for this without a larger refactor right now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we are in for a refactor of the whole StorageManager, hold on to your pants! And being able to use something like an "encryption wrapper" is definitely on the radar. I'd prefer to do it in a separate PR though.

const config = serviceLocator.getConfiguration();
const requestQueue = new RequestQueue({ id: requestQueueInfo.id, client }, config);
const rqInfo = await rqClient.getMetadata();
const requestQueue = new RequestQueue({ id: rqInfo.id, client: rqClient }, config);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: is passing the id necessary, if we have the whole "backend" (rqClient) already?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, are you suggesting that we update the RequestQueue constructor, or are you looking to improve the test code?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constructor part - having to pass the id feels like the rqClient should serve multiple queues (and you use id to differentiate which one to access). This imo gets even more intriguing if we rename the option to backend.

Do we need the id in the constructor (synchronously) for something? I thought the RequestQueue / Dataset ... classes are meant as thin wrappers over the backend implementations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, I'd look into this in a subsequent PR. You are not supposed to call the RequestQueue constructor anyway, as a regular Joe, you should use RequestQueue.open.

@barjin
Copy link
Copy Markdown
Member

barjin commented Apr 15, 2026

Regarding the client -> backend rename, I had a similar idea when reading the diff :) But once I got to the end, I got used to it, I'm honestly fine with the way it is now

@janbuchar
Copy link
Copy Markdown
Contributor Author

This should now be ready for review. It's a first step in a larger series of refactors to achieve parity with the storage system in crawlee-python. See the PR description for a rough outline of what's missing and let me know if something needs to be added.

@janbuchar janbuchar marked this pull request as ready for review April 17, 2026 07:43

// Create a key value store for all images we find
const imageStore = await KeyValueStore.open('images');
const imageStore = await KeyValueStore.open({ name: 'images' });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we keep the old signature supported too? this feels like unnecessary break that will affect most of the users

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but at the cost of higher complexity. Also once we add aliases, it will become even more confusing - if you want to use an alias, you need the "new" signature. I would prefer a clean break here, but your call.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a one-liner, or what am I missing?

function open(k: string | { name: string } | { alias: string }) {
  const opts = k = typeof k === 'string' ? { name: k } : k;
  // ...
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a one-liner, but the "try name, then ID" logic needs a bit more care. The storage client interface would need to specify clearly what must happen if the storage is not found. Not impossible, but annoying.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a second one-liner. I get your point, but this would really affect pretty much every project, so I would rather preserve the BC here.

If we're making this break, I would at least do this in v3 with a deprecation warning, but IMO better to stay compatible here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's try it and see what happens

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welp, almost a one-liner 632e0f6

@janbuchar janbuchar self-assigned this Apr 20, 2026
@janbuchar janbuchar requested review from B4nan and barjin April 20, 2026 09:48
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @janbuchar ! I've got a few more questions, mostly interface design (so they might be noop) ⬇️

Comment thread test/core/crawlers/basic_crawler.test.ts Outdated
clientKey: s.string().optional(),
timeoutSecs: s.number().optional(),
}).parse(options);
async storageExists(id: string, type: 'Dataset' | 'KeyValueStore' | 'RequestQueue'): Promise<boolean> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how I feel about the "constructor name" parameter (as that's exactly the thing causing this).

The alternative would be datasetExists, keyValueStoreExists, and requestQueueExists methods, which kinda bloat the interface. On the other hand, it mimics the structure of the createXYZClient methods.

Wdyt, have you considered something like this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered the three separate methods approach and it felt odd to me to require three methods just to ensure BC. I'd argue that accepting literals here is much less of a problem than in the linked comment - it's at a much deeper level.

Comment thread packages/memory-storage/src/memory-storage.ts Outdated
Comment thread packages/memory-storage/src/memory-storage.ts
@janbuchar janbuchar requested a review from barjin April 21, 2026 13:39
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more nits, otherwise this is good to go from me 👍

Comment thread docs/upgrading/upgrading_v4.md
Comment thread packages/core/src/crawlers/crawler_commons.ts Outdated
Comment thread packages/core/src/storages/dataset.ts Outdated
@janbuchar janbuchar merged commit ffff334 into v4 Apr 22, 2026
6 checks passed
@janbuchar janbuchar deleted the storages-refactor branch April 22, 2026 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rework the storage client system

4 participants