Skip to content

refactor!: Make the StorageManager more like StorageInstanceManager from crawlee-python#3584

Open
janbuchar wants to merge 18 commits intov4from
refactor-storage-manager
Open

refactor!: Make the StorageManager more like StorageInstanceManager from crawlee-python#3584
janbuchar wants to merge 18 commits intov4from
refactor-storage-manager

Conversation

@janbuchar
Copy link
Copy Markdown
Contributor

@janbuchar janbuchar commented Apr 20, 2026

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Apr 20, 2026
@janbuchar janbuchar requested review from B4nan, barjin and l2ysho April 20, 2026 16:39
@janbuchar janbuchar changed the base branch from master to storages-refactor April 20, 2026 16:39
@janbuchar janbuchar changed the title refactor\!: Make the StorageManager more like StorageInstanceManager from crawlee-python refactor!: Make the StorageManager more like StorageInstanceManager from crawlee-python Apr 20, 2026
@janbuchar janbuchar self-assigned this Apr 21, 2026
Base automatically changed from storages-refactor to v4 April 22, 2026 12:45
@janbuchar janbuchar force-pushed the refactor-storage-manager branch from fc1d565 to afc367c Compare April 22, 2026 13:59
@janbuchar janbuchar force-pushed the refactor-storage-manager branch from fc3505c to 4f69d63 Compare April 23, 2026 18:54
Comment thread packages/core/src/storages/storage_instance_manager.ts
@janbuchar janbuchar marked this pull request as ready for review April 23, 2026 20:01
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar !

Here are a few ideas while I still have the context from the previous PR.

Comment thread test/core/crawlers/basic_crawler.test.ts
Comment thread packages/types/src/storages.ts Outdated
Comment on lines +231 to +234
client,
config,
clientOpener,
clientCacheKey,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With client, clientOpener, and clientCacheKey, the client- prefix is imo too overloaded to read this comfortably.

Maybe it's time to switch to the backend naming? iiuc, client and clientCacheKey refer to the "big" client (backend), and clientOpener refers to the DatasetClient etc., right?

Comment on lines +297 to +304
const instance = new cls(
{
id: storageInfo.id,
name: storageInfo.name,
client: subClient,
},
serviceLocator.getConfiguration(),
) as T;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the previous PR (and tracked in this issue - #3594), shouldn't subClient be enough?

Additionally, won't the storage frontends fetch the configuration from the serviceLocator by default?

Comment thread packages/core/src/storages/storage_instance_manager.ts
Comment thread packages/core/src/storages/storage_instance_manager.ts Outdated
@janbuchar janbuchar requested a review from barjin May 4, 2026 16:05
Copy link
Copy Markdown
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @janbuchar , I have a few more ideas / questions, see below ⬇️

// pass in failed requests back to the `crawler.run()`, otherwise they would be considered as handled and
// ignored - as a failed requests is still handled.
if (this.requestQueue?.name === 'default' && purgeRequestQueue) {
const isDefaultQueue = this.requestQueue?.name === 'default' || this.requestQueue?.name === '__default__';
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const isDefaultQueue = this.requestQueue?.name === 'default' || this.requestQueue?.name === '__default__';
const isDefaultQueue = this.requestQueue?.name === 'default' || this.requestQueue?.alias === '__default__';

Is this right? I haven't got used to this new paradigm yet, but __default__ seems to be referenced to as an alias throughout this PR.

Comment on lines +279 to +282
for (const instance of ServiceLocator.storageInstanceManager.getAllInstances()) {
if ('clearCache' in instance && typeof (instance as any).clearCache === 'function') {
(instance as any).clearCache();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for the ServiceLocator to access the internals of StorageInstanceManager? Couldn't we extract this to storageInstanceManager.clearCache?

Additionally, if this happens, do we need ServiceLocator.clearStorageManagerCache? Wouldn't calling ServiceLocator.storageInstanceManager.clearCache(); suffice?

}

if (typeof identifier === 'string') {
if (client.storageExists && (await client.storageExists(identifier, 'Dataset'))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is Dataset hardcoded here? What if we want to spawn, e.g., a named RequestQueue or KVS with this plain string arg? 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we even using string names here btw? Bundlers hate this (they mangle class names), and I would love v4 to be bundler-friendly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a string enum, it's not compared to a constructor.name or anything. But what Jindra says is valid, the storage type should be an argument to resolveStorageIdentifier and passed to storageExists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants