Skip to content

refactor!: Rename storages related methods from get_ to open_#1418

Closed
Pijukatel wants to merge 6 commits intomasterfrom
rename-get-storages-to-open
Closed

refactor!: Rename storages related methods from get_ to open_#1418
Pijukatel wants to merge 6 commits intomasterfrom
rename-get-storages-to-open

Conversation

@Pijukatel
Copy link
Copy Markdown
Collaborator

@Pijukatel Pijukatel commented Sep 19, 2025

Description

-Rename:
BasicCrawler.get_dataset -> BasicCrawler.open_dataset
BasicCrawler.get_key_value_store -> BasicCrawler.open_key_value_store

-Add:
BasicCrawler.open_request_queue

-Fix:
Potential storage client change when running crawler again, and purge_on_start=True

Issues

Testing

  • Added test

Checklist

  • CI passed

@github-actions github-actions Bot added this to the 123rd sprint - Tooling team milestone Sep 19, 2025
@github-actions github-actions Bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 19, 2025
@Pijukatel Pijukatel changed the title Rename get storages to open refactor!: Rename get storages to open Sep 19, 2025
@Pijukatel Pijukatel changed the title refactor!: Rename get storages to open refactor!: Rename storages realted methods from get_... to open_... Sep 19, 2025
@Pijukatel Pijukatel changed the title refactor!: Rename storages realted methods from get_... to open_... refactor!: Rename storages related methods from get_... to open_... Sep 19, 2025
@Pijukatel Pijukatel changed the title refactor!: Rename storages related methods from get_... to open_... refactor!: Rename storages related methods from get_ to open_ Sep 19, 2025
@Pijukatel Pijukatel requested a review from vdusek September 19, 2025 13:37
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about those:

BasicCrawler.get_request_manager -> BasicCrawler.open_request_manager 

This should return the already configured request manager that is being used by the crawler. Not opening a new one. So the get here makes sense, probably.

And this:

BasicCrawler.open_request_queue

not sure whether it is needed or whether it makes sense (more potential problematic edge cases?)...

Revert get_request_manager
@Pijukatel
Copy link
Copy Markdown
Collaborator Author

I am not sure about those:

BasicCrawler.get_request_manager -> BasicCrawler.open_request_manager 

This should return the already configured request manager that is being used by the crawler. Not opening a new one. So the get here makes sense, probably.

Ok, I am fine with that.

And this:

BasicCrawler.open_request_queue

not sure whether it is needed or whether it makes sense (more potential problematic edge cases?)...

This is needed when thinking about the Crawler having it's own service_locator. So when you want RQ created from such a crawler, you need such helper method. On top of it, the fixed bug was exactly this problem, where the crawler would use some global RQ instead of the one specific to it.

@Pijukatel Pijukatel marked this pull request as ready for review September 19, 2025 13:49
@Pijukatel Pijukatel requested a review from vdusek September 19, 2025 13:49
Copy link
Copy Markdown
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please spend some time describing the changes in the upgrading guide.

Comment thread docs/upgrading/upgrading_to_v1.md Outdated
Comment on lines +215 to +222
## BasicCrawler changes

### Renamed methods for opening storages
`BasicCrawler.get_dataset` -> `BasicCrawler.open_dataset`
`BasicCrawler.get_key_value_store` -> `BasicCrawler.open_key_value_store`

### Added method for opening RequestQueue that uses configuration and storage client of the crawler
`BasicCrawler.open_request_queue`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use words. This will be a page in the docs. It is not a changelog. Check out the other sections.

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 19, 2025

BasicCrawler.open_request_queue

IMO this should stay as get_request_queue, just like in the JS version. I know it kinda does the same as the SDK methods, but it's not configurable, it always gives you the default RQ for the crawler (and only opens it if it wasn't opened already, if you pass in a RQ in the crawler options, it acts as a simple getter).

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 19, 2025

BasicCrawler.get_dataset -> BasicCrawler.open_dataset

Same for this one. I thought we are talking about something else here (the factory methods on the storage classes to be specific). The crawler methods should be named based on the JS version.

@Pijukatel
Copy link
Copy Markdown
Collaborator Author

BasicCrawler.open_request_queue

IMO this should stay as get_request_queue, just like in the JS version. I know it kinda does the same as the SDK methods, but it's not configurable, it always gives you the default RQ for the crawler (and only opens it if it wasn't opened already, if you pass in a RQ in the crawler options, it acts as a simple getter).

We can have:
crawler.open_request_queue(alias=...)
crawler.open_request_queue(name=...)
crawler.open_request_queue(id=...)

and since now you can have differently configured crawlers, it makes sense as differently configured Crawler can give different storage (depends on init args)
crawler_1.open_request_queue(...)
crawler_2.open_request_queue(...)

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 19, 2025

I am honestly not sold on having those methods like this, first time hearing we have this in python. The reason we added BC.getRequestQueue was to allow creating the crawler instance without passing a RQ that was explicitly created (and therefore, to give access to this implicitly created RQ). Similarly, the BC.getDataset is there because you can pushData from the crawling context, and you need a way to access the data.

To create any storage instance, people should be using the factory methods directly (e.g. RequestQueue.open). We shouldn't replicate those static factories on the BasicCrawler interface.

@Pijukatel
Copy link
Copy Markdown
Collaborator Author

I am honestly not sold on having those methods like this, first time hearing we have this in python. The reason we added BC.getRequestQueue was to allow creating the crawler instance without passing a RQ that was explicitly created (and therefore, to give access to this implicitly created RQ). Similarly, the BC.getDataset is there because you can pushData from the crawling context, and you need a way to access the data.

To create any storage instance, people should be using the factory methods directly (e.g. RequestQueue.open). We shouldn't replicate those static factories on the BasicCrawler interface.

One of the reworks we did was to allow to have multiple different crawlers that can be configured to use different services and detach them from the global state (if desired) to limit strange side effects. Now the crawlers can be isolated, and in such a situation, the crawler-specific storage creation methods are very convenient.

@Pijukatel Pijukatel requested a review from vdusek September 19, 2025 15:04
@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 19, 2025

How is this about global state? In JS you either just create the crawler without RQ option, and it creates its implicit RQ if it needs one (and you can access it via BC.getRequestQueue()), or you create it explicitly and pass it to the crawler options (that's what RQ.open method is for).

My point is that it sounds like BasicCrawler.get_request_queue does the same as RequestQueue.open, and I can't say I like that. Or what are the differences?

@vdusek
Copy link
Copy Markdown
Collaborator

vdusek commented Sep 19, 2025

IMO this should stay as get_request_queue, just like in the JS version.

In JS, Crawler.getRequestQueue, which returns a RequestProvider, is equivalent to Crawler.get_manager in Python. It simply returns the configured request manager for the crawler (don't be confused by the naming difference - provider vs manager - that's due to historical reasons from when the request loaders were first implemented). That's why I suggested this method should not accept id/name/alias arguments and should keep the get_... prefix, since it only returns the already-configured instance.

The other get_... methods (for dataset or KVS) actually open a new instance by calling Storage.open() internally. And they accept the name/alias/id (ID or name in JS). This is the same behavior as the open_... methods on the Actor. The difference is only in naming: here we use get_..., while in the SDK we use open_.... The goal is to unify these differently named methods that share the same behavior.

Lastly, there's the new method Crawler.open_request_queue (not to be confused with Crawler.get_request_manager). This method opens a new request queue using the storage_client and configuration that were previously passed to the Crawler, just like Crawler.open_dataset and Crawler.open_key_value_store.

Summary:

  • get_... method does not accept id/name/alias and does not open a new storage instance*. They only return the configured one on the crawler (*or a default one if it is not set yet)
  • open_... methods do accept those parameters and create a new instance under the hood and return it, so their behavior is different.

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 20, 2025

In JS, Crawler.getRequestQueue, which returns a RequestProvider, is equivalent to Crawler.get_manager in Python. It simply returns the configured request manager for the crawler (don't be confused by the naming difference - provider vs manager - that's due to historical reasons from when the request loaders were first implemented). That's why I suggested this method should not accept id/name/alias arguments and should keep the get_... prefix, since it only returns the already-configured instance.

Is it really get_manager? If so, I'd rename it to get_request_manager, that's way too abstract.

The other get_... methods (for dataset or KVS) actually open a new instance by calling Storage.open() internally. And they accept the name/alias/id (ID or name in JS). This is the same behavior as the open_... methods on the Actor. The difference is only in naming: here we use get_..., while in the SDK we use open_.... The goal is to unify these differently named methods that share the same behavior.

I would personally just remove all of them, they don't fit the crawler interface at all. I understand that those respect the crawler config, but let's be honest, 99% of our users will use the default config and only override crawler options unrelated to the storage, we shouldn't pollute the main interface (the crawler class) because of an edge case.

@vdusek
Copy link
Copy Markdown
Collaborator

vdusek commented Sep 21, 2025

Is it really get_manager? If so, I'd rename it to get_request_manager, that's way too abstract.

Sorry, it is get_request_manager.

I would personally just remove all of them, they don't fit the crawler interface at all. I understand that those respect the crawler config, but let's be honest, 99% of our users will use the default config and only override crawler options unrelated to the storage, we shouldn't pollute the main interface (the crawler class) because of an edge case.

Sure, let's remove them then. I am all for it, I don't see much benefit in them as well.

As I said, it's analogous to https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler#getDataset , so I suppose we should remove it from the Crawlee JS v4 as well. The get_key_value_store is missing there.

Btw., we have get_key_value_store helper also on the BasicCrawlingContext level (surprisingly get_dataset is not there), I guess it can be removed as well?

Py - https://crawlee.dev/python/api/class/BasicCrawlingContext#get_key_value_store
JS - https://crawlee.dev/js/api/core/interface/CrawlingContext#getKeyValueStore

@Pijukatel
Copy link
Copy Markdown
Collaborator Author

Btw., we have get_key_value_store helper also on the BasicCrawlingContext level (surprisingly get_dataset is not there), I guess it can be removed as well?

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

So I guess we should make a decision. Do we want a limited amount of basic methods and do the stuff "only one way", or do we want a bunch of convenient helpers, which are probably more user-friendly but lead to multiple ways of achieving the same?

It is pretty confusing to develop if we subscribe to both of those concepts at the same time and arbitrarily apply one of them in various situations.

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 22, 2025

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.

or do we want a bunch of convenient helpers, which are probably more user-friendly but lead to multiple ways of achieving the same?

My point is, we already have a convenient helper to do this, at least in JS, we have the static methods on the storage classes, so it doesn't make much sense to have exactly the same functionality on the crawler class.

As I said, it's analogous to crawlee.dev/js/api/basic-crawler/class/BasicCrawler#getDataset , so I suppose we should remove it from the Crawlee JS v4 as well.

I don't think we want to remove this one, we need a way to work with the implicit dataset instance of the crawler, the one that you push data to via the context helper. Again, to me this is a getter, so it should be called getDataset instead of openDataset, you'll want to call it after the crawler finishes, and by that time the dataset will already be opened.

@Pijukatel Pijukatel force-pushed the rename-get-storages-to-open branch from a463739 to ca092f8 Compare September 22, 2025 07:41
@Pijukatel
Copy link
Copy Markdown
Collaborator Author

Pijukatel commented Sep 22, 2025

I think removing the helpers is a step in the wrong direction. I will state a use case where I believe it is best visible:

...
async with Actor(...):
    custom_configuration_1 = Configuration(...)
    custom_storage_client_1 = MemoryStorageClient()
    
    custom_configuration_2 = Configuration(...)
    custom_storage_client_2 = FileSystemStorageClient()
    
    crawler_1 = BasicCrawler(
        configuration=custom_configuration_1,
        storage_client=custom_storage_client_1,
    )
    
    crawler_2 = BasicCrawler(
        configuration=custom_configuration_2,
        storage_client=custom_storage_client_2,
      )
    
    # Same for all other storage helpers
    assert Actor.open_dataset() is not crawler_1.open_dataset()
    assert Actor.open_dataset() is not crawler_2.open_dataset()
    assert crawler_1.open_dataset() is not crawler_2.open_dataset()
...

Helpers have almost no extra maintenance cost. They just make sure, that proper storage is opened, based on the context. Crawlers can have a different context than the Actor, and if we delete the helpers, the user will have to track such context manually, which would be quite inconvenient.

If we keep such a helper just for one of the storages ( BasicCrawler.open_dataset), then we are quite inconsistent and users will have convenient access to the correct dataset, but they will have to do extra work to get convenient access to the other storages. I think such inconsistency will be just confusing.

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Sep 22, 2025

I don't think we should rename (or remove) the Crawler.get_dataset method (nor the Crawler.get_request_queue), it should only serve as a getter for the implicit dataset/RQ, not as a way to open a new dataset/RQ (for that you should use Dataset.open). Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instances, we shouldn't make the API more confusing (e.g. have multiple methods to do the same) because of that.

With the datasets it's a bit more complicated, since we allow ctx.pushData to go to a named dataset too, not just to the implicit/default, so crawler.getDataset has the idOrName argument in JS.

Overall, I'd say we should unify things to do what the JS version is doing, we weren't planning any changes in this part there. But let's see what Honza thinks about this.

@Pijukatel
Copy link
Copy Markdown
Collaborator Author

Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instance

I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.

we shouldn't make the API more confusing (e.g. have multiple methods to do the same) because of that.

If the users don't get confused with crawler.getDataset (or crawler.openDataset), then I do not see a reason why they should get confused with the same method for the other two storage types. On the contrary, treating all three storage types consistently makes the API more predictable as opposed to having helper methods just for some storage types.

Copy link
Copy Markdown
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.

Exactly, even though I'm slowly steering away from pushing people to use the crawling context - I believe that in the future, we'll be able to have "transactional" access to storages even without modifying the context.

I don't think we should rename (or remove) the Crawler.get_dataset method (nor the Crawler.get_request_queue), it should only serve as a getter for the implicit dataset/RQ, not as a way to open a new dataset/RQ (for that you should use Dataset.open).

After seeing the changes, I have to agree with this. Very frequently, you'll call crawler.open_dataset long after the crawler finished after having used the dataset profusely, so I crawler.get_dataset is more fitting IMO.

Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instance

I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.

💯

In conclusion... I'd probably unify the naming in the opposite direction. Let me know if I missed some nuance of the discussion, it's quite a lot to unpack.

return self._request_manager

async def get_dataset(
async def open_request_queue(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has some funk to it - if the crawler uses a non-default request manager, this will still return the default request queue. If somebody does that, they will probably be surprised that adding requests to this queue does nothing 😁

Perhaps the method could throw if there is a non-default request manager in place?

@Pijukatel
Copy link
Copy Markdown
Collaborator Author

I will close this PR as we have no consensus on renaming the methods. I will merge the small non-breaking fix in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unify get_"storage" and open_"storage" helper methods

4 participants