refactor!: Rename storages related methods from `get_` to `open_` by Pijukatel · Pull Request #1418 · apify/crawlee-python

Pijukatel · 2025-09-19T13:29:58Z

Description

-Rename:
BasicCrawler.get_dataset -> BasicCrawler.open_dataset
BasicCrawler.get_key_value_store -> BasicCrawler.open_key_value_store

-Add:
BasicCrawler.open_request_queue

-Fix:
Potential storage client change when running crawler again, and purge_on_start=True

Issues

Closes: Unify get_"storage" and open_"storage" helper methods #1415

Testing

Added test

Checklist

CI passed

…rt` used

vdusek

I am not sure about those:

BasicCrawler.get_request_manager -> BasicCrawler.open_request_manager

This should return the already configured request manager that is being used by the crawler. Not opening a new one. So the get here makes sense, probably.

And this:

BasicCrawler.open_request_queue

not sure whether it is needed or whether it makes sense (more potential problematic edge cases?)...

Revert get_request_manager

Pijukatel · 2025-09-19T13:48:55Z

I am not sure about those:
BasicCrawler.get_request_manager -> BasicCrawler.open_request_manager 
This should return the already configured request manager that is being used by the crawler. Not opening a new one. So the get here makes sense, probably.

Ok, I am fine with that.

And this:
BasicCrawler.open_request_queue
not sure whether it is needed or whether it makes sense (more potential problematic edge cases?)...

This is needed when thinking about the Crawler having it's own service_locator. So when you want RQ created from such a crawler, you need such helper method. On top of it, the fixed bug was exactly this problem, where the crawler would use some global RQ instead of the one specific to it.

vdusek

LGTM, but please spend some time describing the changes in the upgrading guide.

vdusek · 2025-09-19T14:15:24Z

+## BasicCrawler changes
+
+### Renamed methods for opening storages
+`BasicCrawler.get_dataset` -> `BasicCrawler.open_dataset`
+`BasicCrawler.get_key_value_store` -> `BasicCrawler.open_key_value_store`
+
+### Added method for opening RequestQueue that uses configuration and storage client of the crawler
+`BasicCrawler.open_request_queue`


Use words. This will be a page in the docs. It is not a changelog. Check out the other sections.

B4nan · 2025-09-19T14:29:36Z

BasicCrawler.open_request_queue

IMO this should stay as get_request_queue, just like in the JS version. I know it kinda does the same as the SDK methods, but it's not configurable, it always gives you the default RQ for the crawler (and only opens it if it wasn't opened already, if you pass in a RQ in the crawler options, it acts as a simple getter).

B4nan · 2025-09-19T14:32:18Z

BasicCrawler.get_dataset -> BasicCrawler.open_dataset

Same for this one. I thought we are talking about something else here (the factory methods on the storage classes to be specific). The crawler methods should be named based on the JS version.

Pijukatel · 2025-09-19T14:40:15Z

BasicCrawler.open_request_queue

IMO this should stay as get_request_queue, just like in the JS version. I know it kinda does the same as the SDK methods, but it's not configurable, it always gives you the default RQ for the crawler (and only opens it if it wasn't opened already, if you pass in a RQ in the crawler options, it acts as a simple getter).

We can have:
crawler.open_request_queue(alias=...)
crawler.open_request_queue(name=...)
crawler.open_request_queue(id=...)

and since now you can have differently configured crawlers, it makes sense as differently configured Crawler can give different storage (depends on init args)
crawler_1.open_request_queue(...)
crawler_2.open_request_queue(...)

B4nan · 2025-09-19T14:48:15Z

I am honestly not sold on having those methods like this, first time hearing we have this in python. The reason we added BC.getRequestQueue was to allow creating the crawler instance without passing a RQ that was explicitly created (and therefore, to give access to this implicitly created RQ). Similarly, the BC.getDataset is there because you can pushData from the crawling context, and you need a way to access the data.

To create any storage instance, people should be using the factory methods directly (e.g. RequestQueue.open). We shouldn't replicate those static factories on the BasicCrawler interface.

Pijukatel · 2025-09-19T15:02:14Z

I am honestly not sold on having those methods like this, first time hearing we have this in python. The reason we added BC.getRequestQueue was to allow creating the crawler instance without passing a RQ that was explicitly created (and therefore, to give access to this implicitly created RQ). Similarly, the BC.getDataset is there because you can pushData from the crawling context, and you need a way to access the data.

To create any storage instance, people should be using the factory methods directly (e.g. RequestQueue.open). We shouldn't replicate those static factories on the BasicCrawler interface.

One of the reworks we did was to allow to have multiple different crawlers that can be configured to use different services and detach them from the global state (if desired) to limit strange side effects. Now the crawlers can be isolated, and in such a situation, the crawler-specific storage creation methods are very convenient.

B4nan · 2025-09-19T15:06:25Z

How is this about global state? In JS you either just create the crawler without RQ option, and it creates its implicit RQ if it needs one (and you can access it via BC.getRequestQueue()), or you create it explicitly and pass it to the crawler options (that's what RQ.open method is for).

My point is that it sounds like BasicCrawler.get_request_queue does the same as RequestQueue.open, and I can't say I like that. Or what are the differences?

vdusek · 2025-09-19T15:44:58Z

IMO this should stay as get_request_queue, just like in the JS version.

In JS, Crawler.getRequestQueue, which returns a RequestProvider, is equivalent to Crawler.get_manager in Python. It simply returns the configured request manager for the crawler (don't be confused by the naming difference - provider vs manager - that's due to historical reasons from when the request loaders were first implemented). That's why I suggested this method should not accept id/name/alias arguments and should keep the get_... prefix, since it only returns the already-configured instance.

The other get_... methods (for dataset or KVS) actually open a new instance by calling Storage.open() internally. And they accept the name/alias/id (ID or name in JS). This is the same behavior as the open_... methods on the Actor. The difference is only in naming: here we use get_..., while in the SDK we use open_.... The goal is to unify these differently named methods that share the same behavior.

Lastly, there's the new method Crawler.open_request_queue (not to be confused with Crawler.get_request_manager). This method opens a new request queue using the storage_client and configuration that were previously passed to the Crawler, just like Crawler.open_dataset and Crawler.open_key_value_store.

Summary:

get_... method does not accept id/name/alias and does not open a new storage instance*. They only return the configured one on the crawler (*or a default one if it is not set yet)
open_... methods do accept those parameters and create a new instance under the hood and return it, so their behavior is different.

B4nan · 2025-09-20T07:05:55Z

In JS, Crawler.getRequestQueue, which returns a RequestProvider, is equivalent to Crawler.get_manager in Python. It simply returns the configured request manager for the crawler (don't be confused by the naming difference - provider vs manager - that's due to historical reasons from when the request loaders were first implemented). That's why I suggested this method should not accept id/name/alias arguments and should keep the get_... prefix, since it only returns the already-configured instance.

Is it really get_manager? If so, I'd rename it to get_request_manager, that's way too abstract.

The other get_... methods (for dataset or KVS) actually open a new instance by calling Storage.open() internally. And they accept the name/alias/id (ID or name in JS). This is the same behavior as the open_... methods on the Actor. The difference is only in naming: here we use get_..., while in the SDK we use open_.... The goal is to unify these differently named methods that share the same behavior.

I would personally just remove all of them, they don't fit the crawler interface at all. I understand that those respect the crawler config, but let's be honest, 99% of our users will use the default config and only override crawler options unrelated to the storage, we shouldn't pollute the main interface (the crawler class) because of an edge case.

vdusek · 2025-09-21T15:52:15Z

Is it really get_manager? If so, I'd rename it to get_request_manager, that's way too abstract.

Sorry, it is get_request_manager.

I would personally just remove all of them, they don't fit the crawler interface at all. I understand that those respect the crawler config, but let's be honest, 99% of our users will use the default config and only override crawler options unrelated to the storage, we shouldn't pollute the main interface (the crawler class) because of an edge case.

Sure, let's remove them then. I am all for it, I don't see much benefit in them as well.

As I said, it's analogous to https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler#getDataset , so I suppose we should remove it from the Crawlee JS v4 as well. The get_key_value_store is missing there.

Btw., we have get_key_value_store helper also on the BasicCrawlingContext level (surprisingly get_dataset is not there), I guess it can be removed as well?

Py - https://crawlee.dev/python/api/class/BasicCrawlingContext#get_key_value_store
JS - https://crawlee.dev/js/api/core/interface/CrawlingContext#getKeyValueStore

Pijukatel · 2025-09-22T06:59:09Z

Btw., we have get_key_value_store helper also on the BasicCrawlingContext level (surprisingly get_dataset is not there), I guess it can be removed as well?

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

So I guess we should make a decision. Do we want a limited amount of basic methods and do the stuff "only one way", or do we want a bunch of convenient helpers, which are probably more user-friendly but lead to multiple ways of achieving the same?

It is pretty confusing to develop if we subscribe to both of those concepts at the same time and arbitrarily apply one of them in various situations.

B4nan · 2025-09-22T07:12:12Z

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.

or do we want a bunch of convenient helpers, which are probably more user-friendly but lead to multiple ways of achieving the same?

My point is, we already have a convenient helper to do this, at least in JS, we have the static methods on the storage classes, so it doesn't make much sense to have exactly the same functionality on the crawler class.

As I said, it's analogous to crawlee.dev/js/api/basic-crawler/class/BasicCrawler#getDataset , so I suppose we should remove it from the Crawlee JS v4 as well.

I don't think we want to remove this one, we need a way to work with the implicit dataset instance of the crawler, the one that you push data to via the context helper. Again, to me this is a getter, so it should be called getDataset instead of openDataset, you'll want to call it after the crawler finishes, and by that time the dataset will already be opened.

Pijukatel · 2025-09-22T07:54:32Z

I think removing the helpers is a step in the wrong direction. I will state a use case where I believe it is best visible:

...
async with Actor(...):
    custom_configuration_1 = Configuration(...)
    custom_storage_client_1 = MemoryStorageClient()
    
    custom_configuration_2 = Configuration(...)
    custom_storage_client_2 = FileSystemStorageClient()
    
    crawler_1 = BasicCrawler(
        configuration=custom_configuration_1,
        storage_client=custom_storage_client_1,
    )
    
    crawler_2 = BasicCrawler(
        configuration=custom_configuration_2,
        storage_client=custom_storage_client_2,
      )
    
    # Same for all other storage helpers
    assert Actor.open_dataset() is not crawler_1.open_dataset()
    assert Actor.open_dataset() is not crawler_2.open_dataset()
    assert crawler_1.open_dataset() is not crawler_2.open_dataset()
...

Helpers have almost no extra maintenance cost. They just make sure, that proper storage is opened, based on the context. Crawlers can have a different context than the Actor, and if we delete the helpers, the user will have to track such context manually, which would be quite inconvenient.

If we keep such a helper just for one of the storages ( BasicCrawler.open_dataset), then we are quite inconsistent and users will have convenient access to the correct dataset, but they will have to do extra work to get convenient access to the other storages. I think such inconsistency will be just confusing.

This reverts commit ca092f8.

B4nan · 2025-09-22T09:36:53Z

I don't think we should rename (or remove) the Crawler.get_dataset method (nor the Crawler.get_request_queue), it should only serve as a getter for the implicit dataset/RQ, not as a way to open a new dataset/RQ (for that you should use Dataset.open). Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instances, we shouldn't make the API more confusing (e.g. have multiple methods to do the same) because of that.

With the datasets it's a bit more complicated, since we allow ctx.pushData to go to a named dataset too, not just to the implicit/default, so crawler.getDataset has the idOrName argument in JS.

Overall, I'd say we should unify things to do what the JS version is doing, we weren't planning any changes in this part there. But let's see what Honza thinks about this.

Pijukatel · 2025-09-22T10:05:03Z

Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instance

I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.

we shouldn't make the API more confusing (e.g. have multiple methods to do the same) because of that.

If the users don't get confused with crawler.getDataset (or crawler.openDataset), then I do not see a reason why they should get confused with the same method for the other two storage types. On the contrary, treating all three storage types consistently makes the API more predictable as opposed to having helper methods just for some storage types.

janbuchar

Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...

Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.

Exactly, even though I'm slowly steering away from pushing people to use the crawling context - I believe that in the future, we'll be able to have "transactional" access to storages even without modifying the context.

I don't think we should rename (or remove) the Crawler.get_dataset method (nor the Crawler.get_request_queue), it should only serve as a getter for the implicit dataset/RQ, not as a way to open a new dataset/RQ (for that you should use Dataset.open).

After seeing the changes, I have to agree with this. Very frequently, you'll call crawler.open_dataset long after the crawler finished after having used the dataset profusely, so I crawler.get_dataset is more fitting IMO.

Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instance

I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.

💯

In conclusion... I'd probably unify the naming in the opposite direction. Let me know if I missed some nuance of the discussion, it's quite a lot to unpack.

janbuchar · 2025-09-23T14:14:56Z

        return self._request_manager

-    async def get_dataset(
+    async def open_request_queue(


This has some funk to it - if the crawler uses a non-default request manager, this will still return the default request queue. If somebody does that, they will probably be surprised that adding requests to this queue does nothing 😁

Perhaps the method could throw if there is a non-default request manager in place?

Pijukatel · 2025-09-24T07:49:25Z

I will close this PR as we have no consensus on renaming the methods. I will merge the small non-breaking fix in a separate PR.

Pijukatel added 2 commits September 19, 2025 13:17

Rename the get to open methods and fix exposed bug when `purge_on_sta…

be33b0c

…rt` used

Fix purge on start changing storage client in some cases

e80c55a

github-actions Bot assigned Pijukatel Sep 19, 2025

github-actions Bot added this to the 123rd sprint - Tooling team milestone Sep 19, 2025

github-actions Bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Sep 19, 2025

Pijukatel changed the title ~~Rename get storages to open~~ refactor!: Rename get storages to open Sep 19, 2025

Pijukatel changed the title ~~refactor!: Rename get storages to open~~ refactor!: Rename storages realted methods from get_... to open_... Sep 19, 2025

Pijukatel changed the title ~~refactor!: Rename storages realted methods from get_... to open_...~~ refactor!: Rename storages related methods from get_... to open_... Sep 19, 2025

Pijukatel changed the title ~~refactor!: Rename storages related methods from get_... to open_...~~ refactor!: Rename storages related methods from get_ to open_ Sep 19, 2025

Pijukatel requested a review from vdusek September 19, 2025 13:37

vdusek reviewed Sep 19, 2025

View reviewed changes

Update ugrading guid.

29f154e

Revert get_request_manager

Pijukatel marked this pull request as ready for review September 19, 2025 13:49

Pijukatel requested a review from vdusek September 19, 2025 13:49

vdusek requested changes Sep 19, 2025

View reviewed changes

Change upgrading guide

0f90efc

Pijukatel requested a review from vdusek September 19, 2025 15:04

Remove helper methods as suggested

ca092f8

Pijukatel force-pushed the rename-get-storages-to-open branch from a463739 to ca092f8 Compare September 22, 2025 07:41

Revert "Remove helper methods as suggested"

7d87e3c

This reverts commit ca092f8.

janbuchar reviewed Sep 23, 2025

View reviewed changes

Pijukatel closed this Sep 24, 2025

Pijukatel mentioned this pull request Sep 29, 2025

chore: Prevent request manager change after crawler restart #1432

Merged

Uh oh!

Conversation

Pijukatel commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Checklist

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Sep 19, 2025

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

vdusek Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

B4nan commented Sep 19, 2025

Uh oh!

B4nan commented Sep 19, 2025

Uh oh!

Pijukatel commented Sep 19, 2025

Uh oh!

B4nan commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel commented Sep 19, 2025

Uh oh!

B4nan commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdusek commented Sep 19, 2025

Uh oh!

B4nan commented Sep 20, 2025

Uh oh!

vdusek commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pijukatel commented Sep 22, 2025

Uh oh!

B4nan commented Sep 22, 2025

Uh oh!

Pijukatel commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

B4nan commented Sep 22, 2025

Uh oh!

Pijukatel commented Sep 22, 2025

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

janbuchar Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Pijukatel commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pijukatel commented Sep 19, 2025 •

edited

Loading

B4nan commented Sep 19, 2025 •

edited

Loading

B4nan commented Sep 19, 2025 •

edited

Loading

vdusek commented Sep 21, 2025 •

edited

Loading

Pijukatel commented Sep 22, 2025 •

edited

Loading