refactor!: Rename storages related methods from get_ to open_#1418
refactor!: Rename storages related methods from get_ to open_#1418
get_ to open_#1418Conversation
get_... to open_...
get_... to open_...get_... to open_...
get_... to open_...get_ to open_
vdusek
left a comment
There was a problem hiding this comment.
I am not sure about those:
BasicCrawler.get_request_manager -> BasicCrawler.open_request_manager
This should return the already configured request manager that is being used by the crawler. Not opening a new one. So the get here makes sense, probably.
And this:
BasicCrawler.open_request_queue
not sure whether it is needed or whether it makes sense (more potential problematic edge cases?)...
Revert get_request_manager
Ok, I am fine with that.
This is needed when thinking about the Crawler having it's own service_locator. So when you want RQ created from such a crawler, you need such helper method. On top of it, the fixed bug was exactly this problem, where the crawler would use some global RQ instead of the one specific to it. |
vdusek
left a comment
There was a problem hiding this comment.
LGTM, but please spend some time describing the changes in the upgrading guide.
| ## BasicCrawler changes | ||
|
|
||
| ### Renamed methods for opening storages | ||
| `BasicCrawler.get_dataset` -> `BasicCrawler.open_dataset` | ||
| `BasicCrawler.get_key_value_store` -> `BasicCrawler.open_key_value_store` | ||
|
|
||
| ### Added method for opening RequestQueue that uses configuration and storage client of the crawler | ||
| `BasicCrawler.open_request_queue` |
There was a problem hiding this comment.
Use words. This will be a page in the docs. It is not a changelog. Check out the other sections.
IMO this should stay as |
Same for this one. I thought we are talking about something else here (the factory methods on the storage classes to be specific). The crawler methods should be named based on the JS version. |
We can have: and since now you can have differently configured crawlers, it makes sense as differently configured Crawler can give different storage (depends on init args) |
|
I am honestly not sold on having those methods like this, first time hearing we have this in python. The reason we added To create any storage instance, people should be using the factory methods directly (e.g. |
One of the reworks we did was to allow to have multiple different crawlers that can be configured to use different services and detach them from the global state (if desired) to limit strange side effects. Now the crawlers can be isolated, and in such a situation, the crawler-specific storage creation methods are very convenient. |
|
How is this about global state? In JS you either just create the crawler without RQ option, and it creates its implicit RQ if it needs one (and you can access it via My point is that it sounds like |
In JS, The other Lastly, there's the new method Summary:
|
Is it really
I would personally just remove all of them, they don't fit the crawler interface at all. I understand that those respect the crawler config, but let's be honest, 99% of our users will use the default config and only override crawler options unrelated to the storage, we shouldn't pollute the main interface (the crawler class) because of an edge case. |
Sorry, it is
Sure, let's remove them then. I am all for it, I don't see much benefit in them as well. As I said, it's analogous to https://crawlee.dev/js/api/basic-crawler/class/BasicCrawler#getDataset , so I suppose we should remove it from the Crawlee JS v4 as well. The Btw., we have Py - https://crawlee.dev/python/api/class/BasicCrawlingContext#get_key_value_store |
Well, we have So I guess we should make a decision. Do we want a limited amount of basic methods and do the stuff "only one way", or do we want a bunch of convenient helpers, which are probably more user-friendly but lead to multiple ways of achieving the same? It is pretty confusing to develop if we subscribe to both of those concepts at the same time and arbitrarily apply one of them in various situations. |
Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.
My point is, we already have a convenient helper to do this, at least in JS, we have the static methods on the storage classes, so it doesn't make much sense to have exactly the same functionality on the crawler class.
I don't think we want to remove this one, we need a way to work with the implicit dataset instance of the crawler, the one that you push data to via the context helper. Again, to me this is a getter, so it should be called |
a463739 to
ca092f8
Compare
|
I think removing the helpers is a step in the wrong direction. I will state a use case where I believe it is best visible: Helpers have almost no extra maintenance cost. They just make sure, that proper storage is opened, based on the context. Crawlers can have a different context than the Actor, and if we delete the helpers, the user will have to track such context manually, which would be quite inconvenient. If we keep such a helper just for one of the storages ( |
This reverts commit ca092f8.
|
I don't think we should rename (or remove) the With the datasets it's a bit more complicated, since we allow Overall, I'd say we should unify things to do what the JS version is doing, we weren't planning any changes in this part there. But let's see what Honza thinks about this. |
I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.
If the users don't get confused with |
janbuchar
left a comment
There was a problem hiding this comment.
Well, we have BasicCrawlingContext.push_data in the context, which is kind of (Dataset.open_dataset + Dataset.push_data)...
Afaik this was added because of adaptive crawling. I'd say we want to push people to use the context helpers. Let's wait for @janbuchar to get back, I'd like to hear his opinion too.
Exactly, even though I'm slowly steering away from pushing people to use the crawling context - I believe that in the future, we'll be able to have "transactional" access to storages even without modifying the context.
I don't think we should rename (or remove) the Crawler.get_dataset method (nor the Crawler.get_request_queue), it should only serve as a getter for the implicit dataset/RQ, not as a way to open a new dataset/RQ (for that you should use Dataset.open).
After seeing the changes, I have to agree with this. Very frequently, you'll call crawler.open_dataset long after the crawler finished after having used the dataset profusely, so I crawler.get_dataset is more fitting IMO.
Keep in mind the use case you highlighted above is an edge case, 99% of our users won't use multiple crawler instance
I do not think this argument holds. The previous version of the tooling was very obstructive in a multiple-crawler scenario, and was discouraging users from going that way, so I am not surprised it used to be an edge case. I do not see an objective reason for this constraint, though.
💯
In conclusion... I'd probably unify the naming in the opposite direction. Let me know if I missed some nuance of the discussion, it's quite a lot to unpack.
| return self._request_manager | ||
|
|
||
| async def get_dataset( | ||
| async def open_request_queue( |
There was a problem hiding this comment.
This has some funk to it - if the crawler uses a non-default request manager, this will still return the default request queue. If somebody does that, they will probably be surprised that adding requests to this queue does nothing 😁
Perhaps the method could throw if there is a non-default request manager in place?
|
I will close this PR as we have no consensus on renaming the methods. I will merge the small non-breaking fix in a separate PR. |
Description
-Rename:
BasicCrawler.get_dataset->BasicCrawler.open_datasetBasicCrawler.get_key_value_store->BasicCrawler.open_key_value_store-Add:
BasicCrawler.open_request_queue-Fix:
Potential storage client change when running crawler again, and
purge_on_start=TrueIssues
Testing
Checklist