Document how to handle action failures#157
Document how to handle action failures#157Gallaecio wants to merge 12 commits intoscrapy-plugins:mainfrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #157 +/- ##
==========================================
- Coverage 97.89% 97.40% -0.50%
==========================================
Files 14 15 +1
Lines 1618 1693 +75
Branches 297 309 +12
==========================================
+ Hits 1584 1649 +65
- Misses 14 20 +6
- Partials 20 24 +4
🚀 New features to boost your workflow:
|
|
|
||
| ====================== | ||
| Handling action errors | ||
| ====================== |
There was a problem hiding this comment.
There is a lot of code in this section; I wonder if we should make it a part of the library itself instead.
It seems like a general theme: if action failed, should the response be considered a success or a failure? The current default is success, but it seems that making it a failure is better (with an option to get back), as it allows to detect more issues.
There was a problem hiding this comment.
deedfff does this for the middleware. Do we want to do something about the cache policy, or is it OK to keep that documentation-only?
| :class:`~scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware`, you might | ||
| want to use a custom :setting:`HTTPCACHE_POLICY <scrapy:HTTPCACHE_POLICY>` to | ||
| prevent responses with failed actions (i.e. after exceeding retries) to be | ||
| cached: |
There was a problem hiding this comment.
Is there any interaction with the ZYTE_API_ACTION_ERROR_HANDLING / ZYTE_API_ACTION_ERROR_RETRY_ENABLED settings? E.g. do you need this for all values option values? Or do some of these option values require caching to work?
There was a problem hiding this comment.
I guess you only need this for the default handing, "pass", as I don’t imagine caching caches exceptions. But I will test this out manually.
There was a problem hiding this comment.
I don’t imagine caching caches exceptions.
I was wrong.
I tested all 6 scenarios, default policy/custom policy, pass/ignore/err, and pass/ignore/err make no difference, with the default policy responses are cached, with the custom policy they are not. Not sure if the current docs need any clarification, given the custom policy is relevant regardless of those settings.
Test details
I cloned https://github.com/zytedata/zyte-spider-templates-project and added 2 files:
# zyte_spider_templates_project/spiders/test.py
from scrapy import Request, Spider
class TestSpider(Spider):
name = "test"
custom_settings = {
"HTTPCACHE_ENABLED": True,
}
def start_requests(self):
yield Request(
"https://toscrape.com",
meta={
"zyte_api_automap": {
"browserHtml": True,
"actions": [
{
"action": "waitForRequest",
"urlPattern": "unexising/pattern",
"timeout": 0,
},
],
},
},
)
def parse(self, response):
response_actions = getattr(response, "raw_api_response", {}).get("actions", None)
self.logger.debug(response_actions)# zyte_spider_templates_project/extensions.py
from scrapy import Request
from scrapy.extensions.httpcache import DummyPolicy
from scrapy.http import Response
from scrapy_zyte_api.responses import ZyteAPIResponse, ZyteAPITextResponse
class ZyteAPIFailedActionsPolicy(DummyPolicy):
def should_cache_response(self, response: Response, request: Request):
if isinstance(response, (ZyteAPIResponse, ZyteAPITextResponse)) and any(
"error" in action for action in response.raw_api_response.get("actions", [])
):
return False
return super().should_cache_response(response, request)And then tried all 6 scenarios.
Default cache policy, default action error handling ("pass"), responses are cached:
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:08 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test |& grep "\[test\] DEBUG"
2024-12-23 11:39:24 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Default cache policy, "ignore" action error handling, responses are cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' |& grep "\[test\] DEBUG"
2024-12-23 11:43:42 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Default cache policy, "err" action error handling, responses are cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:41 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' |& grep "\[test\] DEBUG"
2024-12-23 11:44:59 [test] DEBUG: None
$ rm -rf .scrapy/httpcache/
Custom cache policy, default action error handling ("pass"), responses are not cached:
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:26 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:50:39 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
Custom cache policy, "ignore" action error handling, responses are not cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:34 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=ignore' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:51:46 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.102, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
Custom cache policy, "err" action error handling, responses are not cached:
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:14 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ scrapy crawl test -s 'ZYTE_API_ACTION_ERROR_HANDLING=err' -s 'HTTPCACHE_POLICY=zyte_spider_templates_project.extensions.ZyteAPIFailedActionsPolicy' |& grep "\[test\] DEBUG"
2024-12-23 11:52:28 [test] DEBUG: [{'action': 'waitForRequest', 'elapsedTime': 0.101, 'status': 'returned', 'error': 'Timed out after waiting 100ms'}]
$ rm -rf .scrapy/httpcache/
There was a problem hiding this comment.
(I’m not sure I fully understood your question, though, I might be missing some angle)
Resolves #118, resolves #119, resolves #120.
To do: