docs: Add guide "HttpCrawler with custom parser" by Mantisus · Pull Request #1622 · apify/crawlee-python

Mantisus · 2025-12-16T20:08:13Z

Description

Add guide "HttpCrawler with custom parser".

Issues

Closes: Add scrapling as a parser #1392
Closes: Integrate an HTML parser with XPath 2 support #702

codecov · 2025-12-16T20:17:13Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.48%. Comparing base (1bb4bcb) to head (a4ccbde).
⚠️ Report is 9 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1622      +/-   ##
==========================================
- Coverage   92.49%   92.48%   -0.02%     
==========================================
  Files         157      157              
  Lines       10439    10439              
==========================================
- Hits         9656     9654       -2     
- Misses        783      785       +2

Flag	Coverage Δ
unit	`92.48% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

janbuchar

Thanks for the guide Max. While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

The cool thing about this is that you can then use the parser class in AdaptivePlaywrightCrawler as well.

Mantisus · 2025-12-17T13:15:43Z

While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

Got it. What do you think about expanding this guide with another section on implementation based on AbstractHttpCrawler?

janbuchar · 2025-12-17T15:00:04Z

While the code contained in there works, the idea was to use AbstractHttpCrawler along with a custom implementation of AbstractHttpParser - similarly to how BeautifulSoupCrawler and company are implemented.

Got it. What do you think about expanding this guide with another section on implementation based on AbstractHttpCrawler?

Well, just adding it to the bottom won't cut it, but you can make the guide into two parts - quick and dirty solution and the "native" way. You can also explain the benefits of each approach.

janbuchar

This is shaping up real nice, thank you!

janbuchar · 2025-12-18T09:48:26Z

docs/guides/code_examples/http_crawlers/selectolax_crawler.py

+        async def final_step(
+            context: ParsedHttpCrawlingContext[LexborHTMLParser],
+        ) -> AsyncGenerator[SelectolaxLexborContext, None]:
+            yield SelectolaxLexborContext.from_parsed_http_crawling_context(context)
+
+        # Build context pipeline: HTTP request -> parsing -> custom context.
+        kwargs['_context_pipeline'] = (
+            self._create_static_content_crawler_pipeline().compose(final_step)
+        )


It'd be nice to mention that these are not strictly necessary, you only need this if you want "idiomatic" helpers on the context (parser in this case). If you don't mind sticking with parsed_content, you can skip a lot of boilerplate.

I think we should mention that this step is optional in a comment or something right here.

docs/guides/crawler_custom_parser.mdx

vdusek

Looks great!

I'm also updating the description to note that this resolves #702 as well, due to saxonche.

However, I would say this content would fit better in the HTTP crawlers guide. I'd suggest merging it there.

"Using HttpCrawler with a custom parser" could be the next section after "HttpCrawler", and "Creating a custom crawler" is essentially the same as "Creating a custom HTTP crawler".

docs/guides/code_examples/crawler_custom_parser/selectolax_adaptive_run.py

docs/guides/code_examples/http_crawlers/selectolax_adaptive_run.py

docs/guides/crawler_custom_parser.mdx

…ptive_run.py Co-authored-by: Jan Buchar <Teyras@gmail.com>

vdusek

This is really good, thanks!

Just a few minor things... Mostly regarding more concise titles and consistent grammatical forms.

docs/guides/http_crawlers.mdx

pyproject.toml

src/crawlee/crawlers/__init__.py

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

vdusek

One more nit. Otherwise LGTM.

docs/guides/code_examples/http_crawlers/selectolax_parser.py

janbuchar

Awesome!

docs/guides/http_crawlers.mdx

janbuchar · 2026-01-05T15:44:10Z

docs/guides/code_examples/http_crawlers/selectolax_crawler.py

+        async def final_step(
+            context: ParsedHttpCrawlingContext[LexborHTMLParser],
+        ) -> AsyncGenerator[SelectolaxLexborContext, None]:
+            yield SelectolaxLexborContext.from_parsed_http_crawling_context(context)
+
+        # Build context pipeline: HTTP request -> parsing -> custom context.
+        kwargs['_context_pipeline'] = (
+            self._create_static_content_crawler_pipeline().compose(final_step)
+        )


I think we should mention that this step is optional in a comment or something right here.

Co-authored-by: Jan Buchar <Teyras@gmail.com>

add docs "HttpCrawler with custom parser"

e5aff86

Mantisus requested review from janbuchar and vdusek December 16, 2025 20:08

Mantisus self-assigned this Dec 16, 2025

fix

247ef79

janbuchar requested changes Dec 17, 2025

View reviewed changes

Mantisus added 2 commits December 18, 2025 03:24

add AbstractHttpCrawler section

758424b

del extra file

7a9e092

Mantisus requested a review from janbuchar December 18, 2025 03:27

janbuchar reviewed Dec 18, 2025

View reviewed changes

add AdaptivePlaywrightCrawler example

a895901

Mantisus requested a review from janbuchar December 19, 2025 00:12

vdusek requested changes Dec 19, 2025

View reviewed changes

janbuchar reviewed Dec 19, 2025

View reviewed changes

Mantisus and others added 2 commits December 19, 2025 15:27

Update docs/guides/code_examples/crawler_custom_parser/selectolax_ada…

5844f4b

…ptive_run.py Co-authored-by: Jan Buchar <Teyras@gmail.com>

integrate to HTTP crawlers guide

7fe669e

Mantisus requested review from janbuchar and vdusek December 19, 2025 15:25

vdusek requested changes Dec 20, 2025

View reviewed changes

Mantisus and others added 8 commits December 20, 2025 16:20

Update pyproject.toml

2bc5967

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

2b1f41f

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

08ee00c

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

8195397

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

a5be06d

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

e22346b

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

3cfacf0

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Update docs/guides/http_crawlers.mdx

ac427cc

Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

Merge branch 'apify:master' into custom-http-parser

bbc0157

vdusek approved these changes Dec 21, 2025

View reviewed changes

docs/guides/code_examples/http_crawlers/selectolax_parser.py Outdated Show resolved Hide resolved

Mantisus added 2 commits December 23, 2025 13:16

Resolve

09c6624

replace match to item

24ca257

janbuchar approved these changes Jan 5, 2026

View reviewed changes

Mantisus and others added 2 commits January 5, 2026 21:21

Update docs/guides/http_crawlers.mdx

e4a20b0

Co-authored-by: Jan Buchar <Teyras@gmail.com>

add additional comments

a4ccbde

vdusek merged commit c8cd334 into apify:master Jan 6, 2026
31 of 32 checks passed

Conversation

Mantisus commented Dec 16, 2025 • edited by vdusek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Uh oh!

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mantisus commented Dec 17, 2025

Uh oh!

janbuchar commented Dec 17, 2025

Uh oh!

janbuchar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janbuchar Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

janbuchar Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vdusek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janbuchar Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Mantisus commented Dec 16, 2025 •

edited by vdusek

Loading

codecov bot commented Dec 16, 2025 •

edited

Loading

janbuchar left a comment •

edited

Loading

janbuchar left a comment •

edited

Loading