Skip to content

Commit 65f8e0d

Browse files
committed
docs: Flatten scraper examples and fix guide inaccuracies
1 parent 3f25d4a commit 65f8e0d

6 files changed

Lines changed: 209 additions & 154 deletions

File tree

docs/03_guides/06_scrapy.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@ In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framewo
2323

2424
## Integrating Scrapy with the Apify platform
2525

26-
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
26+
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
2727

28-
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
28+
<CodeBlock className="language-python" title="__main__.py: The Actor entry point">
2929
{UnderscoreMainExample}
3030
</CodeBlock>
3131

@@ -74,7 +74,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
7474
The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
7575

7676
<Tabs>
77-
<TabItem value="__main__.py" label="__main.py__">
77+
<TabItem value="__main__.py" label="__main__.py">
7878
<CodeBlock className="language-python">
7979
{UnderscoreMainExample}
8080
</CodeBlock>

docs/03_guides/07_running_webserver.mdx

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ The URL is available in the following places:
1818

1919
- In Apify Console, on the Actor run details page as the **Container URL** field.
2020
- In the API as the `container_url` property of the [Run object](https://docs.apify.com/api/v2#/reference/actors/run-object/get-run).
21-
- In the Actor as the `Actor.configuration.container_url` property.
21+
- In the Actor as the `Actor.configuration.web_server_url` property.
2222

23-
The web server running inside the container must listen at the port defined by the `Actor.configuration.container_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`.
23+
The web server running inside the container must listen at the port defined by the `Actor.configuration.web_server_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`.
2424

2525
## Example Actor
2626

@@ -30,6 +30,14 @@ The following example shows how to start a simple web server in your Actor, whic
3030
{WebserverExample}
3131
</RunnableCodeBlock>
3232

33+
## Actor Standby
34+
35+
The example above runs a web server for the duration of a single Actor run. With [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby), you can instead expose your Actor as an always-ready HTTP API: the platform keeps the Actor running in the background and routes incoming HTTP requests to the web server inside it, spinning up additional instances as the load grows.
36+
37+
From the SDK's perspective, a Standby Actor is built the same way as the web server above — start an HTTP server listening on the port from `Actor.configuration.web_server_port`. The difference is operational: instead of doing its work once and exiting, a Standby Actor stays up and serves requests. This makes it a good fit for low-latency, on-demand use cases, such as serving scraped data or acting as a microservice.
38+
39+
To get started quickly, use the [Standby Python template](https://apify.com/templates/python-standby). For details on enabling Standby, request routing, and readiness probes, see the [Actor Standby documentation](https://docs.apify.com/platform/actors/development/programming-interface/standby).
40+
3341
## Conclusion
3442

3543
In this guide, you learned how to run a web server inside your Apify Actor. By leveraging the container URL and port provided by the platform, you can expose HTTP endpoints for monitoring, reporting, or serving content during Actor execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU).

docs/03_guides/code/01_beautifulsoup_httpx.py

Lines changed: 51 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import asyncio
2+
from typing import Any
23
from urllib.parse import urljoin
34

45
import httpx
@@ -7,6 +8,40 @@
78
from apify import Actor, Request
89

910

11+
async def scrape_page(
12+
client: httpx.AsyncClient, url: str
13+
) -> tuple[dict[str, Any], list[str]]:
14+
"""Fetch a single page with HTTPX and extract its data and links.
15+
16+
Keeping the fetching and parsing in this helper keeps the Actor's main loop
17+
shallow. It returns the extracted data together with the links found on the
18+
page, so `main` only has to decide what to store and what to enqueue.
19+
"""
20+
# Fetch the HTTP response from the specified URL using HTTPX.
21+
response = await client.get(url, follow_redirects=True)
22+
23+
# Parse the HTML content using Beautiful Soup.
24+
soup = BeautifulSoup(response.content, 'html.parser')
25+
26+
# Extract the desired data.
27+
data = {
28+
'url': url,
29+
'title': soup.title.string if soup.title else None,
30+
'h1s': [h1.text for h1 in soup.find_all('h1')],
31+
'h2s': [h2.text for h2 in soup.find_all('h2')],
32+
'h3s': [h3.text for h3 in soup.find_all('h3')],
33+
}
34+
35+
# Collect absolute links found on the page so the caller can enqueue them.
36+
links: list[str] = []
37+
for link in soup.find_all('a'):
38+
link_url = urljoin(url, link.get('href'))
39+
if link_url.startswith(('http://', 'https://')):
40+
links.append(link_url)
41+
42+
return data, links
43+
44+
1045
async def main() -> None:
1146
# Enter the context of the Actor.
1247
async with Actor:
@@ -23,65 +58,43 @@ async def main() -> None:
2358
# Open the default request queue for handling URLs to be processed.
2459
request_queue = await Actor.open_request_queue()
2560

26-
# Enqueue the start URLs with an initial crawl depth of 0.
61+
# Enqueue the start URLs. Their crawl depth defaults to 0.
2762
for start_url in start_urls:
2863
url = start_url.get('url')
2964
Actor.log.info(f'Enqueuing {url} ...')
30-
new_request = Request.from_url(url, user_data={'depth': 0})
31-
await request_queue.add_request(new_request)
65+
await request_queue.add_request(Request.from_url(url))
3266

3367
# Create an HTTPX client to fetch the HTML content of the URLs.
3468
async with httpx.AsyncClient() as client:
3569
# Process the URLs from the request queue.
3670
while request := await request_queue.fetch_next_request():
3771
url = request.url
3872

39-
if not isinstance(request.user_data['depth'], (str, int)):
40-
raise TypeError('Request.depth is an unexpected type.')
41-
42-
depth = int(request.user_data['depth'])
73+
# Read the crawl depth tracked by the request itself.
74+
depth = request.crawl_depth
4375
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
4476

4577
try:
46-
# Fetch the HTTP response from the specified URL using HTTPX.
47-
response = await client.get(url, follow_redirects=True)
48-
49-
# Parse the HTML content using Beautiful Soup.
50-
soup = BeautifulSoup(response.content, 'html.parser')
51-
52-
# If the current depth is less than max_depth, find nested links
53-
# and enqueue them.
54-
if depth < max_depth:
55-
for link in soup.find_all('a'):
56-
link_href = link.get('href')
57-
link_url = urljoin(url, link_href)
58-
59-
if link_url.startswith(('http://', 'https://')):
60-
Actor.log.info(f'Enqueuing {link_url} ...')
61-
new_request = Request.from_url(
62-
link_url,
63-
user_data={'depth': depth + 1},
64-
)
65-
await request_queue.add_request(new_request)
66-
67-
# Extract the desired data.
68-
data = {
69-
'url': url,
70-
'title': soup.title.string if soup.title else None,
71-
'h1s': [h1.text for h1 in soup.find_all('h1')],
72-
'h2s': [h2.text for h2 in soup.find_all('h2')],
73-
'h3s': [h3.text for h3 in soup.find_all('h3')],
74-
}
78+
# Fetch the page and extract its data and nested links.
79+
data, links = await scrape_page(client, url)
7580

7681
# Store the extracted data to the default dataset.
7782
await Actor.push_data(data)
7883

84+
# If we are not too deep yet, enqueue the links we found.
85+
if depth < max_depth:
86+
for link_url in links:
87+
Actor.log.info(f'Enqueuing {link_url} ...')
88+
new_request = Request.from_url(link_url)
89+
new_request.crawl_depth = depth + 1
90+
await request_queue.add_request(new_request)
91+
7992
except Exception:
8093
Actor.log.exception(f'Cannot extract data from {url}.')
8194

8295
finally:
83-
# Mark the request as handled to ensure it is not processed again.
84-
await request_queue.mark_request_as_handled(new_request)
96+
# Mark the request as handled so it is not processed again.
97+
await request_queue.mark_request_as_handled(request)
8598

8699

87100
if __name__ == '__main__':

docs/03_guides/code/02_parsel_impit.py

Lines changed: 50 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import asyncio
2+
from typing import Any
23
from urllib.parse import urljoin
34

45
import impit
@@ -7,6 +8,40 @@
78
from apify import Actor, Request
89

910

11+
async def scrape_page(
12+
client: impit.AsyncClient, url: str
13+
) -> tuple[dict[str, Any], list[str]]:
14+
"""Fetch a single page with Impit and extract its data and links.
15+
16+
Keeping the fetching and parsing in this helper keeps the Actor's main loop
17+
shallow. It returns the extracted data together with the links found on the
18+
page, so `main` only has to decide what to store and what to enqueue.
19+
"""
20+
# Fetch the HTTP response from the specified URL using Impit.
21+
response = await client.get(url)
22+
23+
# Parse the HTML content using a Parsel selector.
24+
selector = parsel.Selector(text=response.text)
25+
26+
# Extract the desired data using Parsel selectors.
27+
data = {
28+
'url': url,
29+
'title': selector.css('title::text').get(),
30+
'h1s': selector.css('h1::text').getall(),
31+
'h2s': selector.css('h2::text').getall(),
32+
'h3s': selector.css('h3::text').getall(),
33+
}
34+
35+
# Collect absolute links found on the page so the caller can enqueue them.
36+
links: list[str] = []
37+
for link_href in selector.css('a::attr(href)').getall():
38+
link_url = urljoin(url, link_href)
39+
if link_url.startswith(('http://', 'https://')):
40+
links.append(link_url)
41+
42+
return data, links
43+
44+
1045
async def main() -> None:
1146
# Enter the context of the Actor.
1247
async with Actor:
@@ -23,70 +58,42 @@ async def main() -> None:
2358
# Open the default request queue for handling URLs to be processed.
2459
request_queue = await Actor.open_request_queue()
2560

26-
# Enqueue the start URLs with an initial crawl depth of 0.
61+
# Enqueue the start URLs. Their crawl depth defaults to 0.
2762
for start_url in start_urls:
2863
url = start_url.get('url')
2964
Actor.log.info(f'Enqueuing {url} ...')
30-
new_request = Request.from_url(url, user_data={'depth': 0})
31-
await request_queue.add_request(new_request)
65+
await request_queue.add_request(Request.from_url(url))
3266

3367
# Create an Impit client to fetch the HTML content of the URLs.
3468
async with impit.AsyncClient() as client:
3569
# Process the URLs from the request queue.
3670
while request := await request_queue.fetch_next_request():
3771
url = request.url
3872

39-
if not isinstance(request.user_data['depth'], (str, int)):
40-
raise TypeError('Request.depth is an unexpected type.')
41-
42-
depth = int(request.user_data['depth'])
73+
# Read the crawl depth tracked by the request itself.
74+
depth = request.crawl_depth
4375
Actor.log.info(f'Scraping {url} (depth={depth}) ...')
4476

4577
try:
46-
# Fetch the HTTP response from the specified URL using Impit.
47-
response = await client.get(url)
48-
49-
# Parse the HTML content using Parsel Selector.
50-
selector = parsel.Selector(text=response.text)
51-
52-
# If the current depth is less than max_depth, find nested links
53-
# and enqueue them.
54-
if depth < max_depth:
55-
# Extract all links using CSS selector
56-
links = selector.css('a::attr(href)').getall()
57-
for link_href in links:
58-
link_url = urljoin(url, link_href)
59-
60-
if link_url.startswith(('http://', 'https://')):
61-
Actor.log.info(f'Enqueuing {link_url} ...')
62-
new_request = Request.from_url(
63-
link_url,
64-
user_data={'depth': depth + 1},
65-
)
66-
await request_queue.add_request(new_request)
67-
68-
# Extract the desired data using Parsel selectors.
69-
title = selector.css('title::text').get()
70-
h1s = selector.css('h1::text').getall()
71-
h2s = selector.css('h2::text').getall()
72-
h3s = selector.css('h3::text').getall()
73-
74-
data = {
75-
'url': url,
76-
'title': title,
77-
'h1s': h1s,
78-
'h2s': h2s,
79-
'h3s': h3s,
80-
}
78+
# Fetch the page and extract its data and nested links.
79+
data, links = await scrape_page(client, url)
8180

8281
# Store the extracted data to the default dataset.
8382
await Actor.push_data(data)
8483

84+
# If we are not too deep yet, enqueue the links we found.
85+
if depth < max_depth:
86+
for link_url in links:
87+
Actor.log.info(f'Enqueuing {link_url} ...')
88+
new_request = Request.from_url(link_url)
89+
new_request.crawl_depth = depth + 1
90+
await request_queue.add_request(new_request)
91+
8592
except Exception:
8693
Actor.log.exception(f'Cannot extract data from {url}.')
8794

8895
finally:
89-
# Mark the request as handled to ensure it is not processed again.
96+
# Mark the request as handled so it is not processed again.
9097
await request_queue.mark_request_as_handled(request)
9198

9299

0 commit comments

Comments
 (0)