docs: Flatten scraper examples and fix guide inaccuracies

vdusek · vdusek · commit 65f8e0d1bdce · 2026-06-05T13:24:12.000+02:00
diff --git a/docs/03_guides/06_scrapy.mdx b/docs/03_guides/06_scrapy.mdx
@@ -23,9 +23,9 @@ In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framewo
 
 ## Integrating Scrapy with the Apify platform
 
-The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
+The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. The `apify.scrapy.run_scrapy_actor` function handles this reactor installation automatically. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.
 
-<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
+<CodeBlock className="language-python" title="__main__.py: The Actor entry point">
     {UnderscoreMainExample}
 </CodeBlock>
 
@@ -74,7 +74,7 @@ For further details, see the [Scrapy migration guide](https://docs.apify.com/cli
 The following example shows a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.
 
 <Tabs>
-    <TabItem value="__main__.py" label="__main.py__">
+    <TabItem value="__main__.py" label="__main__.py">
         <CodeBlock className="language-python">
             {UnderscoreMainExample}
         </CodeBlock>
diff --git a/docs/03_guides/07_running_webserver.mdx b/docs/03_guides/07_running_webserver.mdx
@@ -18,9 +18,9 @@ The URL is available in the following places:
 
 - In Apify Console, on the Actor run details page as the **Container URL** field.
 - In the API as the `container_url` property of the [Run object](https://docs.apify.com/api/v2#/reference/actors/run-object/get-run).
-- In the Actor as the `Actor.configuration.container_url` property.
+- In the Actor as the `Actor.configuration.web_server_url` property.
 
-The web server running inside the container must listen at the port defined by the `Actor.configuration.container_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`.
+The web server running inside the container must listen at the port defined by the `Actor.configuration.web_server_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`.
 
 ## Example Actor
 
@@ -30,6 +30,14 @@ The following example shows how to start a simple web server in your Actor, whic
     {WebserverExample}
 </RunnableCodeBlock>
 
+## Actor Standby
+
+The example above runs a web server for the duration of a single Actor run. With [Actor Standby](https://docs.apify.com/platform/actors/development/programming-interface/standby), you can instead expose your Actor as an always-ready HTTP API: the platform keeps the Actor running in the background and routes incoming HTTP requests to the web server inside it, spinning up additional instances as the load grows.
+
+From the SDK's perspective, a Standby Actor is built the same way as the web server above — start an HTTP server listening on the port from `Actor.configuration.web_server_port`. The difference is operational: instead of doing its work once and exiting, a Standby Actor stays up and serves requests. This makes it a good fit for low-latency, on-demand use cases, such as serving scraped data or acting as a microservice.
+
+To get started quickly, use the [Standby Python template](https://apify.com/templates/python-standby). For details on enabling Standby, request routing, and readiness probes, see the [Actor Standby documentation](https://docs.apify.com/platform/actors/development/programming-interface/standby).
+
 ## Conclusion
 
 In this guide, you learned how to run a web server inside your Apify Actor. By leveraging the container URL and port provided by the platform, you can expose HTTP endpoints for monitoring, reporting, or serving content during Actor execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU).
diff --git a/docs/03_guides/code/01_beautifulsoup_httpx.py b/docs/03_guides/code/01_beautifulsoup_httpx.py
@@ -1,4 +1,5 @@
 import asyncio
+from typing import Any
 from urllib.parse import urljoin
 
 import httpx
@@ -7,6 +8,40 @@
 from apify import Actor, Request
 
 
+async def scrape_page(
+    client: httpx.AsyncClient, url: str
+) -> tuple[dict[str, Any], list[str]]:
+    """Fetch a single page with HTTPX and extract its data and links.
+
+    Keeping the fetching and parsing in this helper keeps the Actor's main loop
+    shallow. It returns the extracted data together with the links found on the
+    page, so `main` only has to decide what to store and what to enqueue.
+    """
+    # Fetch the HTTP response from the specified URL using HTTPX.
+    response = await client.get(url, follow_redirects=True)
+
+    # Parse the HTML content using Beautiful Soup.
+    soup = BeautifulSoup(response.content, 'html.parser')
+
+    # Extract the desired data.
+    data = {
+        'url': url,
+        'title': soup.title.string if soup.title else None,
+        'h1s': [h1.text for h1 in soup.find_all('h1')],
+        'h2s': [h2.text for h2 in soup.find_all('h2')],
+        'h3s': [h3.text for h3 in soup.find_all('h3')],
+    }
+
+    # Collect absolute links found on the page so the caller can enqueue them.
+    links: list[str] = []
+    for link in soup.find_all('a'):
+        link_url = urljoin(url, link.get('href'))
+        if link_url.startswith(('http://', 'https://')):
+            links.append(link_url)
+
+    return data, links
+
+
 async def main() -> None:
     # Enter the context of the Actor.
     async with Actor:
@@ -23,65 +58,43 @@ async def main() -> None:
         # Open the default request queue for handling URLs to be processed.
         request_queue = await Actor.open_request_queue()
 
-        # Enqueue the start URLs with an initial crawl depth of 0.
+        # Enqueue the start URLs. Their crawl depth defaults to 0.
         for start_url in start_urls:
             url = start_url.get('url')
             Actor.log.info(f'Enqueuing {url} ...')
-            new_request = Request.from_url(url, user_data={'depth': 0})
-            await request_queue.add_request(new_request)
+            await request_queue.add_request(Request.from_url(url))
 
         # Create an HTTPX client to fetch the HTML content of the URLs.
         async with httpx.AsyncClient() as client:
             # Process the URLs from the request queue.
             while request := await request_queue.fetch_next_request():
                 url = request.url
 
-                if not isinstance(request.user_data['depth'], (str, int)):
-                    raise TypeError('Request.depth is an unexpected type.')
-
-                depth = int(request.user_data['depth'])
+                # Read the crawl depth tracked by the request itself.
+                depth = request.crawl_depth
                 Actor.log.info(f'Scraping {url} (depth={depth}) ...')
 
                 try:
-                    # Fetch the HTTP response from the specified URL using HTTPX.
-                    response = await client.get(url, follow_redirects=True)
-
-                    # Parse the HTML content using Beautiful Soup.
-                    soup = BeautifulSoup(response.content, 'html.parser')
-
-                    # If the current depth is less than max_depth, find nested links
-                    # and enqueue them.
-                    if depth < max_depth:
-                        for link in soup.find_all('a'):
-                            link_href = link.get('href')
-                            link_url = urljoin(url, link_href)
-
-                            if link_url.startswith(('http://', 'https://')):
-                                Actor.log.info(f'Enqueuing {link_url} ...')
-                                new_request = Request.from_url(
-                                    link_url,
-                                    user_data={'depth': depth + 1},
-                                )
-                                await request_queue.add_request(new_request)
-
-                    # Extract the desired data.
-                    data = {
-                        'url': url,
-                        'title': soup.title.string if soup.title else None,
-                        'h1s': [h1.text for h1 in soup.find_all('h1')],
-                        'h2s': [h2.text for h2 in soup.find_all('h2')],
-                        'h3s': [h3.text for h3 in soup.find_all('h3')],
-                    }
+                    # Fetch the page and extract its data and nested links.
+                    data, links = await scrape_page(client, url)
 
                     # Store the extracted data to the default dataset.
                     await Actor.push_data(data)
 
+                    # If we are not too deep yet, enqueue the links we found.
+                    if depth < max_depth:
+                        for link_url in links:
+                            Actor.log.info(f'Enqueuing {link_url} ...')
+                            new_request = Request.from_url(link_url)
+                            new_request.crawl_depth = depth + 1
+                            await request_queue.add_request(new_request)
+
                 except Exception:
                     Actor.log.exception(f'Cannot extract data from {url}.')
 
                 finally:
-                    # Mark the request as handled to ensure it is not processed again.
-                    await request_queue.mark_request_as_handled(new_request)
+                    # Mark the request as handled so it is not processed again.
+                    await request_queue.mark_request_as_handled(request)
 
 
 if __name__ == '__main__':
diff --git a/docs/03_guides/code/02_parsel_impit.py b/docs/03_guides/code/02_parsel_impit.py
@@ -1,4 +1,5 @@
 import asyncio
+from typing import Any
 from urllib.parse import urljoin
 
 import impit
@@ -7,6 +8,40 @@
 from apify import Actor, Request
 
 
+async def scrape_page(
+    client: impit.AsyncClient, url: str
+) -> tuple[dict[str, Any], list[str]]:
+    """Fetch a single page with Impit and extract its data and links.
+
+    Keeping the fetching and parsing in this helper keeps the Actor's main loop
+    shallow. It returns the extracted data together with the links found on the
+    page, so `main` only has to decide what to store and what to enqueue.
+    """
+    # Fetch the HTTP response from the specified URL using Impit.
+    response = await client.get(url)
+
+    # Parse the HTML content using a Parsel selector.
+    selector = parsel.Selector(text=response.text)
+
+    # Extract the desired data using Parsel selectors.
+    data = {
+        'url': url,
+        'title': selector.css('title::text').get(),
+        'h1s': selector.css('h1::text').getall(),
+        'h2s': selector.css('h2::text').getall(),
+        'h3s': selector.css('h3::text').getall(),
+    }
+
+    # Collect absolute links found on the page so the caller can enqueue them.
+    links: list[str] = []
+    for link_href in selector.css('a::attr(href)').getall():
+        link_url = urljoin(url, link_href)
+        if link_url.startswith(('http://', 'https://')):
+            links.append(link_url)
+
+    return data, links
+
+
 async def main() -> None:
     # Enter the context of the Actor.
     async with Actor:
@@ -23,70 +58,42 @@ async def main() -> None:
         # Open the default request queue for handling URLs to be processed.
         request_queue = await Actor.open_request_queue()
 
-        # Enqueue the start URLs with an initial crawl depth of 0.
+        # Enqueue the start URLs. Their crawl depth defaults to 0.
         for start_url in start_urls:
             url = start_url.get('url')
             Actor.log.info(f'Enqueuing {url} ...')
-            new_request = Request.from_url(url, user_data={'depth': 0})
-            await request_queue.add_request(new_request)
+            await request_queue.add_request(Request.from_url(url))
 
         # Create an Impit client to fetch the HTML content of the URLs.
         async with impit.AsyncClient() as client:
             # Process the URLs from the request queue.
             while request := await request_queue.fetch_next_request():
                 url = request.url
 
-                if not isinstance(request.user_data['depth'], (str, int)):
-                    raise TypeError('Request.depth is an unexpected type.')
-
-                depth = int(request.user_data['depth'])
+                # Read the crawl depth tracked by the request itself.
+                depth = request.crawl_depth
                 Actor.log.info(f'Scraping {url} (depth={depth}) ...')
 
                 try:
-                    # Fetch the HTTP response from the specified URL using Impit.
-                    response = await client.get(url)
-
-                    # Parse the HTML content using Parsel Selector.
-                    selector = parsel.Selector(text=response.text)
-
-                    # If the current depth is less than max_depth, find nested links
-                    # and enqueue them.
-                    if depth < max_depth:
-                        # Extract all links using CSS selector
-                        links = selector.css('a::attr(href)').getall()
-                        for link_href in links:
-                            link_url = urljoin(url, link_href)
-
-                            if link_url.startswith(('http://', 'https://')):
-                                Actor.log.info(f'Enqueuing {link_url} ...')
-                                new_request = Request.from_url(
-                                    link_url,
-                                    user_data={'depth': depth + 1},
-                                )
-                                await request_queue.add_request(new_request)
-
-                    # Extract the desired data using Parsel selectors.
-                    title = selector.css('title::text').get()
-                    h1s = selector.css('h1::text').getall()
-                    h2s = selector.css('h2::text').getall()
-                    h3s = selector.css('h3::text').getall()
-
-                    data = {
-                        'url': url,
-                        'title': title,
-                        'h1s': h1s,
-                        'h2s': h2s,
-                        'h3s': h3s,
-                    }
+                    # Fetch the page and extract its data and nested links.
+                    data, links = await scrape_page(client, url)
 
                     # Store the extracted data to the default dataset.
                     await Actor.push_data(data)
 
+                    # If we are not too deep yet, enqueue the links we found.
+                    if depth < max_depth:
+                        for link_url in links:
+                            Actor.log.info(f'Enqueuing {link_url} ...')
+                            new_request = Request.from_url(link_url)
+                            new_request.crawl_depth = depth + 1
+                            await request_queue.add_request(new_request)
+
                 except Exception:
                     Actor.log.exception(f'Cannot extract data from {url}.')
 
                 finally:
-                    # Mark the request as handled to ensure it is not processed again.
+                    # Mark the request as handled so it is not processed again.
                     await request_queue.mark_request_as_handled(request)
 
 
diff --git a/docs/03_guides/code/03_playwright.py b/docs/03_guides/code/03_playwright.py
diff --git a/docs/03_guides/code/04_selenium.py b/docs/03_guides/code/04_selenium.py