Skip to content

Commit 8a854f2

Browse files
committed
docs: make Selenium example runnable, proxy in a separate section
1 parent 0b8d10a commit 8a854f2

3 files changed

Lines changed: 94 additions & 75 deletions

File tree

docs/03_guides/04_selenium.mdx

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,10 @@ description: Build an Apify Actor that scrapes dynamic web pages using Selenium
55
---
66

77
import CodeBlock from '@theme/CodeBlock';
8+
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
89

9-
import SeleniumExample from '!!raw-loader!./code/04_selenium.py';
10+
import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';
11+
import SeleniumProxyExample from '!!raw-loader!./code/04_selenium_proxy.py';
1012

1113
In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for browser automation and web scraping in your Apify Actors.
1214

@@ -36,16 +38,21 @@ This is a simple Actor that recursively scrapes data from linked pages on the sa
3638

3739
It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.
3840

39-
{/* Not runnable from the docs: the "Run on Apify" link encodes the whole snippet into the URL, and this Actor (with its inline proxy-auth extension) is large enough to exceed the URL length limit and fail with an HTTP 414. */}
40-
<CodeBlock className="language-python">
41+
<RunnableCodeBlock className="language-python" language="python">
4142
{SeleniumExample}
42-
</CodeBlock>
43+
</RunnableCodeBlock>
4344

4445
## Using Apify Proxy
4546

46-
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with `Actor.create_proxy_configuration` and routes the browser through it for the whole run.
47+
Running on the Apify platform gives your scraper access to [Apify Proxy](https://docs.apify.com/platform/proxy), which rotates IP addresses to avoid rate limiting and blocking. The runnable example above skips the proxy to stay simple. This section extends it to route the browser through Apify Proxy. The snippet below isn't a complete, runnable Actor on its own. It shows only the proxy-specific parts you add to the example above.
48+
49+
Chrome ignores the credentials passed in the `--proxy-server` flag. To use an authenticated proxy such as Apify Proxy, configure it from inside a small extension. The `proxy_auth_extension` helper builds one at runtime. Its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. The proxy-aware `build_chrome_driver` below replaces the simple one from the example above and loads that extension. The new headless mode (`--headless=new`) is required for Chrome to load it.
50+
51+
<CodeBlock className="language-python">
52+
{SeleniumProxyExample}
53+
</CodeBlock>
4754

48-
Chrome ignores the credentials passed in the `--proxy-server` flag. Because of that, configure an authenticated proxy such as Apify Proxy from inside a small extension. The `proxy_auth_extension` helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (`--headless=new`) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
55+
To wire it in, create the proxy configuration in `main` with `Actor.create_proxy_configuration`, get a URL with `await proxy_configuration.new_url()`, and pass it to `build_chrome_driver`. To select specific proxy groups or a country, pass the relevant arguments to `Actor.create_proxy_configuration`. For details, see [Proxy management](../concepts/proxy-management).
4956

5057
## Conclusion
5158

docs/03_guides/code/04_selenium.py

Lines changed: 3 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
import asyncio
2-
import json
3-
from pathlib import Path
4-
from tempfile import mkdtemp
52
from typing import Any
63
from urllib.parse import urljoin, urlsplit
7-
from zipfile import ZipFile
84

95
from selenium import webdriver
106
from selenium.webdriver.chrome.options import Options as ChromeOptions
@@ -18,71 +14,17 @@
1814
# On the Apify platform, it's already in the Actor's Docker image.
1915

2016

21-
def proxy_auth_extension(proxy_url: str) -> str:
22-
"""Build a Chrome extension that routes Chrome through an authenticated proxy."""
23-
parts = urlsplit(proxy_url)
24-
25-
manifest = {
26-
'name': 'Apify Proxy',
27-
'version': '1.0.0',
28-
'manifest_version': 3,
29-
'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'],
30-
'host_permissions': ['<all_urls>'],
31-
'background': {'service_worker': 'background.js'},
32-
'minimum_chrome_version': '108',
33-
}
34-
35-
# The service worker sets the proxy and answers the auth challenge.
36-
proxy_config = json.dumps(
37-
{
38-
'mode': 'fixed_servers',
39-
'rules': {
40-
'singleProxy': {
41-
'scheme': parts.scheme,
42-
'host': parts.hostname,
43-
'port': parts.port,
44-
},
45-
},
46-
}
47-
)
48-
credentials = json.dumps(
49-
{'username': parts.username or '', 'password': parts.password or ''}
50-
)
51-
background = (
52-
'chrome.proxy.settings.set('
53-
'{value: ' + proxy_config + ', scope: "regular"});\n'
54-
'chrome.webRequest.onAuthRequired.addListener(\n'
55-
' () => ({authCredentials: ' + credentials + '}),\n'
56-
' {urls: ["<all_urls>"]},\n'
57-
' ["blocking"],\n'
58-
');\n'
59-
)
60-
61-
extension_path = Path(mkdtemp()) / 'apify_proxy.zip'
62-
with ZipFile(extension_path, 'w') as archive:
63-
archive.writestr('manifest.json', json.dumps(manifest))
64-
archive.writestr('background.js', background)
65-
return str(extension_path)
66-
67-
68-
def build_chrome_driver(proxy_url: str | None = None) -> webdriver.Chrome:
69-
"""Create a headless Chrome WebDriver, optionally routed through a proxy."""
17+
def build_chrome_driver() -> webdriver.Chrome:
18+
"""Create a headless Chrome WebDriver suitable for a container."""
7019
chrome_options = ChromeOptions()
7120

7221
if Actor.configuration.headless:
73-
# The new headless mode is required to load the proxy extension.
7422
chrome_options.add_argument('--headless=new')
7523

7624
chrome_options.add_argument('--no-sandbox')
7725
chrome_options.add_argument('--disable-dev-shm-usage')
7826
chrome_options.add_argument('--disable-gpu')
7927

80-
if proxy_url:
81-
chrome_options.add_extension(proxy_auth_extension(proxy_url))
82-
chrome_options.add_argument(
83-
'--disable-features=DisableLoadExtensionCommandLineSwitch'
84-
)
85-
8628
return webdriver.Chrome(options=chrome_options)
8729

8830

@@ -140,9 +82,6 @@ async def main() -> None:
14082
Actor.log.info('No start URLs specified in Actor input, exiting...')
14183
await Actor.exit()
14284

143-
# Selenium proxies at the browser level, so one URL is shared per run.
144-
proxy_configuration = await Actor.create_proxy_configuration()
145-
14685
# Open the request queue and enqueue the start URLs (crawl depth 0).
14786
request_queue = await Actor.open_request_queue()
14887
for start_url in start_urls:
@@ -154,13 +93,8 @@ async def main() -> None:
15493
max_requests = 10
15594
handled_requests = 0
15695

157-
# Fresh proxy URL for the run (None if no proxy).
158-
proxy_url = None
159-
if proxy_configuration:
160-
proxy_url = await proxy_configuration.new_url()
161-
16296
Actor.log.info('Launching Chrome WebDriver...')
163-
driver = build_chrome_driver(proxy_url)
97+
driver = build_chrome_driver()
16498

16599
while handled_requests < max_requests and (
166100
request := await request_queue.fetch_next_request()
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
import json
2+
from pathlib import Path
3+
from tempfile import mkdtemp
4+
from urllib.parse import urlsplit
5+
from zipfile import ZipFile
6+
7+
from selenium import webdriver
8+
from selenium.webdriver.chrome.options import Options as ChromeOptions
9+
10+
from apify import Actor
11+
12+
13+
def proxy_auth_extension(proxy_url: str) -> str:
14+
"""Build a Chrome extension that routes Chrome through an authenticated proxy."""
15+
parts = urlsplit(proxy_url)
16+
17+
manifest = {
18+
'name': 'Apify Proxy',
19+
'version': '1.0.0',
20+
'manifest_version': 3,
21+
'permissions': ['proxy', 'webRequest', 'webRequestAuthProvider'],
22+
'host_permissions': ['<all_urls>'],
23+
'background': {'service_worker': 'background.js'},
24+
'minimum_chrome_version': '108',
25+
}
26+
27+
# The service worker sets the proxy and answers the auth challenge.
28+
proxy_config = json.dumps(
29+
{
30+
'mode': 'fixed_servers',
31+
'rules': {
32+
'singleProxy': {
33+
'scheme': parts.scheme,
34+
'host': parts.hostname,
35+
'port': parts.port,
36+
},
37+
},
38+
}
39+
)
40+
credentials = json.dumps(
41+
{'username': parts.username or '', 'password': parts.password or ''}
42+
)
43+
background = (
44+
'chrome.proxy.settings.set('
45+
'{value: ' + proxy_config + ', scope: "regular"});\n'
46+
'chrome.webRequest.onAuthRequired.addListener(\n'
47+
' () => ({authCredentials: ' + credentials + '}),\n'
48+
' {urls: ["<all_urls>"]},\n'
49+
' ["blocking"],\n'
50+
');\n'
51+
)
52+
53+
extension_path = Path(mkdtemp()) / 'apify_proxy.zip'
54+
with ZipFile(extension_path, 'w') as archive:
55+
archive.writestr('manifest.json', json.dumps(manifest))
56+
archive.writestr('background.js', background)
57+
return str(extension_path)
58+
59+
60+
def build_chrome_driver(proxy_url: str) -> webdriver.Chrome:
61+
"""Create a headless Chrome WebDriver routed through an authenticated proxy."""
62+
chrome_options = ChromeOptions()
63+
64+
if Actor.configuration.headless:
65+
# The new headless mode is required to load the proxy extension.
66+
chrome_options.add_argument('--headless=new')
67+
68+
chrome_options.add_argument('--no-sandbox')
69+
chrome_options.add_argument('--disable-dev-shm-usage')
70+
chrome_options.add_argument('--disable-gpu')
71+
72+
# Load the proxy extension and keep it enabled in headless mode.
73+
chrome_options.add_extension(proxy_auth_extension(proxy_url))
74+
chrome_options.add_argument(
75+
'--disable-features=DisableLoadExtensionCommandLineSwitch'
76+
)
77+
78+
return webdriver.Chrome(options=chrome_options)

0 commit comments

Comments
 (0)