Skip to content

Latest commit

 

History

History
56 lines (35 loc) · 4.18 KB

File metadata and controls

56 lines (35 loc) · 4.18 KB
id selenium
title Browser automation with Selenium
description Build an Apify Actor that scrapes dynamic web pages using Selenium WebDriver.

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';

In this guide, you'll learn how to use Selenium for browser automation and web scraping in your Apify Actors.

Introduction

Selenium is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

Some of the key features of Selenium for web scraping include:

  • Broad ecosystem - Selenium has a large community and extensive documentation, with support for multiple programming languages beyond Python.
  • WebDriver protocol - Selenium uses the W3C WebDriver protocol, providing standardized browser automation that works with Chrome, Firefox, Edge, and Safari.
  • Headless and headful modes - Selenium can run with or without a visible browser window, making it suitable for both local development and containerized environments.
  • Flexible element selection - Selenium provides CSS selectors, XPath, ID, class name, and other strategies for locating elements on a page.
  • User interaction emulation - Selenium allows you to emulate user actions like clicking, scrolling, filling out forms, and typing, which is useful for scraping dynamic websites.

To create Actors which use Selenium, start from the Selenium & Python Actor template.

On the Apify platform, the Actor will already have Selenium and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode.

When running the Actor locally, you'll need to install the Selenium browser drivers yourself. Refer to the Selenium documentation for installation instructions.

Example Actor

This is a simple Actor that recursively scrapes data from linked pages on the same site, up to a maximum depth, starting from URLs in the Actor input.

It uses Selenium ChromeDriver to open the pages in an automated Chrome browser, and to extract the title, headings, and links after the pages load.

{SeleniumExample}

Using Apify Proxy

Running on the Apify platform gives your scraper access to Apify Proxy, which rotates IP addresses to avoid rate limiting and blocking. The example creates a proxy configuration with Actor.create_proxy_configuration and routes the browser through it for the whole run.

Chrome ignores the credentials passed in the --proxy-server flag. An authenticated proxy such as Apify Proxy therefore has to be configured from inside a small extension. The proxy_auth_extension helper builds one at runtime: its service worker sets the proxy server and answers the browser's authentication challenge with the username and password. Note that the new headless mode (--headless=new) is required for Chrome to load the extension. To select specific proxy groups or a country, pass the relevant arguments to Actor.create_proxy_configuration. For details, see Proxy management.

Conclusion

In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the Actor templates to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our GitHub or join our Discord community. Happy scraping!

Additional resources