Skip to content

Latest commit

 

History

History
255 lines (184 loc) · 13.2 KB

File metadata and controls

255 lines (184 loc) · 13.2 KB

Apify SDK for Python

The official Python SDK for building Apify Actors.

PyPI version PyPI downloads Python versions Coverage License Chat on Discord

apify is the official SDK for building Apify Actors in Python. Actors are serverless programs that run on the Apify platform, where you can scale them, schedule them, and monetize them. The SDK manages the Actor lifecycle, gives you access to storages (datasets, key-value stores, request queues), handles platform events, configures Apify Proxy, and supports pay-per-event monetization. It builds on the Crawlee web scraping framework and bundles the Apify API client.

If you only need to consume the Apify API from Python (running Actors, reading datasets, managing storages) rather than building Actors, use the Apify API client for Python instead. It comes bundled with this SDK.

Table of contents

Installation

The Apify SDK for Python requires Python 3.11 or higher. It is published on PyPI as the apify package and can be installed with pip:

pip install apify

or with uv:

uv add apify

To use the Scrapy integration, install the scrapy extra:

pip install 'apify[scrapy]'

Quick start

An Actor is a Python program that runs inside the async with Actor: context. The context initializes the Actor when it starts and tears it down when it finishes. Here's a minimal Actor that reads its input and stores a result:

from apify import Actor


async def main() -> None:
    async with Actor:
        actor_input = await Actor.get_input()
        Actor.log.info('Actor input: %s', actor_input)
        await Actor.set_value('OUTPUT', 'Hello, world!')

The quickest way to scaffold a full Actor project, with the .actor configuration, input schema, and Dockerfile already in place, is the Apify CLI:

  1. Install the CLI:

    npm install -g apify-cli
  2. Create a new Actor from the Python "getting started" template:

    apify create my-actor --template python-start
  3. Run it locally:

    cd my-actor
    apify run

To create, run, and deploy your first Actor step by step, see the Quick start guide.

Features

  • Actor lifecycle managementasync with Actor: initializes the Actor, then handles exit, failure, status messages, and reboots (Actor lifecycle).
  • Typed Actor input — read input validated against your input schema with Actor.get_input() (Actor input).
  • Storage access — read and write datasets, key-value stores, and request queues, both locally and on the platform (Working with storages).
  • Platform events — react to system info, migration, and abort events streamed over a WebSocket (Actor events).
  • Proxy management — route requests through Apify Proxy with residential or datacenter groups, country targeting, and rotation (Proxy management).
  • Actor orchestration — start, call, abort, and metamorph other Actors and tasks, and register webhooks for run events (Interacting with other Actors, Webhooks).
  • Pay-per-event monetization — charge for the events your Actor emits (Pay-per-event).
  • Direct Apify API access — reach the full Apify API through a preconfigured ApifyClient (Accessing the Apify API).
  • Built on Crawlee — combine the SDK with Crawlee crawlers, or any HTTP or browser library you prefer (Crawlee guide).
  • Scrapy integration — run existing Scrapy spiders as Apify Actors through the apify[scrapy] extra (Scrapy guide).

Usage examples

The SDK works with whatever scraping stack you prefer. The examples below show two common setups. For more, see the Guides.

HTTPX with BeautifulSoup

Scrape pages with HTTPX and BeautifulSoup, using the Actor's request queue to track URLs:

from bs4 import BeautifulSoup
from httpx import AsyncClient

from apify import Actor


async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])

        # Open the default request queue for handling URLs to be processed.
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs.
        for start_url in start_urls:
            url = start_url.get('url')
            await request_queue.add_request(url)

        # Process the URLs from the request queue.
        while request := await request_queue.fetch_next_request():
            Actor.log.info(f'Scraping {request.url} ...')

            # Fetch the HTTP response from the specified URL using HTTPX.
            async with AsyncClient() as client:
                response = await client.get(request.url)

            # Parse the HTML content using Beautiful Soup.
            soup = BeautifulSoup(response.content, 'html.parser')

            # Extract the desired data.
            data = {
                'url': request.url,
                'title': soup.title.string,
                'h1s': [h1.text for h1 in soup.find_all('h1')],
                'h2s': [h2.text for h2 in soup.find_all('h2')],
                'h3s': [h3.text for h3 in soup.find_all('h3')],
            }

            # Store the extracted data to the default dataset.
            await Actor.push_data(data)

PlaywrightCrawler from Crawlee

Scrape pages with Crawlee's PlaywrightCrawler, which handles queueing, concurrency, and browser automation for you:

from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

from apify import Actor


async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, and use default values if not provided.
        actor_input = await Actor.get_input() or {}
        start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])]

        # Exit if no start URLs are provided.
        if not start_urls:
            Actor.log.info('No start URLs specified in Actor input, exiting...')
            await Actor.exit()

        # Create a crawler.
        crawler = PlaywrightCrawler(
            # Limit the crawl to max requests. Remove or increase it for crawling all links.
            max_requests_per_crawl=50,
            headless=True,
        )

        # Define a request handler, which will be called for every request.
        @crawler.router.default_handler
        async def request_handler(context: PlaywrightCrawlingContext) -> None:
            url = context.request.url
            Actor.log.info(f'Scraping {url}...')

            # Extract the desired data.
            data = {
                'url': context.request.url,
                'title': await context.page.title(),
                'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()],
                'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()],
                'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()],
            }

            # Store the extracted data to the default dataset.
            await context.push_data(data)

            # Enqueue additional links found on the current page.
            await context.enqueue_links()

        # Run the crawler with the starting URLs.
        await crawler.run(start_urls)

What are Actors?

Actors are serverless cloud programs that can do almost anything a human can do in a web browser. They range from small tasks, such as filling in forms or unsubscribing from online services, all the way up to scraping and processing vast numbers of web pages.

They run either locally or on the Apify platform, where you can run them at scale, monitor them, schedule them, or publish and monetize them. If you're new to Apify, learn what Apify is in the platform documentation.

Documentation

The full documentation lives at docs.apify.com/sdk/python.

Section What you'll find
Overview What the SDK is, what Actors are, and how the pieces fit together.
Quick start Create, run, and deploy your first Python Actor.
Concepts Actor lifecycle, input, storages, events, proxy management, interacting with other Actors, webhooks, accessing the Apify API, logging, configuration, and pay-per-event.
Guides Integrations with BeautifulSoup, Parsel, Playwright, Selenium, Crawlee, Scrapy, Crawl4AI, and Browser Use, plus running a web server and using uv.
Upgrading Migrating between major versions.
API reference Generated reference for every class and method.
Changelog Release history and breaking changes.

Related projects

Support and community

Contributing

Bug reports, fixes, and improvements are welcome! See CONTRIBUTING.md for the development setup, coding standards, testing, and release process. The project uses uv for project management and Poe the Poet as a task runner; the typical loop is:

uv run poe install-dev   # install dev dependencies and git hooks
uv run poe check-code    # lint, type-check, and unit tests

License

Released under the Apache License 2.0.