Skip to content

Latest commit

 

History

History
686 lines (551 loc) · 27.4 KB

File metadata and controls

686 lines (551 loc) · 27.4 KB

Apify Actors Development Guide

Important: Before you begin, fill in the generatedBy property in the meta section of .actor/actor.json. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.

What are Apify Actors?

  • Actors are serverless programs that run in the cloud. They're inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems.
  • Actors are programs packaged as Docker images that run in isolated containers

Core Concepts

  • Accept well-defined JSON input
  • Perform isolated tasks (web scraping, automation, data processing)
  • Produce structured JSON output to datasets and/or store data in key-value stores
  • Can run from seconds to hours or even indefinitely
  • Persist state and can be restarted

Do

  • accept well-defined JSON input and produce structured JSON output
  • use Apify SDK (apify) for code running ON Apify platform
  • validate input early with proper error handling and fail gracefully
  • use CheerioCrawler for static HTML content (10x faster than browsers)
  • use PlaywrightCrawler only for JavaScript-heavy sites and dynamic content
  • use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
  • implement retry strategies with exponential backoff for failed requests
  • use proper concurrency settings (HTTP: 10-50, Browser: 1-5)
  • set sensible defaults in .actor/input_schema.json for all optional fields
  • set up output schema in .actor/output_schema.json
  • clean and validate data before pushing to dataset
  • use semantic CSS selectors and fallback strategies for missing elements
  • respect robots.txt, ToS, and implement rate limiting with delays
  • check which tools (cheerio/playwright/crawlee) are installed before applying guidance
  • use Actor.log for logging (censors sensitive data)
  • implement readiness probe handler for standby Actors
  • handle the aborting event to gracefully shut down when Actor is stopped

Don't

  • do not rely on Dataset.getInfo() for final counts on Cloud platform
  • do not use browser crawlers when HTTP/Cheerio works (massive performance gains with HTTP)
  • do not hard code values that should be in input schema or environment variables
  • do not skip input validation or error handling
  • do not overload servers - use appropriate concurrency and delays
  • do not scrape prohibited content or ignore Terms of Service
  • do not store personal/sensitive data unless explicitly permitted
  • do not use deprecated options like requestHandlerTimeoutMillis on CheerioCrawler (v3.x)
  • do not use additionalHttpHeaders - use preNavigationHooks instead
  • do not assume that local storage is persistent or automatically synced to Apify Console - when running locally with apify run, the storage/ directory is local-only and is NOT pushed to the Cloud
  • do not disable standby mode (usesStandbyMode: false) without explicit permission

Logging

  • ALWAYS use Actor.log for logging - This logger contains critical security logic including censoring sensitive data (Apify tokens, API keys, credentials) to prevent accidental exposure in logs

Available Log Levels

The Apify Actor logger provides the following methods for logging:

  • Actor.log.debug() - Debug level logs (detailed diagnostic information)
  • Actor.log.info() - Info level logs (general informational messages)
  • Actor.log.warning() - Warning level logs (warning messages for potentially problematic situations)
  • Actor.log.error() - Error level logs (error messages for failures)
  • Actor.log.exception() - Exception level logs (for exceptions with stack traces)

Best practices:

  • Use Actor.log.debug() for detailed operation-level diagnostics (inside functions)
  • Use Actor.log.info() for general informational messages (API requests, successful operations)
  • Use Actor.log.warning() for potentially problematic situations (validation failures, unexpected states)
  • Use Actor.log.error() for actual errors and failures
  • Use Actor.log.exception() for caught exceptions with stack traces

Graceful Abort Handling

Handle the aborting event to terminate the Actor quickly when stopped by user or platform, minimizing costs especially for PPU/PPE+U billing.

import asyncio

async def on_aborting() -> None:
    # Persist any state, do any cleanup you need, and terminate the Actor using `await Actor.exit()` explicitly as soon as possible
    # This will help ensure that the Actor is doing best effort to honor any potential limits on costs of a single run set by the user
    # Wait 1 second to allow Crawlee/SDK state persistence operations to complete
    # This is a temporary workaround until SDK implements proper state persistence in the aborting event
    await asyncio.sleep(1)
    await Actor.exit()

Actor.on('aborting', on_aborting)

Standby Mode

  • NEVER disable standby mode (usesStandbyMode: false) in .actor/actor.json without explicit permission - Actor Standby mode solves this problem by letting you have the Actor ready in the background, waiting for the incoming HTTP requests. In a sense, the Actor behaves like a real-time web server or standard API server instead of running the logic once to process everything in batch. Always keep usesStandbyMode: true unless there is a specific documented reason to disable it
  • ALWAYS implement readiness probe handler for standby Actors - Handle the x-apify-container-server-readiness-probe header at GET / endpoint to ensure proper Actor lifecycle management

You can recognize a standby Actor by checking the usesStandbyMode property in .actor/actor.json. Only implement the readiness probe if this property is set to true.

Readiness Probe Implementation Example

# Apify standby readiness probe
from http.server import SimpleHTTPRequestHandler

class GetHandler(SimpleHTTPRequestHandler):
    def do_GET(self):
        # Handle Apify standby readiness probe
        if 'x-apify-container-server-readiness-probe' in self.headers:
            self.send_response(200)
            self.end_headers()
            self.wfile.write(b'Readiness probe OK')
            return

        self.send_response(200)
        self.end_headers()
        self.wfile.write(b'Actor is ready')

Key points:

  • Detect the x-apify-container-server-readiness-probe header in incoming requests
  • Respond with HTTP 200 status code for both readiness probe and normal requests
  • This enables proper Actor lifecycle management in standby mode

Commands

# Bootstrap & local development
apify create [name]                    # Create new Actor project from a template
apify init                             # Initialize Actor in current directory
apify run                              # Run Actor locally with simulated platform env
apify run --purge                      # Run after clearing previous local storage
apify validate-schema                  # Validate .actor/input_schema.json

# Authentication & account
apify login                            # Authenticate account (token stored in ~/.apify)
apify logout                           # Remove stored credentials
apify info                             # Print currently authenticated account info

# Deployment & remote execution
apify push                             # Deploy Actor to platform per .actor/actor.json
apify pull <actor>                     # Download Actor code from the platform
apify call <actor>                     # Execute Actor remotely on the platform
apify actors build <actor>             # Create a new build of an Actor
apify runs ls                          # List recent runs

# Discovery (search Apify Store for community Actors)
apify actors search "<query>" --user-agent <your-agent-name>
apify actors info <actor>              # Get details about a specific Actor

# Secrets (referenced from actor.json via "@mySecret")
apify secrets add <name> <value>       # Store a secret locally; uploaded on push
apify secrets ls                       # List stored secret keys

# Direct API access
apify api <endpoint>                   # Send an authenticated HTTP request to Apify API

# Help
apify help                             # List all commands
apify <command> --help                 # Get help for a specific command

Note: If no dedicated Actor exists for your target, search Apify Store for community options with apify actors search "<query>" --user-agent <your-agent-name> before building from scratch.

Tip: Inside a running Actor, prefer the SDK (Actor.get_input(), Actor.push_data(), Actor.set_value()) over the equivalent apify actor runtime subcommands.

Apify Platform Environment

When the Actor runs on the Apify platform, the API token is automatically available via the APIFY_TOKEN environment variable (note: the variable is APIFY_TOKEN, not APIFY_API_TOKEN). The Apify SDK reads it automatically, so you do not need to pass it explicitly. Locally, run apify login once and the SDK will use your stored credentials.

Safety and Permissions

Allowed without prompt:

  • read files with Actor.get_value()
  • push data with Actor.push_data()
  • set values with Actor.set_value()
  • enqueue requests to RequestQueue
  • run locally with apify run

Ask first:

  • npm/pip package installations
  • apify push (deployment to cloud)
  • proxy configuration changes (requires paid plan)
  • Dockerfile changes affecting builds
  • deleting datasets or key-value stores

Project Structure

.actor/ ├── actor.json # Actor config: name, version, env vars, runtime settings ├── input_schema.json # Input validation & Console form definition └── output_schema.json # Specifies where an Actor stores its output src/ └── main.js # Actor entry point and orchestrator storage/ # Local-only storage for development (NOT synced to Cloud) ├── datasets/ # Output items (JSON objects) ├── key_value_stores/ # Files, config, INPUT └── request_queues/ # Pending crawl requests Dockerfile # Container image definition AGENTS.md # AI agent instructions (this file)

Local vs Cloud Storage

When running locally with apify run, the Apify SDK emulates Cloud storage APIs using the local storage/ directory. This local storage behaves differently from Cloud storage:

  • Local storage is NOT persistent - The storage/ directory is meant for local development and testing only. Data stored there (datasets, key-value stores, request queues) exists only on your local disk.
  • Local storage is NOT automatically pushed to Apify Console - Running apify run does not upload any storage data to the Apify platform. The data stays local.
  • Each local run may overwrite previous data - The local storage/ directory is reused between runs, but this is local-only behavior, not Cloud persistence.
  • Cloud storage only works when running on Apify platform - After deploying with apify push and running the Actor in the Cloud, storage calls (Actor.push_data(), Actor.set_value(), etc.) interact with real Apify Cloud storage, which is then visible in the Apify Console.
  • To verify Actor output, deploy and run in Cloud - Do not rely on local storage/ contents as proof that data will appear in the Apify Console. Always test by deploying (apify push) and running the Actor on the platform.

Actor Input Schema

The input schema defines the input parameters for an Actor. It's a JSON object comprising various field types supported by the Apify platform.

Structure

{
    "title": "<INPUT-SCHEMA-TITLE>",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        /* define input fields here */
    },
    "required": []
}

Example

{
    "title": "E-commerce Product Scraper Input",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start scraping from (category pages or product pages)",
            "editor": "requestListSources",
            "default": [{ "url": "https://example.com/category" }],
            "prefill": [{ "url": "https://example.com/category" }]
        },
        "followVariants": {
            "title": "Follow Product Variants",
            "type": "boolean",
            "description": "Whether to scrape product variants (different colors, sizes)",
            "default": true
        },
        "maxRequestsPerCrawl": {
            "title": "Max Requests per Crawl",
            "type": "integer",
            "description": "Maximum number of pages to scrape (0 = unlimited)",
            "default": 1000,
            "minimum": 0
        },
        "proxyConfiguration": {
            "title": "Proxy Configuration",
            "type": "object",
            "description": "Proxy settings for anti-bot protection",
            "editor": "proxy",
            "default": { "useApifyProxy": false }
        },
        "locale": {
            "title": "Locale",
            "type": "string",
            "description": "Language/country code for localized content",
            "default": "cs",
            "enum": ["cs", "en", "de", "sk"],
            "enumTitles": ["Czech", "English", "German", "Slovak"]
        }
    },
    "required": ["startUrls"]
}

Actor Output Schema

The Actor output schema builds upon the schemas for the dataset and key-value store. It specifies where an Actor stores its output and defines templates for accessing that output. Apify Console uses these output definitions to display run results.

Structure

{
    "actorOutputSchemaVersion": 1,
    "title": "<OUTPUT-SCHEMA-TITLE>",
    "properties": {
        /* define your outputs here */
    }
}

Example

{
    "actorOutputSchemaVersion": 1,
    "title": "Output schema of the files scraper",
    "properties": {
        "files": {
            "type": "string",
            "title": "Files",
            "template": "{{links.apiDefaultKeyValueStoreUrl}}/keys"
        },
        "dataset": {
            "type": "string",
            "title": "Dataset",
            "template": "{{links.apiDefaultDatasetUrl}}/items"
        }
    }
}

Output Schema Template Variables

  • links (object) - Contains quick links to most commonly used URLs
  • links.publicRunUrl (string) - Public run url in format https://console.apify.com/view/runs/:runId
  • links.consoleRunUrl (string) - Console run url in format https://console.apify.com/actors/runs/:runId
  • links.apiRunUrl (string) - API run url in format https://api.apify.com/v2/actor-runs/:runId
  • links.apiDefaultDatasetUrl (string) - API url of default dataset in format https://api.apify.com/v2/datasets/:defaultDatasetId
  • links.apiDefaultKeyValueStoreUrl (string) - API url of default key-value store in format https://api.apify.com/v2/key-value-stores/:defaultKeyValueStoreId
  • links.containerRunUrl (string) - URL of a webserver running inside the run in format https://<containerId>.runs.apify.net/
  • run (object) - Contains information about the run same as it is returned from the GET Run API endpoint
  • run.defaultDatasetId (string) - ID of the default dataset
  • run.defaultKeyValueStoreId (string) - ID of the default key-value store

Dataset Schema Specification

The dataset schema defines how your Actor's output data is structured, transformed, and displayed in the Output tab in the Apify Console.

Example

Consider an example Actor that calls Actor.pushData() to store data into dataset:

# Dataset push example (Python)
import asyncio
from datetime import datetime
from apify import Actor

async def main():
    await Actor.init()

    # Actor code
    await Actor.push_data({
        'numericField': 10,
        'pictureUrl': 'https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png',
        'linkUrl': 'https://google.com',
        'textField': 'Google',
        'booleanField': True,
        'dateField': datetime.now().isoformat(),
        'arrayField': ['#hello', '#world'],
        'objectField': {},
    })

    # Exit successfully
    await Actor.exit()

if __name__ == '__main__':
    asyncio.run(main())

To set up the Actor's output tab UI, reference a dataset schema file in .actor/actor.json:

{
    "actorSpecification": 1,
    "name": "book-library-scraper",
    "title": "Book Library Scraper",
    "version": "1.0.0",
    "storages": {
        "dataset": "./dataset_schema.json"
    }
}

Then create the dataset schema in .actor/dataset_schema.json:

{
    "actorSpecification": 1,
    "fields": {},
    "views": {
        "overview": {
            "title": "Overview",
            "transformation": {
                "fields": [
                    "pictureUrl",
                    "linkUrl",
                    "textField",
                    "booleanField",
                    "arrayField",
                    "objectField",
                    "dateField",
                    "numericField"
                ]
            },
            "display": {
                "component": "table",
                "properties": {
                    "pictureUrl": {
                        "label": "Image",
                        "format": "image"
                    },
                    "linkUrl": {
                        "label": "Link",
                        "format": "link"
                    },
                    "textField": {
                        "label": "Text",
                        "format": "text"
                    },
                    "booleanField": {
                        "label": "Boolean",
                        "format": "boolean"
                    },
                    "arrayField": {
                        "label": "Array",
                        "format": "array"
                    },
                    "objectField": {
                        "label": "Object",
                        "format": "object"
                    },
                    "dateField": {
                        "label": "Date",
                        "format": "date"
                    },
                    "numericField": {
                        "label": "Number",
                        "format": "number"
                    }
                }
            }
        }
    }
}

Structure

{
    "actorSpecification": 1,
    "fields": {},
    "views": {
        "<VIEW_NAME>": {
            "title": "string (required)",
            "description": "string (optional)",
            "transformation": {
                "fields": ["string (required)"],
                "unwind": ["string (optional)"],
                "flatten": ["string (optional)"],
                "omit": ["string (optional)"],
                "limit": "integer (optional)",
                "desc": "boolean (optional)"
            },
            "display": {
                "component": "table (required)",
                "properties": {
                    "<FIELD_NAME>": {
                        "label": "string (optional)",
                        "format": "text|number|date|link|boolean|image|array|object (optional)"
                    }
                }
            }
        }
    }
}

Dataset Schema Properties:

  • actorSpecification (integer, required) - Specifies the version of dataset schema structure document (currently only version 1)
  • fields (JSONSchema object, required) - Schema of one dataset object (use JsonSchema Draft 2020-12 or compatible)
  • views (DatasetView object, required) - Object with API and UI views description

DatasetView Properties:

  • title (string, required) - Visible in UI Output tab and API
  • description (string, optional) - Only available in API response
  • transformation (ViewTransformation object, required) - Data transformation applied when loading from Dataset API
  • display (ViewDisplay object, required) - Output tab UI visualization definition

ViewTransformation Properties:

  • fields (string[], required) - Fields to present in output (order matches column order)
  • unwind (string[], optional) - Deconstructs nested children into parent object
  • flatten (string[], optional) - Transforms nested object into flat structure
  • omit (string[], optional) - Removes specified fields from output
  • limit (integer, optional) - Maximum number of results (default: all)
  • desc (boolean, optional) - Sort order (true = newest first)

ViewDisplay Properties:

  • component (string, required) - Only table is available
  • properties (Object, optional) - Keys matching transformation.fields with ViewDisplayProperty values

ViewDisplayProperty Properties:

  • label (string, optional) - Table column header
  • format (string, optional) - One of: text, number, date, link, boolean, image, array, object

Key-Value Store Schema Specification

The key-value store schema organizes keys into logical groups called collections for easier data management.

Example

Consider an example Actor that calls Actor.setValue() to save records into the key-value store:

# Key-Value Store set example (Python)
import asyncio
from apify import Actor

async def main():
    await Actor.init()

    # Actor code
    await Actor.set_value('document-1', 'my text data', content_type='text/plain')

    image_id = '123'          # example placeholder
    image_buffer = b'...'     # bytes buffer with image data
    await Actor.set_value(f'image-{image_id}', image_buffer, content_type='image/jpeg')

    # Exit successfully
    await Actor.exit()

if __name__ == '__main__':
    asyncio.run(main())

To configure the key-value store schema, reference a schema file in .actor/actor.json:

{
    "actorSpecification": 1,
    "name": "data-collector",
    "title": "Data Collector",
    "version": "1.0.0",
    "storages": {
        "keyValueStore": "./key_value_store_schema.json"
    }
}

Then create the key-value store schema in .actor/key_value_store_schema.json:

{
    "actorKeyValueStoreSchemaVersion": 1,
    "title": "Key-Value Store Schema",
    "collections": {
        "documents": {
            "title": "Documents",
            "description": "Text documents stored by the Actor",
            "keyPrefix": "document-"
        },
        "images": {
            "title": "Images",
            "description": "Images stored by the Actor",
            "keyPrefix": "image-",
            "contentTypes": ["image/jpeg"]
        }
    }
}

Structure

{
    "actorKeyValueStoreSchemaVersion": 1,
    "title": "string (required)",
    "description": "string (optional)",
    "collections": {
        "<COLLECTION_NAME>": {
            "title": "string (required)",
            "description": "string (optional)",
            "key": "string (conditional - use key OR keyPrefix)",
            "keyPrefix": "string (conditional - use key OR keyPrefix)",
            "contentTypes": ["string (optional)"],
            "jsonSchema": "object (optional)"
        }
    }
}

Key-Value Store Schema Properties:

  • actorKeyValueStoreSchemaVersion (integer, required) - Version of key-value store schema structure document (currently only version 1)
  • title (string, required) - Title of the schema
  • description (string, optional) - Description of the schema
  • collections (Object, required) - Object where each key is a collection ID and value is a Collection object

Collection Properties:

  • title (string, required) - Collection title shown in UI tabs
  • description (string, optional) - Description appearing in UI tooltips
  • key (string, conditional*) - Single specific key for this collection
  • keyPrefix (string, conditional*) - Prefix for keys included in this collection
  • contentTypes (string[], optional) - Allowed content types for validation
  • jsonSchema (object, optional) - JSON Schema Draft 07 format for application/json content type validation

*Either key or keyPrefix must be specified for each collection, but not both.

Actor README

Always generate a README.md file as part of Actor development. The README is the Actor's public landing page on Apify Store - it serves as SEO, first impression, documentation, and support page combined.

Required: Generate README automatically

When building an Actor, always create a README.md in the project root. Do not wait for the user to ask for it. The README is a critical part of a complete Actor.

README structure

Write in Markdown. Use H2 (##) for main sections (these become the table of contents) and H3 (###) for subsections. Do not use H1 - the Actor name is automatically the H1. Aim for at least 300 words.

Include these sections in order:

  1. What does [Actor name] do? - 2-3 sentences explaining what it does, what data it extracts, and how to try it. Link to the target website. Mention Apify platform advantages (API access, scheduling, integrations, proxy rotation, monitoring).
  2. Why use [Actor name]? - Business use cases and benefits.
  3. How to use [Actor name] - Numbered step-by-step tutorial. Keep it simple and reassuring.
  4. Input - Describe input fields. Reference the Input tab. Optionally include a screenshot or JSON example of the input schema.
  5. Output - Show a simplified JSON output example. Mention "You can download the dataset in various formats such as JSON, HTML, CSV, or Excel."
  6. Data table - If the Actor extracts data, include a table of the main data fields it outputs.
  7. Pricing / Cost estimation - Set expectations on cost. Mention free tier limits if applicable. Frame as "How much does it cost to scrape [target site]?"
  8. Tips or Advanced options - How to optimize runs, limit compute units, improve speed or accuracy.
  9. FAQ, disclaimers, and support - Legality disclaimer for scrapers, known limitations, link to Issues tab for feedback, mention custom solution availability.

README best practices

  • Write SEO-friendly headings with relevant keywords (e.g., "How to scrape [site] data" not just "Tutorial")
  • Bold the most important words in the intro
  • The first 25% of the README matters most - front-load the value proposition
  • Match the tone to the target audience: simple language for no-code users, technical details for developers
  • Include a JSON output example showing 1-2 representative items
  • Reference these top Actors for README best practices: https://apify.com/apify/instagram-scraper and https://apify.com/compass/crawler-google-places
  • Embed YouTube video URLs on their own line (Apify Console auto-renders them)
  • Use HTML for image sizing if needed; CSS is not supported

MCP Tools

Apify MCP

If the Apify MCP server is configured, use these tools for documentation:

  • search-apify-docs - Search documentation
  • fetch-apify-docs - Get full doc pages

Otherwise, reference: @https://mcp.apify.com/

Playwright MCP (debugging)

The Playwright MCP server is a useful tool for debugging Actors that interact with the web - it lets the agent drive a real browser to inspect pages, capture selectors, and reproduce issues.

Install with the Claude Code CLI:

claude mcp add playwright npx @playwright/mcp@latest

Or add it manually to your MCP config:

{
    "mcpServers": {
        "playwright": {
            "command": "npx",
            "args": ["@playwright/mcp@latest"]
        }
    }
}

Resources