|
1 | 1 | """Apify platform tools for Strands Agents. |
2 | 2 |
|
3 | | -This module provides web scraping, data extraction, and automation capabilities |
4 | | -using the Apify platform. It lets you run any Actor, task, fetch dataset |
5 | | -results, and scrape individual URLs. |
| 3 | +Apify is a large marketplace of tools for web scraping, data extraction, |
| 4 | +and web automation. These tools are called Actors — serverless cloud applications that |
| 5 | +take JSON input and store results in a dataset (structured, tabular output) or key-value |
| 6 | +store (files and unstructured data). Actors exist for social media, e-commerce, search |
| 7 | +engines, maps, travel sites, and many other sources. |
6 | 8 |
|
7 | 9 | Available Tools: |
8 | 10 | --------------- |
|
16 | 18 | Setup Requirements: |
17 | 19 | ------------------ |
18 | 20 | 1. Create an Apify account at https://apify.com |
19 | | -2. Obtain your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
| 21 | +2. Get your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
20 | 22 | 3. Install the optional dependency: pip install strands-agents-tools[apify] |
21 | 23 | 4. Set the environment variable: |
22 | 24 | APIFY_API_TOKEN=your_api_token_here |
@@ -366,18 +368,22 @@ def apify_run_actor( |
366 | 368 | ) -> Dict[str, Any]: |
367 | 369 | """Run any Apify Actor and return the run metadata as JSON. |
368 | 370 |
|
369 | | - Executes the Actor synchronously - blocks until the Actor run finishes or the timeout |
370 | | - is reached. Use this when you need to run a specific Actor and then inspect or process |
371 | | - the results separately. |
| 371 | + An Actor is a serverless cloud app on the Apify platform — it takes JSON input, |
| 372 | + runs the scraping or automation job, and writes results to a dataset. This tool |
| 373 | + executes the Actor synchronously and returns run metadata only (run_id, status, |
| 374 | + dataset_id, timestamps). Use apify_run_actor_and_get_dataset to also fetch the |
| 375 | + output data in one call, or apify_scrape_url for quick single-URL extraction. |
372 | 376 |
|
373 | 377 | Common Actors: |
374 | | - - "apify/website-content-crawler" - scrape websites and extract content |
375 | | - - "apify/web-scraper" - general-purpose web scraper |
376 | | - - "apify/google-search-scraper" - scrape Google search results |
| 378 | + - "apify/website-content-crawler" — scrape websites and extract content as markdown |
| 379 | + - "apify/web-scraper" — general-purpose web scraper with JS rendering |
| 380 | + - "apify/google-search-scraper" — scrape Google search results |
377 | 381 |
|
378 | 382 | Args: |
379 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
380 | | - run_input: JSON-serializable input for the Actor. Each Actor defines its own input schema. |
| 383 | + actor_id: Actor identifier in "username/actor-name" format, |
| 384 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 385 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 386 | + input schema — check the Actor README on Apify Store for required fields. |
381 | 387 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
382 | 388 | memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set. |
383 | 389 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
@@ -419,8 +425,9 @@ def apify_get_dataset_items( |
419 | 425 | ) -> Dict[str, Any]: |
420 | 426 | """Fetch items from an existing Apify dataset and return them as JSON. |
421 | 427 |
|
422 | | - Use this after running an Actor to retrieve the structured results from its |
423 | | - default dataset, or to access any dataset by ID. |
| 428 | + Every Actor run writes its output to a dataset — a structured, append-only store |
| 429 | + for tabular data. Use the dataset_id from the run metadata returned by apify_run_actor |
| 430 | + or apify_run_task. Use offset for pagination through large datasets. |
424 | 431 |
|
425 | 432 | Args: |
426 | 433 | dataset_id: The Apify dataset ID to fetch items from. |
@@ -457,15 +464,17 @@ def apify_run_actor_and_get_dataset( |
457 | 464 | ) -> Dict[str, Any]: |
458 | 465 | """Run an Apify Actor and fetch its dataset results in one step. |
459 | 466 |
|
460 | | - Convenience tool that combines running an Actor and fetching its default |
461 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 467 | + Convenience tool that combines running an Actor and fetching its default dataset |
| 468 | + items into a single call. Use this when you want both the run metadata and the |
462 | 469 | result data without making two separate tool calls. |
463 | 470 |
|
464 | 471 | Args: |
465 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
466 | | - run_input: JSON-serializable input for the Actor. |
| 472 | + actor_id: Actor identifier in "username/actor-name" format, |
| 473 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 474 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 475 | + input schema — check the Actor README on Apify Store for required fields. |
467 | 476 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
468 | | - memory_mbytes: Memory allocation in MB for the Actor run. |
| 477 | + memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set. |
469 | 478 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
470 | 479 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
471 | 480 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
@@ -509,15 +518,16 @@ def apify_run_task( |
509 | 518 | timeout_secs: int = DEFAULT_TIMEOUT_SECS, |
510 | 519 | memory_mbytes: Optional[int] = None, |
511 | 520 | ) -> Dict[str, Any]: |
512 | | - """Run an Apify task and return the run metadata as JSON. |
| 521 | + """Run a saved Apify task and return the run metadata as JSON. |
513 | 522 |
|
514 | | - Tasks are saved Actor configurations with preset inputs. Use this when a task |
515 | | - has already been configured in Apify Console, so you don't need to specify |
516 | | - the full Actor input every time. |
| 523 | + Tasks are saved Actor configurations with preset inputs, managed in Apify Console. |
| 524 | + Use this when a task has already been configured, so you don't need to specify |
| 525 | + the full Actor input every time. Use apify_run_task_and_get_dataset to also fetch |
| 526 | + the output data in one call. |
517 | 527 |
|
518 | 528 | Args: |
519 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
520 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 529 | + task_id: Task identifier in "username~task-name" format or a task ID string. |
| 530 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
521 | 531 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
522 | 532 | memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set. |
523 | 533 |
|
@@ -558,17 +568,17 @@ def apify_run_task_and_get_dataset( |
558 | 568 | dataset_items_limit: int = DEFAULT_DATASET_ITEMS_LIMIT, |
559 | 569 | dataset_items_offset: int = 0, |
560 | 570 | ) -> Dict[str, Any]: |
561 | | - """Run an Apify task and fetch its dataset results in one step. |
| 571 | + """Run a saved Apify task and fetch its dataset results in one step. |
562 | 572 |
|
563 | | - Convenience tool that combines running a task and fetching its default |
564 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 573 | + Convenience tool that combines running a task and fetching its default dataset |
| 574 | + items into a single call. Use this when you want both the run metadata and the |
565 | 575 | result data without making two separate tool calls. |
566 | 576 |
|
567 | 577 | Args: |
568 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
569 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 578 | + task_id: Task identifier in "username~task-name" format or a task ID string. |
| 579 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
570 | 580 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
571 | | - memory_mbytes: Memory allocation in MB for the task run. |
| 581 | + memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set. |
572 | 582 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
573 | 583 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
574 | 584 |
|
@@ -613,14 +623,16 @@ def apify_scrape_url( |
613 | 623 |
|
614 | 624 | Uses the Website Content Crawler Actor under the hood, pre-configured for |
615 | 625 | fast single-page scraping. This is the simplest way to extract readable content |
616 | | - from any web page. |
| 626 | + from any web page — no Actor input schema needed. For multi-page crawls, use |
| 627 | + apify_run_actor_and_get_dataset with "apify/website-content-crawler" directly. |
617 | 628 |
|
618 | 629 | Args: |
619 | 630 | url: The URL to scrape, e.g. "https://example.com". |
620 | 631 | timeout_secs: Maximum time in seconds to wait for scraping to finish. Defaults to 120. |
621 | | - crawler_type: Crawler engine to use. One of "cheerio" (fastest, no JS rendering, |
622 | | - default), "playwright:adaptive" (fast, renders JS if present), or |
623 | | - "playwright:firefox" (reliable, renders JS, best at avoiding blocking but slower). |
| 632 | + crawler_type: Crawler engine to use. One of: |
| 633 | + - "cheerio" (default): Fastest, no JavaScript rendering. Best for static HTML. |
| 634 | + - "playwright:adaptive": Renders JS only when needed. Good general-purpose choice. |
| 635 | + - "playwright:firefox": Full JS rendering, best at bypassing anti-bot protection but slowest. |
624 | 636 |
|
625 | 637 | Returns: |
626 | 638 | Dict with status and content containing the markdown content of the scraped page. |
|
0 commit comments