|
1 | 1 | """Apify platform tools for Strands Agents. |
2 | 2 |
|
3 | | -This module provides web scraping, data extraction, and automation capabilities |
4 | | -using the Apify platform. It lets you run any Actor, task, fetch dataset |
5 | | -results, and scrape individual URLs. |
| 3 | +
|
| 4 | +Apify is the world's largest marketplace of tools for web scraping, crawling, data extraction, and web automation. |
| 5 | +These tools are called Actors, serverless cloud programs that take JSON input and store results |
| 6 | +in a dataset (structured, tabular output) or key-value store (files and unstructured data). |
| 7 | +Get structured data from social media, e-commerce, search engines, maps, travel sites, or any other website. |
6 | 8 |
|
7 | 9 | Available Tools: |
8 | 10 | --------------- |
|
16 | 18 | Setup Requirements: |
17 | 19 | ------------------ |
18 | 20 | 1. Create an Apify account at https://apify.com |
19 | | -2. Obtain your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
| 21 | +2. Get your API token: Apify Console > Settings > API & Integrations > Personal API tokens |
20 | 22 | 3. Install the optional dependency: pip install strands-agents-tools[apify] |
21 | 23 | 4. Set the environment variable: |
22 | 24 | APIFY_API_TOKEN=your_api_token_here |
@@ -328,7 +330,7 @@ def scrape_url( |
328 | 330 | timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS, |
329 | 331 | crawler_type: CrawlerType = "cheerio", |
330 | 332 | ) -> str: |
331 | | - """Scrape a single URL using Website Content Crawler and return markdown.""" |
| 333 | + """Scrape a single URL using Website Content Crawler and return Markdown.""" |
332 | 334 | self._validate_url(url) |
333 | 335 | self._validate_positive(timeout_secs, "timeout_secs") |
334 | 336 | if crawler_type not in WEBSITE_CONTENT_CRAWLER_TYPES: |
@@ -375,20 +377,24 @@ def apify_run_actor( |
375 | 377 | ) -> Dict[str, Any]: |
376 | 378 | """Run any Apify Actor and return the run metadata as JSON. |
377 | 379 |
|
378 | | - Executes the Actor synchronously - blocks until the Actor run finishes or the timeout |
379 | | - is reached. Use this when you need to run a specific Actor and then inspect or process |
380 | | - the results separately. |
| 380 | + An Actor is a serverless cloud app on the Apify platform — it takes JSON input, |
| 381 | + runs the scraping or automation job, and writes results to a dataset. This tool |
| 382 | + executes the Actor synchronously and returns run metadata only (run_id, status, |
| 383 | + dataset_id, timestamps). Use apify_run_actor_and_get_dataset to also fetch the |
| 384 | + output data in one call, or apify_scrape_url for quick single-URL extraction. |
381 | 385 |
|
382 | 386 | Common Actors: |
383 | | - - "apify/website-content-crawler" - scrape websites and extract content |
384 | | - - "apify/web-scraper" - general-purpose web scraper |
385 | | - - "apify/google-search-scraper" - scrape Google search results |
| 387 | + - "apify/website-content-crawler" - scrape websites and extract content as Markdown |
| 388 | + - "apify/web-scraper" - general-purpose web scraper with JS rendering |
| 389 | + - "apify/google-search-scraper" — scrape Google search results |
386 | 390 |
|
387 | 391 | Args: |
388 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
389 | | - run_input: JSON-serializable input for the Actor. Each Actor defines its own input schema. |
| 392 | + actor_id: Actor identifier in "username/actor-name" format, |
| 393 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 394 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 395 | + input schema - check the Actor README on Apify Store for required fields. |
390 | 396 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
391 | | - memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set. |
| 397 | + memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set. |
392 | 398 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
393 | 399 |
|
394 | 400 | Returns: |
@@ -428,8 +434,9 @@ def apify_get_dataset_items( |
428 | 434 | ) -> Dict[str, Any]: |
429 | 435 | """Fetch items from an existing Apify dataset and return them as JSON. |
430 | 436 |
|
431 | | - Use this after running an Actor to retrieve the structured results from its |
432 | | - default dataset, or to access any dataset by ID. |
| 437 | + Every Actor run writes its output to a dataset — a structured, append-only store |
| 438 | + for tabular data. Use the dataset_id from the run metadata returned by apify_run_actor |
| 439 | + or apify_run_task. Use offset for pagination through large datasets. |
433 | 440 |
|
434 | 441 | Args: |
435 | 442 | dataset_id: The Apify dataset ID to fetch items from. |
@@ -466,15 +473,17 @@ def apify_run_actor_and_get_dataset( |
466 | 473 | ) -> Dict[str, Any]: |
467 | 474 | """Run an Apify Actor and fetch its dataset results in one step. |
468 | 475 |
|
469 | | - Convenience tool that combines running an Actor and fetching its default |
470 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 476 | + Convenience tool that combines running an Actor and fetching its default dataset |
| 477 | + items into a single call. Use this when you want both the run metadata and the |
471 | 478 | result data without making two separate tool calls. |
472 | 479 |
|
473 | 480 | Args: |
474 | | - actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name". |
475 | | - run_input: JSON-serializable input for the Actor. |
| 481 | + actor_id: Actor identifier in "username/actor-name" format, |
| 482 | + e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store. |
| 483 | + run_input: JSON-serializable input for the Actor. Each Actor defines its own |
| 484 | + input schema - check the Actor README on Apify Store for required fields. |
476 | 485 | timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300. |
477 | | - memory_mbytes: Memory allocation in MB for the Actor run. |
| 486 | + memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set. |
478 | 487 | build: Actor build tag or number to run a specific version. Uses latest build if not set. |
479 | 488 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
480 | 489 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
@@ -518,17 +527,18 @@ def apify_run_task( |
518 | 527 | timeout_secs: int = DEFAULT_TIMEOUT_SECS, |
519 | 528 | memory_mbytes: Optional[int] = None, |
520 | 529 | ) -> Dict[str, Any]: |
521 | | - """Run an Apify task and return the run metadata as JSON. |
| 530 | + """Run a saved Apify task and return the run metadata as JSON. |
522 | 531 |
|
523 | | - Tasks are saved Actor configurations with preset inputs. Use this when a task |
524 | | - has already been configured in Apify Console, so you don't need to specify |
525 | | - the full Actor input every time. |
| 532 | + Tasks are saved Actor configurations with preset inputs, managed in Apify Console. |
| 533 | + Use this when a task has already been configured, so you don't need to specify |
| 534 | + the full Actor input every time. Use apify_run_task_and_get_dataset to also fetch |
| 535 | + the output data in one call. |
526 | 536 |
|
527 | 537 | Args: |
528 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
529 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 538 | + task_id: Task identifier in "username/task-name" format or a task ID string. |
| 539 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
530 | 540 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
531 | | - memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set. |
| 541 | + memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set. |
532 | 542 |
|
533 | 543 | Returns: |
534 | 544 | Dict with status and content containing run metadata: run_id, status, dataset_id, |
@@ -567,17 +577,17 @@ def apify_run_task_and_get_dataset( |
567 | 577 | dataset_items_limit: int = DEFAULT_DATASET_ITEMS_LIMIT, |
568 | 578 | dataset_items_offset: int = 0, |
569 | 579 | ) -> Dict[str, Any]: |
570 | | - """Run an Apify task and fetch its dataset results in one step. |
| 580 | + """Run a saved Apify task and fetch its dataset results in one step. |
571 | 581 |
|
572 | | - Convenience tool that combines running a task and fetching its default |
573 | | - dataset items into a single call. Use this when you want both the run metadata and the |
| 582 | + Convenience tool that combines running a task and fetching its default dataset |
| 583 | + items into a single call. Use this when you want both the run metadata and the |
574 | 584 | result data without making two separate tool calls. |
575 | 585 |
|
576 | 586 | Args: |
577 | | - task_id: Task identifier, e.g. "user/my-task" or a task ID string. |
578 | | - task_input: Optional JSON-serializable input to override the task's default input. |
| 587 | + task_id: Task identifier in "username/task-name" format or a task ID string. |
| 588 | + task_input: Optional JSON-serializable input to override the task's default input fields. |
579 | 589 | timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300. |
580 | | - memory_mbytes: Memory allocation in MB for the task run. |
| 590 | + memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set. |
581 | 591 | dataset_items_limit: Maximum number of dataset items to return. Defaults to 100. |
582 | 592 | dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0. |
583 | 593 |
|
@@ -618,21 +628,23 @@ def apify_scrape_url( |
618 | 628 | timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS, |
619 | 629 | crawler_type: CrawlerType = "cheerio", |
620 | 630 | ) -> Dict[str, Any]: |
621 | | - """Scrape a single URL and return its content as markdown. |
| 631 | + """Scrape a single URL and return its content as Markdown. |
622 | 632 |
|
623 | 633 | Uses the Website Content Crawler Actor under the hood, pre-configured for |
624 | 634 | fast single-page scraping. This is the simplest way to extract readable content |
625 | | - from any web page. |
| 635 | + from any web page — no Actor input schema needed. For multi-page crawls, use |
| 636 | + apify_run_actor_and_get_dataset with "apify/website-content-crawler" directly. |
626 | 637 |
|
627 | 638 | Args: |
628 | 639 | url: The URL to scrape, e.g. "https://example.com". |
629 | 640 | timeout_secs: Maximum time in seconds to wait for scraping to finish. Defaults to 120. |
630 | | - crawler_type: Crawler engine to use. One of "cheerio" (fastest, no JS rendering, |
631 | | - default), "playwright:adaptive" (fast, renders JS if present), or |
632 | | - "playwright:firefox" (reliable, renders JS, best at avoiding blocking but slower). |
| 641 | + crawler_type: Crawler engine to use. One of: |
| 642 | + - "cheerio" (default): Fastest, no JavaScript rendering. Best for static HTML. |
| 643 | + - "playwright:adaptive": Renders JS only when needed. Good general-purpose choice. |
| 644 | + - "playwright:firefox": Full JS rendering, best at bypassing anti-bot protection but slowest. |
633 | 645 |
|
634 | 646 | Returns: |
635 | | - Dict with status and content containing the markdown content of the scraped page. |
| 647 | + Dict with status and content containing the Markdown content of the scraped page. |
636 | 648 | """ |
637 | 649 | try: |
638 | 650 | _check_dependency() |
|
0 commit comments