Skip to content

Commit 38ccda7

Browse files
authored
Merge pull request #4 from jirispilka/feat/strands-core-apify-tools
fix: improve Apify tools docstrings with Apify context
2 parents b1a792c + 2eab80c commit 38ccda7

3 files changed

Lines changed: 55 additions & 43 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -970,7 +970,7 @@ from strands_tools.apify import APIFY_CORE_TOOLS
970970

971971
agent = Agent(tools=APIFY_CORE_TOOLS)
972972

973-
# Scrape a single URL and get markdown content
973+
# Scrape a single URL and get Markdown content
974974
content = agent.tool.apify_scrape_url(url="https://example.com")
975975

976976
# Run an Actor and get results in one step

docs/apify_tool.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -191,9 +191,9 @@ items = agent.tool.apify_get_dataset_items(
191191
| `APIFY_API_TOKEN environment variable is not set` | Token not configured | Set the `APIFY_API_TOKEN` environment variable |
192192
| `apify-client package is required` | Optional dependency not installed | Run `pip install strands-agents-tools[apify]` |
193193
| `Actor ... finished with status FAILED` | Actor execution error | Check Actor input parameters and run logs in [Apify Console](https://console.apify.com) |
194-
| `Task ... finished with status FAILED` | task execution error | Check task configuration and run logs in [Apify Console](https://console.apify.com) |
194+
| `Task ... finished with status FAILED` | Task execution error | Check task configuration and run logs in [Apify Console](https://console.apify.com) |
195195
| `Actor/task ... finished with status TIMED-OUT` | Timeout too short for the workload | Increase the `timeout_secs` parameter |
196-
| `Task ... returned no run data` | task `call()` returned `None` (wait timeout) | Increase the `timeout_secs` parameter |
196+
| `Task ... returned no run data` | Task `call()` returned `None` (wait timeout) | Increase the `timeout_secs` parameter |
197197
| `No content returned for URL` | Website Content Crawler returned empty results | Verify the URL is accessible and returns content |
198198

199199
## References

src/strands_tools/apify.py

Lines changed: 52 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
11
"""Apify platform tools for Strands Agents.
22
3-
This module provides web scraping, data extraction, and automation capabilities
4-
using the Apify platform. It lets you run any Actor, task, fetch dataset
5-
results, and scrape individual URLs.
3+
4+
Apify is the world's largest marketplace of tools for web scraping, crawling, data extraction, and web automation.
5+
These tools are called Actors, serverless cloud programs that take JSON input and store results
6+
in a dataset (structured, tabular output) or key-value store (files and unstructured data).
7+
Get structured data from social media, e-commerce, search engines, maps, travel sites, or any other website.
68
79
Available Tools:
810
---------------
@@ -16,7 +18,7 @@
1618
Setup Requirements:
1719
------------------
1820
1. Create an Apify account at https://apify.com
19-
2. Obtain your API token: Apify Console > Settings > API & Integrations > Personal API tokens
21+
2. Get your API token: Apify Console > Settings > API & Integrations > Personal API tokens
2022
3. Install the optional dependency: pip install strands-agents-tools[apify]
2123
4. Set the environment variable:
2224
APIFY_API_TOKEN=your_api_token_here
@@ -328,7 +330,7 @@ def scrape_url(
328330
timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS,
329331
crawler_type: CrawlerType = "cheerio",
330332
) -> str:
331-
"""Scrape a single URL using Website Content Crawler and return markdown."""
333+
"""Scrape a single URL using Website Content Crawler and return Markdown."""
332334
self._validate_url(url)
333335
self._validate_positive(timeout_secs, "timeout_secs")
334336
if crawler_type not in WEBSITE_CONTENT_CRAWLER_TYPES:
@@ -375,20 +377,24 @@ def apify_run_actor(
375377
) -> Dict[str, Any]:
376378
"""Run any Apify Actor and return the run metadata as JSON.
377379
378-
Executes the Actor synchronously - blocks until the Actor run finishes or the timeout
379-
is reached. Use this when you need to run a specific Actor and then inspect or process
380-
the results separately.
380+
An Actor is a serverless cloud app on the Apify platform — it takes JSON input,
381+
runs the scraping or automation job, and writes results to a dataset. This tool
382+
executes the Actor synchronously and returns run metadata only (run_id, status,
383+
dataset_id, timestamps). Use apify_run_actor_and_get_dataset to also fetch the
384+
output data in one call, or apify_scrape_url for quick single-URL extraction.
381385
382386
Common Actors:
383-
- "apify/website-content-crawler" - scrape websites and extract content
384-
- "apify/web-scraper" - general-purpose web scraper
385-
- "apify/google-search-scraper" - scrape Google search results
387+
- "apify/website-content-crawler" - scrape websites and extract content as Markdown
388+
- "apify/web-scraper" - general-purpose web scraper with JS rendering
389+
- "apify/google-search-scraper" scrape Google search results
386390
387391
Args:
388-
actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name".
389-
run_input: JSON-serializable input for the Actor. Each Actor defines its own input schema.
392+
actor_id: Actor identifier in "username/actor-name" format,
393+
e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store.
394+
run_input: JSON-serializable input for the Actor. Each Actor defines its own
395+
input schema - check the Actor README on Apify Store for required fields.
390396
timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300.
391-
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default if not set.
397+
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set.
392398
build: Actor build tag or number to run a specific version. Uses latest build if not set.
393399
394400
Returns:
@@ -428,8 +434,9 @@ def apify_get_dataset_items(
428434
) -> Dict[str, Any]:
429435
"""Fetch items from an existing Apify dataset and return them as JSON.
430436
431-
Use this after running an Actor to retrieve the structured results from its
432-
default dataset, or to access any dataset by ID.
437+
Every Actor run writes its output to a dataset — a structured, append-only store
438+
for tabular data. Use the dataset_id from the run metadata returned by apify_run_actor
439+
or apify_run_task. Use offset for pagination through large datasets.
433440
434441
Args:
435442
dataset_id: The Apify dataset ID to fetch items from.
@@ -466,15 +473,17 @@ def apify_run_actor_and_get_dataset(
466473
) -> Dict[str, Any]:
467474
"""Run an Apify Actor and fetch its dataset results in one step.
468475
469-
Convenience tool that combines running an Actor and fetching its default
470-
dataset items into a single call. Use this when you want both the run metadata and the
476+
Convenience tool that combines running an Actor and fetching its default dataset
477+
items into a single call. Use this when you want both the run metadata and the
471478
result data without making two separate tool calls.
472479
473480
Args:
474-
actor_id: Actor identifier, e.g. "apify/website-content-crawler" or "username/actor-name".
475-
run_input: JSON-serializable input for the Actor.
481+
actor_id: Actor identifier in "username/actor-name" format,
482+
e.g. "apify/website-content-crawler". Find Actors at https://apify.com/store.
483+
run_input: JSON-serializable input for the Actor. Each Actor defines its own
484+
input schema - check the Actor README on Apify Store for required fields.
476485
timeout_secs: Maximum time in seconds to wait for the Actor run to finish. Defaults to 300.
477-
memory_mbytes: Memory allocation in MB for the Actor run.
486+
memory_mbytes: Memory allocation in MB for the Actor run. Uses Actor default `memory` value if not set.
478487
build: Actor build tag or number to run a specific version. Uses latest build if not set.
479488
dataset_items_limit: Maximum number of dataset items to return. Defaults to 100.
480489
dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0.
@@ -518,17 +527,18 @@ def apify_run_task(
518527
timeout_secs: int = DEFAULT_TIMEOUT_SECS,
519528
memory_mbytes: Optional[int] = None,
520529
) -> Dict[str, Any]:
521-
"""Run an Apify task and return the run metadata as JSON.
530+
"""Run a saved Apify task and return the run metadata as JSON.
522531
523-
Tasks are saved Actor configurations with preset inputs. Use this when a task
524-
has already been configured in Apify Console, so you don't need to specify
525-
the full Actor input every time.
532+
Tasks are saved Actor configurations with preset inputs, managed in Apify Console.
533+
Use this when a task has already been configured, so you don't need to specify
534+
the full Actor input every time. Use apify_run_task_and_get_dataset to also fetch
535+
the output data in one call.
526536
527537
Args:
528-
task_id: Task identifier, e.g. "user/my-task" or a task ID string.
529-
task_input: Optional JSON-serializable input to override the task's default input.
538+
task_id: Task identifier in "username/task-name" format or a task ID string.
539+
task_input: Optional JSON-serializable input to override the task's default input fields.
530540
timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300.
531-
memory_mbytes: Memory allocation in MB for the task run. Uses task default if not set.
541+
memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set.
532542
533543
Returns:
534544
Dict with status and content containing run metadata: run_id, status, dataset_id,
@@ -567,17 +577,17 @@ def apify_run_task_and_get_dataset(
567577
dataset_items_limit: int = DEFAULT_DATASET_ITEMS_LIMIT,
568578
dataset_items_offset: int = 0,
569579
) -> Dict[str, Any]:
570-
"""Run an Apify task and fetch its dataset results in one step.
580+
"""Run a saved Apify task and fetch its dataset results in one step.
571581
572-
Convenience tool that combines running a task and fetching its default
573-
dataset items into a single call. Use this when you want both the run metadata and the
582+
Convenience tool that combines running a task and fetching its default dataset
583+
items into a single call. Use this when you want both the run metadata and the
574584
result data without making two separate tool calls.
575585
576586
Args:
577-
task_id: Task identifier, e.g. "user/my-task" or a task ID string.
578-
task_input: Optional JSON-serializable input to override the task's default input.
587+
task_id: Task identifier in "username/task-name" format or a task ID string.
588+
task_input: Optional JSON-serializable input to override the task's default input fields.
579589
timeout_secs: Maximum time in seconds to wait for the task run to finish. Defaults to 300.
580-
memory_mbytes: Memory allocation in MB for the task run.
590+
memory_mbytes: Memory allocation in MB for the task run. Uses task default `memory` value if not set.
581591
dataset_items_limit: Maximum number of dataset items to return. Defaults to 100.
582592
dataset_items_offset: Number of dataset items to skip for pagination. Defaults to 0.
583593
@@ -618,21 +628,23 @@ def apify_scrape_url(
618628
timeout_secs: int = DEFAULT_SCRAPE_TIMEOUT_SECS,
619629
crawler_type: CrawlerType = "cheerio",
620630
) -> Dict[str, Any]:
621-
"""Scrape a single URL and return its content as markdown.
631+
"""Scrape a single URL and return its content as Markdown.
622632
623633
Uses the Website Content Crawler Actor under the hood, pre-configured for
624634
fast single-page scraping. This is the simplest way to extract readable content
625-
from any web page.
635+
from any web page — no Actor input schema needed. For multi-page crawls, use
636+
apify_run_actor_and_get_dataset with "apify/website-content-crawler" directly.
626637
627638
Args:
628639
url: The URL to scrape, e.g. "https://example.com".
629640
timeout_secs: Maximum time in seconds to wait for scraping to finish. Defaults to 120.
630-
crawler_type: Crawler engine to use. One of "cheerio" (fastest, no JS rendering,
631-
default), "playwright:adaptive" (fast, renders JS if present), or
632-
"playwright:firefox" (reliable, renders JS, best at avoiding blocking but slower).
641+
crawler_type: Crawler engine to use. One of:
642+
- "cheerio" (default): Fastest, no JavaScript rendering. Best for static HTML.
643+
- "playwright:adaptive": Renders JS only when needed. Good general-purpose choice.
644+
- "playwright:firefox": Full JS rendering, best at bypassing anti-bot protection but slowest.
633645
634646
Returns:
635-
Dict with status and content containing the markdown content of the scraped page.
647+
Dict with status and content containing the Markdown content of the scraped page.
636648
"""
637649
try:
638650
_check_dependency()

0 commit comments

Comments
 (0)