Skip to content

Commit 55c9a4f

Browse files
committed
finish the second lesson (no proofreading yet)
1 parent 493aac5 commit 55c9a4f

5 files changed

Lines changed: 77 additions & 133 deletions

File tree

sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md

Lines changed: 76 additions & 133 deletions
Original file line numberDiff line numberDiff line change
@@ -181,205 +181,148 @@ When Cursor opens the Actor's project folder, we'll see something similar to the
181181

182182
![Cursor ready](images/cursor-ready.webp)
183183

184-
Now, finally, let's do some agentic coding!
184+
We can select files, and if we do so, we can browse and modify their content. The same as in the Web IDE. But as an addition, we now have an integrated AI agent which we can prompt and it'll do to the code at hand whatever we need.
185+
186+
Finally, onto some agentic coding!
185187

186188
## Modifying code with Cursor
187189

188-
:::note Course under construction
189-
This section hasn't been written yet. Come later, please!
190-
:::
190+
First, let's simplify how we can run the Actor. This will be our prompt:
191191

192-
## Verifying changes
192+
```text
193+
Change the default input URL of the Actor
194+
to https://warehouse-theme-metal.myshopify.com/collections/sales
195+
```
193196

194-
:::note Course under construction
195-
This section hasn't been written yet. Come later, please!
196-
:::
197+
After we submit the prompt, the Agent will start reading the code, planning, and working on completing the task. Before it runs commands, it'll ask us to approve them.
197198

198-
## Pushing Actor to Apify
199+
![Cursor asking for approval to run a command](images/cursor-approve.webp)
199200

200-
:::note Course under construction
201-
This section hasn't been written yet. Come later, please!
202-
:::
201+
When done, it'll print summary of its work and we'll be able to review all changes made.
203202

204-
## Wrapping up
203+
![Cursor asking for a review of changes](images/cursor-review.webp)
205204

206-
<!--
207-
208-
## Creating an Actor
209-
210-
Now let's use the Apify CLI to help us kick off a new Actor:
205+
We'll approve all changes and go to the command line to try out if the Actor now works as expected:
211206

212207
```text
213-
apify create warehouse-scraper
208+
apify run
214209
```
215210

216-
It starts a wizard where we can choose from various options. For each option, let's press <kbd>↵</kbd> to accept the default:
211+
We should see a scraper output like before, including the following line:
217212

218213
```text
219-
✔ Choose the programming language of your new Actor: JavaScript
220-
✔ Choose a template for your new Actor. You can check more information at https://apify.com/templates. Crawlee + Cheerio
221-
✔ Almost done! Last step is to install dependencies. Install dependencies
222-
223-
...
224-
225-
Success: ✅ Actor 'warehouse-scraper' created successfully!
214+
INFO CheerioCrawler: Processing page: https://warehouse-theme-metal.myshopify.com/collections/sales
215+
```
226216

227-
Next steps:
217+
That's our first successful change to the Actor with an AI agent! Without back-and-forth between the IDE and an AI chat like ChatGPT. Now before pushing this change back to Apify, let's do one more improvement to the scraper.
228218

229-
cd "warehouse-scraper"
230-
apify run
219+
## Scraping prices
231220

232-
💡 Tip: Use 'apify push' to deploy your Actor to the Apify platform
233-
📖 Docs: https://docs.apify.com/platform/actors/development
234-
🌱 Git repository initialized in 'warehouse-scraper'. You can now commit and push your Actor to Git.
235-
```
221+
In the previous lesson, we noticed that the prices in our resulting dataset are in a rather raw shape:
236222

237-
Now that's a lot of output, but no worries, the important part is that we've successfully used a template to set up a new Actor project!
223+
| name | url | price |
224+
| --- | --- | --- |
225+
| JBL Flip 4 Waterproof Portable Bluetooth Speaker | https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker | Sale price$74.95 |
226+
| Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv | Sale priceFrom $1,398.00 |
227+
| Sony SACS9 10" Active Subwoofer | https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer | Sale price$158.00 |
238228

239-
A new directory `warehouse-scraper` has been created for us, with a variety of files and directories inside. The output instructs us to go to this new project directory, so let's do it:
229+
Let's change that. We'll prompt the agent like this, with a clear example of what we want:
240230

241231
```text
242-
cd "warehouse-scraper"
232+
Change the code so that the Actor saves prices as numbers.
233+
Because some prices are "from", let's call the "price" field
234+
"minPrice" instead, as in minimum price. Example follows.
235+
236+
Before:
237+
Sale price$74.95
238+
Sale priceFrom $1,398.00
239+
Sale price$158.00
240+
241+
After:
242+
74.95
243+
1398.00
244+
158.00
243245
```
244246

245-
Now we can run commands which control this new project. We didn't change the template in any way though, so it won't scrape the Warehouse store for us yet.
246-
247-
Out of the box, the template includes a sample Actor that walks through the [crawlee.dev](https://crawlee.dev/) website and downloads all its pages. This process is called _crawling_, and the sample Actor uses a crawling tool called Crawlee, so its documentation is chosen as a sample target website. Let's see if we can run it:
247+
When the agent is done, we'll approve the changes and verify in the command line that the Actor runs locally:
248248

249249
```text
250250
apify run
251251
```
252252

253-
If we see a flood of output mentioning something called `CheerioCrawler`, it means the template works and we can move on to editing its files so that it does what we want.
253+
It runs, that's nice! But looking at the output, we can't really verify what exactly gets scraped! When we're at it, let's change that with another prompt:
254254

255255
```text
256-
...
257-
INFO CheerioCrawler: Starting the crawler.
258-
INFO CheerioCrawler: enqueueing new URLs
259-
INFO CheerioCrawler: Crawlee · Build reliable crawlers. Fast. {"url":"https://crawlee.dev/"}
260-
...
261-
INFO CheerioCrawler: Finished! Total 107 requests: 107 succeeded, 0 failed. {"terminal":true}
256+
In the output of the scraper I want to see
257+
how the items being saved look like.
262258
```
263259

264-
We're done with commands for now, but do not close the Terminal or Command Prompt window yet, as we'll soon need it again.
265-
266-
:::caution Debugging
267-
If we run into issues with the template wizard or the sample Actor, let's share this tutorial with [ChatGPT](https://chatgpt.com/), include the errors we saw, and ask for help debugging.
268-
:::
269-
270-
## Scraping products
271-
272-
Now we're ready to get our own scraper done. We'll open the `src` directory inside the Actor project and find a file called `main.js`.
273-
274-
We'll open it in a _plain text editor_. Every operating system includes one: Notepad on Windows, TextEdit on macOS, and similar tools on Linux.
275-
276-
:::danger Avoid rich text editors
277-
Let's not use a _rich text editor_, such as Microsoft Word. They're great for human-readable documents with rich formatting, but for code editing, we'll use either dedicated coding editors, or the simplest tool possible.
278-
:::
279-
280-
In the editor, we can see JavaScript code. Let's select all the code and copy to our clipboard. Then we'll open a _new ChatGPT conversation_ and start with a prompt like this:
281-
282-
```text
283-
I'm building an Apify Actor that will run on the Apify platform.
284-
I need to modify a sample template project so it downloads
285-
https://warehouse-theme-metal.myshopify.com/collections/sales,
286-
extracts all products in Sales, and returns data with
287-
the following information for each product:
288-
289-
- Product name
290-
- Product detail page URL
291-
- Price
292-
293-
Before the program ends, it should log how many products it collected.
294-
Code from main.js follows. Reply with a code block containing
295-
a new version of that file.
296-
```
297-
298-
We'll use <kbd>Shift+↵</kbd> to add a few empty lines, then paste the code from our clipboard. After submitting, the AI chat should return a large code block with a new version of `main.js`. Copy it, go back to our text editor, and replace the original `main.js` content.
299-
300-
:::info Code and colors
301-
Code is plain text. Some tools color it to make it easier to read, and ChatGPT does this by default. Plain text editors usually show code in black and white, and that's completely fine.
302-
:::
303-
304-
When we're done, we must not forget to _save the change_ with <kbd>Ctrl+S</kbd> or, on macOS, <kbd>Cmd+S</kbd>. Now let's see if the new code works. To run the program, let's go back to Terminal (macOS/Linux) or PowerShell (Windows) and use Apify CLI again:
260+
We'll approve all changes and go to the command line again:
305261

306262
```text
307263
apify run
308264
```
309265

310-
If all goes well, the output should be similar to this:
266+
Now, the output of the scraper contains the actual items being scraped and we can verify we've been successful with changing the format of the prices (they appear at the very end of each line):
311267

312268
```text
313-
Run: npm run start
314-
315-
> warehouse-scraper@0.0.1 start
316-
> node src/main.js
317-
318-
INFO System info {"apifyVersion":"3.6.0","apifyClientVersion":"2.22.2","crawleeVersion":"3.16.0","osType":"Darwin","nodeVersion":"v25.6.1"}
319269
...
320-
INFO CheerioCrawler: Starting the crawler.
321270
INFO CheerioCrawler: Processing page: https://warehouse-theme-metal.myshopify.com/collections/sales
271+
INFO CheerioCrawler: Saving dataset item {"name":"JBL Flip 4 Waterproof Portable Bluetooth Speaker","url":"https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker","minPrice":74.95}
272+
INFO CheerioCrawler: Saving dataset item {"name":"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV","url":"https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv","minPrice":1398}
273+
INFO CheerioCrawler: Saving dataset item {"name":"Sony SACS9 10\" Active Subwoofer","url":"https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer","minPrice":158}
274+
INFO CheerioCrawler: Saving dataset item {"name":"Sony PS-HX500 Hi-Res USB Turntable","url":"https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable","minPrice":398}
322275
...
323-
INFO CheerioCrawler: Finished!
324-
INFO Total products collected: 24
325276
```
326277

327-
This output says `Total products collected: 24`. The Sales page displays 24 products per page and contains 50 products in total.
278+
Now let's push the changes back to Apify, so that our scheduled scraping happening on the platform can benefit from the improvements we've made locally on our computer.
279+
280+
:::tip Automatically approving changes
328281

329-
Depending on whether ChatGPT decided to walk through all pages or scrape just the first one, we might get 24 or more products. For now, any sign that it scraped products is good news.
282+
If you'll grow tired of approvals, you can enable _auto-keep_. Go to **Cursor****Settings…****Cursor Settings****Agents****Applying Changes** and turn off **Inline Diffs**.
330283

331-
:::caution Debugging
332-
If our program crashes instead, let's copy the error message, send it to our ChatGPT conversation, and ask for a fix.
333284
:::
334285

335-
## Exporting to CSV
286+
## Pushing Actor to Apify
336287

337-
Our program likely works, but we haven't seen the data yet. Let's add a CSV export. CSV is a format most data apps can read, including Microsoft Excel, Google Sheets, and Apple Numbers. Let's continue our ChatGPT conversation with:
288+
To replace the Actor files living on the Apify platform with the ones we have locally, we can run the following command:
338289

339290
```text
340-
Before the program ends, I want it to export all data
341-
as "dataset.csv" in the current working directory.
291+
apify push
342292
```
343293

344-
ChatGPT should return a new code block with CSV export added. Let's replace `main.js` with that version and save our changes. Then let's run the scraper again:
294+
The command can take a while to finish, because it also immediately triggers a build. Once it's done, the new version of the Actor is ready to be ran. The output of the command ends with these two lines:
345295

346296
```text
347-
apify run
297+
...
298+
Actor detail https://console.apify.com/actors/EL7U7aNddXOzwEJ66
299+
Success: Actor was deployed to Apify cloud and built there.
348300
```
349301

350-
In the project directory, a new file called `dataset.csv` should emerge. We can use any of the programs mentioned earlier to check what's inside:
351-
352-
| productName | productUrl | price |
353-
| --- | --- | --- |
354-
| JBL Flip 4 Waterproof Portable Bluetooth Speaker | https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker | Sale price$74.95 |
355-
| Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv | Sale priceFrom $1,398.00 |
356-
| Sony SACS9 10" Active Subwoofer | https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer | Sale price$158.00 |
302+
We'll follow the link to our browser and in the Apify interface, we'll click the **Start** button. Soon we should see items popping up in the **Output** section. For a full overwiew, let's switch to **All fields** again:
357303

358-
…and so on. Looks good!
304+
![Modified Apify output](images/apify-output-modified.webp)
359305

360-
Well, does it? If we look closely, the prices include extra text, which isn't ideal. We'll improve this in one of the next lessons. We'll also improve the workflow so we don't have to keep copying and pasting.
306+
We've done it, the prices save as numbers!
361307

362-
Despite a few flaws, we've successfully created a first working prototype of a price-watching app with no coding knowledge. And with a bit of extra command-line work, we now have something we can deploy to a platform where it can run regularly and reliably. In the next lesson, we'll do exactly that.
308+
:::tip Specifying output schema
363309

364-
-->
310+
If we didn't want to always click on **All fields** to see full items, we need to specify an [output schema](https://docs.apify.com/platform/actors/development/actor-definition/output-schema) so that the platform knows what it can expect and how it should display it in the interface. With Cursor, such change is just a single prompt away:
365311

366-
<!--
367-
Explaining benefits (delegation and independent work, AGENTS.md). Getting environment ready, learning the ropes with a GUI/TUI. Using the `apify` CLI to start a project. Creating a basic scraper which does what we need.
312+
```text
313+
Change the output schema of the Actor
314+
so that it represents the items being
315+
saved the best way in the Apify interface.
316+
```
368317

369-
In lesson 3, students would try to make changes via ChatGPT and see that it gets tedious, which leads to introducing an agent-based IDE to work inside the template more comfortably.
318+
:::
370319

371-
The lesson should use Cursor (or Google Antigravity). Only if it truly scales to zero as they claim and it is not required to have a paid account to try an agent. Minimal friction, just install – beats any other decision factors.
320+
## Wrapping up
372321

373-
If the paragraph above turns out being a wrong direction, we should use VS Code and tell people to spend $10 to try Copilot. VS Code is mainstream. Paying for Copilot is the cheapest agent offering, and it's quite powerful.
374-
-->
322+
We've been installing and setting up a lot, but once we got our environment ready, we could reap the benefits of fast changes to our scraper.
375323

376-
<!--
377-
We'll choose [Cursor](https://cursor.com/), because it has a free plan and it's beginner-friendly.
324+
With a single prompt we tackled a significant change in how our app stores the prices. And we still didn't need to know any coding.
378325

379-
#### Installing development environment
380-
Explaining benefits (delegation and independent work, AGENTS.md). Getting environment ready. Use https://docs.apify.com/platform/actors/development/quick-start/build-with-ai
381-
#### Scraping vendor names
382-
Learning the ropes with a GUI/TUI, prompting the agent to update the code so that it scrapes vendor names. Run the program again, get better results.
326+
To improve our project further, we ask the agent to perform a change, review and approve its work, then execute `apify run` in the command line to verify how it works, and finally `apify push` to upload our Actor files to Apify.
383327

384-
Teaser: Explain why this is fragile. In the next lesson we'll learn how to develop features of the scraper in a robust way by first specifying them as documentation.
385-
-->
328+
In the next lesson, we'll take a look at how we can develop our scraper by documenting how it should behave instead of prompting the AI agent feature by feature, without a track record of our intentions.

sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ unlisted: true
66
---
77

88
<!--
9+
explain AGENTS.md
910
Improving the README, e.g. input output. Pointing the agent to the README and turning the design to reality.
1011
-->
1112

98.6 KB
Loading
46.1 KB
Loading
142 KB
Loading

0 commit comments

Comments
 (0)