Skip to content

Commit 4dbf9ef

Browse files
committed
fix: update doc examples and lint errors for include/exclude API
- Replace remaining `globs` → `include` in docs (4 examples + 2 guides) - Convert `type` to `interface` for UrlPatternObject, GlobObject, RegExpObject (ESLint)
1 parent 223eb13 commit 4dbf9ef

7 files changed

Lines changed: 18 additions & 15 deletions

File tree

docs/deployment/apify_platform_init_exit.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ const crawler = new CheerioCrawler({
1313

1414
// Add URLs that match the provided pattern.
1515
await enqueueLinks({
16-
globs: ['https://www.iana.org/*'],
16+
include: ['https://www.iana.org/*'],
1717
});
1818

1919
// Save extracted data to dataset.

docs/deployment/apify_platform_main.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ await Actor.main(async () => {
1212

1313
// Add URLs that match the provided pattern.
1414
await enqueueLinks({
15-
globs: ['https://www.iana.org/*'],
15+
include: ['https://www.iana.org/*'],
1616
});
1717

1818
// Save extracted data to dataset.

docs/examples/crawl_some_links.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ const crawler = new CheerioCrawler({
99
log.info(request.url);
1010
// Add some links from page to the crawler's RequestQueue
1111
await enqueueLinks({
12-
globs: ['http?(s)://crawlee.dev/*/*'],
12+
include: ['http?(s)://crawlee.dev/*/*'],
1313
});
1414
},
1515
});

docs/examples/puppeteer_recursive_crawl.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ const crawler = new PuppeteerCrawler({
66
log.info(`Title of ${request.url}: ${title}`);
77

88
await enqueueLinks({
9-
globs: ['http?(s)://www.iana.org/**'],
9+
include: ['http?(s)://www.iana.org/**'],
1010
});
1111
},
1212
maxRequestsPerCrawl: 10,

docs/introduction/03-adding-urls.mdx

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -140,17 +140,17 @@ If you provide one of those options, the default `same-hostname` strategy will *
140140

141141
```ts
142142
await enqueueLinks({
143-
globs: ['http?(s)://apify.com/*/*'],
143+
include: ['http?(s)://apify.com/*/*'],
144144
});
145145
```
146146

147147
### Transform requests
148148

149-
To have absolute control, we have the <ApiLink to="core/interface/EnqueueLinksOptions/#transformRequestFunction">`transformRequestFunction`</ApiLink>. Just before a new <ApiLink to="core/class/Request">`Request`</ApiLink> is constructed and enqueued to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink>, this function can be used to skip it or modify its contents such as `userData`, `payload` or, most importantly, `uniqueKey`. This is useful when you need to enqueue multiple requests to the queue, and these requests share the same URL, but differ in methods or payloads. Another use case is to dynamically update or create the `userData`.
149+
To have absolute control, we have the <ApiLink to="core/interface/EnqueueLinksOptions/#transformRequestFunction">`transformRequestFunction`</ApiLink>. After request options are filtered by `include`/`exclude` patterns, this function can be used to skip them or modify their contents such as `userData`, `payload` or, most importantly, `uniqueKey`. This is useful when you need to enqueue multiple requests to the queue, and these requests share the same URL, but differ in methods or payloads. Another use case is to dynamically update or create the `userData`.
150150

151151
```ts
152152
await enqueueLinks({
153-
globs: ['http?(s)://apify.com/*/*'],
153+
include: ['http?(s)://apify.com/*/*'],
154154
transformRequestFunction(req) {
155155
// ignore all links ending with `.pdf`
156156
if (req.url.endsWith('.pdf')) return false;

docs/upgrading/upgrading_v3.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -188,14 +188,13 @@ One common helper that received more attention is the `enqueueLinks`. As mention
188188

189189
This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain.
190190

191-
Moreover, we can specify patterns the URL should match via globs:
191+
Moreover, we can specify patterns the URL should match via `include`:
192192

193193
```ts
194194
const crawler = new PlaywrightCrawler({
195195
async requestHandler({ enqueueLinks }) {
196196
await enqueueLinks({
197-
globs: ['https://crawlee.dev/*/*'],
198-
// we can also use `regexps` and `pseudoUrls` keys here
197+
include: ['https://crawlee.dev/*/*'],
199198
});
200199
},
201200
});
@@ -231,7 +230,7 @@ Labeling requests used to work via the `Request.userData` object. With Crawlee,
231230
async requestHandler({ request, enqueueLinks }) {
232231
if (request.label !== 'DETAIL') {
233232
await enqueueLinks({
234-
globs: ['...'],
233+
include: ['...'],
235234
label: 'DETAIL',
236235
});
237236
}

packages/core/src/enqueue_links/shared.ts

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,20 @@ const MAX_ENQUEUE_LINKS_CACHE_SIZE = 1000;
1818
*/
1919
const enqueueLinksPatternCache = new Map();
2020

21-
export type UrlPatternObject = {
21+
export interface UrlPatternObject {
2222
glob?: string;
2323
regexp?: RegExp;
24-
};
24+
}
2525

26-
export type GlobObject = { glob: string };
26+
export interface GlobObject {
27+
glob: string;
28+
}
2729

2830
export type GlobInput = string | GlobObject;
2931

30-
export type RegExpObject = { regexp: RegExp };
32+
export interface RegExpObject {
33+
regexp: RegExp;
34+
}
3135

3236
export type RegExpInput = RegExp | RegExpObject;
3337

0 commit comments

Comments
 (0)