Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/deployment/apify_platform_init_exit.ts
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ const crawler = new CheerioCrawler({

// Add URLs that match the provided pattern.
await enqueueLinks({
globs: ['https://www.iana.org/*'],
include: ['https://www.iana.org/*'],
});

// Save extracted data to dataset.
Expand Down
2 changes: 1 addition & 1 deletion docs/deployment/apify_platform_main.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ await Actor.main(async () => {

// Add URLs that match the provided pattern.
await enqueueLinks({
globs: ['https://www.iana.org/*'],
include: ['https://www.iana.org/*'],
});

// Save extracted data to dataset.
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/crawl_some_links.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
import ApiLink from '@site/src/components/ApiLink';
import CrawlSource from '!!raw-loader!roa-loader!./crawl_some_links.ts';

This <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> example uses the <ApiLink to="core/interface/EnqueueLinksOptions#globs">`globs`</ApiLink> property in the <ApiLink to="cheerio-crawler/interface/CheerioCrawlingContext#enqueueLinks">`enqueueLinks()`</ApiLink> method to only add links to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink> queue if they match the specified pattern.
This <ApiLink to="cheerio-crawler/class/CheerioCrawler">`CheerioCrawler`</ApiLink> example uses the <ApiLink to="core/interface/EnqueueLinksOptions#include">`include`</ApiLink> property in the <ApiLink to="cheerio-crawler/interface/CheerioCrawlingContext#enqueueLinks">`enqueueLinks()`</ApiLink> method to only add links to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink> queue if they match the specified pattern.

<RunnableCodeBlock className="language-js" type="cheerio">
{CrawlSource}
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/crawl_some_links.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ const crawler = new CheerioCrawler({
log.info(request.url);
// Add some links from page to the crawler's RequestQueue
await enqueueLinks({
globs: ['http?(s)://crawlee.dev/*/*'],
include: ['http?(s)://crawlee.dev/*/*'],
});
},
});
Expand Down
2 changes: 1 addition & 1 deletion docs/examples/puppeteer_recursive_crawl.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ const crawler = new PuppeteerCrawler({
log.info(`Title of ${request.url}: ${title}`);

await enqueueLinks({
globs: ['http?(s)://www.iana.org/**'],
include: ['http?(s)://www.iana.org/**'],
});
},
maxRequestsPerCrawl: 10,
Expand Down
8 changes: 4 additions & 4 deletions docs/introduction/03-adding-urls.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ await enqueueLinks({

### Filter URLs with patterns

For even more control, you can use `globs`, `regexps` and `pseudoUrls` to filter the URLs. Each of those arguments is always an `Array`, but the contents can take on many forms. <ApiLink to="core/interface/EnqueueLinksOptions">See the reference</ApiLink> for more information about them as well as other options.
For even more control, you can use `include` and `exclude` to filter the URLs. Each accepts an `Array` of glob pattern strings, `{ glob: string }` objects, `RegExp` instances, or `{ regexp: RegExp }` objects. <ApiLink to="core/interface/EnqueueLinksOptions">See the reference</ApiLink> for more information about them as well as other options.

:::caution Defaults override

Expand All @@ -140,17 +140,17 @@ If you provide one of those options, the default `same-hostname` strategy will *

```ts
await enqueueLinks({
globs: ['http?(s)://apify.com/*/*'],
include: ['http?(s)://apify.com/*/*'],
});
```

### Transform requests

To have absolute control, we have the <ApiLink to="core/interface/EnqueueLinksOptions/#transformRequestFunction">`transformRequestFunction`</ApiLink>. Just before a new <ApiLink to="core/class/Request">`Request`</ApiLink> is constructed and enqueued to the <ApiLink to="core/class/RequestQueue">`RequestQueue`</ApiLink>, this function can be used to skip it or modify its contents such as `userData`, `payload` or, most importantly, `uniqueKey`. This is useful when you need to enqueue multiple requests to the queue, and these requests share the same URL, but differ in methods or payloads. Another use case is to dynamically update or create the `userData`.
To have absolute control, we have the <ApiLink to="core/interface/EnqueueLinksOptions/#transformRequestFunction">`transformRequestFunction`</ApiLink>. After request options are filtered by `include`/`exclude` patterns, this function can be used to skip them or modify their contents such as `userData`, `payload` or, most importantly, `uniqueKey`. This is useful when you need to enqueue multiple requests to the queue, and these requests share the same URL, but differ in methods or payloads. Another use case is to dynamically update or create the `userData`.

```ts
await enqueueLinks({
globs: ['http?(s)://apify.com/*/*'],
include: ['http?(s)://apify.com/*/*'],
transformRequestFunction(req) {
// ignore all links ending with `.pdf`
if (req.url.endsWith('.pdf')) return false;
Expand Down
7 changes: 3 additions & 4 deletions docs/upgrading/upgrading_v3.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,14 +188,13 @@ One common helper that received more attention is the `enqueueLinks`. As mention

This means we can even call `enqueueLinks()` without any parameters. By default, it will go through all the links found on current page and filter only those targeting the same subdomain.

Moreover, we can specify patterns the URL should match via globs:
Moreover, we can specify patterns the URL should match via `include`:

```ts
const crawler = new PlaywrightCrawler({
async requestHandler({ enqueueLinks }) {
await enqueueLinks({
globs: ['https://crawlee.dev/*/*'],
// we can also use `regexps` and `pseudoUrls` keys here
include: ['https://crawlee.dev/*/*'],
});
},
});
Expand Down Expand Up @@ -231,7 +230,7 @@ Labeling requests used to work via the `Request.userData` object. With Crawlee,
async requestHandler({ request, enqueueLinks }) {
if (request.label !== 'DETAIL') {
await enqueueLinks({
globs: ['...'],
include: ['...'],
label: 'DETAIL',
});
}
Expand Down
1 change: 0 additions & 1 deletion packages/core/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@
"@apify/consts": "^2.41.0",
"@apify/datastructures": "^2.0.3",
"@apify/log": "^2.5.18",
"@apify/pseudo_url": "^2.0.59",
"@apify/timeout": "^0.3.2",
"@apify/utilities": "^2.15.5",
"@crawlee/memory-storage": "workspace:*",
Expand Down
10 changes: 4 additions & 6 deletions packages/core/src/crawlers/crawler_commons.ts
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,7 @@ export interface RestrictedCrawlingContext<UserData extends Dictionary = Diction
* This function automatically finds and enqueues links from the current page, adding them to the {@apilink RequestQueue}
* currently used by the crawler.
*
* Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions
* and override settings of the enqueued {@apilink Request} objects.
* Optionally, the function allows you to filter the target links' URLs using an array of glob or regexp patterns.
*
* Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links) example
* for more details regarding its usage.
Expand All @@ -69,7 +68,7 @@ export interface RestrictedCrawlingContext<UserData extends Dictionary = Diction
* ```ts
* async requestHandler({ enqueueLinks }) {
* await enqueueLinks({
* globs: [
* include: [
* 'https://www.example.com/handbags/*',
* ],
* });
Expand Down Expand Up @@ -116,8 +115,7 @@ export interface CrawlingContext<UserData extends Dictionary = Dictionary> exten
* This function automatically finds and enqueues links from the current page, adding them to the {@apilink RequestQueue}
* currently used by the crawler.
*
* Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions
* and override settings of the enqueued {@apilink Request} objects.
* Optionally, the function allows you to filter the target links' URLs using an array of glob or regexp patterns.
*
* Check out the [Crawl a website with relative links](https://crawlee.dev/js/docs/examples/crawl-relative-links) example
* for more details regarding its usage.
Expand All @@ -127,7 +125,7 @@ export interface CrawlingContext<UserData extends Dictionary = Dictionary> exten
* ```ts
* async requestHandler({ enqueueLinks }) {
* await enqueueLinks({
* globs: [
* include: [
* 'https://www.example.com/handbags/*',
* ],
* });
Expand Down
125 changes: 27 additions & 98 deletions packages/core/src/enqueue_links/enqueue_links.ts
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,22 @@ import type { SetRequired } from 'type-fest';

import type { RequestOptions } from '../request.js';
import { Request } from '../request.js';
import { serviceLocator } from '../service_locator.js';
import type {
AddRequestsBatchedOptions,
AddRequestsBatchedResult,
RequestProvider,
RequestQueueOperationOptions,
} from '../storages/request_provider.js';
import type {
GlobInput,
PseudoUrlInput,
RegExpInput,
RequestTransform,
SkippedRequestCallback,
SkippedRequestReason,
UrlPatternInput,
UrlPatternObject,
} from './shared.js';
import {
applyRequestTransform,
constructGlobObjectsFromGlobs,
constructRegExpObjectsFromPseudoUrls,
constructRegExpObjectsFromRegExps,
constructUrlPatternObjects,
createRequestOptions,
filterRequestOptionsByPatterns,
} from './shared.js';
Expand All @@ -50,8 +45,7 @@ export interface EnqueueLinksOptions extends RequestQueueOperationOptions {
/**
* Sets {@apilink Request.label} for newly enqueued requests.
*
* This option has the lowest priority and can be overwritten by request options
* specified in `globs`, `regexps`, or `pseudoUrls` objects, as well as by `transformRequestFunction`.
* Can be overwritten by `transformRequestFunction`.
*/
label?: string;

Expand All @@ -71,65 +65,30 @@ export interface EnqueueLinksOptions extends RequestQueueOperationOptions {
baseUrl?: string;

/**
* An array of glob pattern strings or plain objects
* containing glob pattern strings matching the URLs to be enqueued.
* An array of URL patterns that URLs must match to be enqueued.
*
* The plain objects must include at least the `glob` property, which holds the glob pattern string.
* All remaining keys will be used as request options for the corresponding enqueued {@apilink Request} objects.
*
* The matching is always case-insensitive.
* If you need case-sensitive matching, use `regexps` property directly.
*
* If `globs` is an empty array or `undefined`, and `regexps` are also not defined, then the function
* enqueues the links with the same subdomain.
*/
globs?: readonly GlobInput[];

/**
* An array of glob pattern strings, regexp patterns or plain objects
* containing patterns matching URLs that will **never** be enqueued.
*
* The plain objects must include either the `glob` property or the `regexp` property.
* Accepts glob pattern strings, `{ glob: string }` objects, `RegExp` instances, or `{ regexp: RegExp }` objects.
*
* Glob matching is always case-insensitive.
* If you need case-sensitive matching, provide a regexp.
*/
exclude?: readonly (GlobInput | RegExpInput)[];

/**
* An array of regular expressions or plain objects
* containing regular expressions matching the URLs to be enqueued.
*
* The plain objects must include at least the `regexp` property, which holds the regular expression.
* All remaining keys will be used as request options for the corresponding enqueued {@apilink Request} objects.
* If you need case-sensitive matching, use a `RegExp`.
*
* If `regexps` is an empty array or `undefined`, and `globs` are also not defined, then the function
* If `include` is an empty array or `undefined`, then the function
* enqueues the links with the same subdomain.
*/
regexps?: readonly RegExpInput[];
include?: readonly UrlPatternInput[];

/**
* *NOTE:* In future versions of SDK the options will be removed.
* Please use `globs` or `regexps` instead.
* An array of URL patterns. Matching URLs will **not** be enqueued.
*
* An array of {@apilink PseudoUrl} strings or plain objects
* containing {@apilink PseudoUrl} strings matching the URLs to be enqueued.
* Accepts glob pattern strings, `{ glob: string }` objects, `RegExp` instances, or `{ regexp: RegExp }` objects.
*
* The plain objects must include at least the `purl` property, which holds the pseudo-URL string.
* All remaining keys will be used as request options for the corresponding enqueued {@apilink Request} objects.
*
* With a pseudo-URL string, the matching is always case-insensitive.
* If you need case-sensitive matching, use `regexps` property directly.
*
* If `pseudoUrls` is an empty array or `undefined`, then the function
* enqueues the links with the same subdomain.
*
* @deprecated prefer using `globs` or `regexps` instead
* Glob matching is always case-insensitive.
* If you need case-sensitive matching, use a `RegExp`.
*/
pseudoUrls?: readonly PseudoUrlInput[];
exclude?: readonly UrlPatternInput[];

/**
* After request options are filtered by patterns, this function can be used
* After request options are filtered by `include`/`exclude` patterns, this function can be used
* to remove them or modify their contents such as `userData`, `payload` or, most importantly `uniqueKey`. This is useful
* when you need to enqueue multiple `Requests` to the queue that share the same URL, but differ in methods or payloads,
* or to dynamically update or create `userData`.
Expand All @@ -148,8 +107,8 @@ export interface EnqueueLinksOptions extends RequestQueueOperationOptions {
* }
* ```
*
* Note that `transformRequestFunction` has the highest priority and can overwrite request options
* specified in `globs`, `regexps`, or `pseudoUrls` objects, as well as the global `label` option.
* Note that `transformRequestFunction` has the highest priority and can overwrite
* the global `label` option.
*
* The function receives a {@apilink RequestOptions} object and can return either:
* - The modified {@apilink RequestOptions} object
Expand Down Expand Up @@ -259,8 +218,7 @@ export enum EnqueueStrategy {
* This function enqueues the urls provided to the {@apilink RequestQueue} provided. If you want to automatically find and enqueue links,
* you should use the context-aware `enqueueLinks` function provided on the crawler contexts.
*
* Optionally, the function allows you to filter the target links' URLs using an array of globs or regular expressions
* and override settings of the enqueued {@apilink Request} objects.
* Optionally, the function allows you to filter the target links' URLs using an array of glob or regexp patterns.
*
* **Example usage**
*
Expand All @@ -269,7 +227,7 @@ export enum EnqueueStrategy {
* urls: aListOfFoundUrls,
* requestQueue,
* selector: 'a.product-detail',
* globs: [
* include: [
* 'https://www.example.com/handbags/*',
* 'https://www.example.com/purses/*'
* ],
Expand Down Expand Up @@ -298,6 +256,8 @@ export async function enqueueLinks(
);
}

const urlPatternValidator = ow.any(ow.string, ow.regExp, ow.object.hasKeys('glob'), ow.object.hasKeys('regexp'));

ow(
options as any,
ow.object.exactShape({
Expand All @@ -313,12 +273,8 @@ export async function enqueueLinks(
baseUrl: ow.optional.string,
userData: ow.optional.object,
label: ow.optional.string,
pseudoUrls: ow.optional.array.ofType(ow.any(ow.string, ow.object.hasKeys('purl'))),
globs: ow.optional.array.ofType(ow.any(ow.string, ow.object.hasKeys('glob'))),
exclude: ow.optional.array.ofType(
ow.any(ow.string, ow.regExp, ow.object.hasKeys('glob'), ow.object.hasKeys('regexp')),
),
regexps: ow.optional.array.ofType(ow.any(ow.regExp, ow.object.hasKeys('regexp'))),
include: ow.optional.array.ofType(urlPatternValidator),
exclude: ow.optional.array.ofType(urlPatternValidator),
transformRequestFunction: ow.optional.function,
strategy: ow.optional.string.oneOf(Object.values(EnqueueStrategy)),
waitForAllRequestsToBeAdded: ow.optional.boolean,
Expand All @@ -329,43 +285,17 @@ export async function enqueueLinks(
requestQueue,
limit,
urls,
// oxlint-disable-next-line typescript/no-deprecated -- still accepted for backwards compat
pseudoUrls,
include,
exclude,
globs,
regexps,
transformRequestFunction,
forefront,
waitForAllRequestsToBeAdded,
robotsTxtFile,
onSkippedRequest,
} = options;

const urlExcludePatternObjects: UrlPatternObject[] = [];
const urlPatternObjects: UrlPatternObject[] = [];

if (exclude?.length) {
for (const excl of exclude) {
if (typeof excl === 'string' || 'glob' in excl) {
urlExcludePatternObjects.push(...constructGlobObjectsFromGlobs([excl]));
} else if (excl instanceof RegExp || 'regexp' in excl) {
urlExcludePatternObjects.push(...constructRegExpObjectsFromRegExps([excl]));
}
}
}

if (pseudoUrls?.length) {
serviceLocator.getLogger().deprecated('`pseudoUrls` option is deprecated, use `globs` or `regexps` instead');
urlPatternObjects.push(...constructRegExpObjectsFromPseudoUrls(pseudoUrls));
}

if (globs?.length) {
urlPatternObjects.push(...constructGlobObjectsFromGlobs(globs));
}

if (regexps?.length) {
urlPatternObjects.push(...constructRegExpObjectsFromRegExps(regexps));
}
const urlExcludePatternObjects: UrlPatternObject[] = exclude?.length ? constructUrlPatternObjects(exclude) : [];
const urlPatternObjects: UrlPatternObject[] = include?.length ? constructUrlPatternObjects(include) : [];

if (!urlPatternObjects.length) {
options.strategy ??= EnqueueStrategy.SameHostname;
Expand Down Expand Up @@ -450,8 +380,7 @@ export async function enqueueLinks(
async function createFilteredRequests() {
const skippedRequests: string[] = [];

// Step 1: Filter request options by exclude patterns, user patterns (globs/regexps), and strategy patterns.
// Pattern-level options (label, userData, method, etc.) are merged during this step.
// Step 1: Filter request options by exclude patterns, user include patterns, and strategy patterns.
let filteredOptions: RequestOptions[];
if (urlPatternObjects.length === 0) {
filteredOptions = filterRequestOptionsByPatterns(
Expand Down Expand Up @@ -570,7 +499,7 @@ export interface ResolveBaseUrl {
}

/**
* Internal function that changes the enqueue globs to match both http and https
* Internal function that changes the enqueue glob patterns to match both http and https
*/
function ignoreHttpSchema(pattern: string): string {
return pattern.replace(/^(https?):\/\//, 'http{s,}://');
Expand Down
Loading
Loading