Skip to content

Commit d1f4c98

Browse files
B4nanclaude
andauthored
feat: redesign Configuration class for v4 (#3484)
## Summary - **Replaces `get()`/`set()` with direct property access** — `config.headless` instead of `config.get('headless')`. Both methods are removed (v4 breaking change). - **Constructor options now take highest priority** — `new Configuration({ headless: false })` works even when `CRAWLEE_HEADLESS=true` is set. New priority: constructor > env vars > crawlee.json > schema defaults. - **Immutable instances** — assigning to a config property throws `TypeError`. - **Zod-based field definitions** — schema, env var mapping, and defaults defined in one place (`crawleeConfigFields`). - **Extensible via subclassing** — subclasses override `protected static fields` to register additional config fields (e.g. Apify SDK). Closes #3080 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0d0c58d commit d1f4c98

46 files changed

Lines changed: 74915 additions & 68621 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/guides/configuration.mdx

Lines changed: 23 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,13 +15,13 @@ There are three ways of changing the configuration parameters:
1515
- using the `Configuration` class
1616

1717
You could also combine all the above, but you should keep in mind, that the precedence for these 3 options is the following:
18-
***`crawlee.json`*** < ***constructor options*** < ***environment variables***.
18+
***constructor options*** > ***environment variables*** > ***`crawlee.json`***.
1919

20-
`crawlee.json` is a baseline. The options provided in the `Configuration` constructor will override the options provided in the JSON. Environment variables will override both.
20+
Constructor options have the highest priority. Environment variables override `crawlee.json`. The JSON file serves as a baseline.
2121

2222
## `crawlee.json`
2323

24-
The first option you could use for configuring Crawlee is `crawlee.json` file. The only thing you need to do is specify the <ApiLink to="core/interface/ConfigurationOptions">`ConfigurationOptions`</ApiLink> in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration.
24+
The first option you could use for configuring Crawlee is `crawlee.json` file. The only thing you need to do is specify the configuration options in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration. See the <ApiLink to="core/class/Configuration">`Configuration`</ApiLink> class for the full list of supported options.
2525

2626
```json title="crawlee.json"
2727
{
@@ -133,24 +133,28 @@ the autoscaling feature will only use up to 2048 MB of memory.
133133

134134
## Configuration class
135135

136-
The last option to adjust Crawlee configuration is to use the <ApiLink to="core/class/Configuration">`Configuration`</ApiLink> class in the code.
136+
The last option to adjust Crawlee configuration is to use the <ApiLink to="core/class/Configuration">`Configuration`</ApiLink> class in the code. Configuration is immutable — values are set via the constructor and cannot be changed afterwards.
137137

138138
### Global Configuration
139139

140-
By default, there is a global singleton instance of `Configuration` class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can get access to it via <ApiLink to="core/class/Configuration#getGlobalConfig">`Configuration.getGlobalConfig()`</ApiLink> function. Now you can easily <ApiLink to="core/class/Configuration#get">`get`</ApiLink> and <ApiLink to="core/class/Configuration#set">`set`</ApiLink> the <ApiLink to="core/interface/ConfigurationOptions">`ConfigurationOptions`</ApiLink>.
140+
By default, there is a global singleton instance of `Configuration` class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can access it via <ApiLink to="core/class/Configuration#getGlobalConfig">`Configuration.getGlobalConfig()`</ApiLink>, which delegates to the global <ApiLink to="core/class/ServiceLocator">`serviceLocator`</ApiLink> — the single source of truth for Crawlee's shared services (for example the configuration, event manager, storage client, and logger). You can also reach the same instance directly via `serviceLocator.getConfiguration()` or swap services globally with `serviceLocator.setConfiguration(...)` before any crawler is created. Configuration values are accessible directly as properties on the instance.
141141

142142
```js
143143
import { CheerioCrawler, Configuration, sleep } from 'crawlee';
144144

145145
// Get the global configuration
146146
const config = Configuration.getGlobalConfig();
147-
// Set the 'persistStateIntervalMillis' option
148-
// of global configuration to 10 seconds
149-
config.set('persistStateIntervalMillis', 10_000);
147+
// Access configuration values directly as properties
148+
console.log(config.persistStateIntervalMillis);
150149

151-
// Note, that we are not passing the configuration to the crawler
152-
// as it's using the global configuration
153-
const crawler = new CheerioCrawler();
150+
// To use custom configuration values, create a new Configuration instance
151+
const configuration = new Configuration({
152+
// Set the 'persistStateIntervalMillis' option to 10 seconds
153+
persistStateIntervalMillis: 10_000,
154+
});
155+
156+
// Pass the configuration to the crawler
157+
const crawler = new CheerioCrawler({ configuration });
154158

155159
crawler.router.addDefaultHandler(async ({ request }) => {
156160
// For the first request we wait for 5 seconds,
@@ -170,15 +174,13 @@ crawler.router.addDefaultHandler(async ({ request }) => {
170174
await crawler.run(['https://www.example.com/1']);
171175
```
172176

173-
This is pretty much the same example we used for showing `crawlee.json` usage,
174-
but now we're using the global configuration, which is the only difference.
175-
If you run this example - you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store as before,
176-
which would show the same number of finishes requests (one) and the same crawler runtime (~10 seconds).
177-
This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the global configuration.
177+
If you run this example - you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store,
178+
which would show the same number of finished requests (one) and the same crawler runtime (~10 seconds).
179+
This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the configuration.
178180

179181
:::note
180182

181-
After running the same example with commented two lines of code related to `Configuration` there will be
183+
After running the same example without the custom configuration, there will be
182184
no `SDK_CRAWLER_STATISTICS` file stored in the default Key-Value store:
183185
as we did not change the `persistStateIntervalMillis`, Crawlee used the default value of 60 seconds,
184186
and the crawler was forcefully aborted after ~15 seconds of run time before it persisted the state for the first time.
@@ -187,19 +189,19 @@ and the crawler was forcefully aborted after ~15 seconds of run time before it p
187189

188190
### Custom configuration
189191

190-
Alternatively, you can create a custom configuration. In this case you need to pass it to the class that is going to use it, e.g. to the crawler. Let's adjust the previous example:
192+
You can create a custom configuration and pass it to the crawler via the `configuration` option:
191193

192194
```js
193195
import { CheerioCrawler, Configuration, sleep } from 'crawlee';
194196

195197
// Create new configuration
196-
const config = new Configuration({
198+
const configuration = new Configuration({
197199
// Set the 'persistStateIntervalMillis' option to 10 seconds
198200
persistStateIntervalMillis: 10_000,
199201
});
200202

201-
// Now we need to pass the configuration to the crawler
202-
const crawler = new CheerioCrawler({}, config);
203+
// Pass the configuration to the crawler
204+
const crawler = new CheerioCrawler({ configuration });
203205

204206
crawler.router.addDefaultHandler(async ({ request }) => {
205207
// for the first request we wait for 5 seconds,

docs/guides/parallel-scraping/parallel-scraper.mjs

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -73,15 +73,13 @@ if (!process.env.IN_WORKER_THREAD) {
7373
// or a configuration option. This is just for show 😈
7474
workerLogger.setLevel(log.LEVELS.DEBUG);
7575

76-
// Disable the automatic purge on start
77-
// This is needed when running locally, as otherwise multiple processes will try to clear the default storage (and that will cause clashes)
78-
Configuration.set('purgeOnStart', false);
79-
8076
// Get the request queue
8177
const requestQueue = await getOrInitQueue(false);
8278

83-
// Configure crawlee to store the worker-specific data in a separate directory (needs to be done AFTER the queue is initialized when running locally)
79+
// Disable the automatic purge on start and configure crawlee to store the worker-specific data in a separate directory
80+
// (needs to be done AFTER the queue is initialized when running locally)
8481
const config = new Configuration({
82+
purgeOnStart: false,
8583
storageClientOptions: {
8684
localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
8785
},

docs/guides/parallel-scraping/parallel-scraping.mdx

Lines changed: 5 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -132,22 +132,19 @@ We use this to ensure the parent process stays alive until all the worker proces
132132

133133
There are three steps we want to do for the worker processes:
134134

135-
- ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared
136135
- get the queue that supports locking from the same location as the parent process
137-
- initialize a special storage for worker processes so they do not collide with each other
136+
- ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared, and initialize a special storage for worker processes so they do not collide with each other
138137

139138
In order, that's what these lines do:
140139

141140
```javascript title="src/parallel-scraper.mjs"
142-
// Disable the automatic purge on start (step 1)
143-
// This is needed when running locally, as otherwise multiple processes will try to clear the default storage (and that will cause clashes)
144-
Configuration.set('purgeOnStart', false);
145-
146-
// Get the request queue from the parent process (step 2)
141+
// Get the request queue from the parent process (step 1)
147142
const requestQueue = await getOrInitQueue(false);
148143

149-
// Configure crawlee to store the worker-specific data in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 3)
144+
// Disable the automatic purge on start and configure crawlee to store the worker-specific data
145+
// in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 2)
150146
const config = new Configuration({
147+
purgeOnStart: false,
151148
storageClientOptions: {
152149
localDataDirectory: `./storage/worker-${process.env.WORKER_INDEX}`,
153150
},

docs/upgrading/upgrading_v4.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ The crawler following options are removed:
5555
- `FileDownloadOptions.streamHandler` - streaming should now be handled directly in the `requestHandler` instead
5656
- `playwrightUtils.registerUtilsToContext` and `puppeteerUtils.registerUtilsToContext` - this is now added to the context via `ContextPipeline` composition
5757
- `puppeteerUtils.blockResources` and `puppeteerUtils.cacheResponses` (deprecated)
58+
- `Configuration.systemInfoV2` / `CRAWLEE_SYSTEM_INFO_V2` environment variable — the v2 behavior is now the default (see [Available resource detection](#available-resource-detection))
5859

5960
### The protected `BasicCrawler.crawlingContexts` map is removed
6061

@@ -154,6 +155,40 @@ The `KeyValueStore.getPublicUrl` method is now asynchronous and reads the public
154155

155156
The `preNavigationHooks` option in `HttpCrawler` subclasses no longer accepts the `gotOptions` object as a second parameter. Modify the `crawlingContext` fields (e.g. `.request`) directly instead.
156157

158+
## Configuration class redesign
159+
160+
The `Configuration` class has been redesigned for v4. The main changes are:
161+
162+
### Direct property access replaces `get()` and `set()`
163+
164+
**Before:**
165+
```ts
166+
const config = Configuration.getGlobalConfig();
167+
config.set('persistStateIntervalMillis', 10_000);
168+
const headless = config.get('headless');
169+
```
170+
171+
**After:**
172+
```ts
173+
// Configuration is now immutable — set options via the constructor
174+
const config = new Configuration({ persistStateIntervalMillis: 10_000 });
175+
const headless = config.headless;
176+
```
177+
178+
The `get()` and `set()` methods are removed. Access config values directly as properties.
179+
Configuration instances are immutable — attempting to assign a property throws a `TypeError`.
180+
181+
### Constructor options now take precedence over environment variables
182+
183+
**New priority order (highest to lowest):**
184+
1. Constructor options
185+
2. Environment variables
186+
3. `crawlee.json`
187+
4. Schema defaults
188+
189+
Previously, environment variables always won. Now `new Configuration({ headless: false })`
190+
works even when `CRAWLEE_HEADLESS=true` is set.
191+
157192
## Service management moved from `Configuration` to `ServiceLocator`
158193

159194
The service management functionality has been extracted from `Configuration` into a new `ServiceLocator` class, following the pattern established in Crawlee for Python.
@@ -166,6 +201,7 @@ The following methods and properties have been removed from `Configuration`:
166201
- `Configuration.getEventManager()` - moved to `ServiceLocator.getEventManager()`
167202
- `Configuration.useStorageClient()` - use `ServiceLocator.setStorageClient()` instead
168203
- `Configuration.useEventManager()` - use `ServiceLocator.setEventManager()` instead
204+
- `Configuration.resetGlobalState()` - use `serviceLocator.reset()` instead
169205
- `Configuration.storageManagers` - moved to `ServiceLocator.storageManagers`
170206

171207
The `EventManager` and `LocalEventManager` constructors now accept an options object for configuring event intervals (e.g. `persistStateIntervalMillis`, `systemInfoIntervalMillis`). You can also use the new `LocalEventManager.fromConfig()` factory method to create an instance with intervals derived from a `Configuration` object.

packages/browser-crawler/src/internals/browser-launcher.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ export abstract class BrowserLauncher<
189189
...this.launchOptions,
190190
};
191191

192-
if (this.config.get('disableBrowserSandbox')) {
192+
if (this.config.disableBrowserSandbox) {
193193
launchOptions.args.push('--no-sandbox');
194194
}
195195

@@ -209,11 +209,11 @@ export abstract class BrowserLauncher<
209209
}
210210

211211
protected _getDefaultHeadlessOption(): boolean {
212-
return this.config.get('headless')! && !this.config.get('xvfb', false);
212+
return this.config.headless && !this.config.xvfb;
213213
}
214214

215215
protected _getChromeExecutablePath(): string {
216-
return this.config.get('chromeExecutablePath', this._getTypicalChromeExecutablePath());
216+
return this.config.chromeExecutablePath ?? this._getTypicalChromeExecutablePath();
217217
}
218218

219219
/**

packages/core/package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,14 +59,14 @@
5959
"@sapphire/async-queue": "^1.5.5",
6060
"@vladfrangu/async_event_emitter": "^2.4.6",
6161
"csv-stringify": "^6.5.2",
62-
"fs-extra": "^11.3.0",
6362
"json5": "^2.2.3",
6463
"minimatch": "^10.0.1",
6564
"ow": "^2.0.0",
6665
"stream-json": "^1.9.1",
6766
"tldts": "^7.0.6",
6867
"tough-cookie": "^6.0.0",
6968
"tslib": "^2.8.1",
70-
"type-fest": "^4.41.0"
69+
"type-fest": "^4.41.0",
70+
"zod": "^3.24.0 || ^4.0.0"
7171
}
7272
}

packages/core/src/autoscaling/snapshotter.ts

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -171,21 +171,19 @@ export class Snapshotter {
171171
* Starts capturing snapshots at configured intervals.
172172
*/
173173
async start(): Promise<void> {
174-
const memoryMbytes = serviceLocator.getConfiguration().get('memoryMbytes', 0);
174+
const memoryMbytes = serviceLocator.getConfiguration().memoryMbytes ?? 0;
175175

176176
if (memoryMbytes > 0) {
177177
this.maxMemoryBytes = memoryMbytes * 1024 * 1024;
178178
} else {
179-
const containerized = serviceLocator.getConfiguration().get('containerized', await isContainerized());
179+
const containerized = serviceLocator.getConfiguration().containerized ?? (await isContainerized());
180180
const memInfo = await getMemoryInfo({
181181
containerized,
182182
logger: serviceLocator.getLogger(),
183183
});
184184
const totalBytes = memInfo.totalBytes;
185185

186-
this.maxMemoryBytes = Math.ceil(
187-
totalBytes * serviceLocator.getConfiguration().get('availableMemoryRatio')!,
188-
);
186+
this.maxMemoryBytes = Math.ceil(totalBytes * serviceLocator.getConfiguration().availableMemoryRatio);
189187
this.log.debug(
190188
`Setting max memory of this run to ${Math.round(this.maxMemoryBytes / 1024 / 1024)} MB. ` +
191189
'Use the CRAWLEE_MEMORY_MBYTES or CRAWLEE_AVAILABLE_MEMORY_RATIO environment variable to override it.',

0 commit comments

Comments
 (0)