You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
- **Replaces `get()`/`set()` with direct property access** —
`config.headless` instead of `config.get('headless')`. Both methods are
removed (v4 breaking change).
- **Constructor options now take highest priority** — `new
Configuration({ headless: false })` works even when
`CRAWLEE_HEADLESS=true` is set. New priority: constructor > env vars >
crawlee.json > schema defaults.
- **Immutable instances** — assigning to a config property throws
`TypeError`.
- **Zod-based field definitions** — schema, env var mapping, and
defaults defined in one place (`crawleeConfigFields`).
- **Extensible via subclassing** — subclasses override `protected static
fields` to register additional config fields (e.g. Apify SDK).
Closes#3080
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
`crawlee.json` is a baseline. The options provided in the `Configuration` constructor will override the options provided in the JSON. Environment variables will override both.
20
+
Constructor options have the highest priority. Environment variables override `crawlee.json`. The JSON file serves as a baseline.
21
21
22
22
## `crawlee.json`
23
23
24
-
The first option you could use for configuring Crawlee is `crawlee.json` file. The only thing you need to do is specify the <ApiLinkto="core/interface/ConfigurationOptions">`ConfigurationOptions`</ApiLink> in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration.
24
+
The first option you could use for configuring Crawlee is `crawlee.json` file. The only thing you need to do is specify the configuration options in the file, place the file in the root of your project, and Crawlee will use provided options as global configuration. See the <ApiLinkto="core/class/Configuration">`Configuration`</ApiLink> class for the full list of supported options.
25
25
26
26
```json title="crawlee.json"
27
27
{
@@ -133,24 +133,28 @@ the autoscaling feature will only use up to 2048 MB of memory.
133
133
134
134
## Configuration class
135
135
136
-
The last option to adjust Crawlee configuration is to use the <ApiLinkto="core/class/Configuration">`Configuration`</ApiLink> class in the code.
136
+
The last option to adjust Crawlee configuration is to use the <ApiLinkto="core/class/Configuration">`Configuration`</ApiLink> class in the code. Configuration is immutable — values are set via the constructor and cannot be changed afterwards.
137
137
138
138
### Global Configuration
139
139
140
-
By default, there is a global singleton instance of `Configuration` class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can get access to it via <ApiLinkto="core/class/Configuration#getGlobalConfig">`Configuration.getGlobalConfig()`</ApiLink> function. Now you can easily <ApiLinkto="core/class/Configuration#get">`get`</ApiLink> and <ApiLinkto="core/class/Configuration#set">`set`</ApiLink> the <ApiLinkto="core/interface/ConfigurationOptions">`ConfigurationOptions`</ApiLink>.
140
+
By default, there is a global singleton instance of `Configuration` class, it is used by the crawlers and some other classes that depend on a configurable behavior. In most cases you don't need to adjust any options there, but if needed - you can access it via <ApiLinkto="core/class/Configuration#getGlobalConfig">`Configuration.getGlobalConfig()`</ApiLink>, which delegates to the global <ApiLinkto="core/class/ServiceLocator">`serviceLocator`</ApiLink> — the single source of truth for Crawlee's shared services (for example the configuration, event manager, storage client, and logger). You can also reach the same instance directly via `serviceLocator.getConfiguration()` or swap services globally with `serviceLocator.setConfiguration(...)` before any crawler is created. Configuration values are accessible directly as properties on the instance.
This is pretty much the same example we used for showing `crawlee.json` usage,
174
-
but now we're using the global configuration, which is the only difference.
175
-
If you run this example - you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store as before,
176
-
which would show the same number of finishes requests (one) and the same crawler runtime (~10 seconds).
177
-
This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the global configuration.
177
+
If you run this example - you will find the `SDK_CRAWLER_STATISTICS` file in default Key-Value store,
178
+
which would show the same number of finished requests (one) and the same crawler runtime (~10 seconds).
179
+
This confirms that provided parameters worked: the state was persisted after 10 seconds, as it was set in the configuration.
178
180
179
181
:::note
180
182
181
-
After running the same example with commented two lines of code related to `Configuration` there will be
183
+
After running the same example without the custom configuration, there will be
182
184
no `SDK_CRAWLER_STATISTICS` file stored in the default Key-Value store:
183
185
as we did not change the `persistStateIntervalMillis`, Crawlee used the default value of 60 seconds,
184
186
and the crawler was forcefully aborted after ~15 seconds of run time before it persisted the state for the first time.
@@ -187,19 +189,19 @@ and the crawler was forcefully aborted after ~15 seconds of run time before it p
187
189
188
190
### Custom configuration
189
191
190
-
Alternatively, you can create a custom configuration. In this case you need to pass it to the class that is going to use it, e.g. to the crawler. Let's adjust the previous example:
192
+
You can create a custom configuration and pass it to the crawler via the `configuration` option:
Copy file name to clipboardExpand all lines: docs/guides/parallel-scraping/parallel-scraping.mdx
+5-8Lines changed: 5 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -132,22 +132,19 @@ We use this to ensure the parent process stays alive until all the worker proces
132
132
133
133
There are three steps we want to do for the worker processes:
134
134
135
-
- ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared
136
135
- get the queue that supports locking from the same location as the parent process
137
-
- initialize a special storage for worker processes so they do not collide with each other
136
+
-ensure the default storages do **not** get purged on start, as otherwise we'd lose the queue we prepared, and initialize a special storage for worker processes so they do not collide with each other
138
137
139
138
In order, that's what these lines do:
140
139
141
140
```javascript title="src/parallel-scraper.mjs"
142
-
// Disable the automatic purge on start (step 1)
143
-
// This is needed when running locally, as otherwise multiple processes will try to clear the default storage (and that will cause clashes)
144
-
Configuration.set('purgeOnStart', false);
145
-
146
-
// Get the request queue from the parent process (step 2)
141
+
// Get the request queue from the parent process (step 1)
147
142
constrequestQueue=awaitgetOrInitQueue(false);
148
143
149
-
// Configure crawlee to store the worker-specific data in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 3)
144
+
// Disable the automatic purge on start and configure crawlee to store the worker-specific data
145
+
// in a separate directory (needs to be done AFTER the queue is initialized when running locally) (step 2)
Copy file name to clipboardExpand all lines: docs/upgrading/upgrading_v4.md
+36Lines changed: 36 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,6 +55,7 @@ The crawler following options are removed:
55
55
-`FileDownloadOptions.streamHandler` - streaming should now be handled directly in the `requestHandler` instead
56
56
-`playwrightUtils.registerUtilsToContext` and `puppeteerUtils.registerUtilsToContext` - this is now added to the context via `ContextPipeline` composition
57
57
-`puppeteerUtils.blockResources` and `puppeteerUtils.cacheResponses` (deprecated)
58
+
-`Configuration.systemInfoV2` / `CRAWLEE_SYSTEM_INFO_V2` environment variable — the v2 behavior is now the default (see [Available resource detection](#available-resource-detection))
58
59
59
60
### The protected `BasicCrawler.crawlingContexts` map is removed
60
61
@@ -154,6 +155,40 @@ The `KeyValueStore.getPublicUrl` method is now asynchronous and reads the public
154
155
155
156
The `preNavigationHooks` option in `HttpCrawler` subclasses no longer accepts the `gotOptions` object as a second parameter. Modify the `crawlingContext` fields (e.g. `.request`) directly instead.
156
157
158
+
## Configuration class redesign
159
+
160
+
The `Configuration` class has been redesigned for v4. The main changes are:
161
+
162
+
### Direct property access replaces `get()` and `set()`
163
+
164
+
**Before:**
165
+
```ts
166
+
const config =Configuration.getGlobalConfig();
167
+
config.set('persistStateIntervalMillis', 10_000);
168
+
const headless =config.get('headless');
169
+
```
170
+
171
+
**After:**
172
+
```ts
173
+
// Configuration is now immutable — set options via the constructor
## Service management moved from `Configuration` to `ServiceLocator`
158
193
159
194
The service management functionality has been extracted from `Configuration` into a new `ServiceLocator` class, following the pattern established in Crawlee for Python.
@@ -166,6 +201,7 @@ The following methods and properties have been removed from `Configuration`:
166
201
-`Configuration.getEventManager()` - moved to `ServiceLocator.getEventManager()`
167
202
-`Configuration.useStorageClient()` - use `ServiceLocator.setStorageClient()` instead
168
203
-`Configuration.useEventManager()` - use `ServiceLocator.setEventManager()` instead
204
+
-`Configuration.resetGlobalState()` - use `serviceLocator.reset()` instead
169
205
-`Configuration.storageManagers` - moved to `ServiceLocator.storageManagers`
170
206
171
207
The `EventManager` and `LocalEventManager` constructors now accept an options object for configuring event intervals (e.g. `persistStateIntervalMillis`, `systemInfoIntervalMillis`). You can also use the new `LocalEventManager.fromConfig()` factory method to create an instance with intervals derived from a `Configuration` object.
0 commit comments