Skip to content

Commit ba3a356

Browse files
committed
refactor!: Extract service management from Configuration into ServiceLocator class (#3325)
closes #3073
1 parent eb908c5 commit ba3a356

53 files changed

Lines changed: 1419 additions & 816 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/upgrading/upgrading_v4.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ The crawler following options are removed:
4040

4141
- `BasicCrawler._cleanupContext` (protected) - this is now handled by the `ContextPipeline`
4242
- `BasicCrawler.isRequestBlocked` (protected)
43+
- `BasicCrawler.events` (protected) - this should be accessed via `BasicCrawler.serviceLocator`
4344
- `BrowserRequestHandler` and `BrowserErrorHandler` types in `@crawlee/browser`
4445
- `BrowserCrawler.userProvidedRequestHandler` (protected)
4546
- `BrowserCrawler.requestHandlerTimeoutInnerMillis` (protected)
@@ -115,6 +116,97 @@ The `KeyValueStore.getPublicUrl` method is now asynchronous and reads the public
115116

116117
The `preNavigationHooks` option in `HttpCrawler` subclasses no longer accepts the `gotOptions` object as a second parameter. Modify the `crawlingContext` fields (e.g. `.request`) directly instead.
117118

119+
## Service management moved from `Configuration` to `ServiceLocator`
120+
121+
The service management functionality has been extracted from `Configuration` into a new `ServiceLocator` class, following the pattern established in Crawlee for Python.
122+
123+
### Breaking changes
124+
125+
The following methods and properties have been removed from `Configuration`:
126+
127+
- `Configuration.getStorageClient()` - moved to `ServiceLocator.getStorageClient()`
128+
- `Configuration.getEventManager()` - moved to `ServiceLocator.getEventManager()`
129+
- `Configuration.useStorageClient()` - use `ServiceLocator.setStorageClient()` instead
130+
- `Configuration.useEventManager()` - use `ServiceLocator.setEventManager()` instead
131+
- `Configuration.storageManagers` - moved to `ServiceLocator.storageManagers`
132+
133+
The `EventManager` and `LocalEventManager` constructors now accept an options object for configuring event intervals (e.g. `persistStateIntervalMillis`, `systemInfoIntervalMillis`). You can also use the new `LocalEventManager.fromConfig()` factory method to create an instance with intervals derived from a `Configuration` object.
134+
135+
### Migration guide
136+
137+
If you were using the removed `Configuration` methods directly, you need to update your code:
138+
139+
**Before:**
140+
```typescript
141+
import { Configuration } from 'crawlee';
142+
143+
const config = Configuration.getGlobalConfig();
144+
const storageClient = config.getStorageClient();
145+
const eventManager = config.getEventManager();
146+
147+
// or static methods
148+
const storageClient = Configuration.getStorageClient();
149+
```
150+
151+
**After:**
152+
```typescript
153+
import { serviceLocator } from 'crawlee';
154+
155+
const storageClient = serviceLocator.getStorageClient();
156+
const eventManager = serviceLocator.getEventManager();
157+
```
158+
159+
### Using per-crawler services (recommended)
160+
161+
The new `ServiceLocator` supports per-crawler service isolation, allowing you to use different storage clients or event managers for different crawlers by passing them via options:
162+
163+
```typescript
164+
import { BasicCrawler, Configuration, LocalEventManager } from 'crawlee';
165+
import { MemoryStorage } from '@crawlee/memory-storage';
166+
167+
const crawler = new BasicCrawler({
168+
requestHandler: async ({ request, log }) => {
169+
log.info(`Processing ${request.url}`);
170+
},
171+
configuration: new Configuration({ headless: false }),
172+
storageClient: new MemoryStorage(),
173+
eventManager: LocalEventManager.fromConfig(),
174+
});
175+
176+
await crawler.run(['https://example.com']);
177+
```
178+
179+
### Using the global service locator
180+
181+
For most use cases, the global `serviceLocator` singleton works well:
182+
183+
```typescript
184+
import { serviceLocator, BasicCrawler } from 'crawlee';
185+
import { MemoryStorage } from '@crawlee/memory-storage';
186+
187+
// Configure global services (optional)
188+
serviceLocator.setStorageClient(new MemoryStorage());
189+
190+
// All crawlers will use the global service locator by default
191+
const crawler = new BasicCrawler({
192+
requestHandler: async ({ request, log }) => {
193+
log.info(`Processing ${request.url}`);
194+
},
195+
});
196+
```
197+
198+
### Accessing configuration
199+
200+
`Configuration.getGlobalConfig()` remains as a utility function, but in most cases, you should use `serviceLocator.getConfiguration()` instead:
201+
202+
```typescript
203+
import { serviceLocator } from 'crawlee';
204+
205+
const config = serviceLocator.getConfiguration();
206+
```
207+
208+
Do note that the method is currently misnamed - in specific circumstances, it will not return the global configuration object, but the one from the currently active service locator.
209+
118210
## `transformRequestFunction` precedence in `enqueueLinks`
119211

120212
The `transformRequestFunction` callback in `enqueueLinks` now runs **after** URL pattern filtering (`globs`, `regexps`, `pseudoUrls`) instead of before. This means it has the highest priority and can overwrite any request options set by patterns or the global `label` option.

0 commit comments

Comments
 (0)