Skip to content

Commit a6957fa

Browse files
committed
fix(scraper): replace misleading 403 hint for AI Scraper Studio errors
When a `bdata scraper create` succeeds on the template POST but the subsequent AI-trigger POST 429s (e.g. because the user hit the AI Flow parallel-job cap), the half-built `collector_id` is still printed. If that id is then passed to `bdata scraper run`, the API returns 403 + body `{"error":"Collector does not have a template"}`. Today the CLI maps any 403 to a fixed hint: Hint: Access denied. Check your zone permissions in the control panel. This sends the user 30+ minutes down a zone-permission rabbit hole that has nothing to do with the actual problem (the AI Flow never finished generating selectors for this collector). Observed multiple times during stress testing. This change is structured so the AI Scraper Studio error vocabulary stays in the scraper command and does NOT leak into the shared HTTP client. `scrape`, `search`, `discover`, `pipelines`, and `browser` are unaffected. Mechanism: * `src/utils/client.ts` gains a generic `hints?: Body_hint[]` field on `Request_opts`. The pure helper `pick_hint(status, body, hints)` consults the caller's list first and falls back to the existing `ERROR_HINTS` status-code map. The shared client ships ZERO command-specific patterns. * `src/commands/scraper.ts` defines `SCRAPER_BODY_HINTS` — two patterns: - /collector does not have a template/i → AI generation didn't complete; re-run `scraper create`; web-UI URL for manual recovery. - /cannot run more than \d+ jobs in parallel/i → AI-Flow concurrent-job cap; serialise launches. Every `post`/`get` call in `handle_create_scraper`, `handle_run_scraper`, and `run_batch` passes `hints: SCRAPER_BODY_HINTS` so a 4xx from any of them is translated with the right vocabulary. * Real zone-permission 403s (any body that doesn't match the scraper patterns) still get the original "Access denied" hint — test 'does not consult ERROR_HINTS when an extra-hint pattern matches' locks this in. Tests: 8 unit tests for `client.pick_hint` using mock generic patterns (covers mechanism + asserts the shared client carries no scraper vocabulary in ERROR_HINTS), plus 5 scraper command tests asserting the scraper patterns are well-formed and travel via `hints` to client.post on every AI-Flow call. Two existing tests relaxed from strict opts-object matches to objectContaining-style. 58 / 58 tests in the affected files pass. The 9 pre-existing failures in unrelated suites (daemon, add-mcp, browser, discover, scrape) on main are unchanged by this PR.
1 parent fadfc2b commit a6957fa

4 files changed

Lines changed: 232 additions & 14 deletions

File tree

src/__tests__/commands/scraper.test.ts

Lines changed: 83 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ import {
5858
parse_sync_timeout,
5959
is_realtime_page_limit_error,
6060
classify_dataset,
61+
SCRAPER_BODY_HINTS,
6162
} from '../../commands/scraper';
6263

6364
describe('commands/scraper', ()=>{
@@ -68,6 +69,85 @@ describe('commands/scraper', ()=>{
6869
mocks.start.mockReturnValue({stop: mocks.stop});
6970
});
7071

72+
// PR-12: AI Scraper Studio specific hints live with this command,
73+
// not in the shared HTTP client. These tests assert (a) the
74+
// patterns are well-formed and (b) they actually travel to the
75+
// HTTP layer on every API call this command makes.
76+
describe('SCRAPER_BODY_HINTS', ()=>{
77+
it('contains a hint for the stub-collector 403 response body',
78+
()=>{
79+
const stub_hint = SCRAPER_BODY_HINTS.find(h=>
80+
h.pattern.test('Collector does not have a template'));
81+
expect(stub_hint).toBeDefined();
82+
expect(stub_hint!.hint).toMatch(/AI generation has not completed/);
83+
expect(stub_hint!.hint).toMatch(/bdata scraper create/);
84+
// Must NOT echo the misleading generic 403 hint.
85+
expect(stub_hint!.hint).not.toMatch(/zone permissions/);
86+
});
87+
88+
it('contains a hint for the parallel-job cap 429 response body',
89+
()=>{
90+
const cap_hint = SCRAPER_BODY_HINTS.find(h=>
91+
h.pattern.test('Cannot run more than 3 jobs in parallel'));
92+
expect(cap_hint).toBeDefined();
93+
expect(cap_hint!.hint).toMatch(/concurrent-job cap/);
94+
expect(cap_hint!.hint).toMatch(/serialise/);
95+
});
96+
97+
it('cap pattern is case-insensitive and number-agnostic '
98+
+'(future-proof if the cap changes)', ()=>{
99+
const cap_pattern = SCRAPER_BODY_HINTS[1].pattern;
100+
expect(cap_pattern.test('cannot run more than 5 jobs '
101+
+'in parallel')).toBe(true);
102+
expect(cap_pattern.test('CANNOT RUN MORE THAN 100 JOBS '
103+
+'IN PARALLEL')).toBe(true);
104+
});
105+
106+
it('is passed via `hints` opt to client.post on every '
107+
+'AI-Flow call in handle_create_scraper', async()=>{
108+
mocks.post
109+
.mockResolvedValueOnce({id: 'c_abc', name: 'cli-scraper-1'})
110+
.mockResolvedValueOnce({id: 'ia_xyz', queued: false});
111+
mocks.poll_until.mockResolvedValue({
112+
result: {status: 'done', completed_steps: []},
113+
attempts: 1,
114+
});
115+
await handle_create_scraper(
116+
'https://example.com/p/1',
117+
'extract title',
118+
{}
119+
);
120+
// Both `post` calls (template create + AI trigger) must
121+
// include the scraper hints so any 4xx is translated
122+
// with the right vocabulary.
123+
for (const call of mocks.post.mock.calls)
124+
{
125+
const opts = call[3] as {hints?: unknown};
126+
expect(opts).toBeDefined();
127+
expect(opts.hints).toBe(SCRAPER_BODY_HINTS);
128+
}
129+
});
130+
131+
it('is passed via `hints` opt to client.post on the '
132+
+'trigger_immediate call in handle_run_scraper', async()=>{
133+
mocks.post.mockResolvedValueOnce({response_id: 'r_xyz'});
134+
const fetch_spy = vi.spyOn(global, 'fetch')
135+
.mockImplementation(()=>Promise.resolve({
136+
status: 200,
137+
text: ()=>Promise.resolve('{"title":"ok"}'),
138+
} as unknown as Response));
139+
mocks.poll_until.mockImplementation(async(o: never)=>{
140+
const cfg = o as {fetch_once: ()=>Promise<unknown>};
141+
const r = await cfg.fetch_once();
142+
return {result: r, attempts: 1};
143+
});
144+
await handle_run_scraper('c_abc', 'https://x.com/p/1', {});
145+
const opts = mocks.post.mock.calls[0][3] as {hints?: unknown};
146+
expect(opts.hints).toBe(SCRAPER_BODY_HINTS);
147+
fetch_spy.mockRestore();
148+
});
149+
});
150+
71151
describe('build_template_request', ()=>{
72152
it('uses auto-generated name and stub webhook by default', ()=>{
73153
const before = Math.floor(Date.now()/1000);
@@ -176,15 +256,15 @@ describe('commands/scraper', ()=>{
176256
expect.objectContaining({
177257
deliver: expect.objectContaining({type: 'webhook'}),
178258
}),
179-
{timing: undefined}
259+
expect.objectContaining({timing: undefined})
180260
);
181261
expect(mocks.post).toHaveBeenNthCalledWith(
182262
2,
183263
'api_key',
184264
'/dca/collectors/c_abc/automate_template',
185265
{description: 'extract title',
186266
urls: ['https://example.com/p/1']},
187-
{timing: undefined}
267+
expect.objectContaining({timing: undefined})
188268
);
189269
expect(mocks.poll_until).toHaveBeenCalledTimes(1);
190270
expect(mocks.poll_until).toHaveBeenCalledWith(
@@ -404,7 +484,7 @@ describe('commands/scraper', ()=>{
404484
expect.stringMatching(
405485
/\/dca\/trigger_immediate\?collector=c_abc/),
406486
{url: 'https://x.com/p/1'},
407-
{timing: undefined}
487+
expect.objectContaining({timing: undefined})
408488
);
409489
expect(mocks.print).toHaveBeenCalledWith(
410490
{title: 'hello'},

src/__tests__/utils/client.test.ts

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
import {describe, it, expect} from 'vitest';
2+
import {pick_hint, ERROR_HINTS, type Body_hint} from '../../utils/client';
3+
4+
// pick_hint is the layer the per-command hint lists plug into.
5+
// The client itself ships ZERO command-specific patterns — every test
6+
// here uses generic mock patterns so the contract is clear.
7+
8+
const MOCK_HINTS: Body_hint[] = [
9+
{pattern: /quota.*exceeded/i,
10+
hint: 'Your monthly quota is exhausted. Top up in the dashboard.'},
11+
{pattern: /not allowed in region/i,
12+
hint: 'This API is geo-restricted. Use --country to override.'},
13+
];
14+
15+
describe('utils/client.pick_hint', ()=>{
16+
it('returns the status-based hint when no extra hints are passed', ()=>{
17+
expect(pick_hint(401, 'anything')).toBe(ERROR_HINTS[401]);
18+
expect(pick_hint(403, 'anything')).toBe(ERROR_HINTS[403]);
19+
expect(pick_hint(404, 'anything')).toBe(ERROR_HINTS[404]);
20+
expect(pick_hint(429, 'anything')).toBe(ERROR_HINTS[429]);
21+
});
22+
23+
it('returns the status-based hint when extra hints are passed '
24+
+'but nothing matches the body', ()=>{
25+
expect(pick_hint(403, 'plain unauthorized message', MOCK_HINTS))
26+
.toBe(ERROR_HINTS[403]);
27+
});
28+
29+
it('overrides with the first matching extra hint', ()=>{
30+
expect(pick_hint(429, 'monthly quota exceeded', MOCK_HINTS))
31+
.toMatch(/Top up in the dashboard/);
32+
expect(pick_hint(403, 'not allowed in region: CN', MOCK_HINTS))
33+
.toMatch(/geo-restricted/);
34+
});
35+
36+
it('is case-insensitive on the caller\'s pattern', ()=>{
37+
expect(pick_hint(429, 'MONTHLY QUOTA EXCEEDED', MOCK_HINTS))
38+
.toMatch(/Top up in the dashboard/);
39+
});
40+
41+
it('respects first-match-wins order in the caller\'s list', ()=>{
42+
const ordered: Body_hint[] = [
43+
{pattern: /generic/, hint: 'first wins'},
44+
{pattern: /generic/, hint: 'second loses'},
45+
];
46+
expect(pick_hint(500, 'generic error', ordered))
47+
.toBe('first wins');
48+
});
49+
50+
it('returns undefined for unknown statuses with no body match', ()=>{
51+
expect(pick_hint(418, 'teapot')).toBeUndefined();
52+
expect(pick_hint(599, 'mystery')).toBeUndefined();
53+
});
54+
55+
it('does not consult ERROR_HINTS when an extra-hint pattern '
56+
+'matches', ()=>{
57+
// The override should win even when the generic status hint
58+
// would also be applicable.
59+
const matches_403: Body_hint[] = [
60+
{pattern: /custom 403 case/i,
61+
hint: 'override hint, not the zone-permissions one'},
62+
];
63+
expect(pick_hint(403, 'this is a custom 403 case', matches_403))
64+
.toBe('override hint, not the zone-permissions one');
65+
});
66+
67+
it('does not leak any scraper-studio vocabulary in the default '
68+
+'hints', ()=>{
69+
// The shared client must stay generic. Catches accidental
70+
// regressions where someone adds a domain-specific hint here.
71+
for (const hint of Object.values(ERROR_HINTS))
72+
{
73+
expect(hint).not.toMatch(/scraper create/i);
74+
expect(hint).not.toMatch(/collector/i);
75+
expect(hint).not.toMatch(/AI[- ]Flow/i);
76+
expect(hint).not.toMatch(/automate_template/i);
77+
}
78+
});
79+
});

src/commands/scraper.ts

Lines changed: 38 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import {Command} from 'commander';
2-
import {post, get} from '../utils/client';
2+
import {post, get, type Body_hint} from '../utils/client';
33
import {load as load_config} from '../utils/config';
44
import {ensure_authenticated} from '../utils/auth';
55
import {start as start_spinner} from '../utils/spinner';
@@ -18,6 +18,37 @@ import type {
1818
Batch_trigger_response,
1919
} from '../types/scraper';
2020

21+
// AI Scraper Studio specific error-body hints. Lives with the command
22+
// that owns the DCA endpoints — NOT in the shared HTTP client — so the
23+
// AI-Flow vocabulary doesn't leak into `scrape`, `search`, `discover`,
24+
// `pipelines`, or `browser`.
25+
//
26+
// Both patterns describe the same incident from two perspectives:
27+
// * The 429 fires during `scraper create` when too many AI-Flow jobs
28+
// are in flight at once (undocumented 3-job cap as of v0.2.0).
29+
// * The 403 "Collector does not have a template" fires later when
30+
// the user tries to `scraper run` against the half-built
31+
// collector_id the failed create left behind.
32+
//
33+
// First match wins. Add more patterns here as we learn about other
34+
// scraper-API error bodies that deserve a better hint.
35+
const SCRAPER_BODY_HINTS: Body_hint[] = [
36+
{
37+
pattern: /collector does not have a template/i,
38+
hint: 'AI generation has not completed for this collector. '
39+
+'Re-run `bdata scraper create` (the previous attempt may '
40+
+'have hit the AI-Flow parallel-job cap), or open '
41+
+'https://brightdata.com/cp/scrapers to inspect / finish '
42+
+'the half-built collector in the web UI.',
43+
},
44+
{
45+
pattern: /cannot run more than \d+ jobs in parallel/i,
46+
hint: 'You hit the AI-Flow concurrent-job cap. Wait for an '
47+
+'in-flight `scraper create` to finish, or serialise '
48+
+'your launches.',
49+
},
50+
];
51+
2152
const CREATE_TEMPLATE_ENDPOINT = '/dca/collector';
2253
const AI_TRIGGER_PATH = 'automate_template';
2354
const AI_PROGRESS_PATH = 'automate_template/progress';
@@ -105,7 +136,7 @@ const handle_create_scraper = async(
105136
api_key,
106137
CREATE_TEMPLATE_ENDPOINT,
107138
template_body,
108-
{timing: opts.timing}
139+
{timing: opts.timing, hints: SCRAPER_BODY_HINTS}
109140
);
110141
create_spinner.stop();
111142
if (!template.id)
@@ -129,7 +160,7 @@ const handle_create_scraper = async(
129160
api_key,
130161
`/dca/collectors/${collector_id}/${AI_TRIGGER_PATH}`,
131162
build_ai_request(url, description),
132-
{timing: opts.timing}
163+
{timing: opts.timing, hints: SCRAPER_BODY_HINTS}
133164
);
134165
trigger_spinner.stop();
135166
} catch(e) {
@@ -148,7 +179,7 @@ const handle_create_scraper = async(
148179
fetch_once: ()=>get<Ai_progress_response>(
149180
api_key,
150181
`/dca/collectors/${collector_id}/${AI_PROGRESS_PATH}`,
151-
{timing: opts.timing}
182+
{timing: opts.timing, hints: SCRAPER_BODY_HINTS}
152183
),
153184
get_status: extract_progress_status,
154185
running_statuses: [RUNNING_SENTINEL],
@@ -338,7 +369,7 @@ const run_batch = async(
338369
api_key,
339370
`${BATCH_TRIGGER_ENDPOINT}?${query}`,
340371
[build_run_request(url)],
341-
{timing: opts.timing}
372+
{timing: opts.timing, hints: SCRAPER_BODY_HINTS}
342373
);
343374
trigger_spinner.stop();
344375
if (!trigger.collection_id)
@@ -474,7 +505,7 @@ const handle_run_scraper = async(
474505
api_key,
475506
`${TRIGGER_IMMEDIATE_ENDPOINT}?${trigger_query}`,
476507
build_run_request(url),
477-
{timing: opts.timing}
508+
{timing: opts.timing, hints: SCRAPER_BODY_HINTS}
478509
);
479510
trigger_spinner.stop();
480511
if (!trigger.response_id)
@@ -592,4 +623,5 @@ export {
592623
parse_sync_timeout,
593624
is_realtime_page_limit_error,
594625
classify_dataset,
626+
SCRAPER_BODY_HINTS,
595627
};

src/utils/client.ts

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -11,12 +11,35 @@ const ERROR_HINTS: Record<number, string> = {
1111
429: 'Rate limit exceeded. Wait a moment and try again.',
1212
};
1313

14+
// Each command can supply its own ordered list of body-pattern overrides
15+
// via Request_opts.hints. The shared HTTP layer knows nothing about
16+
// per-command error semantics — it just consults the caller's hints
17+
// before falling back to the generic ERROR_HINTS map.
18+
type Body_hint = {pattern: RegExp; hint: string};
19+
20+
const pick_hint = (
21+
status: number,
22+
body: string,
23+
extra_hints?: Body_hint[]
24+
): string|undefined=>{
25+
if (extra_hints)
26+
{
27+
for (const {pattern, hint} of extra_hints)
28+
{
29+
if (pattern.test(body))
30+
return hint;
31+
}
32+
}
33+
return ERROR_HINTS[status];
34+
};
35+
1436
type Request_opts = {
1537
method?: string;
1638
body?: unknown;
1739
headers?: Record<string, string>;
1840
timing?: boolean;
1941
raw_buffer?: boolean;
42+
hints?: Body_hint[];
2043
};
2144

2245
type Api_error = {
@@ -27,10 +50,14 @@ type Api_error = {
2750

2851
const sleep = (ms: number)=>new Promise(resolve=>setTimeout(resolve, ms));
2952

30-
const format_error = (status: number, detail: string): Api_error=>({
53+
const format_error = (
54+
status: number,
55+
detail: string,
56+
extra_hints?: Body_hint[]
57+
): Api_error=>({
3158
status,
3259
message: detail,
33-
hint: ERROR_HINTS[status],
60+
hint: pick_hint(status, detail, extra_hints),
3461
});
3562

3663
const request = async<T = unknown>(
@@ -93,7 +120,7 @@ const request = async<T = unknown>(
93120
if (err_body)
94121
detail = err_body;
95122
} catch(_e) {}
96-
const api_err = format_error(res.status, detail);
123+
const api_err = format_error(res.status, detail, opts.hints);
97124
const msg = [
98125
`Error: ${api_err.message}`,
99126
` Status: ${api_err.status}`,
@@ -133,5 +160,5 @@ const get = <T = unknown>(
133160
opts: Omit<Request_opts, 'method'> = {}
134161
): Promise<T>=>request<T>(api_key, endpoint, {method: 'GET', ...opts});
135162

136-
export {request, post, get};
137-
export type {Request_opts, Api_error};
163+
export {request, post, get, pick_hint, ERROR_HINTS};
164+
export type {Request_opts, Api_error, Body_hint};

0 commit comments

Comments
 (0)