Skip to content

Commit 7efe277

Browse files
committed
feat: Cap schema inference depth, drop raw schema from get-dataset
Calibration decision from #882 (probe: 10 top store Actors): - generateSchemaFromItems collapses objects/arrays below depth 3 to a bare type; deep social-media items (Facebook posts: 15.3 KB) now fit the ~2K-token budget. - get-dataset stops returning the raw Apify dataset.schema (93–95% of response bytes on top Actors, 23–39% phantom fields); nextStep routes to get-dataset-schema instead. Note for apify-mcp-server-internal: get-dataset structuredContent no longer carries the schema key. https://claude.ai/code/session_01Sf9wACoa9h9y2m2WZ2Sde5
1 parent dd60d45 commit 7efe277

6 files changed

Lines changed: 97 additions & 10 deletions

File tree

src/tools/common/get_dataset.ts

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,9 @@ export const getDataset: ToolEntry = Object.freeze({
2222
name: HelperTools.DATASET_GET,
2323
description: dedent`
2424
Get metadata for a dataset (collection of structured data created by an Actor run).
25-
The results will include dataset details such as itemCount, schema, fields, and stats.
25+
The results will include dataset details such as itemCount, fields, and stats.
2626
Use fields to understand structure for filtering with ${HelperTools.DATASET_GET_ITEMS}.
27+
For a JSON schema of the item structure, use ${HelperTools.DATASET_SCHEMA_GET}.
2728
Note: itemCount updates may be delayed by up to ~5 seconds.
2829
2930
USAGE:
@@ -51,17 +52,23 @@ export const getDataset: ToolEntry = Object.freeze({
5152
if (!dataset) {
5253
return buildStorageNotFound(`Dataset '${datasetId}' not found.`);
5354
}
55+
// The API also returns a raw `schema` (untyped in apify-client). It is 93–95% of the
56+
// response bytes on top store Actors and declares fields that may be absent from the
57+
// data, so drop it — get-dataset-schema infers a compact schema from real items (#882).
58+
const { schema, ...metadata } = dataset as typeof dataset & { schema?: unknown };
5459
// Apify returns `fields` slash-separated AND with array indices expanded
5560
// (e.g. `latestComments/0/owner/username`). For a real Instagram-scraper
5661
// dataset this inflates ~78 schema fields into 528 paths (~85% bloat) and
5762
// produces slash-notation paths that aren't directly usable as projection
5863
// hints for `get-dataset-items` (which expects dot-notation). Run the same
5964
// normalization `buildRunDataset` applies so this tool's `fields` matches
6065
// the structured `storages.datasets.default.fields` shape.
61-
const normalized = dataset.fields ? { ...dataset, fields: normalizeDatasetFields(dataset.fields) } : dataset;
66+
const normalized = metadata.fields
67+
? { ...metadata, fields: normalizeDatasetFields(metadata.fields) }
68+
: metadata;
6269
const fieldCount = Array.isArray(normalized.fields) ? normalized.fields.length : undefined;
6370
const summary = `Dataset '${normalized.name ?? datasetId}' has ${normalized.itemCount ?? 0} items${fieldCount !== undefined ? `, ${fieldCount} fields` : ''}.`;
64-
const nextStep = `Use ${HelperTools.DATASET_GET_ITEMS} with datasetId=${datasetId} and limit (for example 20) to fetch items.`;
71+
const nextStep = `Use ${HelperTools.DATASET_GET_ITEMS} with datasetId=${datasetId} and limit (for example 20) to fetch items, or ${HelperTools.DATASET_SCHEMA_GET} to infer item structure.`;
6572
return buildStorageResponse({
6673
structuredContent: normalized as unknown as Record<string, unknown>,
6774
summary,

src/tools/common/get_dataset_schema.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ import { compileSchema } from '../../utils/ajv.js';
88
import { stripQuoteWrappers } from '../../utils/generic.js';
99
import { getHttpStatusCode } from '../../utils/logging.js';
1010
import { buildMCPResponse } from '../../utils/mcp.js';
11-
import { generateSchemaFromItems } from '../../utils/schema_generation.js';
11+
import { DEFAULT_MAX_SCHEMA_DEPTH, generateSchemaFromItems } from '../../utils/schema_generation.js';
1212
import { datasetSchemaOutputSchema } from '../structured_output_schemas.js';
1313
import { buildStorageNotFound, buildStorageResponse } from './storage_helpers.js';
1414

@@ -36,6 +36,7 @@ export const getDatasetSchema: ToolEntry = Object.freeze({
3636
Generate a JSON schema from a sample of dataset items.
3737
The schema describes the structure of the data and can be used for validation, documentation, or processing.
3838
Use this to understand the dataset before fetching many items.
39+
Nesting is described up to ${DEFAULT_MAX_SCHEMA_DEPTH} levels deep; deeper objects/arrays appear as a bare type.
3940
4041
USAGE:
4142
- Use when you need to infer the structure of dataset items for downstream processing or validation.

src/tools/structured_output_schemas.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -541,7 +541,8 @@ export const datasetItemsOutputSchema = {
541541

542542
/**
543543
* Schema for dataset metadata (get-dataset). Documents the fields the LLM acts on; the raw API
544-
* response carries more keys (stats, schema, access settings), allowed as additional properties.
544+
* response carries more keys (stats, access settings), allowed as additional properties.
545+
* The raw `schema` key is stripped by the tool — get-dataset-schema owns schema output (#882).
545546
*/
546547
export const datasetMetadataOutputSchema = {
547548
type: 'object' as const,

src/utils/schema_generation.ts

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,15 @@ export type SchemaGenerationOptions = {
1919
limit?: number;
2020
/** If true, strips empty arrays from items before inference. Default is true. */
2121
clean?: boolean;
22+
/**
23+
* Maximum nesting depth described in the schema; objects/arrays deeper than this collapse
24+
* to a bare `{ type }`. Caps token cost on deeply nested items (#882). Default is 3.
25+
*/
26+
maxDepth?: number;
2227
};
2328

29+
export const DEFAULT_MAX_SCHEMA_DEPTH = 3;
30+
2431
/**
2532
* Local counterpart to the dataset API's `clean=true` — empty arrays carry no schema info.
2633
* Strips only empty arrays; keeps null / '' / empty objects so schema inference still sees those fields.
@@ -101,23 +108,25 @@ function inferType(value: unknown): JsonSchemaPrimitiveType {
101108
return 'object';
102109
}
103110

104-
function inferSchema(value: unknown): JsonSchemaProperty {
111+
function inferSchema(value: unknown, depth: number, maxDepth: number): JsonSchemaProperty {
105112
const type = inferType(value);
106113

107114
if (type === 'object') {
115+
if (depth >= maxDepth) return { type: 'object' };
108116
const entries = Object.entries(value as Record<string, unknown>);
109117
if (entries.length === 0) return { type: 'object' };
110118
const properties: Record<string, JsonSchemaProperty> = {};
111119
for (const [k, v] of entries) {
112-
properties[k] = inferSchema(v);
120+
properties[k] = inferSchema(v, depth + 1, maxDepth);
113121
}
114122
return { type: 'object', properties };
115123
}
116124

117125
if (type === 'array') {
126+
if (depth >= maxDepth) return { type: 'array' };
118127
const arr = value as unknown[];
119128
if (arr.length === 0) return { type: 'array' };
120-
const merged = arr.map(inferSchema).reduce(mergeSchemas);
129+
const merged = arr.map((v) => inferSchema(v, depth + 1, maxDepth)).reduce(mergeSchemas);
121130
return { type: 'array', items: merged };
122131
}
123132

@@ -182,14 +191,14 @@ export function generateSchemaFromItems(
182191
datasetItems: unknown[],
183192
options: SchemaGenerationOptions = {},
184193
): JsonSchemaArray | null {
185-
const { limit = 5, clean = true } = options;
194+
const { limit = 5, clean = true, maxDepth = DEFAULT_MAX_SCHEMA_DEPTH } = options;
186195

187196
const itemsToUse = datasetItems.slice(0, limit);
188197
if (itemsToUse.length === 0) return null;
189198

190199
const processed = clean ? itemsToUse.map(cleanEmptyArrays) : itemsToUse;
191200

192-
const itemSchemas = processed.map(inferSchema);
201+
const itemSchemas = processed.map((item) => inferSchema(item, 0, maxDepth));
193202
const merged = itemSchemas.reduce(mergeSchemas);
194203

195204
return { type: 'array', items: merged };

tests/unit/schema_generation.test.ts

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,52 @@ describe('generateSchemaFromItems — options', () => {
241241
});
242242
});
243243

244+
describe('generateSchemaFromItems — depth cap', () => {
245+
// Calibration probe (#882): unbounded recursion blew Facebook-posts schemas to ~15 KB
246+
// via deep subtrees (`sharedPost`, `media`). Values deeper than maxDepth collapse to a bare type.
247+
it('collapses objects deeper than the default maxDepth to a bare object type', () => {
248+
const result = generateSchemaFromItems([{ a: { b: { c: { d: 1 } } } }]);
249+
const c = props(result)!.a?.properties?.b?.properties?.c;
250+
expect(c?.type).toBe('object');
251+
expect(c?.properties).toBeUndefined();
252+
});
253+
254+
it('collapses arrays deeper than the default maxDepth to a bare array type', () => {
255+
const result = generateSchemaFromItems([{ a: { b: { c: [1, 2] } } }]);
256+
const c = props(result)!.a?.properties?.b?.properties?.c;
257+
expect(c?.type).toBe('array');
258+
expect(c?.items).toBeUndefined();
259+
});
260+
261+
it('keeps everything above the cap fully described', () => {
262+
const result = generateSchemaFromItems([{ a: { b: { s: 'x', n: 1 } } }]);
263+
const b = props(result)!.a?.properties?.b;
264+
expect(b?.properties?.s?.type).toBe('string');
265+
expect(b?.properties?.n?.type).toBe('integer');
266+
});
267+
268+
it('counts array nesting toward the depth', () => {
269+
const result = generateSchemaFromItems([{ a: [{ b: { c: 1 } }] }]);
270+
const b = props(result)!.a?.items?.properties?.b;
271+
expect(b?.type).toBe('object');
272+
expect(b?.properties).toBeUndefined();
273+
});
274+
275+
it('respects a custom maxDepth', () => {
276+
const result = generateSchemaFromItems([{ a: { b: 1 } }], { maxDepth: 1 });
277+
const { a } = props(result)!;
278+
expect(a?.type).toBe('object');
279+
expect(a?.properties).toBeUndefined();
280+
});
281+
282+
it('merges capped and uncapped schemas across items without resurrecting depth', () => {
283+
const result = generateSchemaFromItems([{ a: { b: { c: { d: 1 } } } }, { a: { b: { c: { e: 'x' } } } }]);
284+
const c = props(result)!.a?.properties?.b?.properties?.c;
285+
expect(c?.type).toBe('object');
286+
expect(c?.properties).toBeUndefined();
287+
});
288+
});
289+
244290
describe('generateSchemaFromItems — user-reported regression', () => {
245291
it('emits all four top-level keys from the NYC sushi dataset sample', () => {
246292
const items = [

tests/unit/tools.get_dataset.test.ts

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,29 @@ describe('get-dataset', () => {
5454
expect(content[0].text).toContain("Dataset 'missing' not found");
5555
});
5656

57+
it('strips the raw schema field from the response', async () => {
58+
// Calibration probe (#882): raw `dataset.schema` was 93–95% of the response bytes on
59+
// top store Actors and declares fields absent from the data. get-dataset-schema is
60+
// the schema source; this tool returns metadata only.
61+
const result = await (getDataset as HelperTool).call(
62+
stubToolCallContext(
63+
{ datasetId: 'ds-1' },
64+
stubApifyClient({ ...MOCK_DATASET, schema: { fields: {}, views: {} } }),
65+
),
66+
);
67+
const { structuredContent } = result as { structuredContent: Record<string, unknown> };
68+
expect(structuredContent).not.toHaveProperty('schema');
69+
expect(structuredContent).toMatchObject(MOCK_DATASET);
70+
});
71+
72+
it('points nextStep at get-dataset-schema for structure inference', async () => {
73+
const result = await (getDataset as HelperTool).call(
74+
stubToolCallContext({ datasetId: 'ds-1' }, stubApifyClient(MOCK_DATASET)),
75+
);
76+
const { structuredContent } = result as { structuredContent: { nextStep: string } };
77+
expect(structuredContent.nextStep).toContain(HelperTools.DATASET_SCHEMA_GET);
78+
});
79+
5780
it('rejects empty datasetId via ajv validation', () => {
5881
const tool = getDataset as HelperTool;
5982
expect(tool.ajvValidate({ datasetId: '' })).toBe(false);

0 commit comments

Comments
 (0)