Skip to content

Commit bef46e3

Browse files
committed
feat: Drop raw schema from get-dataset, stop nudging get-dataset-schema
Calibration outcome for #882 (probe: 10 top store Actors; Mixpanel: the get-dataset-schema tool is rarely called): - get-dataset no longer returns the raw Apify dataset.schema (93–95% of response bytes on top Actors, 23–39% phantom fields). The flat `fields` list it already returns is the complete, projection-ready inventory. - Remove the nudges that steer the LLM into get-dataset-schema (a rarely needed, context-heavy call): drop it from get-dataset's description and nextStep, and from get-dataset-items' last-page nextStep (now points at get-dataset for the field list). - get-dataset-schema stays as an on-demand tool, unchanged — no depth cap; when explicitly called it returns the full inferred schema. Note for apify-mcp-server-internal: get-dataset structuredContent no longer carries the schema key. https://claude.ai/code/session_01Sf9wACoa9h9y2m2WZ2Sde5
1 parent dd60d45 commit bef46e3

5 files changed

Lines changed: 29 additions & 7 deletions

File tree

src/tools/common/get_dataset.ts

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ export const getDataset: ToolEntry = Object.freeze({
2222
name: HelperTools.DATASET_GET,
2323
description: dedent`
2424
Get metadata for a dataset (collection of structured data created by an Actor run).
25-
The results will include dataset details such as itemCount, schema, fields, and stats.
25+
The results will include dataset details such as itemCount, fields, and stats.
2626
Use fields to understand structure for filtering with ${HelperTools.DATASET_GET_ITEMS}.
2727
Note: itemCount updates may be delayed by up to ~5 seconds.
2828
@@ -51,14 +51,20 @@ export const getDataset: ToolEntry = Object.freeze({
5151
if (!dataset) {
5252
return buildStorageNotFound(`Dataset '${datasetId}' not found.`);
5353
}
54+
// The API also returns a raw `schema` (untyped in apify-client). It is 93–95% of the
55+
// response bytes on top store Actors and declares fields that may be absent from the
56+
// data, so drop it — get-dataset-schema infers a compact schema from real items (#882).
57+
const { schema, ...metadata } = dataset as typeof dataset & { schema?: unknown };
5458
// Apify returns `fields` slash-separated AND with array indices expanded
5559
// (e.g. `latestComments/0/owner/username`). For a real Instagram-scraper
5660
// dataset this inflates ~78 schema fields into 528 paths (~85% bloat) and
5761
// produces slash-notation paths that aren't directly usable as projection
5862
// hints for `get-dataset-items` (which expects dot-notation). Run the same
5963
// normalization `buildRunDataset` applies so this tool's `fields` matches
6064
// the structured `storages.datasets.default.fields` shape.
61-
const normalized = dataset.fields ? { ...dataset, fields: normalizeDatasetFields(dataset.fields) } : dataset;
65+
const normalized = metadata.fields
66+
? { ...metadata, fields: normalizeDatasetFields(metadata.fields) }
67+
: metadata;
6268
const fieldCount = Array.isArray(normalized.fields) ? normalized.fields.length : undefined;
6369
const summary = `Dataset '${normalized.name ?? datasetId}' has ${normalized.itemCount ?? 0} items${fieldCount !== undefined ? `, ${fieldCount} fields` : ''}.`;
6470
const nextStep = `Use ${HelperTools.DATASET_GET_ITEMS} with datasetId=${datasetId} and limit (for example 20) to fetch items.`;

src/tools/common/storage_helpers.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ export function buildStorageListSummaryNextStep(params: {
6767

6868
/**
6969
* Pagination-aware {summary, nextStep}: when more items remain, point at the next page;
70-
* otherwise point at get-dataset-schema for structure inspection.
70+
* otherwise point at get-dataset for the field list (structure lives there, not in a heavy schema dump).
7171
*/
7272
export function buildDatasetItemsSummaryNextStep(params: {
7373
datasetId: string;
@@ -88,7 +88,7 @@ export function buildDatasetItemsSummaryNextStep(params: {
8888
: `Fetched ${itemCount} of ${totalItemCount} items (offset=${offset}); no more pages.`;
8989
return {
9090
summary,
91-
nextStep: `Use ${HelperTools.DATASET_SCHEMA_GET} with datasetId=${datasetId} to inspect structure if needed.`,
91+
nextStep: `Use ${HelperTools.DATASET_GET} with datasetId=${datasetId} to see the field list if you need the data structure.`,
9292
};
9393
}
9494

src/tools/structured_output_schemas.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -541,7 +541,8 @@ export const datasetItemsOutputSchema = {
541541

542542
/**
543543
* Schema for dataset metadata (get-dataset). Documents the fields the LLM acts on; the raw API
544-
* response carries more keys (stats, schema, access settings), allowed as additional properties.
544+
* response carries more keys (stats, access settings), allowed as additional properties.
545+
* The raw `schema` key is stripped by the tool — get-dataset-schema owns schema output (#882).
545546
*/
546547
export const datasetMetadataOutputSchema = {
547548
type: 'object' as const,

tests/unit/tools.get_dataset.test.ts

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,21 @@ describe('get-dataset', () => {
5454
expect(content[0].text).toContain("Dataset 'missing' not found");
5555
});
5656

57+
it('strips the raw schema field from the response', async () => {
58+
// Calibration probe (#882): raw `dataset.schema` was 93–95% of the response bytes on
59+
// top store Actors and declares fields absent from the data. get-dataset-schema is
60+
// the schema source; this tool returns metadata only.
61+
const result = await (getDataset as HelperTool).call(
62+
stubToolCallContext(
63+
{ datasetId: 'ds-1' },
64+
stubApifyClient({ ...MOCK_DATASET, schema: { fields: {}, views: {} } }),
65+
),
66+
);
67+
const { structuredContent } = result as { structuredContent: Record<string, unknown> };
68+
expect(structuredContent).not.toHaveProperty('schema');
69+
expect(structuredContent).toMatchObject(MOCK_DATASET);
70+
});
71+
5772
it('rejects empty datasetId via ajv validation', () => {
5873
const tool = getDataset as HelperTool;
5974
expect(tool.ajvValidate({ datasetId: '' })).toBe(false);

tests/unit/tools.get_dataset_items.test.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ describe('get-dataset-items', () => {
181181
expect(structuredContent.datasetId).toBe('user~my-dataset');
182182
});
183183

184-
it('emits a last-page summary and a schema nextStep when all items are returned', async () => {
184+
it('emits a last-page summary and a get-dataset nextStep when all items are returned', async () => {
185185
const result = await (getDatasetItems as HelperTool).call(
186186
stubToolCallContext({ datasetId: 'ds-1' }, stubApifyClient()),
187187
);
@@ -190,7 +190,7 @@ describe('get-dataset-items', () => {
190190
};
191191

192192
expect(structuredContent.summary).toBe('Fetched all 1 items.');
193-
expect(structuredContent.nextStep).toContain(HelperTools.DATASET_SCHEMA_GET);
193+
expect(structuredContent.nextStep).toContain(HelperTools.DATASET_GET);
194194
expect(structuredContent.nextStep).toContain('datasetId=ds-1');
195195
// summary + nextStep ship as a separate text block after the fenced data.
196196
expect(content[1].text).toBe(`${structuredContent.summary}\n${structuredContent.nextStep}`);

0 commit comments

Comments
 (0)