Skip to content

Commit db6255f

Browse files
prosdevclaude
andcommitted
fix(core): revert Linear Merge chunking — breaks merge semantics
Chunking Linear Merge causes each chunk to delete the previous chunk's records (server thinks each subset is the full dataset). Reverted to single-call approach. The Antfly payload size limit (~6k docs) is an Antfly-side issue that needs a fix in the server (raise JSON body limit or support streaming). Tracked in scratchpad. chunk() utility kept — useful elsewhere. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent ab41cd8 commit db6255f

2 files changed

Lines changed: 41 additions & 71 deletions

File tree

.claude/scratchpad.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
- **`getDocsByFilePath` fetches all docs client-side (capped at 5k).** Uses `getAll(limit: 5000)` + exact path filter. Fine for single repos (dev-agent has ~2,200 docs). Won't scale to monorepos with 50k+ files. Future fix: server-side path filter in Antfly SDK.
66
- **Two clones of the same repo share one index.** Storage path is hashed from git remote URL (`prosdevlab/dev-agent``a1b2c3d4`). Two local clones on different branches share the same index, graph cache, and watcher snapshot. Stale data possible if branches diverge significantly. Pre-existing design — not introduced by graph cache. Fix would be to include branch or worktree path in the hash.
7-
- **Antfly Linear Merge fails on large batch sizes (~6k+ docs).** Tested with cli/cli (5,933 docs): `decoding request: json: string unexpected end of JSON input`. The scanner completes successfully but Antfly's HTTP endpoint fails to process the full payload. Workaround: none currently — the full index fails. Fix options: (1) batch the linearMerge into chunks of ~3k docs, (2) raise the limit on Antfly side, (3) stream instead of single POST. This blocks indexing medium-large Go/Rust repos (>~5k components).
7+
- **Antfly Linear Merge fails on large JSON payloads (~6k+ docs).** Tested with cli/cli (5,933 docs): `decoding request: json: string unexpected end of JSON input`. The scanner completes successfully but Antfly's HTTP endpoint can't parse the JSON body. Chunking is NOT a viable fix — Linear Merge semantics require ALL records in one call (the server deletes records not in the set, so each chunk deletes the previous chunk's data). Fix must be Antfly-side: raise the JSON body size limit, or support streaming/chunked transfer encoding. File a ticket with Antfly. Blocks indexing repos with >~5k components.
88
- **Rust/Go callee extraction does not resolve target files.** tree-sitter callees have `name` and `line` but no `file` field (unlike ts-morph which resolves cross-file references). This means `dev_map` hot paths show 0 refs for Rust/Go repos, and `dev_refs --depends-on` won't trace cross-file paths. The dependency graph only has edges when callees include a `file` field. Future: cross-file resolution for tree-sitter languages.
99

1010
## Open Questions

packages/core/src/vector/antfly-store.ts

Lines changed: 40 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,6 @@
77
*/
88

99
import { AntflyClient } from '@antfly/sdk';
10-
import { chunk } from '../utils/chunking';
1110
import type {
1211
EmbeddingDocument,
1312
SearchOptions,
@@ -304,13 +303,6 @@ export class AntflyVectorStore implements VectorStore {
304303
* Use ONLY for full-index operations. For incremental updates, use batchUpsertAndDelete().
305304
* Records must be sorted lexicographically by key (handled internally).
306305
*/
307-
/**
308-
* Maximum documents per Linear Merge HTTP request.
309-
* Antfly's endpoint fails on large JSON payloads (~6k+ docs).
310-
* Chunking into smaller batches avoids the limit.
311-
*/
312-
private static readonly MERGE_BATCH_SIZE = 3000;
313-
314306
async linearMerge(
315307
documents: EmbeddingDocument[],
316308
lastMergedId = '',
@@ -322,21 +314,51 @@ export class AntflyVectorStore implements VectorStore {
322314
this.assertReady();
323315

324316
const sorted = [...documents].sort((a, b) => a.id.localeCompare(b.id));
317+
const records: Record<string, unknown> = {};
318+
for (const doc of sorted) {
319+
records[doc.id] = { text: doc.text, metadata: JSON.stringify(doc.metadata) };
320+
}
321+
325322
const total = documents.length;
326323
const totals: LinearMergeResult = { upserted: 0, skipped: 0, deleted: 0 };
327-
328-
// Chunk documents to avoid Antfly HTTP payload size limit
329-
const chunks = chunk(sorted, AntflyVectorStore.MERGE_BATCH_SIZE);
324+
let cursor = lastMergedId;
330325

331326
try {
332-
for (const chunk of chunks) {
333-
const result = await this.linearMergeChunk(chunk, lastMergedId);
334-
totals.upserted += result.upserted;
335-
totals.skipped += result.skipped;
336-
totals.deleted += result.deleted;
337-
if (result.took) totals.took = (totals.took ?? 0) + result.took;
327+
const raw = this.client.getRawClient();
328+
do {
329+
const result = await raw.POST('/tables/{tableName}/merge', {
330+
params: { path: { tableName: this.cfg.table } },
331+
body: { records, last_merged_id: cursor },
332+
});
333+
334+
if (result.error) {
335+
throw new Error(
336+
typeof result.error === 'object' && 'error' in result.error
337+
? String((result.error as Record<string, unknown>).error)
338+
: String(result.error)
339+
);
340+
}
341+
342+
const data = result.data;
343+
if (!data) {
344+
throw new Error('Linear Merge returned no data');
345+
}
346+
347+
totals.upserted += data.upserted ?? 0;
348+
totals.skipped += data.skipped ?? 0;
349+
totals.deleted += data.deleted ?? 0;
350+
if (data.took) totals.took = (totals.took ?? 0) + data.took;
351+
338352
onProgress?.(totals.upserted + totals.skipped, total);
339-
}
353+
354+
if (data.status === 'partial' && data.next_cursor) {
355+
cursor = data.next_cursor;
356+
} else {
357+
break;
358+
}
359+
// biome-ignore lint/correctness/noConstantCondition: pagination loop exits via break
360+
} while (true);
361+
340362
return totals;
341363
} catch (error) {
342364
throw new Error(
@@ -345,58 +367,6 @@ export class AntflyVectorStore implements VectorStore {
345367
}
346368
}
347369

348-
/**
349-
* Merge a single chunk of documents via Antfly's merge endpoint.
350-
* Handles server-side pagination (status: "partial" + next_cursor).
351-
*/
352-
private async linearMergeChunk(
353-
chunk: EmbeddingDocument[],
354-
lastMergedId: string
355-
): Promise<LinearMergeResult> {
356-
const records: Record<string, unknown> = {};
357-
for (const doc of chunk) {
358-
records[doc.id] = { text: doc.text, metadata: JSON.stringify(doc.metadata) };
359-
}
360-
361-
const totals: LinearMergeResult = { upserted: 0, skipped: 0, deleted: 0 };
362-
let cursor = lastMergedId;
363-
364-
const raw = this.client.getRawClient();
365-
do {
366-
const result = await raw.POST('/tables/{tableName}/merge', {
367-
params: { path: { tableName: this.cfg.table } },
368-
body: { records, last_merged_id: cursor },
369-
});
370-
371-
if (result.error) {
372-
throw new Error(
373-
typeof result.error === 'object' && 'error' in result.error
374-
? String((result.error as Record<string, unknown>).error)
375-
: String(result.error)
376-
);
377-
}
378-
379-
const data = result.data;
380-
if (!data) {
381-
throw new Error('Linear Merge returned no data');
382-
}
383-
384-
totals.upserted += data.upserted ?? 0;
385-
totals.skipped += data.skipped ?? 0;
386-
totals.deleted += data.deleted ?? 0;
387-
if (data.took) totals.took = (totals.took ?? 0) + data.took;
388-
389-
if (data.status === 'partial' && data.next_cursor) {
390-
cursor = data.next_cursor;
391-
} else {
392-
break;
393-
}
394-
// biome-ignore lint/correctness/noConstantCondition: pagination loop exits via break
395-
} while (true);
396-
397-
return totals;
398-
}
399-
400370
/**
401371
* Combined upsert + delete in a single batchOp call.
402372
* Safe for incremental updates and concurrent calls.

0 commit comments

Comments
 (0)