diff --git a/CHANGELOG.md b/CHANGELOG.md index cebdb62..444e811 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Changed +- ⚠️ BREAKING: `vault.read` is now **faithful by default** (ADR-203, [#133](https://github.com/aaronsb/obsidian-mcp-plugin/issues/133)). It returns the **complete, byte-exact file source** (no more newline-flattened fragments) when the file fits a ~50k-char budget. Large files return a **verbatim page 1 with absolute line bookends** (`page=N` to continue) instead of a context-breaking raw dump. `returnFullFile: true` is the explicit whole-large-file override; `query`/`strategy`/`maxFragments` still return semantic fragments. The structured envelope no longer double-encodes the body. Clients that relied on the old fragmented default should pass fragment params explicitly. + ### Security - πŸ”΄ CRITICAL: Identified authentication vulnerability - no API key validation ([#9](https://github.com/aaronsb/obsidian-mcp-plugin/issues/9)) - πŸ”΄ CRITICAL: Identified path traversal vulnerability in file operations ([#10](https://github.com/aaronsb/obsidian-mcp-plugin/issues/10)) diff --git a/docs/architecture/tools/ADR-203-faithful-by-default-content-reads-concise-by-default-tool-results.md b/docs/architecture/tools/ADR-203-faithful-by-default-content-reads-concise-by-default-tool-results.md index 516c603..7f81110 100644 --- a/docs/architecture/tools/ADR-203-faithful-by-default-content-reads-concise-by-default-tool-results.md +++ b/docs/architecture/tools/ADR-203-faithful-by-default-content-reads-concise-by-default-tool-results.md @@ -66,14 +66,43 @@ from **action/result rendering** rather than by one blunt format switch. Adopt a **two-bucket response contract**. This ADR is the decision of record; implementation is tracked separately and is *not* part of this ADR. -1. **Content reads are faithful by default.** `vault.read` returns the - **complete, verbatim file source** by default β€” byte-exact as stored by - Obsidian (newlines, indentation, trailing whitespace, tabs, fenced - blocks, frontmatter delimiters preserved). Fragmentation / summarization - becomes an **explicit opt-in** for large-file context savings via the - *existing* knobs (`query`, `strategy`, `maxFragments`); `returnFullFile` - is no longer required to get the truth and is retired or demoted to a - no-op alias. The readβ†’edit round-trip is correct by construction. +1. **Content reads are faithful by default β€” but never context-breaking.** + `vault.read` returns **byte-exact source** as stored by Obsidian + (newlines, indentation, trailing whitespace, tabs, fenced blocks, + frontmatter delimiters preserved β€” never flattened/fragmented unless + fragments are explicitly requested). The hard invariant: **a default read + must not hand the agent a context-breaking raw dump.** + + The paginate/no-paginate boundary is a **character budget** + (`READ_PAGE_CHARS`, default **50000** β‰ˆ ~12k tokens, one tunable + constant) β€” *not* a line count. A line count is an invalid proxy for the + thing being bounded: 1500 five-word lines and 1500 hundred-word lines + have wildly different context cost. **Bookends stay line-based** (for + `edit.at_line`); the two reconcile by defining a *page* as the longest + run of **whole lines whose cumulative size ≀ `READ_PAGE_CHARS`**, + reported with the line range it covers. Lines are never split mid-line. + + - **Fits the budget (the common case β€” most notes):** the **whole file, + verbatim, in one load**. No pagination structure, no ceremony. + - **Exceeds the budget:** **page 1 only** β€” a single *contiguous* + verbatim block: the longest whole-line run within `READ_PAGE_CHARS` + (not an array of chunk objects), carrying **bookends**: `lineStart`, + `lineEnd`, `totalLines`, `bytes`, and a `nextPage` hint + (`vault.read(path, page=2)`). Absolute line numbers are preserved, so + `edit.at_line` on a large file still works precisely and the agent + pages forward instead of a truncated preview, lossy fragments, or a + context bomb. (A single line that alone exceeds the budget is returned + whole as its own page β€” never split β€” flagged; correctness wins over + the budget in that rare case.) + - **`returnFullFile: true`:** explicit override β€” the entire file + verbatim regardless of size. `returnFullFile` is therefore **retained + and repurposed** (not retired): it is the deliberate large-context + opt-in, used when the agent knowingly accepts the cost. + - **`query` / `strategy` / `maxFragments`:** semantic fragment retrieval, + unchanged. + + The readβ†’edit round-trip is correct by construction in every branch + (per-page content is byte-exact for its line range). 2. **Action/result outputs stay concise-formatted by default.** Tools whose result is a status / confirmation / list (create, update, delete, move, @@ -94,12 +123,13 @@ record; implementation is tracked separately and is *not* part of this ADR. `_Formatter error_` fallback; the failing `returnFullFile` formatted branch is fixed or removed as part of (1). -Backward compatibility: (1) changes the default shape/size of `vault.read` -for clients that currently rely on the fragmented/flattened default. This -is a deliberate breaking change to fix a correctness bug; it ships with a -clear changelog entry and a documented opt-in (`strategy`/`maxFragments`) -for the previous context-saving behaviour. A prerelease + functional -verification of a real readβ†’`edit.window` round-trip gates the rollout. +Backward compatibility: (1) changes the default shape of `vault.read` for +clients relying on the fragmented/flattened default. Deliberate breaking +change to fix a correctness bug; ships with a changelog entry, the +`strategy`/`maxFragments` opt-in for the old context-saving behaviour, and +`returnFullFile:true` for whole-large-file access. A prerelease + functional +verification of a real readβ†’`edit.window` round-trip (incl. a large file +exercising bookended page 1 β†’ `at_line`) gates the rollout. #133 is **reframed, not closed by fiat**: it remains the user-facing tracking issue for bucket 1; this ADR records the agreed direction and @@ -124,19 +154,24 @@ explicitly supersedes its specific "global `raw` flip" remedy. default `vault.read` (smaller, lossy payloads). Mitigated by changelog, the existing `strategy`/`maxFragments` opt-in for context savings, and a gated prerelease test. -- Large files now return full content by default β†’ higher per-call context - cost for the read-everything pattern. Mitigated: fragmentation is still - one explicit parameter away, and a `wordCount`/size `warning` is retained - to nudge large-file callers toward it. +- Large files no longer return verbatim by default β€” they return a + bookended page 1. This is a behavioural change for any client that read a + large file in one default call; `returnFullFile:true` restores that + explicitly. Accepted: a context-breaking default is the failure this ADR + exists to prevent. +- Pagination adds a small surface to `vault.read` (`page` param, + `lineStart/lineEnd/totalLines/bytes/nextPage` fields). Justified: it's + what lets large-file `edit.at_line` keep working instead of forcing + fragments. ### Neutral - Two-bucket contract should be documented in the README and tool descriptions so the default semantics are discoverable. -- `returnFullFile` becomes redundant; retire or alias with a deprecation - note. -- No change to `IObsidianAPI` or the MCP tool surface beyond the default - response shape/size of `vault.read` and the de-duplicated envelope. +- `returnFullFile` is retained and repurposed as the explicit + whole-large-file override (not retired). +- No change to `IObsidianAPI`; the MCP `vault` tool gains a `page` + parameter and the read response gains pagination bookends. ## Alternatives Considered @@ -155,3 +190,15 @@ explicitly supersedes its specific "global `raw` flip" remedy. - **New dedicated `vault.source` tool, leave `read` lossy.** Rejected: splits the surface and leaves the obvious tool (`read`) a footgun; the principled fix is making the obvious tool correct by default. +- **Return full verbatim regardless of size (no guard).** Rejected + outright: a large file dumped raw breaks the agent's context β€” this is + the exact failure mode the pagination guard exists to prevent, and the + reason `returnFullFile:true` is an *explicit, knowing* opt-in. +- **Line-count page budget (e.g. 1500 lines).** Rejected: line count does + not bound context β€” 1500 short lines vs 1500 long lines differ by orders + of magnitude. The budget must be size-based (`READ_PAGE_CHARS`); only the + *bookends* are line-based, for `at_line`. +- **Array of page/chunk objects in one response.** Rejected: that is just + the fragmented default in another shape and is not "one load." A page is + a single contiguous verbatim block; multi-page access is sequential via + the `page` parameter. diff --git a/src/formatters/vault.ts b/src/formatters/vault.ts index fdb3059..0f45153 100644 --- a/src/formatters/vault.ts +++ b/src/formatters/vault.ts @@ -150,13 +150,15 @@ export interface FileReadFragment { } export interface FileReadResponse { - path: string; + path?: string; content: string | FileReadFragment[]; metadata?: { - size: number; - modified: number; + size?: number; + modified?: number; created?: number; - extension: string; + extension?: string; + totalLines?: number; + bytes?: number; }; frontmatter?: Record; tags?: string[]; @@ -166,22 +168,41 @@ export interface FileReadResponse { strategy: string; query?: string; }; + pagination?: { + paginated: boolean; + page: number; + pageLineStart: number; + pageLineEnd: number; + totalLines: number; + bytes: number; + hasMore: boolean; + nextPage: string | null; + oversizedLine?: boolean; + beyondEnd?: boolean; + }; + warning?: string; } export function formatFileRead(response: FileReadResponse): string { - const { path, content, metadata, frontmatter, tags, fragmentMetadata } = response; + const { path, content, metadata, frontmatter, tags, fragmentMetadata, pagination } = response; const lines: string[] = []; - const fileName = path.split('/').pop() || path; + const safePath = path || 'file'; + const fileName = safePath.split('/').pop() || safePath; lines.push(header(1, `File: ${fileName}`)); lines.push(''); // Metadata summary - lines.push(property('Path', path, 0)); - if (metadata) { + lines.push(property('Path', safePath, 0)); + if (metadata && typeof metadata.size === 'number') { lines.push(property('Size', formatFileSize(metadata.size), 0)); + } + if (metadata && typeof metadata.modified === 'number') { lines.push(property('Modified', formatDate(metadata.modified), 0)); } + if (metadata && typeof metadata.totalLines === 'number') { + lines.push(property('Lines', String(metadata.totalLines), 0)); + } // Tags if (tags && tags.length > 0) { @@ -217,6 +238,17 @@ export function formatFileRead(response: FileReadResponse): string { // Content - handle both string and fragment array formats lines.push(''); + if (content != null && typeof content !== 'string' && !Array.isArray(content)) { + // Unrecognized/binary structured passthrough β€” render a safe note + // instead of falling through to the _Formatter error_ fallback. + lines.push(header(2, 'Content')); + lines.push(''); + lines.push('_(non-text content; use `raw: true` for the structured payload)_'); + lines.push(divider()); + lines.push(summaryFooter()); + return joinLines(lines); + } + if (Array.isArray(content)) { // Fragment-based response const fragments = content; @@ -239,18 +271,35 @@ export function formatFileRead(response: FileReadResponse): string { lines.push(`... and ${fragments.length - 5} more fragments`); } } else { - // Simple string content + // Verbatim string content (ADR-203: content reads are faithful by + // default β€” the formatted default path must NOT truncate, or an agent + // cannot derive a byte-matching edit.window oldText without raw:true, + // which is the exact #133 friction this ADR retires). The data layer + // already bounds size to READ_PAGE_CHARS (or it's an explicit + // returnFullFile override), so emitting the full block is safe. No + // wrapping code fence: the body may itself contain ``` fences, and a + // wrapper would corrupt round-trip fidelity. lines.push(header(2, 'Content')); lines.push(''); + lines.push(content); + } - const contentLines = content.split('\n'); - const previewLines = contentLines.slice(0, 50); - lines.push('```markdown'); - lines.push(previewLines.join('\n')); - if (contentLines.length > 50) { - lines.push(`\n... (${contentLines.length - 50} more lines)`); + if (pagination && pagination.paginated) { + lines.push(''); + lines.push(header(2, 'Pagination')); + lines.push(property('Page', `${pagination.page} (lines ${pagination.pageLineStart}-${pagination.pageLineEnd} of ${pagination.totalLines})`, 0)); + if (pagination.hasMore && pagination.nextPage) { + lines.push(property('Next', pagination.nextPage, 0)); } - lines.push('```'); + if (pagination.beyondEnd) { + lines.push(' (requested page is past end of file)'); + } + lines.push(' returnFullFile=true for the whole file Β· query/strategy/maxFragments for fragments Β· line numbers are absolute (edit.at_line works)'); + } + + if (response.warning) { + lines.push(''); + lines.push(`> ${response.warning}`); } lines.push(divider()); diff --git a/src/semantic/operations/vault.ts b/src/semantic/operations/vault.ts index 63b98a8..fa0d168 100644 --- a/src/semantic/operations/vault.ts +++ b/src/semantic/operations/vault.ts @@ -46,6 +46,7 @@ export async function executeVaultOperation(ctx: RouterContext, action: string, return await readFileWithFragments(ctx.api, ctx.fragmentRetriever, { path, returnFullFile: paramBool(params, 'returnFullFile'), + page: paramNum(params, 'page'), query: paramStr(params, 'query'), strategy, maxFragments: paramNum(params, 'maxFragments') diff --git a/src/tools/semantic-tools.ts b/src/tools/semantic-tools.ts index 0513f2e..e446b32 100644 --- a/src/tools/semantic-tools.ts +++ b/src/tools/semantic-tools.ts @@ -433,7 +433,7 @@ function getParametersForOperation(operation: string): Record { }, page: { type: 'number', - description: 'Page number for paginated results' + description: 'Page number for paginated results (list/search; also large-file read β€” see returnFullFile)' }, pageSize: { type: 'number', @@ -450,7 +450,7 @@ function getParametersForOperation(operation: string): Record { }, returnFullFile: { type: 'boolean', - description: 'Return full file instead of fragments (WARNING: large files can consume significant context)' + description: 'read: force the ENTIRE file verbatim regardless of size (explicit large-context override). Default read already returns the whole file verbatim when it fits the size budget; large files return a verbatim page 1 with absolute line bookends (use page=N to continue, or query/strategy/maxFragments for fragments).' }, includeContent: { type: 'boolean', diff --git a/src/utils/file-reader.ts b/src/utils/file-reader.ts index e1660b5..01aa35f 100644 --- a/src/utils/file-reader.ts +++ b/src/utils/file-reader.ts @@ -2,18 +2,47 @@ import { ObsidianAPI } from './obsidian-api'; import { isImageFile } from '../types/obsidian'; import { UniversalFragmentRetriever } from '../indexing/fragment-retriever'; +/** + * Character budget that decides whole-file vs. paginated reads (ADR-203). + * + * Size-based on purpose: a line count is an invalid proxy for context cost + * (1500 short lines vs 1500 long lines differ by orders of magnitude). Only + * the *bookends* are line-based, so `edit.at_line` keeps working on a large + * file. ~50k chars β‰ˆ ~12k tokens β€” generous enough that the overwhelming + * majority of notes return whole in one load. + */ +export const READ_PAGE_CHARS = 50000; + interface FileReadOptions { path: string; + /** Explicit whole-large-file override (ADR-203): full verbatim regardless of size. */ returnFullFile?: boolean; + /** Sequential page (1-based) for large files that exceed READ_PAGE_CHARS. */ + page?: number; query?: string; strategy?: 'auto' | 'adaptive' | 'proximity' | 'semantic'; maxFragments?: number; } interface FileReadResult { + path?: string; content?: unknown; metadata?: unknown; + frontmatter?: unknown; + tags?: unknown; originalContentLength?: number; + pagination?: { + paginated: boolean; + page: number; + pageLineStart: number; + pageLineEnd: number; + totalLines: number; + bytes: number; + hasMore: boolean; + nextPage: string | null; + oversizedLine?: boolean; + beyondEnd?: boolean; + }; fragmentMetadata?: { totalFragments: number; strategy: string; @@ -28,82 +57,211 @@ interface FileReadResult { } /** - * Shared file reading logic with fragment support - * Used by both classic tools and semantic operations + * Build one page: the longest run of whole lines (starting at `startIdx`, + * 0-based) whose joined size stays within READ_PAGE_CHARS. A single line + * larger than the budget is returned whole as its own page (never split). + */ +function buildPage(lines: string[], startIdx: number): { + text: string; + lineStart: number; + lineEnd: number; + nextIdx: number; + oversizedLine: boolean; +} { + const parts: string[] = []; + let size = 0; + let i = startIdx; + let oversizedLine = false; + while (i < lines.length) { + const ln = lines[i]; + const candidate = parts.length === 0 ? ln.length : size + 1 + ln.length; + if (candidate > READ_PAGE_CHARS && parts.length > 0) break; + if (candidate > READ_PAGE_CHARS && parts.length === 0) oversizedLine = true; + parts.push(ln); + size = candidate; + i++; + if (oversizedLine) break; + } + return { + text: parts.join('\n'), + lineStart: startIdx + 1, + lineEnd: i, // 1-based inclusive end == count of lines consumed + nextIdx: i, + oversizedLine, + }; +} + +/** + * Shared file reading logic (ADR-203). + * + * Faithful by default, never context-breaking: + * - fragment params (query/strategy/maxFragments) β†’ semantic fragments + * - returnFullFile:true β†’ entire file verbatim (explicit large override) + * - fits READ_PAGE_CHARS β†’ entire file verbatim, one load (common case) + * - exceeds budget β†’ bookended page 1 (or `page` N): one contiguous + * verbatim block + line bookends so edit.at_line still works + * + * Content is byte-exact in every branch (never flattened); the structured + * envelope no longer double-encodes the body. */ export async function readFileWithFragments( api: ObsidianAPI, fragmentRetriever: UniversalFragmentRetriever, options: FileReadOptions ): Promise { - const { path, returnFullFile, query, strategy, maxFragments } = options; - - // Get the file + const { path, returnFullFile, page, query, strategy, maxFragments } = options; + const fileResponse = await api.getFile(path); - - // Check if it's an image file + + // Image / binary: passthrough unchanged if (isImageFile(fileResponse)) { return fileResponse as FileReadResult; } - - // Extract content from the response + + // Extract verbatim content + metadata (metadata WITHOUT a copy of the body) let fileContent: string; - let metadata: Record = {}; - + let metaNoBody: Record = {}; + let frontmatter: unknown; + let tags: unknown; + if (typeof fileResponse === 'string') { fileContent = fileResponse; } else if (fileResponse && typeof fileResponse === 'object' && 'content' in fileResponse) { - // Handle structured response from Obsidian API - fileContent = fileResponse.content; - metadata = { ...fileResponse }; - - // If it's still not a string (might be an image or binary file) - if (typeof fileContent !== 'string') { - return fileResponse as FileReadResult; + const fc = (fileResponse as { content: unknown }).content; + if (typeof fc !== 'string') { + return fileResponse as FileReadResult; // image/binary structured } + fileContent = fc; + // Strip the body so it is not embedded twice in the envelope (ADR-203 Β§3) + const { content: _body, frontmatter: fm, tags: tg, ...rest } = + fileResponse as unknown as Record; + void _body; + metaNoBody = rest; + frontmatter = fm; + tags = tg; } else { - // Handle other non-text files return fileResponse as FileReadResult; } - - // Return full file if requested - if (returnFullFile) { - const wordCount = fileContent.split(/\s+/).length; - + + const totalChars = fileContent.length; + const lines = fileContent.split('\n'); + const totalLines = lines.length; + + // 1. Explicit fragment retrieval (unchanged behaviour) + const wantsFragments = + query !== undefined || strategy !== undefined || maxFragments !== undefined; + if (wantsFragments) { + const docId = `file:${path}`; + fragmentRetriever.indexDocument(docId, path, fileContent); + const fragmentQuery = query || path.split('/').pop()?.replace('.md', '') || ''; + const fragmentResponse = fragmentRetriever.retrieveFragments(fragmentQuery, { + strategy: strategy || 'auto', + maxFragments: maxFragments || 5, + }); + return { + path, + ...metaNoBody, + frontmatter, + tags, + content: fragmentResponse.result, + originalContentLength: totalChars, + fragmentMetadata: { + totalFragments: fragmentResponse.result.length, + strategy: strategy || 'auto', + query: fragmentQuery, + }, + workflow: fragmentResponse.workflow, + efficiency_hints: fragmentResponse.efficiency_hints, + }; + } + + // 2. Whole file, one load β€” fits the budget OR explicit override + if (returnFullFile || totalChars <= READ_PAGE_CHARS) { + const overrideOnLarge = !!returnFullFile && totalChars > READ_PAGE_CHARS; return { - content: fileResponse, + path, + content: fileContent, // verbatim, single contiguous string + frontmatter, + tags, metadata: { - ...metadata, - wordCount, - warning: wordCount > 2000 ? - `This file contains ${wordCount} words. Consider using fragment retrieval (remove returnFullFile parameter) to reduce context consumption.` : - null - } + ...metaNoBody, + totalLines, + bytes: totalChars, + }, + pagination: { + paginated: false, + page: 1, + pageLineStart: 1, + pageLineEnd: totalLines, + totalLines, + bytes: totalChars, + hasMore: false, + nextPage: null, + }, + warning: overrideOnLarge + ? `Returned entire large file verbatim (${totalLines} lines, ${totalChars} bytes) via returnFullFile override.` + : undefined, }; } - - // Use fragment retrieval - const docId = `file:${path}`; - fragmentRetriever.indexDocument(docId, path, fileContent); - - // Retrieve relevant fragments based on query or path - const fragmentQuery = query || path.split('/').pop()?.replace('.md', '') || ''; - const fragmentResponse = fragmentRetriever.retrieveFragments(fragmentQuery, { - strategy: strategy || 'auto', - maxFragments: maxFragments || 5 - }); - - // Return structured response with fragments + + // 3. Large file, no override, no fragments β†’ bookended page + const requested = typeof page === 'number' && page >= 1 ? Math.floor(page) : 1; + let idx = 0; + let cur = 1; + let built = buildPage(lines, idx); + while (cur < requested && built.nextIdx < lines.length) { + idx = built.nextIdx; + cur++; + built = buildPage(lines, idx); + } + + // Requested a page past EOF + if (cur < requested) { + return { + path, + content: '', + frontmatter, + tags, + metadata: { ...metaNoBody, totalLines, bytes: totalChars }, + pagination: { + paginated: true, + page: requested, + pageLineStart: totalLines + 1, + pageLineEnd: totalLines, + totalLines, + bytes: totalChars, + hasMore: false, + nextPage: null, + beyondEnd: true, + }, + warning: `Requested page ${requested} is past end of file (file has ${totalLines} lines, last page is ${cur}).`, + }; + } + + const hasMore = built.nextIdx < lines.length; + const nextPageNum = cur + 1; return { - ...metadata, - content: fragmentResponse.result, - originalContentLength: fileContent.length, - fragmentMetadata: { - totalFragments: fragmentResponse.result.length, - strategy: strategy || 'auto', - query: fragmentQuery + path, + content: built.text, // contiguous verbatim block for this line range + frontmatter, + tags, + metadata: { ...metaNoBody, totalLines, bytes: totalChars }, + pagination: { + paginated: true, + page: cur, + pageLineStart: built.lineStart, + pageLineEnd: built.lineEnd, + totalLines, + bytes: totalChars, + hasMore, + nextPage: hasMore ? `vault.read(path='${path}', page=${nextPageNum})` : null, + oversizedLine: built.oversizedLine || undefined, }, - workflow: fragmentResponse.workflow, - efficiency_hints: fragmentResponse.efficiency_hints + warning: + `Large file (${totalLines} lines, ${totalChars} bytes). Returned page ${cur} ` + + `(lines ${built.lineStart}-${built.lineEnd}, verbatim). ` + + (hasMore ? `Use page=${nextPageNum} for more, ` : '') + + `returnFullFile=true for the whole file, or query/strategy/maxFragments for fragments. ` + + `Line numbers are absolute β€” edit.at_line works on this page.`, }; -} \ No newline at end of file +} diff --git a/tests/vault-read-fidelity.test.ts b/tests/vault-read-fidelity.test.ts new file mode 100644 index 0000000..cbb6f63 --- /dev/null +++ b/tests/vault-read-fidelity.test.ts @@ -0,0 +1,134 @@ +/** + * ADR-203 β€” faithful-by-default content reads with char-budget pagination + * and line bookends. Covers #133's intent + the large-raw guard. + */ +import { readFileWithFragments, READ_PAGE_CHARS } from '../src/utils/file-reader'; +import { formatFileRead } from '../src/formatters/vault'; +import { UniversalFragmentRetriever } from '../src/indexing/fragment-retriever'; +import { ObsidianAPI } from '../src/utils/obsidian-api'; +import { App } from 'obsidian'; + +class MockAPI extends ObsidianAPI { + files = new Map(); + constructor() { super({} as App); } + async getFile(path: string): Promise { + const content = this.files.get(path); + if (content === undefined) throw new Error(`not found: ${path}`); + return { path, content, tags: ['#demo'], frontmatter: { title: 'T' } }; + } +} + +const fr = () => new UniversalFragmentRetriever(); + +// A whitespace/structure-sensitive small file β€” the #133 fidelity case. +const TRICKY = '---\ntitle: T\n---\n\n# H\n\npara **b**\n\n```python\ndef f(x):\n return x*2 # indented\n```\n\ttab-line \n'; + +describe('vault.read fidelity & pagination (ADR-203)', () => { + test('small file: whole verbatim source, byte-exact, not paginated, body not duplicated', async () => { + const api = new MockAPI(); + api.files.set('s.md', TRICKY); + const r: any = await readFileWithFragments(api, fr(), { path: 's.md' }); + + expect(typeof r.content).toBe('string'); + expect(r.content).toBe(TRICKY); // exact bytes, no flatten + expect(r.pagination.paginated).toBe(false); + expect(r.pagination.totalLines).toBe(TRICKY.split('\n').length); + // metadata must NOT carry a second copy of the body (ADR-203 Β§3) + expect(JSON.stringify(r.metadata)).not.toContain('def f(x)'); + }); + + test('round-trip: a substring taken from the read matches the file for edit.window', async () => { + const api = new MockAPI(); + api.files.set('s.md', TRICKY); + const r: any = await readFileWithFragments(api, fr(), { path: 's.md' }); + // The exact indented code line an editing agent would target: + expect(r.content).toContain(' return x*2 # indented'); + expect(r.content.split('\n')).toContain('\ttab-line '); + }); + + const big = Array.from({ length: 4000 }, (_, i) => `line ${i + 1} ${'x'.repeat(20)}`).join('\n'); + + test('large file: default returns bookended page 1, not the whole dump', async () => { + const api = new MockAPI(); + api.files.set('big.md', big); + const r: any = await readFileWithFragments(api, fr(), { path: 'big.md' }); + + expect(r.pagination.paginated).toBe(true); + expect(r.pagination.page).toBe(1); + expect(r.pagination.pageLineStart).toBe(1); + expect(r.pagination.pageLineEnd).toBeLessThan(r.pagination.totalLines); + expect(r.pagination.hasMore).toBe(true); + expect(r.pagination.nextPage).toContain('page=2'); + // page content is bounded by the char budget (the agent-safety invariant) + expect((r.content as string).length).toBeLessThanOrEqual(READ_PAGE_CHARS); + // ...and is a verbatim contiguous prefix (line 1 present, exact) + expect((r.content as string).split('\n')[0]).toBe('line 1 ' + 'x'.repeat(20)); + }); + + test('large file: page 2 continues contiguously from page 1 (absolute line numbers)', async () => { + const api = new MockAPI(); + api.files.set('big.md', big); + const p1: any = await readFileWithFragments(api, fr(), { path: 'big.md', page: 1 }); + const p2: any = await readFileWithFragments(api, fr(), { path: 'big.md', page: 2 }); + + expect(p2.pagination.pageLineStart).toBe(p1.pagination.pageLineEnd + 1); + const firstLineOfP2 = `line ${p2.pagination.pageLineStart} ${'x'.repeat(20)}`; + expect((p2.content as string).split('\n')[0]).toBe(firstLineOfP2); + }); + + test('large file: returnFullFile=true overrides to the entire verbatim file', async () => { + const api = new MockAPI(); + api.files.set('big.md', big); + const r: any = await readFileWithFragments(api, fr(), { path: 'big.md', returnFullFile: true }); + expect(r.content).toBe(big); + expect(r.pagination.paginated).toBe(false); + expect(r.warning).toMatch(/returnFullFile override/i); + }); + + test('page past EOF is reported, not an error', async () => { + const api = new MockAPI(); + api.files.set('big.md', big); + const r: any = await readFileWithFragments(api, fr(), { path: 'big.md', page: 9999 }); + expect(r.pagination.beyondEnd).toBe(true); + expect(r.content).toBe(''); + expect(r.warning).toMatch(/past end of file/i); + }); + + test('fragment params still route to semantic fragments (unchanged)', async () => { + const api = new MockAPI(); + api.files.set('big.md', big); + const r: any = await readFileWithFragments(api, fr(), { path: 'big.md', maxFragments: 3 }); + expect(Array.isArray(r.content)).toBe(true); + expect(r.fragmentMetadata).toBeDefined(); + }); + + test('formatted (non-raw) default output is byte-faithful β€” no truncation, no fence corruption', () => { + // The blocking #133 case: an agent reading the DEFAULT (raw:false) + // formatted output must be able to lift a byte-exact edit.window oldText. + const out = formatFileRead({ + path: 's.md', content: TRICKY, + metadata: { totalLines: TRICKY.split('\n').length, bytes: TRICKY.length }, + pagination: { paginated: false, page: 1, pageLineStart: 1, pageLineEnd: TRICKY.split('\n').length, totalLines: TRICKY.split('\n').length, bytes: TRICKY.length, hasMore: false, nextPage: null }, + } as any); + // Verbatim body present in full, including the inner ```python fence, + // the indented code line, and the trailing-space + tab line. + expect(out).toContain(TRICKY); + expect(out).toContain(' return x*2 # indented'); + expect(out).toContain('\ttab-line '); + expect(out).not.toMatch(/more lines\)/); // no truncation marker + }); + + test('formatter renders paginated + edge shapes without the _Formatter error_ crash', () => { + const out2 = formatFileRead({ + path: 'big.md', content: 'line 1 ...\nline 2 ...', + metadata: { totalLines: 4000, bytes: 90000 }, + pagination: { paginated: true, page: 1, pageLineStart: 1, pageLineEnd: 1800, totalLines: 4000, bytes: 90000, hasMore: true, nextPage: "vault.read(path='big.md', page=2)" }, + warning: 'Large file …', + } as any); + expect(out2).toContain('Pagination'); + expect(out2).toContain('page=2'); + expect(out2).not.toContain('Formatter error'); + // non-text passthrough must not crash + expect(() => formatFileRead({ path: 'x.bin', content: { binary: true } } as any)).not.toThrow(); + }); +});