Skip to content

feat(vault.read): faithful-by-default reads with char-budget pagination (ADR-203, #133)#205

Merged
aaronsb merged 3 commits into
mainfrom
feat/203-faithful-paginated-reads
May 24, 2026
Merged

feat(vault.read): faithful-by-default reads with char-budget pagination (ADR-203, #133)#205
aaronsb merged 3 commits into
mainfrom
feat/203-faithful-paginated-reads

Conversation

@aaronsb
Copy link
Copy Markdown
Owner

@aaronsb aaronsb commented May 19, 2026

Implements ADR-203 (Accepted, amended through design dialogue). Resolves the #133 intent + the large-raw-context guard.

Contract

Condition Result
Fits READ_PAGE_CHARS (50000 ≈ ~12k tok) Whole file, byte-exact, one load (common case)
Exceeds budget Verbatim page 1, single contiguous block + line bookends (lineStart/lineEnd/totalLines/nextPage); page=N continues
returnFullFile: true Entire file verbatim (explicit large-context override — param retained & repurposed)
query/strategy/maxFragments Semantic fragments (unchanged)
  • Budget is char-based on purpose: line count is an invalid proxy for context cost (1500 short vs 1500 long lines). Only bookends are line-based so edit.at_line still works on a large file.
  • Hard invariant: a default read never hands the agent a context-breaking raw dump.
  • Structured envelope no longer double-encodes the body (metadata sans body — ~halves raw read payload).
  • Fixes the latent formatFileRead crash (_Formatter error_) on full-file shapes; formatter now renders verbatim + a Pagination section.

Files

  • src/utils/file-reader.ts — rewritten to the contract (READ_PAGE_CHARS, buildPage, the four branches).
  • src/semantic/operations/vault.ts — threads page into the read options.
  • src/formatters/vault.tsFileReadResponse extended; formatFileRead hardened (guarded size/modified, pagination block, no crash).
  • src/tools/semantic-tools.tsreturnFullFile/page descriptions reflect the new semantics.
  • CHANGELOG.md — breaking-change entry.
  • tests/vault-read-fidelity.test.ts — 8 tests: byte-exact small read, code-fence/tab round-trip, bookended page 1, contiguous page 2, returnFullFile override, page-past-EOF, fragments-unchanged, formatter-no-crash.

Verification

make check: 0 errors, baseline 5 warnings (no new), 243/243 (21 suites). tsc clean.

Rollout

Per ADR-203 + the standing process: a prerelease will be cut for a BRAT functional test of a real read→edit.window round-trip and a large-file bookended-page-1 → at_line before this is considered field-validated. #133 stays open until that confirms, then is closed/reframed.

Refs ADR-203, #133.

aaronsb added 2 commits May 19, 2026 00:24
Refined through design dialogue:
- paginate/no-paginate boundary is a CHARACTER budget (READ_PAGE_CHARS,
  default 50000), not a line count (line count is an invalid proxy for
  context cost — 1500 short vs 1500 long lines differ by orders of magnitude)
- bookends stay line-based (lineStart/lineEnd/totalLines) so edit.at_line
  on a large file still works; a page = longest whole-line run within the
  char budget
- fits budget → whole file verbatim, one load (common case, no ceremony)
- exceeds → bookended page 1 (single contiguous block, not chunk array) +
  nextPage hint; sequential paging via new `page` param
- returnFullFile:true RETAINED and repurposed as the explicit
  whole-large-file override (not retired)
- hard invariant recorded: a default read must never hand the agent a
  context-breaking raw dump
Alternatives added: no-guard full dump, line-count budget, chunk-array — all rejected with rationale.
…on (ADR-203, #133)

Default vault.read no longer fragments-and-flattens. New contract:
- fits READ_PAGE_CHARS (50000) → whole file, byte-exact, one load (common case)
- exceeds → verbatim page 1, single contiguous block, with line bookends
  (lineStart/lineEnd/totalLines + nextPage); page=N to continue. Absolute
  line numbers preserved so edit.at_line still works on large files.
- returnFullFile:true → entire file verbatim (explicit large override; param
  retained & repurposed, not retired)
- query/strategy/maxFragments → semantic fragments (unchanged)
- structured envelope no longer double-encodes the body (metadata sans body)

Budget is char-based on purpose: line count is an invalid proxy for context
cost; only bookends are line-based (for at_line). Hard invariant: a default
read must never hand the agent a context-breaking raw dump.

Also fixes the latent formatFileRead crash (_Formatter error_) on full-file
shapes; formatter now renders verbatim + a Pagination section.

src/utils/file-reader.ts rewritten; vault.ts threads `page`; formatter
hardened; tool description + CHANGELOG (breaking) updated.
make check green (0 errors, baseline 5 warnings, 243/243 incl. 8 new
ADR-203 round-trip/pagination/fidelity tests).
@github-actions
Copy link
Copy Markdown

✅ Build succeeded! Artifacts are available in the Actions tab.

@aaronsb
Copy link
Copy Markdown
Owner Author

aaronsb commented May 19, 2026

Code Review — ADR-203 faithful-by-default paginated reads

Reviewed against the amended ADR-203 (decision of record), #133, and the full diff. Forensic verification done on buildPage round-trip, the page walk, and consumer breakage.

Verdict: Request changes — one blocking issue


BLOCKING

formatFileRead truncates content to 50 lines in the default (non-raw) path — breaks ADR-203 bucket 1

Location: src/formatters/vault.ts:267-273

const contentLines = content.split('\n');
const previewLines = contentLines.slice(0, 50);
lines.push('```markdown');
lines.push(previewLines.join('\n'));
if (contentLines.length > 50) {
  lines.push(`\n... (${contentLines.length - 50} more lines)`);
}

Problem: ADR-203 §1 places vault.read in bucket 1 — "faithful by default". Bucket 2 (concise) is explicitly enumerated as create/update/delete/move/rename/copy/search/graph./edit.read is deliberately not in it. The ADR Consequences/Positive states "Read→edit round-trips become correct by default" — i.e. without raw:true.

But raw defaults to false (src/tools/semantic-tools.ts:257,331: rawMode = args.raw === true), so the default vault.read response is formatFileRead's output — and that output truncates the string-content branch to a 50-line preview. For any file/page over 50 lines, the agent receives a truncated markdown preview and cannot derive a byte-matching edit.window oldText without re-calling with raw:true. That is precisely the read→edit friction #133 documents and that ADR-203 §1 exists to retire.

The PR description claims "formatter now renders verbatim + a Pagination section." The Pagination section is implemented; the verbatim rendering is not — the .slice(0, 50) is still there.

Why it matters: This silently negates the core correctness guarantee of the ADR for the common (non-raw) call path. It is the exact failure class the brief defines as blocking ("behaviour-fidelity bugs in a byte-exact feature").

Suggestion: For the whole-file and paginated branches, emit the content whole inside the fenced block — drop the 50-line slice. This is safe by construction: the page is already bounded to ≤ READ_PAGE_CHARS in file-reader.ts, so there is no unbounded-context risk. (The fragment-array branch's truncate(frag.content, 500) is a separate, legitimately-bucketed concern and can stay.)

Also add a fidelity test asserting the formatted (non-raw) output contains the byte-exact content for both the whole-file and paginated branches. The current tests/vault-read-fidelity.test.ts:105-122 only asserts not.toThrow() and the presence of the strings Pagination/page=2 — it never checks that emitted text equals the source, so this can regress undetected. The 8 tests passing is not evidence the contract holds; it's evidence the tests don't check it.


Verified correct (no action needed)

buildPage byte-exactness and contiguity — Verified by an independent round-trip harness across pathological inputs: trailing/no-trailing newline, blank-line runs, CRLF, single oversized line, multiple consecutive oversized lines, empty string, only-newlines, exact-budget boundary. In every case pages.join('\n') === original byte-for-byte. The split('\n')/join('\n') pairing is faithful because \r stays attached to the line (CRLF preserved) and blank lines round-trip as empty strings. No off-by-one: lineStart = startIdx+1, lineEnd = i (1-based inclusive) is correct, and page N+1 pageLineStart === page N pageLineEnd + 1 holds — verified in the logic, not just the test (3 pages cover lines 1–4000 as 1–1648 / 1649–3260 / 3261–4000, no gap/overlap/dropped line).

Pathological single-line > budget — Returned whole as its own page, oversizedLine flagged, nextIdx advances by 1 so the walk continues; no infinite loop. Multiple consecutive oversized lines each become their own page correctly with hasMore accurate on the last.

Page edge casespage=0, page<0, page=NaN clamp to page 1; page=2.7 floors to 2; page-past-EOF returns beyondEnd:true, empty content, and reports the real last page number. All sane.

ADR contract — four branches match exactly: fragments (query/strategy/maxFragments) unchanged; returnFullFile:true → entire verbatim with override warning; fits READ_PAGE_CHARS → whole verbatim, paginated:false, one load; exceeds → bookended page 1 with lineStart/lineEnd/totalLines/bytes/nextPage. The data layer is faithful in every branch.

De-dup (ADR §3) — Body is correctly stripped from metadata in all branches including fragments (const { content: _body, ... } = ...; metaNoBody = rest). No path embeds the body twice. Confirmed no consumer anywhere in src/ reads the old nested result.content.content / metadata.content / metadata.wordCount shape — only one caller (vault.ts:46) and one formatter consumer (formatFileRead), so the breaking de-dup is contained.

Image / buffer interactions — Image files are intercepted by the mimeType/base64Data check in semantic-tools.ts:296-308 before reaching the formatter, so formatFileRead never receives an image passthrough. ContentBufferManager/from_buffer are edit.*-only and do not interact with vault.read.


Non-blocking notes

  1. Unrecognized-binary passthrough (file-reader.ts:132,143) — an exotic object with neither string content nor base64Data/mimeType would reach formatFileRead, hit content.split('\n') (vault.ts:267) and fall to the _Formatter error_ catch (index.ts:649). This is pre-existing behaviour, not introduced here, and outside the faithful-text contract — but a typeof content !== 'string' guard in the formatter would close it cleanly.

  2. Page-walk perf (file-reader.ts:209-216) — building page N re-walks from page 1, O(totalLines × pageNum). Acceptable for the target sizes (notes, files of a few thousand lines: 3-4 pages). Not a concern at these scales; worth a comment noting the deliberate trade-off.


Once the truncation is fixed and a byte-exact formatted-output test is added, this is sound: the data layer is genuinely byte-faithful and the pagination math is correct. The blocking issue is purely the presentation layer silently undercutting the default-path guarantee.

AI-assisted review via Claude

…blocking review)

formatFileRead truncated the string branch to a 50-line preview; since raw
defaults false, the DEFAULT vault.read was that truncated view — an agent
reading any >50-line file could not derive a byte-matching edit.window
oldText without raw:true, the exact #133 friction ADR-203 retires. The data
layer was already byte-faithful; the presentation default undercut it.

- string branch: emit full content verbatim, no 50-line slice, no wrapping
  code fence (body may contain fences — a wrapper would corrupt round-trip)
- add non-text/binary passthrough guard (closes a pre-existing latent
  formatter-error path; reviewer non-blocking note)
- test now asserts the FORMATTED (raw:false) output is byte-exact, not just
  not.toThrow

make check green: 0 errors, baseline 5 warnings, 244/244.
@github-actions
Copy link
Copy Markdown

✅ Build succeeded! Artifacts are available in the Actions tab.

@aaronsb aaronsb merged commit f6c108e into main May 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant