Improve recall cue quality

focaxisdev · web-flow · commit dca820a9581b · 2026-04-23T21:36:11.000+08:00
Adds cue quality lint warnings, gist-first summaries, boundary-aware chunking, and v0.3.1 release metadata.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,14 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.3.1] - 2026-04-23
+
+- Added impression cue quality warnings to `deja-vu-lint-memory` for sparse, oversized, duplicate, generic, and repeated keyword sets.
+- Updated the default summary generator to preserve decision, rationale, and trigger gist cues instead of only truncating source content.
+- Updated the default chunker to preserve Markdown heading and paragraph boundaries before falling back to hard splitting.
+- Added source tests for cue lint warnings, gist summaries, and boundary-aware chunking.
+- Updated package metadata for the 0.3.1 patch release.
+
 ## [0.3.0] - 2026-04-22
 
 - Repositioned Deja Vu as a cue-first memory protocol centered on `task cue -> familiarity score -> minimal recall -> durable writeback`.
diff --git a/README.md b/README.md
@@ -16,6 +16,8 @@ The protocol is packaged as three project-local assets:
 
 The goal is not to give every agent a heavy runtime. The goal is to give any agent a repeatable discipline for spending almost no tokens until the task proves that deeper memory is useful.
 
+The current patch line also emphasizes memory quality control: compact cue linting, gist-first summaries, and boundary-aware chunks keep recall routes small instead of merely adding more stored text.
+
 ## What Deja Vu Is
 
 Deja Vu defines a shared memory behavior for agents working inside one project.
@@ -86,6 +88,7 @@ The canonical layout and field rules are specified in [docs/storage-markdown.md]
 - Use a single-project scope only in MVP: `project:<project-id>`.
 - Recall before substantial work, but follow a strict recall budget.
 - Prefer scripted impression scans first; open summary or detailed records only when needed.
+- Keep impression cues sparse, specific, and linted so the first recall step stays cheap.
 - Write back only durable memory:
   - decisions
   - architecture intent
@@ -163,6 +166,8 @@ The public TypeScript exports remain intact for hosts that want semantic recall.
 
 `scanImpressions()` performs token-only familiarity scanning and does not load summaries or chunks.
 
+The default engine helpers preserve low-token recall quality by generating decision/rationale/trigger summaries and by chunking Markdown or paragraph boundaries before falling back to character splits.
+
 ## Examples
 
 - Protocol-first example: [examples/protocol-project](./examples/protocol-project)
@@ -209,4 +214,5 @@ npm run lint:memory
 - [docs/scripted-recall.md](./docs/scripted-recall.md)
 - [docs/bootstrap-instructions.md](./docs/bootstrap-instructions.md)
 - [docs/project-rules-template.md](./docs/project-rules-template.md)
+- [docs/release-v0.3.1.md](./docs/release-v0.3.1.md)
 - [llms.txt](./llms.txt)
diff --git a/docs/engine/semantic-engine.md b/docs/engine/semantic-engine.md
@@ -20,6 +20,8 @@ The engine provides:
 - threshold gating
 - scoring and ranking
 - in-memory demo adapters
+- gist-first default summaries
+- Markdown and paragraph boundary-aware default chunking
 
 ## What it does not do
 
diff --git a/docs/impression-layer.md b/docs/impression-layer.md
@@ -77,6 +77,7 @@ Default keyword discipline:
 - prefer nouns, feature names, decision names, and project-specific phrases
 - avoid full sentences
 - use `aliases` only for stable alternate names
+- run the memory linter to catch sparse, oversized, generic, duplicate, or repeated cue sets
 
 ## Retention Behavior
 
diff --git a/docs/release-v0.3.1.md b/docs/release-v0.3.1.md
@@ -0,0 +1,15 @@
+# Deja Vu v0.3.1
+
+Deja Vu v0.3.1 is a recall-quality patch for the cue-first protocol release.
+
+## Highlights
+
+- `deja-vu-lint-memory` now warns when impression cues are too sparse, too large, duplicated, too generic, or repeated across records.
+- Default engine summaries now preserve gist cues as decision, rationale, and trigger fields when those labels are available.
+- Default engine chunking now respects Markdown headings and paragraph boundaries before using hard character splits.
+- The patch keeps the v0.3 protocol surface unchanged while making low-token recall routes cleaner and less noisy.
+
+## Validation
+
+- `npm run test:src`
+- `npm run lint:memory`
diff --git a/docs/scripted-recall.md b/docs/scripted-recall.md
@@ -20,6 +20,13 @@ The companion linter checks whether the impression index is structurally usable:
 deja-vu-lint-memory
 ```
 
+The linter also warns about low-quality cue routes that make future recall more expensive:
+
+- too few or too many keywords
+- duplicate keywords inside one record
+- too many generic keywords
+- duplicate keyword sets across records
+
 ## Inputs
 
 The script reads:
diff --git a/encoding-status.md b/encoding-status.md
@@ -4,13 +4,13 @@
 | --- | --- | --- |
 | `encoding-status.md` | 編碼正常（新建，UTF-8） | Project-local registry created by agent. |
 | `.gitignore` | 編碼正常（新建，UTF-8） | New text file. |
-| `package.json` | 編碼正常（已檢查） | Rewritten in UTF-8; version bumped to 0.3.0 and package description now uses cue-first positioning. |
-| `package-lock.json` | 編碼正常（已檢查） | Rewritten in UTF-8; package version metadata now matches the 0.3.0 release. |
+| `package.json` | 編碼正常（已檢查） | Updated in UTF-8; version bumped to 0.3.1 with cue-quality package description and keywords. |
+| `package-lock.json` | 編碼正常（已檢查） | Updated in UTF-8; package version metadata now matches the 0.3.1 release. |
 | `tsconfig.json` | 編碼正常（新建，UTF-8） | New text file. |
 | `LICENSE` | 編碼正常（新建，UTF-8） | New text file. |
-| `README.md` | 編碼正常（已檢查） | Rewritten in UTF-8; repo entrypoint now centers cue-first recall, minimum memory files, and recall budget. |
-| `CHANGELOG.md` | 編碼正常（已檢查） | Rewritten in UTF-8; now includes the 0.3.0 cue-first protocol release notes. |
-| `llms.txt` | 編碼正常（已檢查） | Rewritten in UTF-8; AI-readable index now points to cue-first recall and recall budget concepts. |
+| `README.md` | 編碼正常（已檢查） | Updated in UTF-8; overview now includes cue quality control, gist summaries, and boundary-aware chunks. |
+| `CHANGELOG.md` | 編碼正常（已檢查） | Updated in UTF-8; now includes the 0.3.1 recall-quality patch release notes. |
+| `llms.txt` | 編碼正常（已檢查） | Updated in UTF-8; AI-readable index now includes cue quality linting, gist summaries, and boundary-aware chunking. |
 | `docs/architecture.md` | 編碼正常（已檢查） | Rewritten in UTF-8; architecture doc now describes the engine as the optional layer inside a protocol-first product. |
 | `docs/agent-handshake.md` | 編碼正常（已檢查） | Rewritten in UTF-8; handshake now starts from cue-first adoption and recall-budget discipline. |
 | `docs/project-rules-template.md` | 編碼正常（已檢查） | Rewritten in UTF-8; points to canonical AGENTS and memory templates. |
@@ -20,7 +20,7 @@
 | `docs/protocol.md` | 編碼正常（已檢查） | Rewritten in UTF-8; protocol now defines cue-first v0.3, minimum artifacts, and recall budget. |
 | `docs/workflow.md` | 編碼正常（已檢查） | Rewritten in UTF-8; workflow now uses cue-first recall budget and lower-priority event writeback. |
 | `docs/storage-markdown.md` | 編碼正常（已檢查） | Rewritten in UTF-8; storage contract now separates required, recommended, and optional layouts. |
-| `docs/engine/semantic-engine.md` | 編碼正常（新建，UTF-8） | New optional engine overview. |
+| `docs/engine/semantic-engine.md` | 編碼正常（已檢查） | Updated in UTF-8; engine overview now mentions gist-first summaries and boundary-aware chunking. |
 | `docs/engine/protocol-to-engine.md` | 編碼正常（新建，UTF-8） | New mapping from protocol workflow to semantic engine usage. |
 | `docs/templates/AGENTS.template.md` | 編碼正常（已檢查） | Rewritten in UTF-8; template now uses protocol v0.3, recall budget, and optional index/events rules. |
 | `docs/templates/memory/index.md` | 編碼正常（新建，UTF-8） | New memory index template. |
@@ -35,8 +35,8 @@
 | `src/utils/id.ts` | 編碼正常（新建，UTF-8） | New text file. |
 | `src/utils/math.ts` | 編碼正常（新建，UTF-8） | New text file. |
 | `src/utils/text.ts` | 編碼正常（新建，UTF-8） | New text file. |
-| `src/memory/default-chunker.ts` | 編碼正常（新建，UTF-8） | New text file. |
-| `src/memory/default-summary-generator.ts` | 編碼正常（新建，UTF-8） | New text file. |
+| `src/memory/default-chunker.ts` | 編碼正常（已檢查） | Updated in UTF-8; default chunking now preserves Markdown and paragraph boundaries before hard splitting. |
+| `src/memory/default-summary-generator.ts` | 編碼正常（已檢查） | Updated in UTF-8; default summaries now preserve decision/rationale/trigger gist cues. |
 | `src/scoring/hybrid-scoring-strategy.ts` | 編碼正常（新建，UTF-8） | New text file. |
 | `src/plugins/mock-embedding-provider.ts` | 編碼正常（新建，UTF-8） | Updated in UTF-8; hybrid token and trigram demo embeddings. |
 | `src/plugins/create-in-memory-engine.ts` | 編碼正常（新建，UTF-8） | New text file. |
@@ -58,15 +58,16 @@
 | `examples/protocol-project/memory/context/project-context.md` | 編碼正常（新建，UTF-8） | New example project context. |
 | `examples/protocol-project/memory/decisions/protocol-first-positioning.md` | 編碼正常（新建，UTF-8） | New example decision record. |
 | `examples/protocol-project/memory/open-loops/add-engine-later.md` | 編碼正常（新建，UTF-8） | New example open-loop record. |
-| `tests/semantic-recall-engine.test.ts` | 編碼正常（已檢查） | Checked in UTF-8; source tests now run through the updated Node 24-compatible test script. |
-| `docs/impression-layer.md` | 編碼正常（已檢查） | Rewritten in UTF-8; impression layer now anchors cue-first token spending and keyword discipline. |
-| `docs/scripted-recall.md` | 編碼正常（已檢查） | Rewritten in UTF-8; script contract now requires only summary, impressions, and scanner for bootstrap. |
+| `tests/semantic-recall-engine.test.ts` | 編碼正常（已檢查） | Updated in UTF-8; source tests now cover gist summaries and boundary-aware chunking. |
+| `docs/impression-layer.md` | 編碼正常（已檢查） | Updated in UTF-8; keyword discipline now points to linter checks for low-quality cue routes. |
+| `docs/scripted-recall.md` | 編碼正常（已檢查） | Updated in UTF-8; linter docs now describe low-quality cue warnings. |
 | `docs/release-v0.2.1.md` | 編碼正常（新建，UTF-8） | New release note for scripted impression-first recall. |
 | `docs/release-v0.3.0.md` | 編碼正常（新建，UTF-8） | New release note for cue-first protocol and recall budget release. |
+| `docs/release-v0.3.1.md` | 編碼正常（新建，UTF-8） | New release note for cue quality, gist summaries, and boundary-aware chunking. |
 | `docs/templates/memory/impressions.jsonl` | 編碼正常（新建，UTF-8） | New impression index template. |
 | `docs/templates/memory/events/YYYY-MM.md` | 編碼正常（新建，UTF-8） | New event ledger template. |
 | `scripts/dejavu-scan-memory.mjs` | 編碼正常（新建，UTF-8） | New default memory impression scanner. |
 | `examples/protocol-project/memory/impressions.jsonl` | 編碼正常（新建，UTF-8） | New example impression index. |
 | `examples/protocol-project/memory/events/2026-04.md` | 編碼正常（新建，UTF-8） | New example event ledger. |
-| `scripts/dejavu-lint-memory.mjs` | 編碼正常（新建，UTF-8） | New memory impression index linter. |
-| `tests/memory-cli.test.ts` | 編碼正常（新建，UTF-8） | New CLI and package smoke tests. |
+| `scripts/dejavu-lint-memory.mjs` | 編碼正常（已檢查） | Updated in UTF-8; linter now warns on low-quality impression cues and duplicate keyword sets. |
+| `tests/memory-cli.test.ts` | 編碼正常（已檢查） | Updated in UTF-8; CLI tests now cover low-quality cue warnings. |
diff --git a/llms.txt b/llms.txt
@@ -7,6 +7,7 @@ Deja Vu helps agents preserve useful project memory through:
 - explicit project rules
 - a repeatable cue-scan, minimal-recall, and writeback workflow
 - tiny Markdown and JSONL memory files inside the repository
+- cue quality checks that keep the first recall step sparse and specific
 
 The minimum adoption path does not require npm, embeddings, vector search, or a dedicated memory service.
 
@@ -36,6 +37,9 @@ The minimum adoption path does not require npm, embeddings, vector search, or a
 - project-scoped continuity
 - recall before substantial work
 - impression-first scripted recall
+- cue quality linting
+- gist-first summaries
+- boundary-aware chunking
 - recall budget
 - selective writeback
 - event ledger continuity
diff --git a/package-lock.json b/package-lock.json
diff --git a/package.json b/package.json
@@ -1,7 +1,7 @@
 {
   "name": "@focaxisdev/deja-vu",
-  "version": "0.3.0",
-  "description": "Deja Vu: a cue-first memory protocol for AI agents with an optional semantic recall engine.",
+  "version": "0.3.1",
+  "description": "Deja Vu: a cue-first memory protocol with quality-gated recall cues and an optional semantic engine.",
   "type": "module",
   "main": "./dist/src/index.js",
   "types": "./dist/src/index.d.ts",
@@ -43,6 +43,10 @@
     "memory-protocol",
     "project-memory",
     "markdown-memory",
+    "cue-first-recall",
+    "cue-quality",
+    "gist-summary",
+    "boundary-aware-chunking",
     "semantic-recall",
     "memory-engine",
     "ai-agent",
diff --git a/scripts/dejavu-lint-memory.mjs b/scripts/dejavu-lint-memory.mjs
@@ -16,6 +16,27 @@ const rootPath = resolve(process.cwd(), memoryRoot);
 const impressionsPath = resolve(rootPath, "impressions.jsonl");
 const diagnostics = [];
 const seenIds = new Set();
+const keywordSignatures = new Map();
+const genericKeywords = new Set([
+  "about",
+  "agent",
+  "change",
+  "context",
+  "data",
+  "detail",
+  "file",
+  "general",
+  "info",
+  "memory",
+  "note",
+  "project",
+  "record",
+  "summary",
+  "task",
+  "thing",
+  "update",
+  "work",
+]);
 
 function addDiagnostic(level, message, details = {}) {
   diagnostics.push({ level, message, ...details });
@@ -25,6 +46,10 @@ function isStringArray(value) {
   return Array.isArray(value) && value.every((item) => typeof item === "string" && item.length > 0);
 }
 
+function normalizedKeywords(keywords) {
+  return keywords.map((keyword) => keyword.trim().toLowerCase()).filter(Boolean);
+}
+
 if (!existsSync(impressionsPath)) {
   addDiagnostic("error", "Missing memory/impressions.jsonl", { path: impressionsPath });
 } else {
@@ -57,6 +82,60 @@ if (!existsSync(impressionsPath)) {
 
     if (!isStringArray(record.keywords)) {
       addDiagnostic("error", "keywords must be a non-empty string array", { path: impressionsPath, line: lineNumber });
+    } else {
+      const keywords = normalizedKeywords(record.keywords);
+      const uniqueKeywords = new Set(keywords);
+
+      if (keywords.length < 3) {
+        addDiagnostic("warning", "keywords should include at least 3 cue terms", {
+          path: impressionsPath,
+          line: lineNumber,
+          id: record.id,
+        });
+      }
+
+      if (keywords.length > 12) {
+        addDiagnostic("warning", "keywords should stay at or below 12 cue terms", {
+          path: impressionsPath,
+          line: lineNumber,
+          id: record.id,
+          count: keywords.length,
+        });
+      }
+
+      if (uniqueKeywords.size !== keywords.length) {
+        addDiagnostic("warning", "keywords contain duplicate cue terms", {
+          path: impressionsPath,
+          line: lineNumber,
+          id: record.id,
+        });
+      }
+
+      const genericMatches = keywords.filter((keyword) => genericKeywords.has(keyword));
+      if (genericMatches.length >= 3) {
+        addDiagnostic("warning", "keywords rely on too many generic cue terms", {
+          path: impressionsPath,
+          line: lineNumber,
+          id: record.id,
+          keywords: genericMatches,
+        });
+      }
+
+      const signature = [...uniqueKeywords].sort().join("|");
+      if (signature) {
+        const previous = keywordSignatures.get(signature);
+        if (previous) {
+          addDiagnostic("warning", "duplicate keyword set across impression records", {
+            path: impressionsPath,
+            line: lineNumber,
+            id: record.id,
+            duplicate_of: previous.id,
+            duplicate_line: previous.line,
+          });
+        } else {
+          keywordSignatures.set(signature, { id: record.id, line: lineNumber });
+        }
+      }
     }
 
     if (record.aliases !== undefined && !isStringArray(record.aliases)) {
diff --git a/src/memory/default-chunker.ts b/src/memory/default-chunker.ts
diff --git a/src/memory/default-summary-generator.ts b/src/memory/default-summary-generator.ts
diff --git a/tests/memory-cli.test.ts b/tests/memory-cli.test.ts
diff --git a/tests/semantic-recall-engine.test.ts b/tests/semantic-recall-engine.test.ts