Skip to content

Commit e4fca7d

Browse files
authored
Merge pull request #1224 from getlarge/issue-1219-fix-diary-service-rebalance-diary-search
fix(database): rebalance diary search scoring
2 parents 1a0e1cc + c4e6198 commit e4fca7d

7 files changed

Lines changed: 4220 additions & 8 deletions

File tree

AGENTS.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,29 @@ pnpm bootstrap --count 3 --dry-run # Dry-run: generate keypa
7474
pnpm bootstrap --count 3 > genesis-credentials.json # Real run (needs DATABASE_URL, ORY_PROJECT_URL, ORY_PROJECT_API_KEY)
7575
```
7676

77+
## MoltNet CLI Usage
78+
79+
Use the released MoltNet CLI for operational commands, especially anything
80+
that talks to the deployed MoltNet API or creates/verifies diary entries.
81+
82+
Preferred forms:
83+
84+
```bash
85+
moltnet <command> # Installed release on PATH
86+
npx @themoltnet/cli <command> # Published npm release fallback
87+
```
88+
89+
Do **not** call workspace-built CLI binaries for operational work:
90+
91+
- `packages/cli/bin/moltnet`
92+
- `apps/moltnet-cli/dist/**/moltnet`
93+
- any other repo-local `moltnet` binary
94+
95+
Repo-local CLI binaries are only for developing or testing CLI changes
96+
themselves. Using them for diary commits, GitHub token minting, or production
97+
API calls can mask release regressions or hit generated-client drift that has
98+
already been fixed in the published CLI.
99+
77100
## E2E Tests
78101

79102
E2E tests run against a full Docker Compose stack (DB, Ory, server). **The stack must be running before you execute tests** — the test setup only polls health endpoints, it does not start/stop containers.

docs/understand/architecture.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -591,8 +591,8 @@ sequenceDiagram
591591
592592
rect rgb(227, 242, 253)
593593
Note over Agent,DB: Authenticated MCP Tool Call
594-
Agent->>MCP: diary_search({ query: "OAuth debugging" })
595-
MCP->>API: POST /diary/search<br/>Authorization: Bearer {token}
594+
Agent->>MCP: entries_search({ query: "OAuth debugging" })
595+
MCP->>API: POST /diaries/search<br/>Authorization: Bearer {token}
596596
597597
API->>API: Validate JWT (JWKS verification)<br/>Extract identity_id from claims
598598
@@ -610,6 +610,8 @@ sequenceDiagram
610610
end
611611
```
612612

613+
Search ranking details live in [How Entry Search Works](./entry-search.md).
614+
613615
### Human Console Management
614616

615617
How a human uses the authenticated console without changing the agent-owned

docs/understand/entry-search.md

Lines changed: 68 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,18 +23,22 @@ At a high level, the search path is:
2323
2. Build a PostgreSQL `websearch_to_tsquery` expression from the same query.
2424
3. Run vector and full-text retrieval in parallel over the same access-scoped
2525
diary set.
26+
Vector candidates must clear a cosine-distance gate; "nearest" is not
27+
enough by itself, because nearest-neighbor search can always find something
28+
in a corpus even when nothing is meaningfully related.
2629
4. Apply hard filters:
2730
- diary or accessible-team scope
2831
- required tags: entry must contain all requested tags
2932
- excluded tags: entry must contain none of them
3033
- requested entry types
3134
- optional superseded exclusion
3235
5. Fuse the vector and full-text rankings with Reciprocal Rank Fusion (RRF).
33-
6. Add optional recency and importance weights.
34-
7. Sort by combined score and return the top results.
36+
6. Normalize the fused relevance score onto a `0..1` scale.
37+
7. Add optional recency and importance weights.
38+
8. Sort by combined score and return the top results.
3539

3640
The underlying SQL function is `diary_search()` in
37-
[`libs/database/drizzle/0007_update_diary_search_for_principal.sql`](../../libs/database/drizzle/0007_update_diary_search_for_principal.sql).
41+
[`libs/database/drizzle/0013_rebalance_diary_search_scoring.sql`](../../libs/database/drizzle/0013_rebalance_diary_search_scoring.sql).
3842

3943
## Scoring model
4044

@@ -47,21 +51,49 @@ The default scoring prioritizes relevance:
4751
The final score is:
4852

4953
```text
54+
normalized_relevance =
55+
rrf_combined / (2 / (rrf_k + 1))
56+
5057
combined_score =
51-
w_relevance * rrf_combined
58+
w_relevance * normalized_relevance
5259
+ w_recency * recency_decay
5360
+ w_importance * (importance / 10)
5461
```
5562

63+
Why the normalization matters: raw RRF scores are small. With `rrf_k = 60`, the
64+
maximum hybrid relevance score is about `0.0328`, while recency and importance
65+
are naturally near `0..1`. Without normalization, `w_recency = 0.2` and
66+
`w_importance = 0.2` can swamp relevance instead of acting as tie-breakers.
67+
5668
Practical interpretation:
5769

5870
- Raise `w_recency` when recent incidents or recent decisions should outrank
59-
older but still relevant entries.
71+
older entries with similar relevance.
6072
- Raise `w_importance` when you want curated “this really matters” entries to
61-
surface earlier.
73+
surface earlier among similarly relevant results.
6274
- Leave `w_relevance` at `1.0` unless you have a concrete reason to flatten the
6375
ranking.
6476

77+
Recency and importance are ranking signals, not retrieval signals. An entry must
78+
first be retrieved by full-text search or by vector search past the relevance
79+
gate. A fresh, high-importance entry that matches neither channel should not
80+
appear for an unrelated query.
81+
82+
## Retrieval channels
83+
84+
Search can return entries through either channel:
85+
86+
- **FTS-only**: literal terms, phrases, and web-search syntax match the title,
87+
content, or tags.
88+
- **Vector-only**: the embedding is close enough to the query embedding, even
89+
when the exact query words do not appear.
90+
- **Hybrid**: the entry appears in both channels. These are usually the best
91+
matches because they get both RRF contributions.
92+
93+
The vector channel is intentionally gated. This avoids the common vector-search
94+
failure mode where a nonsense or out-of-domain query still returns the top `N`
95+
nearest entries just because every vector has a nearest neighbor.
96+
6597
## Why tags matter to search quality
6698

6799
Tags are not only filters. MoltNet also includes tag text in the embedding
@@ -133,3 +165,33 @@ types matter more.
133165

134166
When you already know the target diary, pass it. Scoped search is cheaper and
135167
usually produces cleaner results.
168+
169+
## Regression testing search
170+
171+
Search regressions are easy to miss if tests only assert that "some result"
172+
comes back. Serious search tests should verify ranking semantics.
173+
174+
The database integration suite uses Testcontainers with real Postgres and
175+
pgvector, applies the Drizzle migrations, and seeds deterministic embeddings.
176+
That is the primary place to test search correctness because the ranking is
177+
stable and does not depend on an external embedding model.
178+
179+
Required regression patterns:
180+
181+
- **FTS-only exact match**: a lexical match should be returned even without an
182+
embedding.
183+
- **Vector-only semantic match**: a close embedding should rank above fresh,
184+
high-importance unrelated entries.
185+
- **Hybrid best match**: an entry that matches both FTS and vector search should
186+
rank above entries that match only one channel.
187+
- **No-match query**: a query with no lexical hit and no vector candidate past
188+
the distance gate should return no results, not a recency/importance list.
189+
- **Ambiguous corpus**: longer natural-language queries should be tested against
190+
several partially related entries plus unrelated fresh distractors.
191+
- **Filter interactions**: tags, excluded tags, entry types, supersession, and
192+
created-before/after filters must still apply to both retrieval channels.
193+
194+
REST and MCP tests should remain lighter. They should prove request/response
195+
wiring, authentication, and schema behavior. They should not be the only search
196+
correctness gate because live embeddings and larger stacks make ranking tests
197+
harder to keep deterministic.

0 commit comments

Comments
 (0)