perf(core): speed up vector sync and tune fastembed defaults by phernandez · Pull Request #731 · basicmachines-co/basic-memory

phernandez · 2026-04-08T20:14:26Z

Summary

batch vector-sync prepare reads across SQLite and Postgres while keeping the hot path parallel where it matters
make local bm reindex --full match cloud-style full rebuild semantics more closely for search and embeddings
tune FastEmbed auto defaults to reserve a small CPU cushion and keep parallel=1, with runtime settings logged at batch start

Validation

just fast-check
full local rebuild benchmarks with uv run bm reindex --project main --full

Benchmark notes

older default-ish full rebuild baseline: about 11:15 total, embed_seconds_total=625.598
tuned threads=8, parallel=1: about 9:55 total, embed_seconds_total=546.204
tuned threads=6, parallel=1: about 9:10 total in one explicit run, embed_seconds_total=497.861
final auto-path validation with config back on null: 9:04.46 total, embed_seconds_total=486.343, queue_wait_seconds_total=2717.249

github-actions · 2026-04-08T20:14:40Z

Claude finished @phernandez's task in 3m 14s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Read key changed files
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

Summary

This is a solid, well-motivated performance PR. The benchmark numbers demonstrate a real speedup (~18% on wall clock, ~22% reduction in embed time), and the code follows the project's literate programming conventions throughout. Below is the full checklist and specific findings.

Code Quality & Standards

Follows Basic Memory coding conventions — decision-point comments, section headers, trigger/why/outcome patterns are consistently applied
Python 3.12+ type annotations and async patterns — correct use of type aliases, asyncio.gather, asynccontextmanager
SQLAlchemy 2.0 best practices — uses text() with parameterized binds throughout; no raw string interpolation with user data
100-character line length — a few lines in the log format strings are long but that's the existing pattern in this file
No commented-out code — clean

Minor: magic number for window size cap

search_repository_base.py:1116:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

The 8 is the shared-batch window cap for non-Postgres backends. The comment explains the rationale well, but extracting this to a named constant would make it searchable and easier to tune:

_SQLITE_VECTOR_PREPARE_WINDOW_MAX = 8

Fix this →

Testing & Documentation

Unit tests for new functions — tests/cli/test_db_reindex.py (+322 lines) comprehensively covers all bm reindex flag combinations and edge cases
Integration tests — test_force_full_reindexes_unchanged_files validates the key behavioral change in sync_service.py using monkeypatching to inspect what files were re-indexed
Edge cases covered — cloud project filter, disabled semantic search, per-project reindex, window error propagation
100% coverage maintained — # pragma: no cover used only for else branch guards and runtime-mode-dependent paths

Basic Memory Architecture

Batched window reads are a clean abstraction — _prepare_entity_vector_jobs_window shares one session and two batched SELECTs across all entities in a window, then fans out async per-entity prepare work. The shared read / serialized write split is cleanly separated between _fetch_prepare_window_* (shared session) and _prepare_entity_write_scope (per-entity lock for SQLite).
Polymorphism is well-structured — _vector_prepare_window_size() and _prepare_entity_write_scope() override correctly in each backend; Postgres uses its own higher concurrency window, SQLite keeps the write lock.
scanned_paths in SyncReport — the force_full change in sync_service.py passes all scanned paths (not just diffs) to _index_changed_files, which is the correct approach for matching cloud-style full rebuilds. The test confirms unchanged files get re-indexed.
sync_embeddings=False decoupling — prevents the FTS sync phase from triggering a redundant embedding pass when the CLI runs FTS and embeddings as separate phases.

Potential: started_span attributes set after context exit

search_repository_base.py:1091-1104:

        # The `with telemetry.started_span(...) as batch_span:` block closes above
        if batch_span is not None:
            batch_span.set_attributes({...})

The started_span context wraps logfire.span(), which closes the span when the with block exits. Setting attributes on batch_span after the context has returned means the span has already been committed. This pattern may work with Logfire's implementation (some span backends accept post-close mutations), but semantically the attributes should be set before the span closes to guarantee they appear in traces. Consider moving the set_attributes call inside the with block, or accumulating them earlier and passing to the span on entry. Fix this →

Window-level exception collapse

search_repository_base.py:1225-1226:

        except Exception as exc:
            return [exc for _ in entity_ids]

When the shared read session fails, all entities in the window get the same exception. This is the right fail-safe — we can't know which entities would have been OK — but worth a comment noting that a window-level DB failure marks all window entities as failed (which is already visible in the outer warning log).

Security & Performance

No hardcoded secrets — none
Parameterized SQL throughout — the comment at postgres_search_repository.py:479 explicitly notes that SQL template construction uses only integer enumerate() indices; no user-controlled text is interpolated
FastEmbed auto-tuning — _resolve_fastembed_runtime_knobs caps at min(8, available_cpus - 2) and keeps parallel=1. The benchmark notes show this is empirically validated. The two-path explicit-config vs. auto-tune split is clean.
Telemetry metric caching — _METRICS dict prevents repeated metric instrument creation; properly cleared in reset_telemetry_state() for test isolation

Compatibility

No Windows path issues introduced — the new scanned_paths: Set[str] in SyncReport holds string paths derived from existing path handling; no new Path comparisons are introduced
Unicode ✓ in CLI output — db.py:276 and db.py:324:
```
console.print("  [green]✓[/green] Full-text search index rebuilt")
```
Per CLAUDE.md, console and log output should avoid unicode characters. These appear to follow the existing convention in this file (the reset --reindex path already uses similar Rich markup elsewhere), but the ✓ is technically outside ASCII. If the project aims to be fully Windows-terminal-safe, these should use ASCII alternatives like [OK] or done. Fix this →

Checklist Result

Area	Status	Notes
Code conventions	✅ Pass	Excellent literate-style comments throughout
Type annotations	✅ Pass
SQLAlchemy patterns	✅ Pass
Unit tests	✅ Pass	Full CLI wiring coverage
Integration tests	✅ Pass	`force_full` path validated
100% coverage	✅ Pass	`# pragma: no cover` used appropriately
MCP tool composability	✅ Pass	No MCP tool changes
DB migrations	✅ Pass	No schema changes; vector table recreation handled at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
Windows path compat	⚠️ Minor	`✓` unicode in 2 console.print lines
`started_span` timing	⚠️ Minor	Attributes set after span context exits
Magic number `8`	⚠️ Minor	Could be a named constant

Overall: Approve with minor suggestions. The core batching and FastEmbed tuning work is correct and well-tested. The three flagged items are all minor — the unicode ✓ and span attribute timing are the most worth addressing before merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T20:17:47Z

Claude finished @phernandez's task in 2m 28s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Read key changed files
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture concerns
Review security & performance
Review compatibility
Post final review summary

Summary

This PR delivers two distinct improvements: (1) batched window reads for vector-sync prepare phase to eliminate per-entity round-trips on both SQLite and Postgres, and (2) CPU-aware FastEmbed thread/parallel defaults to tune local full-rebuild performance. The benchmark numbers are compelling (~15% wall-clock reduction). Overall the implementation is solid and follows project conventions well. A few items worth addressing before merge:

Code Quality & Standards

✅ Passes

Full Python 3.12+ type annotations, | unions, type alias syntax used throughout
Literate programming style with Trigger/Why/Outcome comments on every non-obvious branch — exactly per CLAUDE.md
Section headers in files with multi-phase logic
SQLAlchemy 2.0 async patterns (no sync calls, proper async_sessionmaker)
Parameterized SQL throughout — no user data interpolated into f-strings (only enumerate() indices used for dynamic bind-param names, which is safe)

⚠️ Minor issues

Magic numbers should be named constants — two places use 8 without an explanatory constant name:

search_repository_base.py:1116 — base window size cap:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

embedding_provider_factory.py:58 — max thread cap:

threads = min(8, max(2, available_cpus - 2))

Both are explained by comments, but a named constant like _SQLITE_MAX_PREPARE_WINDOW = 8 and _FASTEMBED_MAX_THREADS = 8 would make the values easier to find and change in future tuning. Fix this →

Duplicated early-exit delete blocks — search_repository_base.py:1267–1297:

Two consecutive early-exit branches both do the exact same thing (delete chunks, commit, compute prepare_seconds, return empty _PreparedEntityVectorSync). The only difference is the guard condition (not source_rows vs not chunk_records). These could be merged or the repeated boilerplate extracted into an inline helper, reducing the surface area for future drift.

Private method access across abstraction boundary — search_repository_base.py:1630:

effective_parallel=provider._effective_parallel(),

The base class imports and calls a private method on FastEmbedEmbeddingProvider. This works today but is a leaky abstraction. Consider adding a runtime_log_attrs() -> dict property to EmbeddingProvider / FastEmbedEmbeddingProvider so the base class can log provider details without knowing internal API.

Testing & Documentation

✅ Passes

New tests/cli/test_db_reindex.py (322 lines) — covers reindex CLI wiring for all flag combinations, cloud project guard, and missing project guard
Semantic search base tests significantly expanded (+334 lines) including window batch tests, shard plan tests, and error handling
SQLite vector search repository tests expanded with _prepare_window_existing_rows_sql variants
Postgres unit tests updated for new window path
Telemetry tests added (+56 lines)
Sync incremental tests include new force_full path (+27 lines)

⚠️ Minor gap

test_db_reindex.py tests the _reindex async function but the outer reindex() Typer command is marked # pragma: no cover at db.py:158. This is expected for Typer entrypoints and matches existing patterns in the file. Coverage appears maintained.

Basic Memory Architecture

✅ Passes

Local-first architecture preserved — no external dependencies added
SQLite write serialization maintained via _sqlite_prepare_write_lock in _prepare_entity_write_scope() — concurrent reads now fan out per window, but writes still funnel through the lock, preventing WAL contention
Postgres keeps its higher _semantic_postgres_prepare_concurrency window via the _vector_prepare_window_size() override — backend-specific tuning preserved
_clear_project_vectors_for_full_reindex() correctly handles the SQLite cascade-delete difference vs Postgres
Window-based prepare path preserves return_exceptions=True semantics — individual entity failures don't abort the whole window, and the error is propagated back per-entity

Security & Performance

✅ Passes

All dynamically-constructed SQL uses bind parameters only (chunk/entity indices from enumerate(), never from user input)
_prepare_window_entity_params correctly builds deterministic bind-param names from integer indices
FastEmbed parallel=1 default aligns with documented single-process preference in provider — avoids spawning multiple ONNX worker processes that would fight for the same CPU budget
Auto-thread formula min(8, max(2, available_cpus - 2)) reserves a 2-CPU cushion — sound for mixed-workload environments
No secrets, credentials, or sensitive data in logs

⚠️ Observation

The window-batch path opens a shared read session per window, then opens per-entity write sessions inside _prepare_entity_vector_jobs_prefetched. Under a slow or failing write, the shared read results are already in memory, so there is no double-read risk. This is correct behavior, but worth a comment at _prepare_entity_vector_jobs_window:1216 explaining that the read session is closed before write sessions open.

Compatibility

❌ Unicode in new console output (blocking per CLAUDE.md)

db.py:276 and db.py:325 use the ✓ Unicode check mark in new console.print calls added by this PR:

console.print("  [green]✓[/green] Full-text search index rebuilt")   # line 276
console.print(f"  [green]✓[/green] Embeddings complete: ...")         # line 325

Per CLAUDE.md: "Avoid using emojis and unicode characters in console and log output." These are new additions in this diff. Fix this →

✅ File path handling

All path operations use Path objects — Windows compatible.

Checklist

Category	Status	Notes
Coding conventions (CLAUDE.md)	✅	Follows literate style, section headers, decision comments
Python 3.12+ type annotations	✅	Full coverage
SQLAlchemy 2.0 async	✅	No sync patterns
FastAPI/Typer conventions	✅
100-char line limit	✅
No commented-out code	✅
Unit tests	✅	Extensive additions
Integration tests	✅
Edge case coverage	✅	Window errors, oversized entities, orphan runtime
100% test coverage	✅	Appropriate `pragma: no cover` usage
Documentation/docstrings	✅
MCP atomic design	n/a	No new MCP tools
DB migrations	✅	No schema changes; tables created at runtime as before
Local-first architecture	✅
Knowledge graph consistency	✅
No hardcoded secrets	✅
Input validation	✅
Error handling/logging	✅
Performance	✅	Core improvement of this PR
No sensitive data in logs	✅
Windows path compatibility	✅	Uses Path objects
No emojis/unicode in output	❌	`✓` in 2 new `console.print` lines in `db.py`

Required Before Merge

Replace ✓ with ASCII text in db.py lines 276 and 325 (compatibility rule from CLAUDE.md)

Suggested (Non-blocking)

Extract named constants for 8 in _vector_prepare_window_size base and _resolve_fastembed_runtime_knobs
Consolidate the two identical early-exit delete blocks in _prepare_entity_vector_jobs_prefetched
Move _effective_parallel() call into a runtime_log_attrs() provider method to avoid base class reaching into private FastEmbed internals

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T20:23:53Z

Claude finished @phernandez's task in 2m 45s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Read key changed files
Check if previous review issues were addressed
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

Summary

This is a solid performance PR delivering batched window reads for the vector-sync prepare phase and CPU-aware FastEmbed thread/parallel defaults. The benchmark numbers (down from ~11:15 to ~9:04 wall clock, embed time from ~625s to ~486s) demonstrate a meaningful real-world improvement.

What was fixed since previous reviews

✅ Unicode ✓ replaced with done — db.py lines 276 and 325 now use [green]done[/green] (the blocking compatibility issue from prior reviews is resolved)

Code Quality & Standards

✅ Passes

Python 3.12+ type annotations, type aliases, | unions throughout
Literate programming style with Trigger/Why/Outcome comments on every non-obvious branch
Section headers in multi-phase files
SQLAlchemy 2.0 async patterns — no sync calls in hot path
Parameterized SQL throughout; dynamic bind-param names come only from enumerate() indices, never from user input
100-character line limit respected

⚠️ Minor: magic numbers `8` should be named constants

search_repository_base.py:1116:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

embedding_provider_factory.py:58:

threads = min(8, max(2, available_cpus - 2))

Both occurrences are documented by comments, but extracting to named constants (_SQLITE_MAX_PREPARE_WINDOW = 8, _FASTEMBED_MAX_THREADS = 8) makes them searchable and easier to tune in the future. Fix this →

Testing & Documentation

✅ Passes

tests/cli/test_db_reindex.py (+322 lines) — comprehensive CLI wiring coverage for all flag combinations, cloud-project guard, missing-project guard
tests/repository/test_semantic_search_base.py (+334 lines) — window batch tests, shard plan tests, error handling paths
tests/repository/test_sqlite_vector_search_repository.py (+158 lines) — _prepare_window_existing_rows_sql variants
Postgres unit tests updated for new window path
Telemetry tests added (+56 lines)
Sync incremental tests include force_full path (+27 lines)
# pragma: no cover used appropriately (Typer entrypoints, TYPE_CHECKING blocks)

Basic Memory Architecture

✅ Passes

Local-first architecture preserved; no new external dependencies
SQLite write serialization maintained — _sqlite_prepare_write_lock serializes writes inside _prepare_entity_write_scope() while concurrent per-entity prepare work fans out within each window
Postgres overrides _vector_prepare_window_size() with its higher _semantic_postgres_prepare_concurrency — backend-specific tuning is clean
_clear_project_vectors_for_full_reindex() correctly handles SQLite cascade-delete vs Postgres
return_exceptions=True in asyncio.gather preserves per-entity failure isolation — a window-level DB failure returns the same exception for all window entities (correct behavior)
sync_embeddings=False decoupling prevents a redundant embedding pass when CLI runs FTS and embeddings as separate phases

⚠️ Minor: `batch_span.set_attributes()` is called after the span context exits

search_repository_base.py:836-842 opens the span context:

with telemetry.started_span(
    "basic_memory.vector_sync.batch",
    ...
) as batch_span:
    for window_start in range(0, total_entities, prepare_window_size):
        ...

But batch_span.set_attributes(...) at lines 1091–1104 is outside this with block (the block closes after all flush logic). The logfire span has already been committed when the attributes are set. In practice Logfire may accept post-close mutations, but semantically these attributes should be set inside the with block (or accumulated earlier and passed on entry) to guarantee they appear in traces. Fix this →

⚠️ Suggested: duplicated early-exit delete blocks in `_prepare_entity_vector_jobs_prefetched`

Lines 1267–1280 and 1284–1297 are identical: delete chunks, commit, compute prepare_seconds, return empty _PreparedEntityVectorSync. Only the guard condition differs (not source_rows vs not chunk_records). These could be merged into a shared inline helper to reduce drift surface.

⚠️ Suggested: private method access across abstraction boundary

search_repository_base.py:1630:

effective_parallel=provider._effective_parallel(),

The base class reaches into FastEmbedEmbeddingProvider's private API. Consider adding a runtime_log_attrs() -> dict method to the provider interface so the base class can log provider details without knowing internal structure.

Security & Performance

✅ Passes

No hardcoded secrets or credentials
Parameterized SQL throughout (bind-param names from enumerate() only)
FastEmbed parallel=1 default matches single-process preference in docs
Auto-thread formula min(8, max(2, available_cpus - 2)) reserves 2-CPU cushion — sound for mixed-workload environments
No sensitive data in logs; _METRICS dict prevents repeated metric instrument creation

Compatibility

✅ Passes

Unicode ✓ → done fix applied — console output now ASCII-safe
All path operations use Path objects — Windows compatible
scanned_paths: Set[str] holds string paths derived from existing path handling; no new Path comparisons introduced

Checklist Result

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Excellent literate-style comments throughout
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	✅ Pass
FastAPI/Typer conventions	✅ Pass
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	Extensive additions
Integration tests	✅ Pass	`force_full` path validated
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	Appropriate `pragma: no cover` usage
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes; vector tables recreated at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	Fixed: `✓` → `done`
Windows path compatibility	✅ Pass
`started_span` timing	⚠️ Minor	`set_attributes` after span context exits
Magic number `8`	⚠️ Minor	Two instances — could be named constants
Duplicated delete blocks	⚠️ Minor	`_prepare_entity_vector_jobs_prefetched` lines 1267–1297
Private `_effective_parallel()`	⚠️ Minor	Base class reaches into provider internals

Overall: Approve. The blocking compatibility issue from previous reviews (unicode ✓) has been fixed. All remaining items are minor non-blocking suggestions. The core batching and FastEmbed tuning work is correct, well-tested, and follows project conventions throughout.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-08T20:46:49Z

Claude finished @phernandez's task in 1m 58s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review
Read key changed files
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

Summary

All suggestions from previous review rounds have been addressed in d59b974 refactor(core): address review nits. This PR is ready to merge.

What was fixed in the latest commit

Previous suggestion	Status
Unicode `✓` → ASCII `done` in `db.py` (blocking)	✅ Fixed
Magic number `8` → `_SQLITE_MAX_PREPARE_WINDOW` in `search_repository_base.py`	✅ Fixed
Magic number `8` → `_FASTEMBED_MAX_THREADS` in `embedding_provider_factory.py`	✅ Fixed
`batch_span.set_attributes()` called after span context exits	✅ Fixed — now indented inside `with telemetry.started_span(...) as batch_span:`
Duplicate early-exit delete blocks in `_prepare_entity_vector_jobs_prefetched`	✅ Fixed — consolidated into `_delete_entity_chunks_and_finish()` inner helper
Base class reaching into `provider._effective_parallel()` private method	✅ Fixed — `runtime_log_attrs()` added to `EmbeddingProvider` Protocol, `FastEmbedEmbeddingProvider`, and `OpenAIEmbeddingProvider`
Window-level exception collapse missing comment	✅ Fixed — Trigger/Why/Outcome comment added

Final Checklist

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Literate-style throughout, decision-point comments, section headers
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	✅ Pass
FastAPI/Typer conventions	✅ Pass
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	+322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests	✅ Pass	`force_full` path validated
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	`# pragma: no cover` used appropriately
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes; vector tables recreated at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	`✓` → `done`
Windows path compatibility	✅ Pass
`started_span` timing	✅ Pass	`set_attributes` now inside span context
Magic number `8`	✅ Pass	`_SQLITE_MAX_PREPARE_WINDOW` and `_FASTEMBED_MAX_THREADS`
Duplicate delete blocks	✅ Pass	`_delete_entity_chunks_and_finish()` helper
Provider abstraction boundary	✅ Pass	`runtime_log_attrs()` on Protocol

Overall: Approve. The implementation is correct, well-tested, and follows project conventions throughout. No remaining issues.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T01:11:19Z

Claude finished @phernandez's task in 3m 11s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review
Review new changes against checklist
Post final review summary

What changed in `794dcc6 perf(core): lower vector sync batch defaults`

Single file touched: src/basic_memory/config.py

Both semantic_embedding_batch_size and semantic_embedding_sync_batch_size dropped from 64 → 2:

# Before
semantic_embedding_batch_size: int = Field(default=64, ...)
semantic_embedding_sync_batch_size: int = Field(default=64, ...)

# After
semantic_embedding_batch_size: int = Field(default=2, ...)
semantic_embedding_sync_batch_size: int = Field(default=2, ...)

Analysis

Change is clean and minimal — config-only, no new logic.

Rationale is well-documented — the Trigger/Why/Outcome comment added in config.py:191-196 is clear: full rebuilds are dominated by queue wait (entities blocking on a full batch to form), not embedding compute. Shrinking the batch cuts wait far more than it increases per-batch overhead.

Benchmarks back it up — the PR description shows queue_wait_seconds_total=2717.249 in the final run, confirming queue saturation was the bottleneck being resolved.

Tests unaffected — all test references to 64 either use the provider's own constructor default (independent of config) or hardcode the value explicitly in test setup. The _ConcreteRepo test stub at test_semantic_search_base.py:59 hardcodes _semantic_embedding_sync_batch_size = 64 — that's correct test isolation, not a stale value.

One observation worth noting

With semantic_embedding_sync_batch_size = 2, _vector_prepare_window_size() now returns max(1, min(2, 8)) = 2 instead of the previous max(1, min(64, 8)) = 8. The window batching path introduced earlier in this PR (which shares one session across multiple entities' DB reads) now spans only 2 entities at a time instead of 8.

This is a trade-off that the benchmarks validate: the queue-wait savings outweigh the reduced batching efficiency in the read window. But it means these two configs are now more tightly coupled than they appear — increasing semantic_embedding_sync_batch_size in the future will simultaneously increase the window size (up to _SQLITE_MAX_PREPARE_WINDOW = 8). This coupling isn't obvious from the field descriptions alone. Not blocking — the comment on _vector_prepare_window_size() handles this — but worth keeping in mind for future tuning docs.

Checklist

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Trigger/Why/Outcome comment present
Python 3.12+ type annotations	✅ Pass	No new code
SQLAlchemy patterns	n/a	Config-only change
Unit tests	✅ Pass	No test updates needed; existing tests use explicit values
Integration tests	✅ Pass
100% coverage	✅ Pass
No hardcoded secrets	✅ Pass
No unicode/emoji in output	✅ Pass	Already fixed in prior commit
Performance	✅ Pass	Benchmark-validated
All prior review items	✅ Pass	All addressed in `d59b974`

Overall: Approve. This is a clean, minimal, benchmark-backed config tuning on top of an already-approved implementation. All previous review findings remain resolved.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T01:34:44Z

Claude finished @phernandez's task in 2m 17s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (0434d9c fix(sync): stabilize windows checksum writes)
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

What changed in `0434d9c fix(sync): stabilize windows checksum writes`

Three files touched: file_utils.py, services/file_service.py, tests/services/test_file_service.py

This commit addresses a Windows-specific checksum divergence bug through two complementary fixes:

1. Force LF line endings at write time (file_utils.py)

# Before
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

# After
async with aiofiles.open(temp_path, mode="w", encoding="utf-8", newline="\n") as f:

Applied to both write_file_atomic and format_markdown_builtin. Prevents Windows text mode from translating \n → \r\n, which would make stored bytes diverge from the in-memory strings used for indexing and hashing.

2. Compute checksum from disk, not from the in-memory string (file_service.py)

# Before (write_file)
checksum = await file_utils.compute_checksum(final_content)

# After
checksum = await self.compute_checksum(full_path)

Same fix applied in update_frontmatter_with_result. The FileService.compute_checksum(path) method reads raw bytes in 64KB chunks — so the returned hash always matches what's actually stored, regardless of any byte-level transformation that happened during the write.

Analysis

✅ Code Quality & Standards

Trigger/Why/Outcome comments on every changed branch — follows the project's literate programming conventions exactly
newline="\n" is the correct aiofiles.open() parameter; encoding and newline are independent parameters and both are needed here
100-character line limit maintained
No commented-out code

✅ Testing & Documentation

test_update_frontmatter_checksum_matches_persisted_bytes monkeypatches write_file_atomic to simulate CRLF byte writes and asserts the returned checksum equals compute_checksum(path) (the disk read). This directly exercises the regression scenario.
The write_file path change is implicitly covered by existing test_write_* tests; the root cause (CRLF) is now prevented upstream in write_file_atomic.
# pragma: no cover on the formatter branch is correct — format_on_save is off in test config.

✅ Basic Memory Architecture

Checksum-from-disk is the more robust design: it makes FileService the single authoritative source for checksums, regardless of upstream transformations (formatters, platform newline handling, BOM stripping)
No schema changes; no migration needed

✅ Security & Performance

No secrets
compute_checksum(full_path) uses chunked async reads (64KB) — memory usage is bounded and unchanged
The extra disk read per write is negligible; checksums are computed post-write already

✅ Compatibility

This fix is the Windows compatibility fix — forcing newline="\n" closes the CRLF divergence at the root. Combined with reading from disk, sync and move detection now produce deterministic checksums across platforms.

⚠️ Minor observation (non-blocking)

file_service.py:211-216:

if self.app_config:
    formatted_content = await file_utils.format_file(
        full_path, self.app_config, is_markdown=self.is_markdown(path)
    )
    if formatted_content is not None:
        pass  # pragma: no cover

format_file is still called for its side effect (writing reformatted content to disk), but the return value is now explicitly discarded with pass. The intent is correct — we'll read the checksum from disk anyway — but the empty if body is slightly unusual. A one-line comment like # formatter wrote to disk; checksum read below would make it self-documenting. Non-blocking.

Checklist

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Trigger/Why/Outcome on every changed branch
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	n/a
FastAPI/Typer conventions	n/a
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	Regression test covers the CRLF divergence case
Integration tests	✅ Pass	Existing write tests cover the write path
Edge case coverage	✅ Pass	Monkeypatched CRLF write simulates the actual failure
100% test coverage	✅ Pass	`pragma: no cover` used appropriately
DB migrations	n/a	No schema changes
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	Already fixed in prior commit
Windows path compatibility	✅ Pass	This commit specifically resolves the Windows checksum bug
Empty `if` body in write path	⚠️ Minor	`pass` is correct but could use an explanatory comment

Overall: Approve. This is a clean, targeted, correctly scoped fix. The dual-layer approach (prevent CRLF at the write source + always hash from disk) is defensive in depth and the right design. All previous review findings remain resolved.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T02:42:52Z

Claude finished @phernandez's task in 2m 38s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (b54c5f2 fix(sync): preserve platform newlines)
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

What changed in `b54c5f2 fix(sync): preserve platform newlines`

Single file touched: src/basic_memory/file_utils.py — 3 insertions, 7 deletions.

This commit reverses one half of the dual-layer Windows fix introduced in 0434d9c:

Fix	`0434d9c`	`b54c5f2`
Force LF via `newline="\n"` in `write_file_atomic` and `format_markdown_builtin`	✅ Added	❌ Reverted
Compute checksum from disk bytes in `write_file` and `update_frontmatter_with_result`	✅ Added	✅ Kept

The newline="\n" override is gone from both write paths, allowing the platform to use its native line endings (CRLF on Windows). The disk-based checksum approach remains.

Analysis

✅ Functional correctness is maintained

I traced the checksum comparison paths to confirm there is no regression:

Write path: write_file → binary compute_checksum(path) → stored in DB
Sync comparison: file_service.compute_checksum(path) (binary rb mode) → consistent
Initial entity load: read_file_bytes + compute_checksum(bytes) → consistent
read_file: computes checksum from text-mode string, but callers at entity_service.py:627 and entity_service.py:1127 discard the checksum (_), so this is not in the sync comparison path

The disk-read checksum is the single authoritative source for sync comparisons. Removing newline="\n" allows CRLF on Windows, but since the checksum is always computed from the actual on-disk bytes, subsequent reads will always agree with what was stored.

✅ CRLF regression test still present

tests/services/test_file_service.py:172 — test_update_frontmatter_checksum_matches_persisted_bytes is unchanged and still validates that FileService hashes the stored file, not the pre-write string. This test covers the root scenario that motivated 0434d9c.

⚠️ Missing decision-point comment (Minor — CLAUDE.md violation)

file_utils.py:117-118:

# Use aiofiles for non-blocking write
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

The original Trigger/Why/Outcome comment from 0434d9c was removed and replaced with a generic "what" comment that has nothing to do with the decision being made here. The comment describes aiofiles (obvious) but says nothing about why newline="\n" was removed. Per CLAUDE.md, decision-point comments must explain:

Trigger: when does this apply?
Why: the rationale
Outcome: what changes downstream

Something like:

# Trigger: caller passes normalized Python strings (LF-only); write_file / update_frontmatter
#          already compute the checksum from on-disk bytes, so platform newline translation
#          does not cause checksum drift.
# Why: preserving platform-native newlines (CRLF on Windows) keeps files compatible
#      with Obsidian and other local editors that expect the host OS line ending.
# Outcome: file bytes may differ across platforms; downstream consumers use
#          FileService.compute_checksum(path) (binary mode) to stay consistent.
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

Fix this →

Checklist

Area	Status	Notes
Code conventions (CLAUDE.md)	⚠️ Minor	Decision-point comment missing in `write_file_atomic`
Python 3.12+ type annotations	✅ Pass	No new code
SQLAlchemy patterns	n/a	Utility change
Unit tests	✅ Pass	CRLF regression test retained and still valid
Integration tests	✅ Pass
100% coverage	✅ Pass
No hardcoded secrets	✅ Pass
No unicode/emoji in output	✅ Pass	Already fixed in prior commit
Windows path compatibility	✅ Pass	Disk-based checksum keeps sync consistent across platforms
All prior review items	✅ Pass	All addressed in `d59b974`

Overall: Approve with minor suggestion. The functional change is correct — the disk-based checksum approach is sufficient to maintain sync consistency without forcing LF, and the CRLF regression test is still in place. The only item worth fixing before merge is the missing Trigger/Why/Outcome comment in write_file_atomic, which doesn't explain the reasoning behind removing newline="\n".

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T02:54:07Z

Claude finished @phernandez's task in 1m 7s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (3f23b3e docs(core): explain windows newline checksum behavior)
Review code quality & standards
Review testing & documentation
Review Basic Memory architecture
Review security & performance
Review compatibility
Post final review summary

What changed in `3f23b3e docs(core): explain windows newline checksum behavior`

Single file touched: src/basic_memory/file_utils.py — 14 insertions, 1 deletion.

This commit adds the missing Trigger/Why/Outcome decision-point comments that the previous review flagged in b54c5f2. Two locations updated:

write_file_atomic (line 117–123):

# Trigger: callers hand us normalized Python text, but the final bytes are allowed
#          to use the host platform's native newline convention during the write.
# Why: preserving CRLF on Windows keeps local files aligned with editors like
#      Obsidian, while FileService now hashes the persisted file bytes instead of
#      the pre-write string.
# Outcome: this async write stays editor-friendly across platforms without
#          reintroducing checksum drift in sync or move detection.
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

format_markdown_builtin (line 177–183):

# Trigger: mdformat may rewrite markdown content, then the host platform
#          decides the newline bytes for the follow-up async text write.
# Why: we want formatter output to preserve native newlines instead of
#      forcing LF, and the authoritative checksum comes from rereading the
#      stored file bytes later in FileService.
# Outcome: formatting remains compatible with local editors on Windows while
#          checksum-based sync logic stays anchored to on-disk bytes.
async with aiofiles.open(path, mode="w", encoding="utf-8") as f:

Analysis

✅ Missing decision-point comment addressed

Both comments follow the Trigger/Why/Outcome format exactly as specified in CLAUDE.md. They correctly explain:

Trigger: what kind of input arrives (normalized Python text)
Why: editor compatibility (CRLF on Windows for Obsidian) + the disk-checksum approach eliminates drift risk
Outcome: editor-friendly cross-platform writes without sync regression

The second comment in format_markdown_builtin is a bonus — it wasn't flagged in the prior review but follows the same reasoning pattern, which makes the dual-write path self-documenting throughout.

✅ No regressions introduced

This is a docs-only commit — no logic changes, no new tests needed. All previously passing review items remain intact.

Final Checklist (all items from all review rounds)

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Literate-style throughout; all decision-point comments present
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	✅ Pass
FastAPI/Typer conventions	✅ Pass
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	+322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests	✅ Pass	`force_full` path validated
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	`# pragma: no cover` used appropriately
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes; vector tables recreated at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	`✓` → `done` (fixed in prior commit)
Windows path compatibility	✅ Pass	Disk-based checksum resolves cross-platform drift
`started_span` timing	✅ Pass	`set_attributes` inside span context (fixed in `d59b974`)
Magic number `8`	✅ Pass	`_SQLITE_MAX_PREPARE_WINDOW` and `_FASTEMBED_MAX_THREADS`
Duplicate delete blocks	✅ Pass	`_delete_entity_chunks_and_finish()` helper
Provider abstraction boundary	✅ Pass	`runtime_log_attrs()` on Protocol
Decision-point comment (write_file_atomic)	✅ Pass	Trigger/Why/Outcome added in this commit
Decision-point comment (format_markdown_builtin)	✅ Pass	Added proactively in this commit

Overall: Approve. Every item raised across all review rounds has been resolved. The PR delivers a real, benchmark-validated performance improvement with a clean implementation, comprehensive tests, and thorough documentation. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T02:55:41Z

Claude finished @phernandez's task in 1m 0s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (5069b46 test(core): clarify windows checksum regression)
Review new changes against checklist
Post final review summary

What changed in `5069b46 test(core): clarify windows checksum regression`

Single file touched: tests/services/test_file_service.py — 4 insertions, 3 deletions. Test-only, no logic changes.

Three clarifications made to test_update_frontmatter_checksum_matches_persisted_bytes:

Change	Before	After
Test name	`test_update_frontmatter_checksum_matches_persisted_bytes`	`test_update_frontmatter_checksum_matches_windows_crlf_persisted_bytes`
Docstring	`"Frontmatter writes should hash the stored file, not the pre-write string."`	`"Windows-style CRLF writes should hash the stored file, not the pre-write string."`
Trigger comment	`"Trigger: simulate a writer that persists CRLF bytes like Windows text mode."`	`"Trigger: simulate Windows text-mode persistence, where logical LF strings land on disk as CRLF bytes."`

Analysis

✅ Naming and clarity improvements

All three changes make the test's intent more precise. The rename makes the platform-specific regression scenario explicit in the test name itself, so developers reading a failing test output or coverage report immediately know what scenario is covered. The expanded trigger comment follows CLAUDE.md's literate style more closely than the prior one-liner.

✅ No regressions introduced

Test-only commit with zero logic changes. All previously passing review items remain intact.

✅ Trigger/Why/Outcome format preserved

The expanded trigger comment still follows the Trigger/Why/Outcome structure exactly. The Why and Outcome lines are unchanged from 0434d9c; only the Trigger line was made more descriptive.

Final Checklist (all items across all review rounds)

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Literate-style throughout; all decision-point comments present
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	✅ Pass
FastAPI/Typer conventions	✅ Pass
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	+322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests	✅ Pass	`force_full` path validated
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	`# pragma: no cover` used appropriately
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes; vector tables recreated at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	`✓` → `done` (fixed in prior commit)
Windows path compatibility	✅ Pass	Disk-based checksum resolves cross-platform drift
`started_span` timing	✅ Pass	`set_attributes` inside span context (fixed in `d59b974`)
Magic number `8`	✅ Pass	`_SQLITE_MAX_PREPARE_WINDOW` and `_FASTEMBED_MAX_THREADS`
Duplicate delete blocks	✅ Pass	`_delete_entity_chunks_and_finish()` helper
Provider abstraction boundary	✅ Pass	`runtime_log_attrs()` on Protocol
Decision-point comment (write_file_atomic)	✅ Pass	Trigger/Why/Outcome added in `3f23b3e`
Decision-point comment (format_markdown_builtin)	✅ Pass	Added in `3f23b3e`
Windows CRLF regression test clarity	✅ Pass	Test name and trigger comment sharpened in this commit

Overall: Approve. This is a pure test-clarity commit with no logic changes. The renamed test and expanded trigger comment make the Windows CRLF regression scenario unambiguous. Every item raised across all review rounds remains resolved. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T03:18:36Z

Claude finished @phernandez's task in 2m 32s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (7421cfc fix(core): align checksums with persisted bytes)
Review new changes against checklist
Post final review summary

What changed in `7421cfc fix(core): align checksums with persisted bytes`

Two files touched: src/basic_memory/services/file_service.py (+8/-1) and tests/indexing/test_batch_indexer.py (+16/-2).

`file_service.py` — `read_file()` checksum now reads from disk

# Before
checksum = await file_utils.compute_checksum(content)

# After
# Trigger: text-mode reads normalize line endings on Windows, so the
#          decoded string can differ from the bytes we just wrote.
# Why: write_file/update_frontmatter now return the checksum of the
#      persisted file, and read_file should report the same authority.
# Outcome: callers get human-readable content plus the checksum for the
#          exact bytes stored on disk.
checksum = await self.compute_checksum(full_path)

file_utils.compute_checksum(content) hashes in-memory text (LF-normalized by Python's text-mode decoder), while self.compute_checksum(full_path) reads raw bytes in binary mode. This makes read_file consistent with write_file and update_frontmatter_with_result, which were already fixed to use disk-based checksums in 0434d9c. The disk-based checksum approach is now the single, authoritative strategy across all FileService methods that return a checksum.

`test_batch_indexer.py` — two assertions made platform-neutral

# Before
assert result.indexed[0].markdown_content == original_content  # LF literal

# After
persisted_content = (project_config.home / path).read_text()  # reads from disk
assert result.indexed[0].markdown_content == persisted_content

Same change applied to test_batch_indexer_returns_original_markdown_content_when_no_frontmatter (line 272) and test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter (line 527). Both previously compared against string literals with LF line endings. On Windows, the file is persisted with CRLF (now intentionally, per the b54c5f2 decision), and markdown_content is produced from a text-mode file read — same path as read_text(). Both sides now go through the same text-mode normalization, making the assertion platform-safe.

Analysis

✅ Code Quality & Standards

Trigger/Why/Outcome comment follows CLAUDE.md's literate programming format exactly
Change is minimal — single line replaced in read_file, two test assertions updated
100-character line limit maintained
No commented-out code

✅ Functional correctness

The only two callers of read_file in production code (entity_service.py:627 and entity_service.py:1127) both discard the returned checksum (current_content, _ = await self.file_service.read_file(...)). The behavior change has no effect on those code paths today, but it ensures any future caller gets the authoritative disk-based checksum rather than a LF-normalized hash that could diverge on Windows.

On Linux/macOS the two approaches are equivalent (no CRLF translation), so there is no regression on those platforms.

✅ Testing

Test changes are platform-idiomatic: read_text() normalizes CRLF → LF on all platforms, consistent with how markdown_content is produced
Trigger/Why/Outcome comments added to both test assertion blocks
No new test is needed for the read_file change itself — the write path already has test_update_frontmatter_checksum_matches_windows_crlf_persisted_bytes, and both callers discard the checksum

✅ Minor observation (non-blocking)

read_file now does two sequential async I/O passes per call: one text-mode read for content and one binary read for the checksum. The extra pass is negligible (files are typically small), and the semaphore in compute_checksum bounds concurrency correctly. Worth noting only because future profiling might flag the double read.

Final Checklist (all items across all review rounds)

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Trigger/Why/Outcome comment present and correct
Python 3.12+ type annotations	✅ Pass	No new code
SQLAlchemy 2.0 async	n/a
FastAPI/Typer conventions	n/a
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	+322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests	✅ Pass	`force_full` path validated
Batch indexer test platform safety	✅ Pass	`read_text()` replaces LF literals in 2 assertions
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	`# pragma: no cover` used appropriately
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	`✓` → `done` (fixed in prior commit)
Windows path compatibility	✅ Pass	Disk-based checksum resolves cross-platform drift
`read_file` checksum consistency	✅ Pass	Now uses `self.compute_checksum(path)` like all other FileService methods
`started_span` timing	✅ Pass	Fixed in `d59b974`
Magic number `8`	✅ Pass	`_SQLITE_MAX_PREPARE_WINDOW` and `_FASTEMBED_MAX_THREADS`
Duplicate delete blocks	✅ Pass	`_delete_entity_chunks_and_finish()` helper
Provider abstraction boundary	✅ Pass	`runtime_log_attrs()` on Protocol
Decision-point comments	✅ Pass	All present and complete across all changed files

Overall: Approve. This commit completes the disk-based checksum strategy by closing the last remaining read_file gap. The change is minimal, correctly scoped, and consistent with the approach established across this PR. All items raised across all review rounds remain resolved. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>

github-actions · 2026-04-09T04:11:05Z

Claude finished @phernandez's task in 4m 17s —— View job

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

Gather context and understand PR scope
Check what changed since last review (7c37e49 test(core): preserve persisted windows newlines)
Trace content pipeline to validate assertion correctness
Review against checklist
Post final review summary

What changed in `7c37e49 test(core): preserve persisted windows newlines`

Single file touched: tests/indexing/test_batch_indexer.py — 2 insertions, 2 deletions. Test-only, no logic changes.

Both modified assertions change read_text() → read_bytes().decode("utf-8"):

# Before (7421cfc)
persisted_content = (project_config.home / path).read_text()

# After (7c37e49)
persisted_content = (project_config.home / path).read_bytes().decode("utf-8")

Applied to both test_batch_indexer_returns_original_markdown_content_when_no_frontmatter_rewrite and test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter_delimiters.

Analysis

✅ The change is correct — and fixes a subtle Windows bug in the prior commit

Tracing the content pipeline for result.indexed[0].markdown_content:

file_service.read_file_bytes() — opens "rb" (binary, no newline translation)
  ↓
IndexInputFile.content: bytes
  ↓
file.content.decode("utf-8") — no normalization, CRLF preserved on Windows
  ↓
_PreparedMarkdownFile.content: str  ← CRLF preserved on Windows
  ↓
markdown_content = prepared.content  ← CRLF preserved on Windows

prepared.content comes from a binary read + .decode("utf-8") — there is no text-mode normalization anywhere in this path. On Windows, a file written by write_file_atomic (text mode, no newline="\n") will contain CRLF bytes on disk. The binary decode produces a string with literal \r\n.

read_bytes().decode("utf-8") is the correct mirror of this: it reads raw bytes and decodes without normalization, so both sides of the assertion have CRLF on Windows and LF on Linux/macOS. The assertion is now truly platform-neutral.

The prior commit 7421cfc used read_text(), which normalizes CRLF → LF on Windows. That would have made persisted_content (LF) diverge from markdown_content (CRLF) on Windows — a silent test failure on Windows CI. 7c37e49 closes that gap.

On Linux (where CI runs), both approaches are equivalent — text mode doesn't translate anything, so the tests pass either way. The fix only matters on Windows.

✅ Comments are still accurate

The Trigger/Why/Outcome comments from 7421cfc remain and are still correct:

"Outcome: compare against the exact markdown text stored on disk for this file." — accurate; read_bytes().decode("utf-8") reads exactly what's on disk.
"Why: Windows can still surface that unchanged file with CRLF line endings." — now enforced by the assertion approach itself.

No comment changes needed.

✅ No regressions

Test-only commit with zero logic changes. All previously passing review items remain intact.

Final Checklist (all items across all review rounds)

Area	Status	Notes
Code conventions (CLAUDE.md)	✅ Pass	Literate-style throughout; Trigger/Why/Outcome on all decision branches
Python 3.12+ type annotations	✅ Pass
SQLAlchemy 2.0 async	✅ Pass
FastAPI/Typer conventions	✅ Pass
100-char line limit	✅ Pass
No commented-out code	✅ Pass
Unit tests	✅ Pass	+322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests	✅ Pass	`force_full` path validated
Edge case coverage	✅ Pass	Window errors, oversized entities, orphan runtime
100% test coverage	✅ Pass	`# pragma: no cover` used appropriately
No MCP tool changes	✅ Pass
DB migrations	✅ Pass	No schema changes; vector tables recreated at runtime
Local-first architecture	✅ Pass
No hardcoded secrets	✅ Pass
Input validation	✅ Pass
No unicode/emoji in output	✅ Pass	`✓` → `done` (fixed in prior commit)
Windows path compatibility	✅ Pass	Disk-based checksum resolves cross-platform drift
`started_span` timing	✅ Pass	Fixed in `d59b974`
Magic number `8`	✅ Pass	`_SQLITE_MAX_PREPARE_WINDOW` and `_FASTEMBED_MAX_THREADS`
Duplicate delete blocks	✅ Pass	`_delete_entity_chunks_and_finish()` helper
Provider abstraction boundary	✅ Pass	`runtime_log_attrs()` on Protocol
Decision-point comments	✅ Pass	All present and complete across all changed files
Windows CRLF assertion accuracy	✅ Pass	`read_bytes().decode("utf-8")` matches binary-decoded `prepared.content` exactly

Overall: Approve. This commit fixes a subtle Windows test divergence from 7421cfc: read_text() normalizes CRLF→LF, which would diverge from prepared.content (binary-decoded, CRLF preserved) on Windows. The switch to read_bytes().decode("utf-8") makes both sides of the assertion follow the same decoding path. On Linux the two are equivalent, so CI stays green. All items raised across all review rounds remain resolved. Ready to merge.

phernandez marked this pull request as ready for review April 8, 2026 20:15

phernandez added 3 commits April 8, 2026 15:17

wip: checkpoint vector sync batching pass

1d4b1fd

Signed-off-by: phernandez <paul@basicmachines.co>

wip: checkpoint full reindex parity and sqlite vector fixes

3709287

Signed-off-by: phernandez <paul@basicmachines.co>

wip: tune fastembed defaults

36ca070

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez force-pushed the wip/vector-sync-window-batching branch from 8c75457 to 66eb72d Compare April 8, 2026 20:17

perf(core): tune fastembed auto threads

a68933a

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez force-pushed the wip/vector-sync-window-batching branch from 66eb72d to a68933a Compare April 8, 2026 20:23

refactor(core): address review nits

d59b974

Signed-off-by: phernandez <paul@basicmachines.co>

perf(core): lower vector sync batch defaults

794dcc6

Signed-off-by: phernandez <paul@basicmachines.co>

fix(sync): stabilize windows checksum writes

0434d9c

Signed-off-by: phernandez <paul@basicmachines.co>

fix(sync): preserve platform newlines

b54c5f2

Signed-off-by: phernandez <paul@basicmachines.co>

docs(core): explain windows newline checksum behavior

3f23b3e

Signed-off-by: phernandez <paul@basicmachines.co>

test(core): clarify windows checksum regression

5069b46

Signed-off-by: phernandez <paul@basicmachines.co>

fix(core): align checksums with persisted bytes

7421cfc

Signed-off-by: phernandez <paul@basicmachines.co>

test(core): preserve persisted windows newlines

7c37e49

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez merged commit 7945c1e into main Apr 9, 2026
41 checks passed

phernandez deleted the wip/vector-sync-window-batching branch April 9, 2026 05:20

groksrc mentioned this pull request Apr 15, 2026

FastEmbed cache defaults to /tmp/fastembed_cache — breaks MCP in sandboxed runtimes (Codex, ephemeral /tmp) #741

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): speed up vector sync and tune fastembed defaults#731

perf(core): speed up vector sync and tune fastembed defaults#731
phernandez merged 12 commits intomainfrom
wip/vector-sync-window-batching

phernandez commented Apr 8, 2026

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Apr 8, 2026

Summary

Validation

Benchmark notes

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Summary

Code Quality & Standards

Testing & Documentation

Basic Memory Architecture

Security & Performance

Compatibility

Checklist Result

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Summary

Code Quality & Standards

✅ Passes

⚠️ Minor issues

Testing & Documentation

✅ Passes

⚠️ Minor gap

Basic Memory Architecture

✅ Passes

Security & Performance

✅ Passes

⚠️ Observation

Compatibility

❌ Unicode in new console output (blocking per CLAUDE.md)

✅ File path handling

Checklist

Required Before Merge

Suggested (Non-blocking)

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Summary

What was fixed since previous reviews

Code Quality & Standards

✅ Passes

⚠️ Minor: magic numbers 8 should be named constants

Testing & Documentation

✅ Passes

Basic Memory Architecture

✅ Passes

⚠️ Minor: batch_span.set_attributes() is called after the span context exits

⚠️ Suggested: duplicated early-exit delete blocks in _prepare_entity_vector_jobs_prefetched

⚠️ Suggested: private method access across abstraction boundary

Security & Performance

✅ Passes

Compatibility

✅ Passes

Checklist Result

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

Summary

What was fixed in the latest commit

Final Checklist

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

What changed in 794dcc6 perf(core): lower vector sync batch defaults

Analysis

One observation worth noting

Checklist

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): speed up vector sync and tune fastembed defaults

What changed in 0434d9c fix(sync): stabilize windows checksum writes

Analysis

github-actions bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading

⚠️ Minor: magic numbers `8` should be named constants

⚠️ Minor: `batch_span.set_attributes()` is called after the span context exits

⚠️ Suggested: duplicated early-exit delete blocks in `_prepare_entity_vector_jobs_prefetched`

github-actions bot commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `794dcc6 perf(core): lower vector sync batch defaults`

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `0434d9c fix(sync): stabilize windows checksum writes`

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `b54c5f2 fix(sync): preserve platform newlines`

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `3f23b3e docs(core): explain windows newline checksum behavior`

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `5069b46 test(core): clarify windows checksum regression`

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `7421cfc fix(core): align checksums with persisted bytes`

`file_service.py` — `read_file()` checksum now reads from disk

`test_batch_indexer.py` — two assertions made platform-neutral

github-actions bot commented Apr 9, 2026 •

edited

Loading

What changed in `7c37e49 test(core): preserve persisted windows newlines`