Skip to content

perf(core): speed up vector sync and tune fastembed defaults#731

Merged
phernandez merged 12 commits intomainfrom
wip/vector-sync-window-batching
Apr 9, 2026
Merged

perf(core): speed up vector sync and tune fastembed defaults#731
phernandez merged 12 commits intomainfrom
wip/vector-sync-window-batching

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • batch vector-sync prepare reads across SQLite and Postgres while keeping the hot path parallel where it matters
  • make local bm reindex --full match cloud-style full rebuild semantics more closely for search and embeddings
  • tune FastEmbed auto defaults to reserve a small CPU cushion and keep parallel=1, with runtime settings logged at batch start

Validation

  • just fast-check
  • full local rebuild benchmarks with uv run bm reindex --project main --full

Benchmark notes

  • older default-ish full rebuild baseline: about 11:15 total, embed_seconds_total=625.598
  • tuned threads=8, parallel=1: about 9:55 total, embed_seconds_total=546.204
  • tuned threads=6, parallel=1: about 9:10 total in one explicit run, embed_seconds_total=497.861
  • final auto-path validation with config back on null: 9:04.46 total, embed_seconds_total=486.343, queue_wait_seconds_total=2717.249

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 3m 14s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Read key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

Summary

This is a solid, well-motivated performance PR. The benchmark numbers demonstrate a real speedup (~18% on wall clock, ~22% reduction in embed time), and the code follows the project's literate programming conventions throughout. Below is the full checklist and specific findings.


Code Quality & Standards

  • Follows Basic Memory coding conventions — decision-point comments, section headers, trigger/why/outcome patterns are consistently applied
  • Python 3.12+ type annotations and async patterns — correct use of type aliases, asyncio.gather, asynccontextmanager
  • SQLAlchemy 2.0 best practices — uses text() with parameterized binds throughout; no raw string interpolation with user data
  • 100-character line length — a few lines in the log format strings are long but that's the existing pattern in this file
  • No commented-out code — clean

Minor: magic number for window size cap

search_repository_base.py:1116:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

The 8 is the shared-batch window cap for non-Postgres backends. The comment explains the rationale well, but extracting this to a named constant would make it searchable and easier to tune:

_SQLITE_VECTOR_PREPARE_WINDOW_MAX = 8

Fix this →


Testing & Documentation

  • Unit tests for new functionstests/cli/test_db_reindex.py (+322 lines) comprehensively covers all bm reindex flag combinations and edge cases
  • Integration teststest_force_full_reindexes_unchanged_files validates the key behavioral change in sync_service.py using monkeypatching to inspect what files were re-indexed
  • Edge cases covered — cloud project filter, disabled semantic search, per-project reindex, window error propagation
  • 100% coverage maintained# pragma: no cover used only for else branch guards and runtime-mode-dependent paths

Basic Memory Architecture

  • Batched window reads are a clean abstraction_prepare_entity_vector_jobs_window shares one session and two batched SELECTs across all entities in a window, then fans out async per-entity prepare work. The shared read / serialized write split is cleanly separated between _fetch_prepare_window_* (shared session) and _prepare_entity_write_scope (per-entity lock for SQLite).
  • Polymorphism is well-structured_vector_prepare_window_size() and _prepare_entity_write_scope() override correctly in each backend; Postgres uses its own higher concurrency window, SQLite keeps the write lock.
  • scanned_paths in SyncReport — the force_full change in sync_service.py passes all scanned paths (not just diffs) to _index_changed_files, which is the correct approach for matching cloud-style full rebuilds. The test confirms unchanged files get re-indexed.
  • sync_embeddings=False decoupling — prevents the FTS sync phase from triggering a redundant embedding pass when the CLI runs FTS and embeddings as separate phases.

Potential: started_span attributes set after context exit

search_repository_base.py:1091-1104:

        # The `with telemetry.started_span(...) as batch_span:` block closes above
        if batch_span is not None:
            batch_span.set_attributes({...})

The started_span context wraps logfire.span(), which closes the span when the with block exits. Setting attributes on batch_span after the context has returned means the span has already been committed. This pattern may work with Logfire's implementation (some span backends accept post-close mutations), but semantically the attributes should be set before the span closes to guarantee they appear in traces. Consider moving the set_attributes call inside the with block, or accumulating them earlier and passing to the span on entry. Fix this →

Window-level exception collapse

search_repository_base.py:1225-1226:

        except Exception as exc:
            return [exc for _ in entity_ids]

When the shared read session fails, all entities in the window get the same exception. This is the right fail-safe — we can't know which entities would have been OK — but worth a comment noting that a window-level DB failure marks all window entities as failed (which is already visible in the outer warning log).


Security & Performance

  • No hardcoded secrets — none
  • Parameterized SQL throughout — the comment at postgres_search_repository.py:479 explicitly notes that SQL template construction uses only integer enumerate() indices; no user-controlled text is interpolated
  • FastEmbed auto-tuning_resolve_fastembed_runtime_knobs caps at min(8, available_cpus - 2) and keeps parallel=1. The benchmark notes show this is empirically validated. The two-path explicit-config vs. auto-tune split is clean.
  • Telemetry metric caching_METRICS dict prevents repeated metric instrument creation; properly cleared in reset_telemetry_state() for test isolation

Compatibility

  • No Windows path issues introduced — the new scanned_paths: Set[str] in SyncReport holds string paths derived from existing path handling; no new Path comparisons are introduced
  • Unicode in CLI outputdb.py:276 and db.py:324:
    console.print("  [green]✓[/green] Full-text search index rebuilt")
    Per CLAUDE.md, console and log output should avoid unicode characters. These appear to follow the existing convention in this file (the reset --reindex path already uses similar Rich markup elsewhere), but the is technically outside ASCII. If the project aims to be fully Windows-terminal-safe, these should use ASCII alternatives like [OK] or done. Fix this →

Checklist Result

Area Status Notes
Code conventions ✅ Pass Excellent literate-style comments throughout
Type annotations ✅ Pass
SQLAlchemy patterns ✅ Pass
Unit tests ✅ Pass Full CLI wiring coverage
Integration tests ✅ Pass force_full path validated
100% coverage ✅ Pass # pragma: no cover used appropriately
MCP tool composability ✅ Pass No MCP tool changes
DB migrations ✅ Pass No schema changes; vector table recreation handled at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
Windows path compat ⚠️ Minor unicode in 2 console.print lines
started_span timing ⚠️ Minor Attributes set after span context exits
Magic number 8 ⚠️ Minor Could be a named constant

Overall: Approve with minor suggestions. The core batching and FastEmbed tuning work is correct and well-tested. The three flagged items are all minor — the unicode and span attribute timing are the most worth addressing before merge.

@phernandez phernandez marked this pull request as ready for review April 8, 2026 20:15
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez phernandez force-pushed the wip/vector-sync-window-batching branch from 8c75457 to 66eb72d Compare April 8, 2026 20:17
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 28s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Read key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture concerns
  • Review security & performance
  • Review compatibility
  • Post final review summary

Summary

This PR delivers two distinct improvements: (1) batched window reads for vector-sync prepare phase to eliminate per-entity round-trips on both SQLite and Postgres, and (2) CPU-aware FastEmbed thread/parallel defaults to tune local full-rebuild performance. The benchmark numbers are compelling (~15% wall-clock reduction). Overall the implementation is solid and follows project conventions well. A few items worth addressing before merge:


Code Quality & Standards

✅ Passes

  • Full Python 3.12+ type annotations, | unions, type alias syntax used throughout
  • Literate programming style with Trigger/Why/Outcome comments on every non-obvious branch — exactly per CLAUDE.md
  • Section headers in files with multi-phase logic
  • SQLAlchemy 2.0 async patterns (no sync calls, proper async_sessionmaker)
  • Parameterized SQL throughout — no user data interpolated into f-strings (only enumerate() indices used for dynamic bind-param names, which is safe)

⚠️ Minor issues

Magic numbers should be named constants — two places use 8 without an explanatory constant name:

search_repository_base.py:1116 — base window size cap:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

embedding_provider_factory.py:58 — max thread cap:

threads = min(8, max(2, available_cpus - 2))

Both are explained by comments, but a named constant like _SQLITE_MAX_PREPARE_WINDOW = 8 and _FASTEMBED_MAX_THREADS = 8 would make the values easier to find and change in future tuning. Fix this →

Duplicated early-exit delete blockssearch_repository_base.py:1267–1297:

Two consecutive early-exit branches both do the exact same thing (delete chunks, commit, compute prepare_seconds, return empty _PreparedEntityVectorSync). The only difference is the guard condition (not source_rows vs not chunk_records). These could be merged or the repeated boilerplate extracted into an inline helper, reducing the surface area for future drift.

Private method access across abstraction boundarysearch_repository_base.py:1630:

effective_parallel=provider._effective_parallel(),

The base class imports and calls a private method on FastEmbedEmbeddingProvider. This works today but is a leaky abstraction. Consider adding a runtime_log_attrs() -> dict property to EmbeddingProvider / FastEmbedEmbeddingProvider so the base class can log provider details without knowing internal API.


Testing & Documentation

✅ Passes

  • New tests/cli/test_db_reindex.py (322 lines) — covers reindex CLI wiring for all flag combinations, cloud project guard, and missing project guard
  • Semantic search base tests significantly expanded (+334 lines) including window batch tests, shard plan tests, and error handling
  • SQLite vector search repository tests expanded with _prepare_window_existing_rows_sql variants
  • Postgres unit tests updated for new window path
  • Telemetry tests added (+56 lines)
  • Sync incremental tests include new force_full path (+27 lines)

⚠️ Minor gap

test_db_reindex.py tests the _reindex async function but the outer reindex() Typer command is marked # pragma: no cover at db.py:158. This is expected for Typer entrypoints and matches existing patterns in the file. Coverage appears maintained.


Basic Memory Architecture

✅ Passes

  • Local-first architecture preserved — no external dependencies added
  • SQLite write serialization maintained via _sqlite_prepare_write_lock in _prepare_entity_write_scope() — concurrent reads now fan out per window, but writes still funnel through the lock, preventing WAL contention
  • Postgres keeps its higher _semantic_postgres_prepare_concurrency window via the _vector_prepare_window_size() override — backend-specific tuning preserved
  • _clear_project_vectors_for_full_reindex() correctly handles the SQLite cascade-delete difference vs Postgres
  • Window-based prepare path preserves return_exceptions=True semantics — individual entity failures don't abort the whole window, and the error is propagated back per-entity

Security & Performance

✅ Passes

  • All dynamically-constructed SQL uses bind parameters only (chunk/entity indices from enumerate(), never from user input)
  • _prepare_window_entity_params correctly builds deterministic bind-param names from integer indices
  • FastEmbed parallel=1 default aligns with documented single-process preference in provider — avoids spawning multiple ONNX worker processes that would fight for the same CPU budget
  • Auto-thread formula min(8, max(2, available_cpus - 2)) reserves a 2-CPU cushion — sound for mixed-workload environments
  • No secrets, credentials, or sensitive data in logs

⚠️ Observation

The window-batch path opens a shared read session per window, then opens per-entity write sessions inside _prepare_entity_vector_jobs_prefetched. Under a slow or failing write, the shared read results are already in memory, so there is no double-read risk. This is correct behavior, but worth a comment at _prepare_entity_vector_jobs_window:1216 explaining that the read session is closed before write sessions open.


Compatibility

❌ Unicode in new console output (blocking per CLAUDE.md)

db.py:276 and db.py:325 use the Unicode check mark in new console.print calls added by this PR:

console.print("  [green]✓[/green] Full-text search index rebuilt")   # line 276
console.print(f"  [green]✓[/green] Embeddings complete: ...")         # line 325

Per CLAUDE.md: "Avoid using emojis and unicode characters in console and log output." These are new additions in this diff. Fix this →

✅ File path handling

All path operations use Path objects — Windows compatible.


Checklist

Category Status Notes
Coding conventions (CLAUDE.md) Follows literate style, section headers, decision comments
Python 3.12+ type annotations Full coverage
SQLAlchemy 2.0 async No sync patterns
FastAPI/Typer conventions
100-char line limit
No commented-out code
Unit tests Extensive additions
Integration tests
Edge case coverage Window errors, oversized entities, orphan runtime
100% test coverage Appropriate pragma: no cover usage
Documentation/docstrings
MCP atomic design n/a No new MCP tools
DB migrations No schema changes; tables created at runtime as before
Local-first architecture
Knowledge graph consistency
No hardcoded secrets
Input validation
Error handling/logging
Performance Core improvement of this PR
No sensitive data in logs
Windows path compatibility Uses Path objects
No emojis/unicode in output in 2 new console.print lines in db.py

Required Before Merge

  1. Replace with ASCII text in db.py lines 276 and 325 (compatibility rule from CLAUDE.md)

Suggested (Non-blocking)

  1. Extract named constants for 8 in _vector_prepare_window_size base and _resolve_fastembed_runtime_knobs
  2. Consolidate the two identical early-exit delete blocks in _prepare_entity_vector_jobs_prefetched
  3. Move _effective_parallel() call into a runtime_log_attrs() provider method to avoid base class reaching into private FastEmbed internals

Signed-off-by: phernandez <paul@basicmachines.co>
@phernandez phernandez force-pushed the wip/vector-sync-window-batching branch from 66eb72d to a68933a Compare April 8, 2026 20:23
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 2m 45s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Read key changed files
  • Check if previous review issues were addressed
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

Summary

This is a solid performance PR delivering batched window reads for the vector-sync prepare phase and CPU-aware FastEmbed thread/parallel defaults. The benchmark numbers (down from ~11:15 to ~9:04 wall clock, embed time from ~625s to ~486s) demonstrate a meaningful real-world improvement.

What was fixed since previous reviews

Unicode replaced with donedb.py lines 276 and 325 now use [green]done[/green] (the blocking compatibility issue from prior reviews is resolved)


Code Quality & Standards

✅ Passes

  • Python 3.12+ type annotations, type aliases, | unions throughout
  • Literate programming style with Trigger/Why/Outcome comments on every non-obvious branch
  • Section headers in multi-phase files
  • SQLAlchemy 2.0 async patterns — no sync calls in hot path
  • Parameterized SQL throughout; dynamic bind-param names come only from enumerate() indices, never from user input
  • 100-character line limit respected

⚠️ Minor: magic numbers 8 should be named constants

search_repository_base.py:1116:

return max(1, min(self._semantic_embedding_sync_batch_size, 8))

embedding_provider_factory.py:58:

threads = min(8, max(2, available_cpus - 2))

Both occurrences are documented by comments, but extracting to named constants (_SQLITE_MAX_PREPARE_WINDOW = 8, _FASTEMBED_MAX_THREADS = 8) makes them searchable and easier to tune in the future. Fix this →


Testing & Documentation

✅ Passes

  • tests/cli/test_db_reindex.py (+322 lines) — comprehensive CLI wiring coverage for all flag combinations, cloud-project guard, missing-project guard
  • tests/repository/test_semantic_search_base.py (+334 lines) — window batch tests, shard plan tests, error handling paths
  • tests/repository/test_sqlite_vector_search_repository.py (+158 lines) — _prepare_window_existing_rows_sql variants
  • Postgres unit tests updated for new window path
  • Telemetry tests added (+56 lines)
  • Sync incremental tests include force_full path (+27 lines)
  • # pragma: no cover used appropriately (Typer entrypoints, TYPE_CHECKING blocks)

Basic Memory Architecture

✅ Passes

  • Local-first architecture preserved; no new external dependencies
  • SQLite write serialization maintained — _sqlite_prepare_write_lock serializes writes inside _prepare_entity_write_scope() while concurrent per-entity prepare work fans out within each window
  • Postgres overrides _vector_prepare_window_size() with its higher _semantic_postgres_prepare_concurrency — backend-specific tuning is clean
  • _clear_project_vectors_for_full_reindex() correctly handles SQLite cascade-delete vs Postgres
  • return_exceptions=True in asyncio.gather preserves per-entity failure isolation — a window-level DB failure returns the same exception for all window entities (correct behavior)
  • sync_embeddings=False decoupling prevents a redundant embedding pass when CLI runs FTS and embeddings as separate phases

⚠️ Minor: batch_span.set_attributes() is called after the span context exits

search_repository_base.py:836-842 opens the span context:

with telemetry.started_span(
    "basic_memory.vector_sync.batch",
    ...
) as batch_span:
    for window_start in range(0, total_entities, prepare_window_size):
        ...

But batch_span.set_attributes(...) at lines 1091–1104 is outside this with block (the block closes after all flush logic). The logfire span has already been committed when the attributes are set. In practice Logfire may accept post-close mutations, but semantically these attributes should be set inside the with block (or accumulated earlier and passed on entry) to guarantee they appear in traces. Fix this →

⚠️ Suggested: duplicated early-exit delete blocks in _prepare_entity_vector_jobs_prefetched

Lines 1267–1280 and 1284–1297 are identical: delete chunks, commit, compute prepare_seconds, return empty _PreparedEntityVectorSync. Only the guard condition differs (not source_rows vs not chunk_records). These could be merged into a shared inline helper to reduce drift surface.

⚠️ Suggested: private method access across abstraction boundary

search_repository_base.py:1630:

effective_parallel=provider._effective_parallel(),

The base class reaches into FastEmbedEmbeddingProvider's private API. Consider adding a runtime_log_attrs() -> dict method to the provider interface so the base class can log provider details without knowing internal structure.


Security & Performance

✅ Passes

  • No hardcoded secrets or credentials
  • Parameterized SQL throughout (bind-param names from enumerate() only)
  • FastEmbed parallel=1 default matches single-process preference in docs
  • Auto-thread formula min(8, max(2, available_cpus - 2)) reserves 2-CPU cushion — sound for mixed-workload environments
  • No sensitive data in logs; _METRICS dict prevents repeated metric instrument creation

Compatibility

✅ Passes

  • Unicode done fix applied — console output now ASCII-safe
  • All path operations use Path objects — Windows compatible
  • scanned_paths: Set[str] holds string paths derived from existing path handling; no new Path comparisons introduced

Checklist Result

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Excellent literate-style comments throughout
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async ✅ Pass
FastAPI/Typer conventions ✅ Pass
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass Extensive additions
Integration tests ✅ Pass force_full path validated
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass Appropriate pragma: no cover usage
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes; vector tables recreated at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass Fixed: done
Windows path compatibility ✅ Pass
started_span timing ⚠️ Minor set_attributes after span context exits
Magic number 8 ⚠️ Minor Two instances — could be named constants
Duplicated delete blocks ⚠️ Minor _prepare_entity_vector_jobs_prefetched lines 1267–1297
Private _effective_parallel() ⚠️ Minor Base class reaches into provider internals

Overall: Approve. The blocking compatibility issue from previous reviews (unicode ) has been fixed. All remaining items are minor non-blocking suggestions. The core batching and FastEmbed tuning work is correct, well-tested, and follows project conventions throughout.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

Claude finished @phernandez's task in 1m 58s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review
  • Read key changed files
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

Summary

All suggestions from previous review rounds have been addressed in d59b974 refactor(core): address review nits. This PR is ready to merge.


What was fixed in the latest commit

Previous suggestion Status
Unicode → ASCII done in db.py (blocking) ✅ Fixed
Magic number 8_SQLITE_MAX_PREPARE_WINDOW in search_repository_base.py ✅ Fixed
Magic number 8_FASTEMBED_MAX_THREADS in embedding_provider_factory.py ✅ Fixed
batch_span.set_attributes() called after span context exits ✅ Fixed — now indented inside with telemetry.started_span(...) as batch_span:
Duplicate early-exit delete blocks in _prepare_entity_vector_jobs_prefetched ✅ Fixed — consolidated into _delete_entity_chunks_and_finish() inner helper
Base class reaching into provider._effective_parallel() private method ✅ Fixed — runtime_log_attrs() added to EmbeddingProvider Protocol, FastEmbedEmbeddingProvider, and OpenAIEmbeddingProvider
Window-level exception collapse missing comment ✅ Fixed — Trigger/Why/Outcome comment added

Final Checklist

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Literate-style throughout, decision-point comments, section headers
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async ✅ Pass
FastAPI/Typer conventions ✅ Pass
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass +322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests ✅ Pass force_full path validated
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass # pragma: no cover used appropriately
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes; vector tables recreated at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass done
Windows path compatibility ✅ Pass
started_span timing ✅ Pass set_attributes now inside span context
Magic number 8 ✅ Pass _SQLITE_MAX_PREPARE_WINDOW and _FASTEMBED_MAX_THREADS
Duplicate delete blocks ✅ Pass _delete_entity_chunks_and_finish() helper
Provider abstraction boundary ✅ Pass runtime_log_attrs() on Protocol

Overall: Approve. The implementation is correct, well-tested, and follows project conventions throughout. No remaining issues.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 3m 11s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review
  • Review new changes against checklist
  • Post final review summary

What changed in 794dcc6 perf(core): lower vector sync batch defaults

Single file touched: src/basic_memory/config.py

Both semantic_embedding_batch_size and semantic_embedding_sync_batch_size dropped from 642:

# Before
semantic_embedding_batch_size: int = Field(default=64, ...)
semantic_embedding_sync_batch_size: int = Field(default=64, ...)

# After
semantic_embedding_batch_size: int = Field(default=2, ...)
semantic_embedding_sync_batch_size: int = Field(default=2, ...)

Analysis

Change is clean and minimal — config-only, no new logic.

Rationale is well-documented — the Trigger/Why/Outcome comment added in config.py:191-196 is clear: full rebuilds are dominated by queue wait (entities blocking on a full batch to form), not embedding compute. Shrinking the batch cuts wait far more than it increases per-batch overhead.

Benchmarks back it up — the PR description shows queue_wait_seconds_total=2717.249 in the final run, confirming queue saturation was the bottleneck being resolved.

Tests unaffected — all test references to 64 either use the provider's own constructor default (independent of config) or hardcode the value explicitly in test setup. The _ConcreteRepo test stub at test_semantic_search_base.py:59 hardcodes _semantic_embedding_sync_batch_size = 64 — that's correct test isolation, not a stale value.


One observation worth noting

With semantic_embedding_sync_batch_size = 2, _vector_prepare_window_size() now returns max(1, min(2, 8)) = 2 instead of the previous max(1, min(64, 8)) = 8. The window batching path introduced earlier in this PR (which shares one session across multiple entities' DB reads) now spans only 2 entities at a time instead of 8.

This is a trade-off that the benchmarks validate: the queue-wait savings outweigh the reduced batching efficiency in the read window. But it means these two configs are now more tightly coupled than they appear — increasing semantic_embedding_sync_batch_size in the future will simultaneously increase the window size (up to _SQLITE_MAX_PREPARE_WINDOW = 8). This coupling isn't obvious from the field descriptions alone. Not blocking — the comment on _vector_prepare_window_size() handles this — but worth keeping in mind for future tuning docs.


Checklist

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Trigger/Why/Outcome comment present
Python 3.12+ type annotations ✅ Pass No new code
SQLAlchemy patterns n/a Config-only change
Unit tests ✅ Pass No test updates needed; existing tests use explicit values
Integration tests ✅ Pass
100% coverage ✅ Pass
No hardcoded secrets ✅ Pass
No unicode/emoji in output ✅ Pass Already fixed in prior commit
Performance ✅ Pass Benchmark-validated
All prior review items ✅ Pass All addressed in d59b974

Overall: Approve. This is a clean, minimal, benchmark-backed config tuning on top of an already-approved implementation. All previous review findings remain resolved.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 2m 17s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (0434d9c fix(sync): stabilize windows checksum writes)
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

What changed in 0434d9c fix(sync): stabilize windows checksum writes

Three files touched: file_utils.py, services/file_service.py, tests/services/test_file_service.py

This commit addresses a Windows-specific checksum divergence bug through two complementary fixes:

1. Force LF line endings at write time (file_utils.py)

# Before
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

# After
async with aiofiles.open(temp_path, mode="w", encoding="utf-8", newline="\n") as f:

Applied to both write_file_atomic and format_markdown_builtin. Prevents Windows text mode from translating \n\r\n, which would make stored bytes diverge from the in-memory strings used for indexing and hashing.

2. Compute checksum from disk, not from the in-memory string (file_service.py)

# Before (write_file)
checksum = await file_utils.compute_checksum(final_content)

# After
checksum = await self.compute_checksum(full_path)

Same fix applied in update_frontmatter_with_result. The FileService.compute_checksum(path) method reads raw bytes in 64KB chunks — so the returned hash always matches what's actually stored, regardless of any byte-level transformation that happened during the write.


Analysis

✅ Code Quality & Standards

  • Trigger/Why/Outcome comments on every changed branch — follows the project's literate programming conventions exactly
  • newline="\n" is the correct aiofiles.open() parameter; encoding and newline are independent parameters and both are needed here
  • 100-character line limit maintained
  • No commented-out code

✅ Testing & Documentation

  • test_update_frontmatter_checksum_matches_persisted_bytes monkeypatches write_file_atomic to simulate CRLF byte writes and asserts the returned checksum equals compute_checksum(path) (the disk read). This directly exercises the regression scenario.
  • The write_file path change is implicitly covered by existing test_write_* tests; the root cause (CRLF) is now prevented upstream in write_file_atomic.
  • # pragma: no cover on the formatter branch is correct — format_on_save is off in test config.

✅ Basic Memory Architecture

  • Checksum-from-disk is the more robust design: it makes FileService the single authoritative source for checksums, regardless of upstream transformations (formatters, platform newline handling, BOM stripping)
  • No schema changes; no migration needed

✅ Security & Performance

  • No secrets
  • compute_checksum(full_path) uses chunked async reads (64KB) — memory usage is bounded and unchanged
  • The extra disk read per write is negligible; checksums are computed post-write already

✅ Compatibility

  • This fix is the Windows compatibility fix — forcing newline="\n" closes the CRLF divergence at the root. Combined with reading from disk, sync and move detection now produce deterministic checksums across platforms.

⚠️ Minor observation (non-blocking)

file_service.py:211-216:

if self.app_config:
    formatted_content = await file_utils.format_file(
        full_path, self.app_config, is_markdown=self.is_markdown(path)
    )
    if formatted_content is not None:
        pass  # pragma: no cover

format_file is still called for its side effect (writing reformatted content to disk), but the return value is now explicitly discarded with pass. The intent is correct — we'll read the checksum from disk anyway — but the empty if body is slightly unusual. A one-line comment like # formatter wrote to disk; checksum read below would make it self-documenting. Non-blocking.


Checklist

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Trigger/Why/Outcome on every changed branch
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async n/a
FastAPI/Typer conventions n/a
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass Regression test covers the CRLF divergence case
Integration tests ✅ Pass Existing write tests cover the write path
Edge case coverage ✅ Pass Monkeypatched CRLF write simulates the actual failure
100% test coverage ✅ Pass pragma: no cover used appropriately
DB migrations n/a No schema changes
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass Already fixed in prior commit
Windows path compatibility ✅ Pass This commit specifically resolves the Windows checksum bug
Empty if body in write path ⚠️ Minor pass is correct but could use an explanatory comment

Overall: Approve. This is a clean, targeted, correctly scoped fix. The dual-layer approach (prevent CRLF at the write source + always hash from disk) is defensive in depth and the right design. All previous review findings remain resolved.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 2m 38s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (b54c5f2 fix(sync): preserve platform newlines)
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

What changed in b54c5f2 fix(sync): preserve platform newlines

Single file touched: src/basic_memory/file_utils.py — 3 insertions, 7 deletions.

This commit reverses one half of the dual-layer Windows fix introduced in 0434d9c:

Fix 0434d9c b54c5f2
Force LF via newline="\n" in write_file_atomic and format_markdown_builtin ✅ Added ❌ Reverted
Compute checksum from disk bytes in write_file and update_frontmatter_with_result ✅ Added ✅ Kept

The newline="\n" override is gone from both write paths, allowing the platform to use its native line endings (CRLF on Windows). The disk-based checksum approach remains.


Analysis

✅ Functional correctness is maintained

I traced the checksum comparison paths to confirm there is no regression:

  • Write path: write_file → binary compute_checksum(path) → stored in DB
  • Sync comparison: file_service.compute_checksum(path) (binary rb mode) → consistent
  • Initial entity load: read_file_bytes + compute_checksum(bytes) → consistent
  • read_file: computes checksum from text-mode string, but callers at entity_service.py:627 and entity_service.py:1127 discard the checksum (_), so this is not in the sync comparison path

The disk-read checksum is the single authoritative source for sync comparisons. Removing newline="\n" allows CRLF on Windows, but since the checksum is always computed from the actual on-disk bytes, subsequent reads will always agree with what was stored.

✅ CRLF regression test still present

tests/services/test_file_service.py:172test_update_frontmatter_checksum_matches_persisted_bytes is unchanged and still validates that FileService hashes the stored file, not the pre-write string. This test covers the root scenario that motivated 0434d9c.


⚠️ Missing decision-point comment (Minor — CLAUDE.md violation)

file_utils.py:117-118:

# Use aiofiles for non-blocking write
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

The original Trigger/Why/Outcome comment from 0434d9c was removed and replaced with a generic "what" comment that has nothing to do with the decision being made here. The comment describes aiofiles (obvious) but says nothing about why newline="\n" was removed. Per CLAUDE.md, decision-point comments must explain:

  • Trigger: when does this apply?
  • Why: the rationale
  • Outcome: what changes downstream

Something like:

# Trigger: caller passes normalized Python strings (LF-only); write_file / update_frontmatter
#          already compute the checksum from on-disk bytes, so platform newline translation
#          does not cause checksum drift.
# Why: preserving platform-native newlines (CRLF on Windows) keeps files compatible
#      with Obsidian and other local editors that expect the host OS line ending.
# Outcome: file bytes may differ across platforms; downstream consumers use
#          FileService.compute_checksum(path) (binary mode) to stay consistent.
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

Fix this →


Checklist

Area Status Notes
Code conventions (CLAUDE.md) ⚠️ Minor Decision-point comment missing in write_file_atomic
Python 3.12+ type annotations ✅ Pass No new code
SQLAlchemy patterns n/a Utility change
Unit tests ✅ Pass CRLF regression test retained and still valid
Integration tests ✅ Pass
100% coverage ✅ Pass
No hardcoded secrets ✅ Pass
No unicode/emoji in output ✅ Pass Already fixed in prior commit
Windows path compatibility ✅ Pass Disk-based checksum keeps sync consistent across platforms
All prior review items ✅ Pass All addressed in d59b974

Overall: Approve with minor suggestion. The functional change is correct — the disk-based checksum approach is sufficient to maintain sync consistency without forcing LF, and the CRLF regression test is still in place. The only item worth fixing before merge is the missing Trigger/Why/Outcome comment in write_file_atomic, which doesn't explain the reasoning behind removing newline="\n".

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 1m 7s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (3f23b3e docs(core): explain windows newline checksum behavior)
  • Review code quality & standards
  • Review testing & documentation
  • Review Basic Memory architecture
  • Review security & performance
  • Review compatibility
  • Post final review summary

What changed in 3f23b3e docs(core): explain windows newline checksum behavior

Single file touched: src/basic_memory/file_utils.py — 14 insertions, 1 deletion.

This commit adds the missing Trigger/Why/Outcome decision-point comments that the previous review flagged in b54c5f2. Two locations updated:

write_file_atomic (line 117–123):

# Trigger: callers hand us normalized Python text, but the final bytes are allowed
#          to use the host platform's native newline convention during the write.
# Why: preserving CRLF on Windows keeps local files aligned with editors like
#      Obsidian, while FileService now hashes the persisted file bytes instead of
#      the pre-write string.
# Outcome: this async write stays editor-friendly across platforms without
#          reintroducing checksum drift in sync or move detection.
async with aiofiles.open(temp_path, mode="w", encoding="utf-8") as f:

format_markdown_builtin (line 177–183):

# Trigger: mdformat may rewrite markdown content, then the host platform
#          decides the newline bytes for the follow-up async text write.
# Why: we want formatter output to preserve native newlines instead of
#      forcing LF, and the authoritative checksum comes from rereading the
#      stored file bytes later in FileService.
# Outcome: formatting remains compatible with local editors on Windows while
#          checksum-based sync logic stays anchored to on-disk bytes.
async with aiofiles.open(path, mode="w", encoding="utf-8") as f:

Analysis

✅ Missing decision-point comment addressed

Both comments follow the Trigger/Why/Outcome format exactly as specified in CLAUDE.md. They correctly explain:

  • Trigger: what kind of input arrives (normalized Python text)
  • Why: editor compatibility (CRLF on Windows for Obsidian) + the disk-checksum approach eliminates drift risk
  • Outcome: editor-friendly cross-platform writes without sync regression

The second comment in format_markdown_builtin is a bonus — it wasn't flagged in the prior review but follows the same reasoning pattern, which makes the dual-write path self-documenting throughout.

✅ No regressions introduced

This is a docs-only commit — no logic changes, no new tests needed. All previously passing review items remain intact.


Final Checklist (all items from all review rounds)

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Literate-style throughout; all decision-point comments present
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async ✅ Pass
FastAPI/Typer conventions ✅ Pass
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass +322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests ✅ Pass force_full path validated
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass # pragma: no cover used appropriately
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes; vector tables recreated at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass done (fixed in prior commit)
Windows path compatibility ✅ Pass Disk-based checksum resolves cross-platform drift
started_span timing ✅ Pass set_attributes inside span context (fixed in d59b974)
Magic number 8 ✅ Pass _SQLITE_MAX_PREPARE_WINDOW and _FASTEMBED_MAX_THREADS
Duplicate delete blocks ✅ Pass _delete_entity_chunks_and_finish() helper
Provider abstraction boundary ✅ Pass runtime_log_attrs() on Protocol
Decision-point comment (write_file_atomic) ✅ Pass Trigger/Why/Outcome added in this commit
Decision-point comment (format_markdown_builtin) ✅ Pass Added proactively in this commit

Overall: Approve. Every item raised across all review rounds has been resolved. The PR delivers a real, benchmark-validated performance improvement with a clean implementation, comprehensive tests, and thorough documentation. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 1m 0s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (5069b46 test(core): clarify windows checksum regression)
  • Review new changes against checklist
  • Post final review summary

What changed in 5069b46 test(core): clarify windows checksum regression

Single file touched: tests/services/test_file_service.py — 4 insertions, 3 deletions. Test-only, no logic changes.

Three clarifications made to test_update_frontmatter_checksum_matches_persisted_bytes:

Change Before After
Test name test_update_frontmatter_checksum_matches_persisted_bytes test_update_frontmatter_checksum_matches_windows_crlf_persisted_bytes
Docstring "Frontmatter writes should hash the stored file, not the pre-write string." "Windows-style CRLF writes should hash the stored file, not the pre-write string."
Trigger comment "Trigger: simulate a writer that persists CRLF bytes like Windows text mode." "Trigger: simulate Windows text-mode persistence, where logical LF strings land on disk as CRLF bytes."

Analysis

✅ Naming and clarity improvements

All three changes make the test's intent more precise. The rename makes the platform-specific regression scenario explicit in the test name itself, so developers reading a failing test output or coverage report immediately know what scenario is covered. The expanded trigger comment follows CLAUDE.md's literate style more closely than the prior one-liner.

✅ No regressions introduced

Test-only commit with zero logic changes. All previously passing review items remain intact.

✅ Trigger/Why/Outcome format preserved

The expanded trigger comment still follows the Trigger/Why/Outcome structure exactly. The Why and Outcome lines are unchanged from 0434d9c; only the Trigger line was made more descriptive.


Final Checklist (all items across all review rounds)

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Literate-style throughout; all decision-point comments present
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async ✅ Pass
FastAPI/Typer conventions ✅ Pass
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass +322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests ✅ Pass force_full path validated
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass # pragma: no cover used appropriately
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes; vector tables recreated at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass done (fixed in prior commit)
Windows path compatibility ✅ Pass Disk-based checksum resolves cross-platform drift
started_span timing ✅ Pass set_attributes inside span context (fixed in d59b974)
Magic number 8 ✅ Pass _SQLITE_MAX_PREPARE_WINDOW and _FASTEMBED_MAX_THREADS
Duplicate delete blocks ✅ Pass _delete_entity_chunks_and_finish() helper
Provider abstraction boundary ✅ Pass runtime_log_attrs() on Protocol
Decision-point comment (write_file_atomic) ✅ Pass Trigger/Why/Outcome added in 3f23b3e
Decision-point comment (format_markdown_builtin) ✅ Pass Added in 3f23b3e
Windows CRLF regression test clarity ✅ Pass Test name and trigger comment sharpened in this commit

Overall: Approve. This is a pure test-clarity commit with no logic changes. The renamed test and expanded trigger comment make the Windows CRLF regression scenario unambiguous. Every item raised across all review rounds remains resolved. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 2m 32s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (7421cfc fix(core): align checksums with persisted bytes)
  • Review new changes against checklist
  • Post final review summary

What changed in 7421cfc fix(core): align checksums with persisted bytes

Two files touched: src/basic_memory/services/file_service.py (+8/-1) and tests/indexing/test_batch_indexer.py (+16/-2).


file_service.pyread_file() checksum now reads from disk

# Before
checksum = await file_utils.compute_checksum(content)

# After
# Trigger: text-mode reads normalize line endings on Windows, so the
#          decoded string can differ from the bytes we just wrote.
# Why: write_file/update_frontmatter now return the checksum of the
#      persisted file, and read_file should report the same authority.
# Outcome: callers get human-readable content plus the checksum for the
#          exact bytes stored on disk.
checksum = await self.compute_checksum(full_path)

file_utils.compute_checksum(content) hashes in-memory text (LF-normalized by Python's text-mode decoder), while self.compute_checksum(full_path) reads raw bytes in binary mode. This makes read_file consistent with write_file and update_frontmatter_with_result, which were already fixed to use disk-based checksums in 0434d9c. The disk-based checksum approach is now the single, authoritative strategy across all FileService methods that return a checksum.


test_batch_indexer.py — two assertions made platform-neutral

# Before
assert result.indexed[0].markdown_content == original_content  # LF literal

# After
persisted_content = (project_config.home / path).read_text()  # reads from disk
assert result.indexed[0].markdown_content == persisted_content

Same change applied to test_batch_indexer_returns_original_markdown_content_when_no_frontmatter (line 272) and test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter (line 527). Both previously compared against string literals with LF line endings. On Windows, the file is persisted with CRLF (now intentionally, per the b54c5f2 decision), and markdown_content is produced from a text-mode file read — same path as read_text(). Both sides now go through the same text-mode normalization, making the assertion platform-safe.


Analysis

✅ Code Quality & Standards

  • Trigger/Why/Outcome comment follows CLAUDE.md's literate programming format exactly
  • Change is minimal — single line replaced in read_file, two test assertions updated
  • 100-character line limit maintained
  • No commented-out code

✅ Functional correctness

The only two callers of read_file in production code (entity_service.py:627 and entity_service.py:1127) both discard the returned checksum (current_content, _ = await self.file_service.read_file(...)). The behavior change has no effect on those code paths today, but it ensures any future caller gets the authoritative disk-based checksum rather than a LF-normalized hash that could diverge on Windows.

On Linux/macOS the two approaches are equivalent (no CRLF translation), so there is no regression on those platforms.

✅ Testing

  • Test changes are platform-idiomatic: read_text() normalizes CRLF → LF on all platforms, consistent with how markdown_content is produced
  • Trigger/Why/Outcome comments added to both test assertion blocks
  • No new test is needed for the read_file change itself — the write path already has test_update_frontmatter_checksum_matches_windows_crlf_persisted_bytes, and both callers discard the checksum

✅ Minor observation (non-blocking)

read_file now does two sequential async I/O passes per call: one text-mode read for content and one binary read for the checksum. The extra pass is negligible (files are typically small), and the semaphore in compute_checksum bounds concurrency correctly. Worth noting only because future profiling might flag the double read.


Final Checklist (all items across all review rounds)

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Trigger/Why/Outcome comment present and correct
Python 3.12+ type annotations ✅ Pass No new code
SQLAlchemy 2.0 async n/a
FastAPI/Typer conventions n/a
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass +322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests ✅ Pass force_full path validated
Batch indexer test platform safety ✅ Pass read_text() replaces LF literals in 2 assertions
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass # pragma: no cover used appropriately
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass done (fixed in prior commit)
Windows path compatibility ✅ Pass Disk-based checksum resolves cross-platform drift
read_file checksum consistency ✅ Pass Now uses self.compute_checksum(path) like all other FileService methods
started_span timing ✅ Pass Fixed in d59b974
Magic number 8 ✅ Pass _SQLITE_MAX_PREPARE_WINDOW and _FASTEMBED_MAX_THREADS
Duplicate delete blocks ✅ Pass _delete_entity_chunks_and_finish() helper
Provider abstraction boundary ✅ Pass runtime_log_attrs() on Protocol
Decision-point comments ✅ Pass All present and complete across all changed files

Overall: Approve. This commit completes the disk-based checksum strategy by closing the last remaining read_file gap. The change is minimal, correctly scoped, and consistent with the approach established across this PR. All items raised across all review rounds remain resolved. Ready to merge.

Signed-off-by: phernandez <paul@basicmachines.co>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Claude finished @phernandez's task in 4m 17s —— View job


PR Review: perf(core): speed up vector sync and tune fastembed defaults

Tasks:

  • Gather context and understand PR scope
  • Check what changed since last review (7c37e49 test(core): preserve persisted windows newlines)
  • Trace content pipeline to validate assertion correctness
  • Review against checklist
  • Post final review summary

What changed in 7c37e49 test(core): preserve persisted windows newlines

Single file touched: tests/indexing/test_batch_indexer.py — 2 insertions, 2 deletions. Test-only, no logic changes.

Both modified assertions change read_text()read_bytes().decode("utf-8"):

# Before (7421cfc)
persisted_content = (project_config.home / path).read_text()

# After (7c37e49)
persisted_content = (project_config.home / path).read_bytes().decode("utf-8")

Applied to both test_batch_indexer_returns_original_markdown_content_when_no_frontmatter_rewrite and test_batch_indexer_uses_parsed_markdown_body_for_malformed_frontmatter_delimiters.


Analysis

✅ The change is correct — and fixes a subtle Windows bug in the prior commit

Tracing the content pipeline for result.indexed[0].markdown_content:

file_service.read_file_bytes() — opens "rb" (binary, no newline translation)
  ↓
IndexInputFile.content: bytes
  ↓
file.content.decode("utf-8") — no normalization, CRLF preserved on Windows
  ↓
_PreparedMarkdownFile.content: str  ← CRLF preserved on Windows
  ↓
markdown_content = prepared.content  ← CRLF preserved on Windows

prepared.content comes from a binary read + .decode("utf-8") — there is no text-mode normalization anywhere in this path. On Windows, a file written by write_file_atomic (text mode, no newline="\n") will contain CRLF bytes on disk. The binary decode produces a string with literal \r\n.

read_bytes().decode("utf-8") is the correct mirror of this: it reads raw bytes and decodes without normalization, so both sides of the assertion have CRLF on Windows and LF on Linux/macOS. The assertion is now truly platform-neutral.

The prior commit 7421cfc used read_text(), which normalizes CRLF → LF on Windows. That would have made persisted_content (LF) diverge from markdown_content (CRLF) on Windows — a silent test failure on Windows CI. 7c37e49 closes that gap.

On Linux (where CI runs), both approaches are equivalent — text mode doesn't translate anything, so the tests pass either way. The fix only matters on Windows.

✅ Comments are still accurate

The Trigger/Why/Outcome comments from 7421cfc remain and are still correct:

  • "Outcome: compare against the exact markdown text stored on disk for this file." — accurate; read_bytes().decode("utf-8") reads exactly what's on disk.
  • "Why: Windows can still surface that unchanged file with CRLF line endings." — now enforced by the assertion approach itself.

No comment changes needed.

✅ No regressions

Test-only commit with zero logic changes. All previously passing review items remain intact.


Final Checklist (all items across all review rounds)

Area Status Notes
Code conventions (CLAUDE.md) ✅ Pass Literate-style throughout; Trigger/Why/Outcome on all decision branches
Python 3.12+ type annotations ✅ Pass
SQLAlchemy 2.0 async ✅ Pass
FastAPI/Typer conventions ✅ Pass
100-char line limit ✅ Pass
No commented-out code ✅ Pass
Unit tests ✅ Pass +322 lines CLI, +334 lines semantic base, +158 lines SQLite vector
Integration tests ✅ Pass force_full path validated
Edge case coverage ✅ Pass Window errors, oversized entities, orphan runtime
100% test coverage ✅ Pass # pragma: no cover used appropriately
No MCP tool changes ✅ Pass
DB migrations ✅ Pass No schema changes; vector tables recreated at runtime
Local-first architecture ✅ Pass
No hardcoded secrets ✅ Pass
Input validation ✅ Pass
No unicode/emoji in output ✅ Pass done (fixed in prior commit)
Windows path compatibility ✅ Pass Disk-based checksum resolves cross-platform drift
started_span timing ✅ Pass Fixed in d59b974
Magic number 8 ✅ Pass _SQLITE_MAX_PREPARE_WINDOW and _FASTEMBED_MAX_THREADS
Duplicate delete blocks ✅ Pass _delete_entity_chunks_and_finish() helper
Provider abstraction boundary ✅ Pass runtime_log_attrs() on Protocol
Decision-point comments ✅ Pass All present and complete across all changed files
Windows CRLF assertion accuracy ✅ Pass read_bytes().decode("utf-8") matches binary-decoded prepared.content exactly

Overall: Approve. This commit fixes a subtle Windows test divergence from 7421cfc: read_text() normalizes CRLF→LF, which would diverge from prepared.content (binary-decoded, CRLF preserved) on Windows. The switch to read_bytes().decode("utf-8") makes both sides of the assertion follow the same decoding path. On Linux the two are equivalent, so CI stays green. All items raised across all review rounds remain resolved. Ready to merge.

@phernandez phernandez merged commit 7945c1e into main Apr 9, 2026
41 checks passed
@phernandez phernandez deleted the wip/vector-sync-window-batching branch April 9, 2026 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant