perf(core): batch postgres vector sync preparation by phernandez · Pull Request #723 · basicmachines-co/basic-memory

phernandez · 2026-04-07T00:57:09Z

Summary

prepare Postgres vector sync work in small concurrent windows to hide round-trip latency during cloud indexing
bulk upsert vector embeddings in a single statement per flush instead of one execute call per chunk
update repository unit tests for the new bulk-write path and concurrent prepare behavior

Testing

uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

Signed-off-by: phernandez <paul@basicmachines.co>

phernandez · 2026-04-07T00:57:35Z

Additional local verification note: just fast-check failed after the targeted repository tests passed.

Failure:

test-int/semantic/test_semantic_quality.py::test_semantic_quality[asyncio-postgres-openai]
assertion: recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

The targeted tests that passed before opening the PR were:

uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

github-actions · 2026-04-07T00:58:03Z

Claude finished @phernandez's task in 5m 32s —— View job

PR Review: `perf(core): batch postgres vector sync preparation`

Tasks

Gather context and read changed files
Analyze semantic quality regression
Review implementation correctness
Run team checklist
Post final review

Summary

The PR adds two improvements to Postgres vector sync:

Concurrent entity preparation via asyncio.gather windows (hides round-trip latency on Neon/cloud)
Bulk upsert of embeddings in a single SQL statement per flush (reduces execute calls)

The individual pieces look sound, but there's a known semantic quality regression (recall@5 = 0.0) that needs to be understood before merging.

Critical Issues

1. Semantic quality regression — `recall@5 = 0.000`

The PR author noted that just fast-check failed with:

recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

recall@5 = 0.0 (not merely below threshold — literally zero) is a strong signal that embeddings are not being stored or found at all, not just degraded. This is consistent with a silent bug in the new code path rather than a quality regression from the algorithm change.

The new sync_entity_vectors_batch override completely replaces the base class _sync_entity_vectors_internal flow. If anything goes wrong in the concurrent-prepare → flush pipeline, embeddings could silently fail to write while the method still returns a VectorSyncBatchResult with no errors (because exceptions are caught and logged as warnings).

Before merging, please confirm whether this test was already failing on main before these changes, or whether it is a regression introduced here. A quick git stash && just test-int-postgres -- -k test_semantic_quality on main would settle this.

2. Large code duplication — fragile maintenance surface

The new PostgresSearchRepository.sync_entity_vectors_batch (~100 lines) duplicates almost all orchestration logic from SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window. If the base class sync logic ever changes (error handling, batch flushing, synced_entity_ids tracking), the Postgres override will silently diverge.

A less risky design would be to extract just the "prepare one window of N entities concurrently" concern into a _prepare_entity_vector_jobs_window helper and override only that in PostgresSearchRepository, still delegating the outer loop to the base class.

Fix this →

Moderate Issues

3. Missing type annotation on `progress_callback`

src/basic_memory/repository/postgres_search_repository.py — the Postgres override drops the type annotation:

# base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

# Postgres override (type annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,  # ← no annotation
) -> VectorSyncBatchResult:

4. Missing explicit `session.commit()` before early returns in `_prepare_entity_vector_jobs`

The base class explicitly calls await session.commit() before each early return inside the async with db.scoped_session(...) block (e.g. lines 865, 886, 995). The Postgres override relies on the context manager auto-commit. While functionally equivalent today (the auto-commit runs on __aexit__), the explicit commits make the intent clearer and are consistent with the rest of the codebase. Add explicit await session.commit() calls before the two early returns inside the async with block.

5. `incoming_by_key` is built but partially used

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This is only used for stale detection (if chunk_key not in incoming_by_key). The main loop iterates over chunk_records and uses existing_by_key.get(record["chunk_key"]) — it doesn't go through incoming_by_key. The dict is harmless but adds confusion.

Minor Issues

6. F-string SQL template for bulk upsert

Both the chunk upsert and the embedding upsert in _write_embeddings build SQL via f-strings (", ".join(upsert_values)). The values are driven by enumerate() (safe), but this pattern deviates from the project's style of using static text() SQL. This is acceptable for a bulk-upsert path, but a comment noting that the f-string is safe (integer indices only, no user data) would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) per entity before its individual prepare. The Postgres override calls it per entity within the window before the concurrent gather fires, using window_start + offset_within_window. The semantics are slightly different (batch-of-4 progress vs per-entity), which could affect any UI progress bars that depend on this callback.

Team Checklist

Item	Status	Notes
Python 3.12+ types and async patterns	⚠️	`progress_callback` missing annotation on override
SQLAlchemy 2.0 best practices	✅	F-string SQL building is safe but should be commented
100-character line length	✅
No commented-out code	✅
Unit tests for new code	✅	New `TestBatchPrepareConcurrency` and `TestWriteEmbeddings`
100% test coverage	⚠️	`_update_timestamp_sql` has `# pragma: no cover` on line 892 — verify that's still needed
No hardcoded secrets	✅
Error handling / logging	✅	Good structured logging with loguru
Performance	✅	Core intent: latency hiding via concurrent prepare
Code duplication	❌	~100 lines duplicated from base class orchestration loop
Known test regression	❌	`test_semantic_quality[asyncio-postgres-openai]` recall@5 = 0.0
File path / Windows compat	✅	No file path operations in changed code
No emoji/unicode in logs	✅
Literate programming comments	✅	Good Trigger/Why/Outcome comments on the new method
Alembic migration	N/A	No schema changes

Bottom Line

The bulk-embedding upsert and concurrent-prepare window are good ideas. The main concerns are:

Unresolved test regression — please verify whether recall@5 = 0.0 is a pre-existing failure or introduced by this PR. If it's a regression, the concurrent asyncio.gather prepare path or the new bulk-upsert chunk path may have a bug that silently loses embedding writes.
Code duplication — the nearly verbatim copy of the base class orchestration loop should be refactored before this lands.

perf(core): batch postgres vector sync preparation

a1541f3

Signed-off-by: phernandez <paul@basicmachines.co>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): batch postgres vector sync preparation#723

perf(core): batch postgres vector sync preparation#723
phernandez wants to merge 1 commit intomainfrom
perf/core-postgres-vector-sync-batching

phernandez commented Apr 7, 2026

Uh oh!

phernandez commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phernandez commented Apr 7, 2026

Summary

Testing

Uh oh!

phernandez commented Apr 7, 2026

Uh oh!

github-actions bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: perf(core): batch postgres vector sync preparation

Summary

Critical Issues

1. Semantic quality regression — recall@5 = 0.000

2. Large code duplication — fragile maintenance surface

Moderate Issues

3. Missing type annotation on progress_callback

4. Missing explicit session.commit() before early returns in _prepare_entity_vector_jobs

5. incoming_by_key is built but partially used

Minor Issues

6. F-string SQL template for bulk upsert

7. Progress callback semantics changed

Team Checklist

Bottom Line

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Apr 7, 2026 •

edited

Loading

PR Review: `perf(core): batch postgres vector sync preparation`

1. Semantic quality regression — `recall@5 = 0.000`

3. Missing type annotation on `progress_callback`

4. Missing explicit `session.commit()` before early returns in `_prepare_entity_vector_jobs`

5. `incoming_by_key` is built but partially used