Skip to content

perf(core): batch postgres vector sync preparation#723

Open
phernandez wants to merge 1 commit intomainfrom
perf/core-postgres-vector-sync-batching
Open

perf(core): batch postgres vector sync preparation#723
phernandez wants to merge 1 commit intomainfrom
perf/core-postgres-vector-sync-batching

Conversation

@phernandez
Copy link
Copy Markdown
Member

Summary

  • prepare Postgres vector sync work in small concurrent windows to hide round-trip latency during cloud indexing
  • bulk upsert vector embeddings in a single statement per flush instead of one execute call per chunk
  • update repository unit tests for the new bulk-write path and concurrent prepare behavior

Testing

  • uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

Signed-off-by: phernandez <paul@basicmachines.co>
Copy link
Copy Markdown
Member Author

Additional local verification note: just fast-check failed after the targeted repository tests passed.

Failure:

  • test-int/semantic/test_semantic_quality.py::test_semantic_quality[asyncio-postgres-openai]
  • assertion: recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

The targeted tests that passed before opening the PR were:

  • uv run pytest -q tests/repository/test_postgres_search_repository_unit.py tests/repository/test_semantic_search_base.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 7, 2026

Claude finished @phernandez's task in 5m 32s —— View job


PR Review: perf(core): batch postgres vector sync preparation

Tasks

  • Gather context and read changed files
  • Analyze semantic quality regression
  • Review implementation correctness
  • Run team checklist
  • Post final review

Summary

The PR adds two improvements to Postgres vector sync:

  1. Concurrent entity preparation via asyncio.gather windows (hides round-trip latency on Neon/cloud)
  2. Bulk upsert of embeddings in a single SQL statement per flush (reduces execute calls)

The individual pieces look sound, but there's a known semantic quality regression (recall@5 = 0.0) that needs to be understood before merging.


Critical Issues

1. Semantic quality regression — recall@5 = 0.000

The PR author noted that just fast-check failed with:

recall@5 for postgres-openai/paraphrase/hybrid: 0.000 < 0.250

recall@5 = 0.0 (not merely below threshold — literally zero) is a strong signal that embeddings are not being stored or found at all, not just degraded. This is consistent with a silent bug in the new code path rather than a quality regression from the algorithm change.

The new sync_entity_vectors_batch override completely replaces the base class _sync_entity_vectors_internal flow. If anything goes wrong in the concurrent-prepare → flush pipeline, embeddings could silently fail to write while the method still returns a VectorSyncBatchResult with no errors (because exceptions are caught and logged as warnings).

Before merging, please confirm whether this test was already failing on main before these changes, or whether it is a regression introduced here. A quick git stash && just test-int-postgres -- -k test_semantic_quality on main would settle this.

2. Large code duplication — fragile maintenance surface

The new PostgresSearchRepository.sync_entity_vectors_batch (~100 lines) duplicates almost all orchestration logic from SearchRepositoryBase._sync_entity_vectors_internal. The only meaningful addition is the asyncio.gather window. If the base class sync logic ever changes (error handling, batch flushing, synced_entity_ids tracking), the Postgres override will silently diverge.

A less risky design would be to extract just the "prepare one window of N entities concurrently" concern into a _prepare_entity_vector_jobs_window helper and override only that in PostgresSearchRepository, still delegating the outer loop to the base class.

Fix this →


Moderate Issues

3. Missing type annotation on progress_callback

src/basic_memory/repository/postgres_search_repository.py — the Postgres override drops the type annotation:

# base class (correct):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback: Optional[Callable[[int, int, int], Any]] = None,
) -> VectorSyncBatchResult:

# Postgres override (type annotation dropped):
async def sync_entity_vectors_batch(
    self,
    entity_ids: list[int],
    progress_callback=None,  # ← no annotation
) -> VectorSyncBatchResult:

4. Missing explicit session.commit() before early returns in _prepare_entity_vector_jobs

The base class explicitly calls await session.commit() before each early return inside the async with db.scoped_session(...) block (e.g. lines 865, 886, 995). The Postgres override relies on the context manager auto-commit. While functionally equivalent today (the auto-commit runs on __aexit__), the explicit commits make the intent clearer and are consistent with the rest of the codebase. Add explicit await session.commit() calls before the two early returns inside the async with block.

5. incoming_by_key is built but partially used

incoming_by_key = {record["chunk_key"]: record for record in chunk_records}

This is only used for stale detection (if chunk_key not in incoming_by_key). The main loop iterates over chunk_records and uses existing_by_key.get(record["chunk_key"]) — it doesn't go through incoming_by_key. The dict is harmless but adds confusion.


Minor Issues

6. F-string SQL template for bulk upsert

Both the chunk upsert and the embedding upsert in _write_embeddings build SQL via f-strings (", ".join(upsert_values)). The values are driven by enumerate() (safe), but this pattern deviates from the project's style of using static text() SQL. This is acceptable for a bulk-upsert path, but a comment noting that the f-string is safe (integer indices only, no user data) would help future reviewers.

7. Progress callback semantics changed

The base class calls progress_callback(entity_id, sequential_index, total_entities) per entity before its individual prepare. The Postgres override calls it per entity within the window before the concurrent gather fires, using window_start + offset_within_window. The semantics are slightly different (batch-of-4 progress vs per-entity), which could affect any UI progress bars that depend on this callback.


Team Checklist

Item Status Notes
Python 3.12+ types and async patterns ⚠️ progress_callback missing annotation on override
SQLAlchemy 2.0 best practices F-string SQL building is safe but should be commented
100-character line length
No commented-out code
Unit tests for new code New TestBatchPrepareConcurrency and TestWriteEmbeddings
100% test coverage ⚠️ _update_timestamp_sql has # pragma: no cover on line 892 — verify that's still needed
No hardcoded secrets
Error handling / logging Good structured logging with loguru
Performance Core intent: latency hiding via concurrent prepare
Code duplication ~100 lines duplicated from base class orchestration loop
Known test regression test_semantic_quality[asyncio-postgres-openai] recall@5 = 0.0
File path / Windows compat No file path operations in changed code
No emoji/unicode in logs
Literate programming comments Good Trigger/Why/Outcome comments on the new method
Alembic migration N/A No schema changes

Bottom Line

The bulk-embedding upsert and concurrent-prepare window are good ideas. The main concerns are:

  1. Unresolved test regression — please verify whether recall@5 = 0.0 is a pre-existing failure or introduced by this PR. If it's a regression, the concurrent asyncio.gather prepare path or the new bulk-upsert chunk path may have a bug that silently loses embedding writes.

  2. Code duplication — the nearly verbatim copy of the base class orchestration loop should be refactored before this lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant