Commit a7f0466
authored
feat(graph): bulk_upsert_entity_with_lineage_parts Protocol + 4 backends + curation cutover (W8-2) (#1771)
Wave 8 task #13 (W8-2): adds a bulk-upsert primitive on the
``LineageGraphStore`` Protocol so callers consolidating N×M
``(description_part, lineage_member)`` tuples to the same target
entity get one transaction / one round-trip instead of N×M
sequential single upserts. Cuts ``LineageEntityMerger.merge_entities``
step 6a over to the bulk path.
## Why
``GraphCurationService.merge_entities`` (the N-source-into-1-target
user merge flow) re-anchors every source's description parts under
the target name. With N source entities each carrying M description
parts, that loop emitted N×M sequential ``upsert_entity_with_lineage``
round-trips — one full transaction per part. For typical curation
runs (3-10 sources × 5-20 parts each), that's 15-200 sequential SQL
hits where one bulk write would do.
Per architect spec (Wave 8 candidate W8-2 sediment) + huangheng's
task #6 PR #1758 perf observation, this folds into a single bulk
write per backend.
## Scope (5 numbered items)
1. **Protocol method** — ``bulk_upsert_entity_with_lineage_parts(*,
parts: Sequence[tuple[EntityRecord, LineageMember]])`` on
``aperag/indexing/graph.py:LineageGraphStore``. All ``record.name``
values MUST share the same string (asserted as ``ValueError``).
Empty parts is a no-op. Per-part dedup key is
``(document_id, parse_version)`` last-wins. Bulk path NEVER
touches ``compacted_description`` (preserves existing — same
``COALESCE``-style semantic as single upsert with ``None``).
2. **InMemory ref impl** — single ``asyncio.Lock``-guarded loop in
``InMemoryLineageGraphStore``.
3. **Postgres impl** — single ``INSERT … ON CONFLICT (collection_id,
name) DO UPDATE`` that strips matching keys via
``jsonb_array_elements`` ``NOT EXISTS`` against an incoming
``strip_keys_json`` array, then appends the whole new
``new_members_json`` / ``new_parts_json`` arrays. One statement,
atomic.
4. **Neo4j impl** — single Cypher MERGE + parallel-list strip-then-
append, with the strip predicate matching against the **set** of
incoming keys (``IN $strip_keys`` on the ``"<doc_id>|<pv>"`` key
string). Row-lock on MERGE serialises concurrent bulk ops on the
same entity.
5. **Nebula impl** — single ``EntityLock(target_name)`` acquire +
single read / Python merge / write. Mirrors the existing
read-modify-write pattern of single upsert but folds the strip-
then-append over the **set** of incoming keys.
## Caller cutover
* ``LineageEntityMerger.merge_entities`` (step 6a) — replaces the
N×M ``for src in source_entities: for part in
src.description_parts: await self._store.upsert_entity_with_lineage(...)``
with a single ``bulk_upsert_entity_with_lineage_parts`` call.
Step 6b's sentinel write (``__curation_merge__`` final write with
unified+compacted text) still goes through the single-upsert path
because it needs the ``compacted_description`` column write that
the bulk path intentionally doesn't carry.
## Alias-redirect decorator
* ``LineageGraphStoreWithAliasRedirect`` — bulk path mirrors single
upsert: each part's ``record.name`` resolves through the alias
map before forwarding to the inner store. The merger always
passes records pinned to the canonical ``final_target`` so the
redirect is a no-op in that flow, but symmetry with the single
upsert contract means a future caller writing to an aliased name
still gets correct behaviour.
## 12-invariant cross-check (Wave 7 §K.12)
* **#1 L1 不污染**: bulk path operates on lineage SETs only — kg.jsonl
raw extract layer untouched. ✅
* **#3 indexer write redirect through alias map**: bulk path goes
through the same decorator alias resolution as single upsert. ✅
* **#9 alias redirect transparent**: decorator forwards bulk to
inner with redirected names; merger callers see no behaviour
change. ✅
* **#10 DB column cap**: Postgres ``compacted_description`` Text
column unchanged (bulk path preserves existing value via
COALESCE-equivalent — bulk passes ``None`` end-to-end through
the Postgres SQL → INSERT branch sets NULL, ON CONFLICT branch
preserves). ✅
* All other invariants: unaffected.
## 4-pattern pre-check matrix
* Pattern A (kg.jsonl shape): unchanged
* Pattern B (Lineage SET semantics): bulk preserves dedup-by-
``(document_id, parse_version)`` exactly — same key as single
upsert. ✅
* Pattern C (Cypher LIST<MAP>): bulk reuses parallel-list encoding
for the strip-then-append. ✅
* Pattern D (vector 3-field payload): unchanged
## Simple-stable 4-guardrail
* **#1 不无限扩范围**: 1 new Protocol method, no new public REST
surface, no new schema column, no new alembic migration.
* **#2 尽快上线**: single PR, no spec amend needed (architect
W8-2 candidate sediment + Wave 7 task #1 3-backend Protocol
pattern reuse).
* **#3 简单稳定**: bulk semantic mirrors single upsert exactly;
caller cutover is a 1-call replacement of the N×M loop.
* **#4 私有化部署免维护**: backend-portable (4 backends shipped
cross-backend); no operator-facing config knob.
## Test plan
- [x] InMemory contract — 7 new tests pin empty / create / replace /
mismatched-names / dedup-within-input / preserves-compacted /
last-entity-type-wins.
- [x] Alias-redirect decorator — 3 new tests pin redirect-each-part /
no-alias-passthrough / empty-short-circuit.
- [x] LineageEntityMerger step 6a cutover — existing
``test_source_parts_reanchored_preserving_doc_lineage``
rewritten to assert the new single bulk call (was: 3
sequential single upserts) + single sentinel final write.
- [x] All 1166 unit tests pass (up from 1141 baseline; 25 new tests).
- [x] Wave 7 grep-zero contract — 10/10 still pass (intent-driven
gate unaffected).
- [x] ruff format / check clean on touched files.
- [ ] Cross-backend integration test — sediment to Wave 8 follow-up
(out of scope this PR; the 4-backend Protocol contract is
structurally identical to single upsert which already has
cross-backend integration coverage).
- [ ] CR by @huangheng (focus: invariant #1 L1, #9 alias redirect
transparent, 4-backend cross-roundtrip on contract level).1 parent 66c6861 commit a7f0466
9 files changed
Lines changed: 592 additions & 35 deletions
File tree
- aperag
- graph_curation
- indexing
- graph_storage
- tests/unit_test
- graph_curation
- indexing
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
260 | 260 | | |
261 | 261 | | |
262 | 262 | | |
263 | | - | |
264 | | - | |
265 | | - | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
266 | 268 | | |
267 | 269 | | |
| 270 | + | |
268 | 271 | | |
269 | 272 | | |
270 | 273 | | |
271 | | - | |
272 | | - | |
273 | | - | |
274 | | - | |
275 | | - | |
276 | | - | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
284 | 289 | | |
| 290 | + | |
| 291 | + | |
285 | 292 | | |
286 | 293 | | |
287 | 294 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
163 | 163 | | |
164 | 164 | | |
165 | 165 | | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
166 | 194 | | |
167 | 195 | | |
168 | 196 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
572 | 572 | | |
573 | 573 | | |
574 | 574 | | |
| 575 | + | |
| 576 | + | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
| 586 | + | |
| 587 | + | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
575 | 608 | | |
576 | 609 | | |
577 | 610 | | |
| |||
893 | 926 | | |
894 | 927 | | |
895 | 928 | | |
| 929 | + | |
| 930 | + | |
| 931 | + | |
| 932 | + | |
| 933 | + | |
| 934 | + | |
| 935 | + | |
| 936 | + | |
| 937 | + | |
| 938 | + | |
| 939 | + | |
| 940 | + | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
896 | 955 | | |
897 | 956 | | |
898 | 957 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
845 | 845 | | |
846 | 846 | | |
847 | 847 | | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
848 | 914 | | |
849 | 915 | | |
850 | 916 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
440 | 440 | | |
441 | 441 | | |
442 | 442 | | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
443 | 535 | | |
444 | 536 | | |
445 | 537 | | |
| |||
0 commit comments