Skip to content

Commit 20b9071

Browse files
earayuclaude
andauthored
feat(celery Wave 6 #33 chunk 1): LightRAG-style query layer Protocol stubs on LineageGraphStore (#1737)
Per §K.11.11 sub-spec — first chunk of Wave 6 PR-A "graphindex elimination" cross-cutting refactor. Adds the read API the retrieval pipeline + graph-curation flows will consume in chunk 3 (caller migration), so chunk 2 (per-backend implementations) can be reviewed against a stable Protocol surface without spec drift. `aperag/indexing/graph.py`: * New `LineageGraphStore` Protocol methods (declarations + docstrings only, ~85 LOC): - `query_entities_by_keyword(query, top_k)` — lexical recall - `query_entities_by_vector(embedding, top_k)` — vector recall (caller pre-embeds; backend stays embedder-agnostic per simple-stable-directive separation of concerns) - `expand_neighbors_n_hops(entity_names, hops=1)` — graph traversal bounded by `hops` (no per-collection traversal-budget knob; operator/dev pick `hops=1` for plain 1-hop or pass higher explicitly). * Returns are the canonical `EntityWithLineage` / `RelationWithLineage` shapes already used by `get_*` — uniform read surface. Callers compose context text themselves rather than receiving a backend-formatted string (cleaner than legacy `query_context() -> ctx.text`). * `InMemoryLineageGraphStore` ships `NotImplementedError("Wave 6 #33 chunk 2 — pending")` stubs so chunk 2 can switch to real behaviour without rewriting test scaffolding, and so any caller that accidentally invokes the API before chunk 2 lands gets a clear diagnostic. `tests/unit_test/indexing/test_lineage_query_protocol.py`: 5 new unit tests pinning the Protocol surface (3 method names + signature shape via `inspect.signature` + the `NotImplementedError` chunk-1 contract on the in-memory store). Out of scope (deferred to chunk 2 / chunk 3 per §K.11.11): * Postgres / Neo4j / Nebula backend implementations (chunk 2). * Caller migration: retrieval/pipeline.py + knowledge_graph/service.py + graph_curation/* (chunk 3). * Legacy `aperag/domains/knowledge_graph/graphindex/` package delete + legacy tests delete + grep-zero verify (chunk 3). Production-readiness 三类 (chunk 1 specific): - must-be-real: Protocol surface declared with full docstrings; tests pin signatures and chunk-1 stub contract. - may-be-gated: in-memory + production backends raise `NotImplementedError` until chunk 2 — gate fires loudly so a caller cannot silently fall through to wrong behaviour. - fully-resolves: §K.11.11 chunk 1 acceptance scope (Protocol stubs ~200 LOC). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 74327c0 commit 20b9071

2 files changed

Lines changed: 222 additions & 0 deletions

File tree

aperag/indexing/graph.py

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -507,6 +507,89 @@ async def get_entity(self, entity_name: str) -> EntityWithLineage | None:
507507
async def get_relation(self, source: str, target: str, type: str) -> RelationWithLineage | None:
508508
"""Read-path helper for relations."""
509509

510+
# ------------------------------------------------------------------
511+
# Wave 6 #33 chunk 1 — LightRAG-style retrieval query layer (Protocol
512+
# stubs; cross-backend implementations land in chunk 2 per
513+
# §K.11.11). The retrieval pipeline (§G.5) and graph-curation flows
514+
# consume these to compose a graph-recall context for a user query
515+
# without going through the legacy ``GraphIndexService.query_context``.
516+
#
517+
# Design notes:
518+
# * Returns are the canonical :class:`EntityWithLineage` /
519+
# :class:`RelationWithLineage` shapes already used by ``get_*`` —
520+
# keeps the read surface uniform; callers compose context text
521+
# themselves rather than relying on a backend-formatted string.
522+
# * ``query_by_vector`` takes a pre-computed embedding so backends
523+
# stay agnostic of the embedding model; the retrieval pipeline
524+
# already owns the embedder.
525+
# * ``expand_neighbors_n_hops`` returns BOTH neighbour entities and
526+
# the relation edges that connect them (1-hop or multi-hop). Per
527+
# simple-stable directive the multi-hop frontier is bounded by
528+
# ``hops`` (caller picks; default 1) so backends don't need a
529+
# query planner.
530+
# ------------------------------------------------------------------
531+
532+
async def query_entities_by_keyword(
533+
self,
534+
*,
535+
query: str,
536+
top_k: int,
537+
) -> list[EntityWithLineage]:
538+
"""Return up to ``top_k`` entities whose ``name`` (or backend
539+
equivalent text-searchable field) matches ``query``.
540+
541+
Match semantics are backend-defined (case-insensitive substring
542+
on Postgres / Cypher CONTAINS on Neo4j / etc.); the contract
543+
is "best-effort lexical recall, not strict equality". An empty
544+
``query`` returns ``[]`` (no spurious recall).
545+
546+
Wave 6 #33 chunk 1 — Protocol stub; backend implementations
547+
land chunk 2.
548+
"""
549+
550+
async def query_entities_by_vector(
551+
self,
552+
*,
553+
embedding: list[float],
554+
top_k: int,
555+
) -> list[EntityWithLineage]:
556+
"""Return up to ``top_k`` entities ranked by similarity of
557+
``embedding`` against an entity-vector index the backend
558+
maintains. Backends without a native vector index may compose
559+
this against the collection's existing Qdrant collection
560+
(entity-name-keyed points) — chunk 2 picks the per-backend
561+
approach.
562+
563+
Empty ``embedding`` is invalid and raises ``ValueError`` (the
564+
retrieval pipeline always passes a real vector — the embedder
565+
has already validated the query upstream).
566+
567+
Wave 6 #33 chunk 1 — Protocol stub.
568+
"""
569+
570+
async def expand_neighbors_n_hops(
571+
self,
572+
*,
573+
entity_names: list[str],
574+
hops: int = 1,
575+
) -> tuple[list[EntityWithLineage], list[RelationWithLineage]]:
576+
"""Return entities + relations reachable from any of
577+
``entity_names`` within ``hops`` graph traversal steps.
578+
579+
``hops=1`` returns immediate neighbours plus the connecting
580+
relations. ``hops=N`` returns the N-hop frontier; the result
581+
UNION includes the seed entities themselves so callers can
582+
feed it directly to a context formatter without a separate
583+
join step.
584+
585+
Implementations MUST bound the result by ``hops`` to keep the
586+
traversal cost predictable; per simple-stable directive
587+
operators don't manage tunable per-collection traversal
588+
budgets — this is the only knob.
589+
590+
Wave 6 #33 chunk 1 — Protocol stub.
591+
"""
592+
510593

511594
# ---------------------------------------------------------------------
512595
# In-memory reference implementation — usable by tests and as the
@@ -705,6 +788,40 @@ async def get_relation(self, source: str, target: str, type: str) -> RelationWit
705788
description_parts=tuple(_sorted_description_parts(row.description_parts.values())),
706789
)
707790

791+
# ------------------------------------------------------------------
792+
# Wave 6 #33 chunk 1 — LightRAG-style query layer stubs.
793+
# Real implementations land chunk 2 (per §K.11.11 3-chunk
794+
# decomposition). The in-memory store also defers to chunk 2 so
795+
# tests that assert the Protocol surface can pin the
796+
# NotImplementedError contract today and switch to the real
797+
# behaviour when chunk 2 ships — without rewriting the test
798+
# scaffolding.
799+
# ------------------------------------------------------------------
800+
801+
async def query_entities_by_keyword(
802+
self,
803+
*,
804+
query: str,
805+
top_k: int,
806+
) -> list[EntityWithLineage]:
807+
raise NotImplementedError("Wave 6 #33 chunk 2 — query_entities_by_keyword pending implementation")
808+
809+
async def query_entities_by_vector(
810+
self,
811+
*,
812+
embedding: list[float],
813+
top_k: int,
814+
) -> list[EntityWithLineage]:
815+
raise NotImplementedError("Wave 6 #33 chunk 2 — query_entities_by_vector pending implementation")
816+
817+
async def expand_neighbors_n_hops(
818+
self,
819+
*,
820+
entity_names: list[str],
821+
hops: int = 1,
822+
) -> tuple[list[EntityWithLineage], list[RelationWithLineage]]:
823+
raise NotImplementedError("Wave 6 #33 chunk 2 — expand_neighbors_n_hops pending implementation")
824+
708825

709826
def _sorted_lineage(members: Iterable[LineageMember]) -> list[LineageMember]:
710827
return sorted(members, key=lambda m: (m.document_id, m.parse_version))
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Copyright 2025 ApeCloud, Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""Wave 6 #33 chunk 1 — Protocol surface for the LightRAG-style query
16+
layer on :class:`aperag.indexing.graph.LineageGraphStore`.
17+
18+
Pin the Protocol method signatures so chunk 2 (cross-backend impl) can
19+
be reviewed against a stable surface, and so callers (chunk 3 caller
20+
migration) can rely on the methods existing even before the per-backend
21+
implementations land.
22+
23+
Per §K.11.11 sub-spec the read API is:
24+
* ``query_entities_by_keyword(query, top_k)`` — lexical recall
25+
* ``query_entities_by_vector(embedding, top_k)`` — vector recall
26+
* ``expand_neighbors_n_hops(entity_names, hops)`` — graph traversal
27+
"""
28+
29+
from __future__ import annotations
30+
31+
import asyncio
32+
import inspect
33+
34+
import pytest
35+
36+
from aperag.indexing.graph import (
37+
InMemoryLineageGraphStore,
38+
LineageGraphStore,
39+
)
40+
41+
42+
def test_lineage_query_protocol_exposes_three_read_methods():
43+
"""The new query layer surface must declare exactly the three
44+
LightRAG-style read methods agreed in §K.11.11. Adding a method
45+
here implies a chunk 2 backend implementation contract."""
46+
47+
members = set(LineageGraphStore.__dict__)
48+
assert "query_entities_by_keyword" in members
49+
assert "query_entities_by_vector" in members
50+
assert "expand_neighbors_n_hops" in members
51+
52+
53+
def test_query_by_keyword_signature_takes_query_and_top_k():
54+
sig = inspect.signature(LineageGraphStore.query_entities_by_keyword)
55+
params = sig.parameters
56+
assert "query" in params
57+
assert "top_k" in params
58+
# kw-only by spec — the retrieval pipeline always names the args.
59+
assert params["query"].kind is inspect.Parameter.KEYWORD_ONLY
60+
assert params["top_k"].kind is inspect.Parameter.KEYWORD_ONLY
61+
62+
63+
def test_query_by_vector_signature_takes_embedding_and_top_k():
64+
sig = inspect.signature(LineageGraphStore.query_entities_by_vector)
65+
params = sig.parameters
66+
assert "embedding" in params
67+
assert "top_k" in params
68+
assert params["embedding"].kind is inspect.Parameter.KEYWORD_ONLY
69+
assert params["top_k"].kind is inspect.Parameter.KEYWORD_ONLY
70+
71+
72+
def test_expand_neighbors_signature_takes_entity_names_and_hops():
73+
sig = inspect.signature(LineageGraphStore.expand_neighbors_n_hops)
74+
params = sig.parameters
75+
assert "entity_names" in params
76+
assert "hops" in params
77+
assert params["entity_names"].kind is inspect.Parameter.KEYWORD_ONLY
78+
assert params["hops"].kind is inspect.Parameter.KEYWORD_ONLY
79+
# Default hops=1 is the simple-stable-directive choice — caller
80+
# must opt into multi-hop explicitly.
81+
assert params["hops"].default == 1
82+
83+
84+
def test_in_memory_store_query_methods_raise_not_implemented_in_chunk_1():
85+
"""Chunk 1 ships Protocol stubs only — the in-memory test store
86+
must surface a clear NotImplementedError so callers (chunk 3
87+
migration) can rely on the gate firing if they accidentally
88+
invoke the API before chunk 2 lands."""
89+
90+
store = InMemoryLineageGraphStore()
91+
92+
async def _run() -> None:
93+
with pytest.raises(NotImplementedError) as exc:
94+
await store.query_entities_by_keyword(query="anything", top_k=5)
95+
assert "chunk 2" in str(exc.value)
96+
97+
with pytest.raises(NotImplementedError) as exc:
98+
await store.query_entities_by_vector(embedding=[0.1, 0.2, 0.3], top_k=5)
99+
assert "chunk 2" in str(exc.value)
100+
101+
with pytest.raises(NotImplementedError) as exc:
102+
await store.expand_neighbors_n_hops(entity_names=["X"], hops=1)
103+
assert "chunk 2" in str(exc.value)
104+
105+
asyncio.run(_run())

0 commit comments

Comments
 (0)