Skip to content

Commit f2b9f99

Browse files
cdeustclaude
andcommitted
fix(ast): uncap L6 symbol/edge ingestion; surface file-import chain
Two unrelated caps were silently strangling the cortex viz at the L6 phase. With the Cortex repo indexed (4115 files, 200K AP nodes), L6 was emitting only 2,007 symbols and 4,121 imports — at most ~3% of the data AP holds. This commit lifts both caps and adds the missing rel tables. Cap 1 — workflow_graph_source_ast.py:_MAX_SYMBOLS_PER_FILE - The LIMIT clause scaled with len(paths). The L6 caller passes paths=[] for full-graph load, so max(0, 1) = 1 capped every per-label query at 500 symbols. 6 labels × 6 projects ≈ 18K theoretical, ~2K observed after AP's per-table sparseness — vs 91,648 after the fix. - Drop the LIMIT entirely in load-all mode (paths=[]); keep it only when the caller specifies an explicit path filter. Cap 2 — mcp_client.py:line_limit - Asyncio stream-buffer cap of 10 MB on the JSON-RPC stdio frame. AP responses with 100K+ symbols + edges legitimately exceed this and trigger LimitOverrunError, truncating L6 output mid-stream. Bumped to 1 GB so realistic workloads never hit the cap; OS pipe buffering still provides backpressure. Missing edge kinds — workflow_graph_source_ast.py:_load_edges_async - Hardcoded a narrow set of (src, dst) label pairs for the rel-table enumeration. Files imported into Class/Interface/TypeAlias/Macro etc. were silently dropped because their tables (e.g. Imports_File_Class) were never queried. Calls from Macro and member-of from JVM/Swift containers had the same issue. - Replaced with the full Cartesian product over the appropriate label sets. AP returns empty rows for missing rel tables, so over- enumeration is safe — costs extra round-trips, no correctness risk. Import nodes — _SYMBOL_LABELS + _NON_QUALIFIED_LABELS - AP exposes every `import` statement as an Import node and links it via Defines_File_Import; that table alone holds 36,637 edges per project, vs the ~5K Imports_File_* set the loader was querying. - Add Import to _SYMBOL_LABELS and a _NON_QUALIFIED_LABELS set so _load_symbols_async knows to read s.id / s.path instead of s.qualified_name / s.name (Import nodes don't carry the latter). - Patch _run_edge to select dst.id when dst_lbl is non-qualified. - Wire Defines_File_Import as an "imports"-kind edge in the loader. Uses_* edges - Type-usage edges (Method/Function uses Struct/Class/etc.) were never captured. Adding them yields +6,774 edges across the full roster. Net effect on the live Cortex graph (6 AP projects): symbols 2,007 → 91,648 (45.7×) imports 4,121 → 41,846 (10.2×) uses 0 → 6,774 (new kind) defined_in 54,889 → 91,648 (1.7×; Imports become symbol nodes) total nodes 305,669 → 342,849 total edges 397,382 → 479,109 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent df141f5 commit f2b9f99

3 files changed

Lines changed: 94 additions & 130 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 0 additions & 100 deletions
This file was deleted.

mcp_server/infrastructure/mcp_client.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,12 @@ async def _spawn_process(self) -> None:
8484
cwd = self._config.get("cwd")
8585
env = self._config.get("env")
8686
merged_env = {**os.environ, **(env or {})}
87-
line_limit = 10 * 1024 * 1024
87+
# Stream-buffer cap per JSON-RPC frame. Sized for the L6 path,
88+
# where AP responses with 100k+ symbols + edges legitimately
89+
# exceed 100MB. Keep an upper bound large enough that we never
90+
# cap real workloads; OS-level subprocess pipe buffering still
91+
# provides backpressure.
92+
line_limit = 1024 * 1024 * 1024 # 1 GB
8893

8994
# Validate command against allowlist (CWE-78 mitigation).
9095
# In test/dev, extra commands can be allowed via _extra_allowed_commands.

mcp_server/infrastructure/workflow_graph_source_ast.py

Lines changed: 88 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,9 @@
2828
resolve_graph_paths,
2929
)
3030

31-
# A paranoid cap so a bad Cypher can't drag in the world. Matches
32-
# AP's default page size for ``query_graph``.
31+
# Per-file paranoid cap, applied ONLY when the caller specifies paths.
32+
# Load-all callers (paths=[]) get an uncapped query — the L6 viz pipeline
33+
# legitimately needs every symbol the AP graph holds.
3334
_MAX_SYMBOLS_PER_FILE = 500
3435

3536

@@ -189,8 +190,19 @@ def _as_list(payload: Any) -> list[dict]:
189190
"Package",
190191
"Namespace",
191192
"Variable",
193+
# Import statements (one node per ``import`` site). AP wires every
194+
# file to its imports via the ``Defines_File_Import`` rel table; the
195+
# nodes themselves carry ``id`` (``<file>::<modpath>``), ``path``,
196+
# ``alias``, ``is_glob``. Loaded via a custom property mapping in
197+
# ``_load_symbols_async`` because imports lack ``qualified_name``.
198+
"Import",
192199
)
193200

201+
# Labels whose nodes don't expose ``qualified_name`` / ``name``. The
202+
# load query falls back to ``id`` / ``path`` (or whatever the node
203+
# DOES carry) so they still flow into the graph.
204+
_NON_QUALIFIED_LABELS = {"Import"}
205+
194206

195207
def _symbol_type_from_label(label: str) -> str:
196208
"""Map AP's label → workflow-graph symbol_type.
@@ -337,12 +349,28 @@ async def _load_symbols_async(
337349
for i in range(1, len(parts)):
338350
path_tails.add("/".join(parts[i:]))
339351
for label in _SYMBOL_LABELS:
340-
query = (
341-
f"MATCH (s:{label}) "
342-
"RETURN s.qualified_name AS qualified_name, "
343-
" s.name AS name "
344-
f"LIMIT {_MAX_SYMBOLS_PER_FILE * max(len(paths), 1)}"
345-
)
352+
# Import nodes don't carry qualified_name / name — they use
353+
# ``id`` (``<file>::<modpath>``) and ``path`` (the imported
354+
# module). Use those as the qualified_name / name surrogate.
355+
if label in _NON_QUALIFIED_LABELS:
356+
base_query = (
357+
f"MATCH (s:{label}) "
358+
"RETURN s.id AS qualified_name, "
359+
" s.path AS name"
360+
)
361+
else:
362+
base_query = (
363+
f"MATCH (s:{label}) "
364+
"RETURN s.qualified_name AS qualified_name, "
365+
" s.name AS name"
366+
)
367+
# No LIMIT in load-all mode (paths=[]). When paths are given,
368+
# apply a per-path paranoid cap so a stray query can't drag
369+
# in the world.
370+
if paths:
371+
query = base_query + f" LIMIT {_MAX_SYMBOLS_PER_FILE * len(paths)}"
372+
else:
373+
query = base_query
346374
rows = await self._bridge.call(
347375
"query_graph",
348376
{"graph_path": graph_path, "query": query},
@@ -510,25 +538,25 @@ async def _load_edges_async(
510538
parts = p.split("/")
511539
for i in range(1, len(parts)):
512540
path_tails.add("/".join(parts[i:]))
513-
calls_rels = [
514-
("Function", "Function"),
515-
("Function", "Method"),
516-
("Method", "Function"),
517-
("Method", "Method"),
518-
]
519-
imports_rels = [
520-
("File", "Function"),
521-
("File", "Struct"),
522-
("File", "Enum"),
523-
("File", "Trait"),
524-
("File", "Method"),
525-
("File", "Constant"),
526-
]
527-
member_rels = [
528-
("Struct", "Method"),
529-
("Enum", "Method"),
530-
("Trait", "Method"),
531-
]
541+
# Enumerate the full Cartesian product of label kinds AP could
542+
# have produced rel tables for. AP rejects queries against rel
543+
# tables that don't exist by returning empty rows, so the over-
544+
# enumeration is safe — it just costs extra round-trips against
545+
# missing tables. The narrower prior lists were the reason the
546+
# cortex viz showed ~4k imports instead of the tens of thousands
547+
# the codebase actually contains: every File→Class / File→
548+
# Interface / File→TypeAlias / File→Macro etc. edge was being
549+
# silently dropped because its rel table was never queried.
550+
_CALL_LABELS = ("Function", "Method", "Macro")
551+
_IMPORT_TARGETS = _SYMBOL_LABELS # File can import any symbol kind
552+
_CONTAINER_LBLS = (
553+
"Struct", "Enum", "Trait",
554+
"Class", "Interface", "Protocol", "Extension", "Union",
555+
)
556+
_MEMBER_LBLS = ("Method", "Field", "Property", "Constant", "TypeAlias")
557+
calls_rels = [(s, d) for s in _CALL_LABELS for d in _CALL_LABELS]
558+
imports_rels = [("File", t) for t in _IMPORT_TARGETS]
559+
member_rels = [(s, d) for s in _CONTAINER_LBLS for d in _MEMBER_LBLS]
532560

533561
def _match(file_part: str) -> bool:
534562
if not path_tails:
@@ -557,16 +585,26 @@ async def _run_edge(
557585
"""
558586
if src_lbl == "File":
559587
select_src = "src.id AS src_name"
588+
elif src_lbl in _NON_QUALIFIED_LABELS:
589+
select_src = "src.id AS src_name"
560590
else:
561591
select_src = "src.qualified_name AS src_name"
592+
# Import nodes (and any other ``_NON_QUALIFIED_LABELS`` kind)
593+
# carry ``id`` instead of ``qualified_name``. Selecting the
594+
# missing property would raise a Kuzu Binder exception.
595+
dst_qn = (
596+
"dst.id AS dst_name"
597+
if dst_lbl in _NON_QUALIFIED_LABELS
598+
else "dst.qualified_name AS dst_name"
599+
)
562600
if has_provenance:
563601
return_tail = (
564-
" dst.qualified_name AS dst_name, "
602+
f" {dst_qn}, "
565603
" r.confidence AS confidence, "
566604
" r.resolution_method AS reason"
567605
)
568606
else:
569-
return_tail = " dst.qualified_name AS dst_name"
607+
return_tail = f" {dst_qn}"
570608
query = (
571609
f"MATCH (src:{src_lbl})-[r:{table}]->(dst:{dst_lbl}) "
572610
f"RETURN {select_src}, {return_tail}"
@@ -629,6 +667,27 @@ async def _run_edge(
629667
await _run_edge(
630668
"member_of", f"HasMethod_{s}_{d}", s, d, has_provenance=False
631669
)
670+
# File → Import node. AP wires every ``import`` statement to its
671+
# file via this rel table; counts in the tens of thousands per
672+
# project. Without this, the cortex viz captures only the small
673+
# subset that AP managed to RESOLVE to in-graph symbols (the
674+
# ``Imports_File_*`` tables, totalling ~5k vs ~36k actual).
675+
await _run_edge(
676+
"imports", "Defines_File_Import", "File", "Import",
677+
has_provenance=False,
678+
)
679+
# Type-usage edges (Method/Function uses Struct/Class/etc).
680+
# Captured by AP's resolver and exposed as ``Uses_<src>_<dst>``.
681+
_USES_SRC = ("Method", "Function")
682+
_USES_DST = (
683+
"Struct", "Enum", "Trait", "Class", "Interface",
684+
"Protocol", "Extension", "Union", "TypeAlias",
685+
)
686+
for s in _USES_SRC:
687+
for d in _USES_DST:
688+
await _run_edge(
689+
"uses", f"Uses_{s}_{d}", s, d, has_provenance=True
690+
)
632691
return out
633692

634693

0 commit comments

Comments
 (0)