The first version of memory-path-engine is only useful if it can be evaluated against explicit baselines.
For the higher-level benchmark portfolio strategy across public datasets, repository-owned fixtures, and private real-world gold sets, see benchmark-strategy.md.
The initial example benchmark pack lives in examples/contract_pack.
Repository-owned benchmark fixtures now live in ../benchmarks/structured_memory and are modeled through strong pydantic types in the benchmark bounded context.
Inputs:
- synthetic markdown contract-like documents in
examples/contract_pack/contracts - annotated evaluation questions in
examples/contract_pack/eval/questions.json
Each question includes:
idqueryanswerevidence_node_idstags
Whether the retriever returns a path whose final answer matches the expected answer pattern.
Whether at least one expected evidence node is present in the returned path set.
Whether the path crosses edges that are semantically allowed by the domain pack.
Elapsed time for a single query under local execution.
For retrieval-only public benchmarks, the repository also reports:
R@5R@10NDCG@10
These are currently used for LongMemEval session retrieval and should be interpreted as external positioning metrics, not as direct proof of MemoryPath correctness.
- retrieve by lexical similarity only
- no graph expansion
- no weight-aware reranking
- retrieve by embedding similarity through a pluggable
EmbeddingProvider - no graph expansion
- no weight-aware reranking
- retrieve by embedding similarity
- expand across explicit edges
- no risk or anomaly boosts
- embedding-based candidate retrieval
- edge-aware expansion
- weighted scoring across semantic, structural, anomaly, and importance signals
- replayable
MemoryPathoutput
- embedding-based seed selection
- explicit activation propagation with decay and threshold controls
- edge-type-aware traversal
- semantic bonuses for exception, remedy, and escalation nodes
- replayable path output with propagation-oriented diagnostics
The repository also exposes paired experiment modes that keep the retriever logic fixed while changing whether memory state is updated across queries:
weighted_graph_staticweighted_graph_dynamicactivation_spreading_staticactivation_spreading_dynamic
Interpretation:
*_staticusesStaticMemoryStatePolicy, so query order should not change node memory state*_dynamicusesMemoryStatePolicy, so repeated queries can reinforce some nodes and decay others before later cases run
The primary repository-owned fixture for this comparison is benchmarks/structured_memory/dynamic_memory_priming_benchmark.json, which is intentionally ordered as repeated prime-* cases followed by a final probe-* case.
- remove structure
- remove weights
- remove path expansion
If these ablations produce no meaningful change, the core design assumptions need to be revisited.
- graph-aware retrieval wins on multi-hop structured-document questions in the example benchmark
- weighted retrieval improves critical clause discovery
- path output makes failures easy to inspect
The evaluation runner can now emit detailed per-question diagnostics in addition to summary scores.
With detailed=True, each mode includes:
avg_latency_ms- per-question hit or miss
- expected vs returned evidence node ids
- best answer text for inspection
- surfaced semantic roles
- best-path edge types
- activated node count and propagation depth
The suite output also includes a cross-mode comparison report so you can quickly spot:
- questions missed only by one mode
- modes that win on the same question
- latency trade-offs between lexical, embedding, structure-only, and weighted retrieval
- path-hit and semantic-hit rates for graph-oriented cases
- activation breadth and propagation depth for spreading-based retrieval
The structured benchmark workflow is separated into its own bounded context:
benchmarking.domain: typed dataset, case, expectation, and report modelsbenchmarking.infrastructure: JSON fixture loadingbenchmarking.application: runner and end-to-end evaluation service
This keeps evaluation logic explicit and strongly typed instead of spreading anonymous dict payloads through the codebase.
Repository-owned graph fixtures now also cover:
- exception override cases
- multi-hop chain cases
- path-shape expectations
- semantic-role and edge-type expectations
- contradiction-pair expectations
- dynamic priming cases for static vs dynamic memory comparison
The public benchmark adapters now split into two tracks:
HotpotQA: evidence retrieval sanity on multi-document QA (evidence_hit_rate,evidence_recall, per-type breakdowns)LongMemEval: session-level retrieval-only memory recall (R@5,R@10,NDCG@10)