Skip to content

Serendipitous knowledge retrieval via graph traversal #18

@jimador

Description

@jimador

Observation

Proposition retrieval in DICE is direct:

  • MemoryRetriever.recall(query) — vector similarity to query
  • MemoryRetriever.recallAbout(entityId) — propositions mentioning an entity
  • MemoryRetriever.recallByType(type) — propositions by entity type
  • PropositionRepository.findSimilar() — embedding-based similarity search

All of these find propositions directly related to the query. But some of the most valuable knowledge connections are indirect — a proposition about entity A might be critical context for entity B, connected via a shared relationship 2-3 hops away in the knowledge graph.

DICE's text2graph pipeline builds exactly the kind of relational structure that could support this. GraphProjectionService projects propositions into ProjectedRelationship edges. But retrieval only uses direct lookups — the graph structure goes unused for discovery.

The idea

A retrieval mode inspired by spreading activation: starting from directly-relevant propositions, follow ProjectedRelationship edges in the knowledge graph to discover indirectly-related propositions that normal retrieval would miss.

Query → direct matches → follow graph edges (2-3 hops) → score by activation strength → return surprising-but-relevant propositions

Parameters

  • Minimum hop distance — don't return direct matches (those come from normal retrieval)
  • Maximum hop distance — don't go too far (relevance drops sharply beyond 3-4 hops)
  • Activation decay per hop — strength halves with each hop
  • Separate token budget — serendipitous results are supplementary, first to drop under pressure

Safety guard

Serendipitously retrieved propositions must not contradict actively injected ones. If conflict detection (#12 ) is available, check before including. This prevents graph traversal from introducing contradictions.

Open questions

  • Is the graph dense enough? If text2graph produces sparse relationships, there may not be enough edges for multi-hop traversal to find anything useful.
  • How do you evaluate "serendipitous but useful" vs. "random noise"? Without a metric, it's hard to know if the feature is helping or just consuming tokens.
  • Should this be deterministic or probabilistic? Probabilistic traversal (random walk with restart) keeps results varied across invocations. Deterministic (shortest path) is reproducible but less surprising.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions