Skip to content

Adds a Reddit data connector and domain ontology for graph databases#35

Open
SquireGraphs wants to merge 11 commits into
neo4j-labs:mainfrom
SquireGraphs:reddit-connector
Open

Adds a Reddit data connector and domain ontology for graph databases#35
SquireGraphs wants to merge 11 commits into
neo4j-labs:mainfrom
SquireGraphs:reddit-connector

Conversation

@SquireGraphs

Copy link
Copy Markdown

…s on Reddit.

connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests. domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit).

Modified files:

config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants connectors/init.py — registers RedditConnector on import cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export

…s on Reddit.

connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests.
domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit).

Modified files:

config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants
connectors/__init__.py — registers RedditConnector on import
cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Reddit ingestion path and a dedicated “graph-data-community” ontology to build a community-intelligence knowledge graph from public subreddit discussions, with optional LLM enrichment for higher-level signals (pain points/use cases/topics/technologies).

Changes:

  • Added graph-data-community domain ontology with entities, relationships, document templates, decision traces, and agent tools.
  • Implemented a RedditConnector that scrapes Reddit’s public JSON endpoints and normalizes posts/comments/authors into the fixture schema (optionally enriched via Anthropic).
  • Wired the connector into the registry and CLI flow; added default subreddit/keyword configuration.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/create_context_graph/domains/graph-data-community.yaml New domain ontology for graph/data community intelligence built from Reddit discussions.
src/create_context_graph/connectors/reddit_connector.py New connector that scrapes Reddit search/comments JSON endpoints and outputs normalized entities/relationships/documents, with optional LLM enrichment.
src/create_context_graph/connectors/__init__.py Registers the new reddit connector via import side effects.
src/create_context_graph/config.py Adds default subreddit + keyword lists for the Reddit connector.
src/create_context_graph/cli.py Injects Anthropic/OpenAI API keys into connector credentials to support optional enrichment.
Comments suppressed due to low confidence (2)

src/create_context_graph/config.py:90

  • Trailing whitespace after the comma will be flagged by linters/formatters and can cause CI style checks to fail. Remove the extra space at end of line.
    "apache", 

src/create_context_graph/connectors/reddit_connector.py:545

  • This relationship is appended even when uc becomes an empty string after strip(). That will create relationships with an empty target_name, which won’t match any node during ingest (and can lead to misleading relationship counts). Only append DEMONSTRATES when uc is truthy.
                                    "type": "DEMONSTRATES",
                                    "source_name": post_name,
                                    "source_label": "Post",
                                    "target_name": uc,
                                    "target_label": "UseCase",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/create_context_graph/config.py Outdated
Comment thread src/create_context_graph/config.py Outdated
Comment thread src/create_context_graph/connectors/reddit_connector.py
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
Comment thread src/create_context_graph/connectors/__init__.py
SquireGraphs and others added 6 commits May 13, 2026 12:42
…domain

PR Summary:
Refactors the Reddit connector to be a general-purpose product discovery and community intelligence tool, removing hardcoded graph/data community references so any team can adopt it for their own product or technology.
Changes:
connectors/reddit_connector.py — removed the hardcoded KEYWORD_TECH_MAP that mapped specific graph/data technology names. Keywords are now title-cased dynamically at runtime, so whatever keywords a user configures become the seed product/technology nodes. The LLM enrichment prompt is rewritten to be domain-agnostic, extracting pain points, use cases, topics, and product mentions from any community discussion.
config.py — defaults trimmed to a minimal starting point: one subreddit (neo4j) and three keywords (rag, agentic ai, knowledge graph). This gives new users a working example without prescribing a large fixed scope.
domains/product-discovery.yaml — new generic domain ontology for product discovery and community intelligence. Same entity structure as graph-data-community (Post, Comment, Subreddit, Person, Technology, PainPoint, UseCase, Topic) but with neutral language, generic agent tools (get_product_pulse, find_pain_points, product_co_occurrence), and a system prompt framed around product and developer relations teams rather than graph database practitioners.
Copilot found two bugs one when anthropic api key is instantiated but not set and another when pp/uc is null/empty but still tries to set a relationship.
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
Comment thread src/create_context_graph/connectors/reddit_connector.py
Comment thread src/create_context_graph/connectors/reddit_connector.py Outdated
"Reddit import complete: %d entities, %d relationships, %d documents",
total_entities, len(relationships), len(documents),
)
return NormalizedData(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we create any indexes/embeddings for the data?

emoji: "🕸️"

entity_types:
- label: Post

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to repeat them in code if we can configure them from the outside?
make sure to align with the other domains so that labels and key(properties) align and can merge together.

Comment thread src/create_context_graph/domains/product-discovery.yaml
SquireGraphs and others added 2 commits May 14, 2026 11:43
reddit_connector.py — two remaining OAuth fixes:

End-of-subreddit delay now uses self._request_delay (respects OAuth rate limit of 1s vs public 6s)
_fetch_post_comments() now picks the correct base URL (OAUTH_URL vs BASE_URL), uses self._comment_delay, and passes self._auth_headers through to _get_json()

tests/test_connectors.py — three new test cases + fixes:

test_credential_prompts — now also asserts client_id, client_secret, reddit_username, reddit_password are present and that reddit_password is marked secret: True
test_authenticate_defaults — adds assertions that _auth_headers is falsy and API keys are None by default
test_authenticate_oauth_sets_auth_headers — patches _fetch_oauth_token to return a token and verifies the Authorization: bearer <token> header gets set
test_authenticate_oauth_falls_back_on_failure — patches _fetch_oauth_token to return None and verifies the connector gracefully falls back to unauthenticated mode
Fixed mock_get_json signature to accept **kwargs so the extra_headers argument doesn't break it
  - Deleted src/create_context_graph/domains/graph-data-community.yaml
  - product-discovery.yaml: Subreddit recolored #f97316 → #fb923c
  (entity_types + visualization)
  - reddit_connector.py:247: docstring now references product-discovery
  - name_pools.py: added product-discovery to DOMAIN_INDUSTRY_POOL

Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
@jpadams jpadams force-pushed the reddit-connector branch from 87be31e to dccb00a Compare May 14, 2026 19:01
@jpadams jpadams marked this pull request as draft May 14, 2026 19:21
@jpadams

jpadams commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Lexical-layer research: Document, multi-connector composition, and
connector-shaped domain entities

Context

Coming out of the reddit-connector review, two ontologies turned out to
have connector-specific entities baked in (Post/Comment/Subreddit in
product-discovery). The user wants to understand:

  1. When the :Document node is used today
  2. Whether mix-and-match of connectors composes cleanly into a single graph
  3. Which other ontologies have the same connector-leak smell

This is a research pass to inform whether and how to introduce a proper
lexical layer. No code changes yet — pick a direction afterward.

Findings

  1. :Document today — a half-built lexical layer

Created only by the ingester, not by any ontology.

  • Scaffold-time: src/create_context_graph/ingest.py:130 (MERGE (d:Document
    {title: $title})) and the driver-fallback at :296.
  • Generated-project runtime:
    src/create_context_graph/templates/backend/shared/generate_data.py.j2:112.
  • Not declared in _base.yaml or any domain YAML. Schema view doesn't see
    it; agent tools aren't generated for it.

Every major connector emits Documents. Each maps a source artifact to one
Document with {title, content, metadata}:

┌─────────────┬───────────────────┬────────────────────────────────────┐
│ Connector │ Artifact → │ Source │
│ │ Document │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Gmail │ email subject + │ gmail_connector.py:225 │
│ │ snippet │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Slack │ long message body │ slack_connector.py:167 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ GitHub │ issue/PR body │ github_connector.py:199 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Linear │ project updates, │ linear_connector.py:529, 826 │
│ │ Linear Docs │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Jira │ issue description │ jira_connector.py:200 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Reddit │ post title+body │ reddit_connector.py:731 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Claude Code │ session │ claude_code_connector.py:356 │
│ │ transcript │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Google │ Drive files, │ google_workspace_connector.py │
│ Workspace │ comment threads │ (multi-stage) │
└─────────────┴───────────────────┴────────────────────────────────────┘

So yes, conceptually the user's question is answered: messages/issue
bodies/PR bodies/emails/Reddit posts all do become :Document nodes today.

But the Document layer is underbuilt:

  • MENTIONS is lazy text-CONTAINS, not eager — ingest.py:159 runs d.content
    CONTAINS e.name after entities are loaded. Brittle for non-trivial names,
    no positional provenance ("this entity came from this span").
  • MERGE key is {title} only (ingest.py:130) — not {title, domain}. Two
    domains importing a doc titled "Project Plan" collide. This is also true
    for DecisionTrace (ingest.py:176).
  • No chunking — Document.content is one flat blob; connectors truncate at
    4k–10k chars (reddit_connector.py:733: content[:4000],
    claude_code_connector.py: content[:10000]). No :Chunk exists anywhere in
    the codebase.
  • No first-class author/container/thread — connectors stuff author,
    subreddit, post_id, channel, permalink into an opaque metadata dict. None
    are graph citizens linked to the Document node.
  • Frontend shows them (DocumentBrowser.tsx.j2, GET /documents, GET
    /documents/{title}) but the agent does not have document-aware tools — it
    can only reach them via raw execute_cypher().
  1. Mix-and-match composition — supported in shape, broken in identity

CLI accepts multiple --connector flags (cli.py:57); the generated
import_data.py.j2 runs them serially and concatenates results (lines
120–136). So you can scaffold --connector github --connector slack --domain
software-engineering. But entities don't reconcile across sources:

  • Entity MERGE = {name, domain} (ingest.py:260) — Alice's GitHub username
    aliceg ≠ her Gmail "Alice G" ≠ her Slack display name. Three Person nodes.
  • Document MERGE = {title} only — cross-source title collisions are silent.
  • No identity reconciliation logic exists — no email → Person resolver, no
    fuzzy name match, no canonical-identity pass.
  • Cross-connector linking is one ad-hoc regex —
    google_workspace_connector.py:86, 1352–1405 scans for Linear-style refs
    (ENG-123) in Drive comments/Gmail subjects/meetings and creates
    RELATES_TO_ISSUE edges. Not generalized. GitHub PR refs (owner/repo#123)
    would need their own pattern.

Net: mix-and-match is a shape that runs without crashing, not a join that
produces a coherent graph.

  1. Connector-leak survey across all 23 domains

Severity: 🔴 High
Domain: product-discovery
Suspicious labels: Post, Comment, Subreddit (+ Post.permalink,
Post.subreddit, Comment.post_id)
Verdict: Pretends generic, hardcoded to Reddit.
────────────────────────────────────────
Severity: 🟠 Medium
Domain: software-engineering
Suspicious labels: PullRequest, Issue, Repository (+ pr_id, issue_id,
Repository.visibility, Deployment.TRIGGERED_BY PullRequest)
Verdict: Portable in concept, GitHub-specific in properties.
────────────────────────────────────────
Severity: 🟢 By design
Domain: agent-memory
Suspicious labels: Conversation, Session, ToolCall
Verdict: The domain's purpose is modeling AI agent interaction. These are
abstract patterns.
────────────────────────────────────────
Severity: 🟢 By design
Domain: personal-knowledge
Suspicious labels: Bookmark, Note, JournalEntry
Verdict: User-facing abstractions, not platform shapes.
────────────────────────────────────────
Severity: ✅ Clean
Domain: 19 others
Suspicious labels: —
Verdict: conservation, data-journalism, digital-twin, financial-services,
gaming, genai-llm-ops, gis-cartography, golf-sports, healthcare,
hospitality, manufacturing, oil-gas, options-intelligence,
product-management, real-estate, retail-ecommerce, scientific-research,
trip-planning, vacation-industry, wildlife-management

Two real leaks. agent-memory and personal-knowledge look like leaks at
first glance but their entire purpose is to model a connector's output as a
domain — those are interface-by-design.

Implications

The user's intuition — that there should be a lexical layer where
connector-shaped things like Reddit posts, GitHub issues, Slack messages
live, and domain entities are extracted from them — already has a partial
implementation: Document. It just needs three things to actually fulfill
that role:

  1. Promotion — declare Document (and helpers) in _base.yaml so it's part of
    the ontology, visible to the schema view, and generated into agent tools.
  2. Structure — give it real authorship, containment, and provenance edges
    instead of opaque metadata.
  3. Identity — fix the {title}-only MERGE key, and add at least an
    email-based Person reconciliation pass so cross-connector composition
    produces one Alice, not three.

The two leaky domains (product-discovery, software-engineering) would
benefit from this: their connector-shaped entities (Post/Comment/Subreddit,
PullRequest/Issue/Repository) could be retired in favor of a lexical-layer
Document with source: "reddit" / source: "github" and authorship/container
edges, while their truly-domain entities (PainPoint, UseCase, Technology,
Service, Component, Engineer) stay where they are.

Proposed directions (pick one — none committed)

Option A — Minimal: promote Document to _base.yaml only

Declare Document as a base entity type in _base.yaml with its existing
shape (title, content, template_id, template_name, domain, plus source and
source_id for provenance). Add a MENTIONS relationship declaration.
Domain-scope the MERGE key. Generate agent tools for it.

Impact: ontology now sees Documents, agent can query them, MERGE-collision
bug fixed. Doesn't touch connectors, doesn't fix identity reconciliation,
doesn't remove connector-leak from product-discovery /
software-engineering. Lowest risk.

Option B — Document + lexical primitives

Option A plus declare Chunk, Author, Container
(channel/subreddit/repo/folder/thread) in _base.yaml with
NEXT_CHUNK/PART_OF/AUTHORED_BY/IN_CONTAINER relationships. Update ingest.py
and connectors to populate them. Now Post/Comment in product-discovery
could be retired in favor of Document + Container("subreddit") + Author.

Impact: real lexical layer, mix-and-match starts to compose meaningfully
(one Container per source system, one Author per real person if identity
reconciliation is added). Touches every connector. Bigger change.

Option C — B + identity reconciliation pass

Option B plus add a post-ingest reconciliation step: collapse Person nodes
that share email, then optionally fuzzy-match on name + container
co-occurrence. Generalize the Linear-regex into a configurable
cross-reference pattern registry.

Impact: mix-and-match actually works — GitHub + Slack + Gmail in the same
domain produces one Alice. Largest change; affects ingestion model.

Recommendation

If the goal is to validate the architectural idea before committing to a
refactor, Option A is the right next step. It unblocks ontology-level
visibility of Document and the domain-scoped MERGE fix without disturbing
connectors, and it would expose any latent issues (e.g. agent tools that
assume Document is in the ontology) before the bigger change.

Once Option A lands and is stable, evaluate Option B's blast radius by
trying it on one domain — product-discovery is the smallest test case since
it's the most clearly leaked.

@jpadams

jpadams commented May 14, 2026

Copy link
Copy Markdown
Collaborator

based on #35 (comment), I recommend that we remove the product-discovery domain/ontology (since it is mostly about Reddit structures and not about Product things such as: ICP, moat, use-case/feature, market, GTM_motion, competitors, etc) from this PR and just use the Reddit connector to produce "Document" nodes like the other connectors.

@SquireGraphs SquireGraphs marked this pull request as ready for review May 14, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants