Adds a Reddit data connector and domain ontology for graph databases by SquireGraphs · Pull Request #35 · neo4j-labs/create-context-graph

SquireGraphs · 2026-05-13T17:15:34Z

…s on Reddit.

connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests. domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit).

Modified files:

config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants connectors/init.py — registers RedditConnector on import cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export

…s on Reddit. connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests. domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit). Modified files: config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants connectors/__init__.py — registers RedditConnector on import cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export

Copilot

Pull request overview

Adds a new Reddit ingestion path and a dedicated “graph-data-community” ontology to build a community-intelligence knowledge graph from public subreddit discussions, with optional LLM enrichment for higher-level signals (pain points/use cases/topics/technologies).

Changes:

Added graph-data-community domain ontology with entities, relationships, document templates, decision traces, and agent tools.
Implemented a RedditConnector that scrapes Reddit’s public JSON endpoints and normalizes posts/comments/authors into the fixture schema (optionally enriched via Anthropic).
Wired the connector into the registry and CLI flow; added default subreddit/keyword configuration.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`src/create_context_graph/domains/graph-data-community.yaml`	New domain ontology for graph/data community intelligence built from Reddit discussions.
`src/create_context_graph/connectors/reddit_connector.py`	New connector that scrapes Reddit search/comments JSON endpoints and outputs normalized entities/relationships/documents, with optional LLM enrichment.
`src/create_context_graph/connectors/__init__.py`	Registers the new `reddit` connector via import side effects.
`src/create_context_graph/config.py`	Adds default subreddit + keyword lists for the Reddit connector.
`src/create_context_graph/cli.py`	Injects Anthropic/OpenAI API keys into connector credentials to support optional enrichment.

Comments suppressed due to low confidence (2)

src/create_context_graph/config.py:90

Trailing whitespace after the comma will be flagged by linters/formatters and can cause CI style checks to fail. Remove the extra space at end of line.

    "apache",

src/create_context_graph/connectors/reddit_connector.py:545

This relationship is appended even when uc becomes an empty string after strip(). That will create relationships with an empty target_name, which won’t match any node during ingest (and can lead to misleading relationship counts). Only append DEMONSTRATES when uc is truthy.

                                    "type": "DEMONSTRATES",
                                    "source_name": post_name,
                                    "source_label": "Post",
                                    "target_name": uc,
                                    "target_label": "UseCase",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…domain PR Summary: Refactors the Reddit connector to be a general-purpose product discovery and community intelligence tool, removing hardcoded graph/data community references so any team can adopt it for their own product or technology. Changes: connectors/reddit_connector.py — removed the hardcoded KEYWORD_TECH_MAP that mapped specific graph/data technology names. Keywords are now title-cased dynamically at runtime, so whatever keywords a user configures become the seed product/technology nodes. The LLM enrichment prompt is rewritten to be domain-agnostic, extracting pain points, use cases, topics, and product mentions from any community discussion. config.py — defaults trimmed to a minimal starting point: one subreddit (neo4j) and three keywords (rag, agentic ai, knowledge graph). This gives new users a working example without prescribing a large fixed scope. domains/product-discovery.yaml — new generic domain ontology for product discovery and community intelligence. Same entity structure as graph-data-community (Post, Comment, Subreddit, Person, Technology, PainPoint, UseCase, Topic) but with neutral language, generic agent tools (get_product_pulse, find_pain_points, product_co_occurrence), and a system prompt framed around product and developer relations teams rather than graph database practitioners.

…eate-context-graph into reddit-connector

Copilot found two bugs one when anthropic api key is instantiated but not set and another when pp/uc is null/empty but still tries to set a relationship.

Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>

jexp · 2026-05-13T23:55:56Z

+            "Reddit import complete: %d entities, %d relationships, %d documents",
+            total_entities, len(relationships), len(documents),
+        )
+        return NormalizedData(


do we create any indexes/embeddings for the data?

jexp · 2026-05-13T23:56:47Z

+  emoji: "🕸️"
+
+entity_types:
+  - label: Post


do we need to repeat them in code if we can configure them from the outside?
make sure to align with the other domains so that labels and key(properties) align and can merge together.

reddit_connector.py — two remaining OAuth fixes: End-of-subreddit delay now uses self._request_delay (respects OAuth rate limit of 1s vs public 6s) _fetch_post_comments() now picks the correct base URL (OAUTH_URL vs BASE_URL), uses self._comment_delay, and passes self._auth_headers through to _get_json() tests/test_connectors.py — three new test cases + fixes: test_credential_prompts — now also asserts client_id, client_secret, reddit_username, reddit_password are present and that reddit_password is marked secret: True test_authenticate_defaults — adds assertions that _auth_headers is falsy and API keys are None by default test_authenticate_oauth_sets_auth_headers — patches _fetch_oauth_token to return a token and verifies the Authorization: bearer <token> header gets set test_authenticate_oauth_falls_back_on_failure — patches _fetch_oauth_token to return None and verifies the connector gracefully falls back to unauthenticated mode Fixed mock_get_json signature to accept **kwargs so the extra_headers argument doesn't break it

- Deleted src/create_context_graph/domains/graph-data-community.yaml - product-discovery.yaml: Subreddit recolored #f97316 → #fb923c (entity_types + visualization) - reddit_connector.py:247: docstring now references product-discovery - name_pools.py: added product-discovery to DOMAIN_INDUSTRY_POOL Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>

jpadams · 2026-05-14T19:23:27Z

Lexical-layer research: Document, multi-connector composition, and
connector-shaped domain entities

Context

Coming out of the reddit-connector review, two ontologies turned out to
have connector-specific entities baked in (Post/Comment/Subreddit in
product-discovery). The user wants to understand:

When the :Document node is used today
Whether mix-and-match of connectors composes cleanly into a single graph
Which other ontologies have the same connector-leak smell

This is a research pass to inform whether and how to introduce a proper
lexical layer. No code changes yet — pick a direction afterward.

Findings

:Document today — a half-built lexical layer

Created only by the ingester, not by any ontology.

Scaffold-time: src/create_context_graph/ingest.py:130 (MERGE (d:Document
{title: $title})) and the driver-fallback at :296.
Generated-project runtime:
src/create_context_graph/templates/backend/shared/generate_data.py.j2:112.
Not declared in _base.yaml or any domain YAML. Schema view doesn't see
it; agent tools aren't generated for it.

Every major connector emits Documents. Each maps a source artifact to one
Document with {title, content, metadata}:

┌─────────────┬───────────────────┬────────────────────────────────────┐
│ Connector │ Artifact → │ Source │
│ │ Document │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Gmail │ email subject + │ gmail_connector.py:225 │
│ │ snippet │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Slack │ long message body │ slack_connector.py:167 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ GitHub │ issue/PR body │ github_connector.py:199 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Linear │ project updates, │ linear_connector.py:529, 826 │
│ │ Linear Docs │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Jira │ issue description │ jira_connector.py:200 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Reddit │ post title+body │ reddit_connector.py:731 │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Claude Code │ session │ claude_code_connector.py:356 │
│ │ transcript │ │
├─────────────┼───────────────────┼────────────────────────────────────┤
│ Google │ Drive files, │ google_workspace_connector.py │
│ Workspace │ comment threads │ (multi-stage) │
└─────────────┴───────────────────┴────────────────────────────────────┘

So yes, conceptually the user's question is answered: messages/issue
bodies/PR bodies/emails/Reddit posts all do become :Document nodes today.

But the Document layer is underbuilt:

MENTIONS is lazy text-CONTAINS, not eager — ingest.py:159 runs d.content
CONTAINS e.name after entities are loaded. Brittle for non-trivial names,
no positional provenance ("this entity came from this span").
MERGE key is {title} only (ingest.py:130) — not {title, domain}. Two
domains importing a doc titled "Project Plan" collide. This is also true
for DecisionTrace (ingest.py:176).
No chunking — Document.content is one flat blob; connectors truncate at
4k–10k chars (reddit_connector.py:733: content[:4000],
claude_code_connector.py: content[:10000]). No :Chunk exists anywhere in
the codebase.
No first-class author/container/thread — connectors stuff author,
subreddit, post_id, channel, permalink into an opaque metadata dict. None
are graph citizens linked to the Document node.
Frontend shows them (DocumentBrowser.tsx.j2, GET /documents, GET
/documents/{title}) but the agent does not have document-aware tools — it
can only reach them via raw execute_cypher().

Mix-and-match composition — supported in shape, broken in identity

CLI accepts multiple --connector flags (cli.py:57); the generated
import_data.py.j2 runs them serially and concatenates results (lines
120–136). So you can scaffold --connector github --connector slack --domain
software-engineering. But entities don't reconcile across sources:

Entity MERGE = {name, domain} (ingest.py:260) — Alice's GitHub username
aliceg ≠ her Gmail "Alice G" ≠ her Slack display name. Three Person nodes.
Document MERGE = {title} only — cross-source title collisions are silent.
No identity reconciliation logic exists — no email → Person resolver, no
fuzzy name match, no canonical-identity pass.
Cross-connector linking is one ad-hoc regex —
google_workspace_connector.py:86, 1352–1405 scans for Linear-style refs
(ENG-123) in Drive comments/Gmail subjects/meetings and creates
RELATES_TO_ISSUE edges. Not generalized. GitHub PR refs (owner/repo#123)
would need their own pattern.

Net: mix-and-match is a shape that runs without crashing, not a join that
produces a coherent graph.

Connector-leak survey across all 23 domains

Severity: 🔴 High
Domain: product-discovery
Suspicious labels: Post, Comment, Subreddit (+ Post.permalink,
Post.subreddit, Comment.post_id)
Verdict: Pretends generic, hardcoded to Reddit.
────────────────────────────────────────
Severity: 🟠 Medium
Domain: software-engineering
Suspicious labels: PullRequest, Issue, Repository (+ pr_id, issue_id,
Repository.visibility, Deployment.TRIGGERED_BY PullRequest)
Verdict: Portable in concept, GitHub-specific in properties.
────────────────────────────────────────
Severity: 🟢 By design
Domain: agent-memory
Suspicious labels: Conversation, Session, ToolCall
Verdict: The domain's purpose is modeling AI agent interaction. These are
abstract patterns.
────────────────────────────────────────
Severity: 🟢 By design
Domain: personal-knowledge
Suspicious labels: Bookmark, Note, JournalEntry
Verdict: User-facing abstractions, not platform shapes.
────────────────────────────────────────
Severity: ✅ Clean
Domain: 19 others
Suspicious labels: —
Verdict: conservation, data-journalism, digital-twin, financial-services,
gaming, genai-llm-ops, gis-cartography, golf-sports, healthcare,
hospitality, manufacturing, oil-gas, options-intelligence,
product-management, real-estate, retail-ecommerce, scientific-research,
trip-planning, vacation-industry, wildlife-management

Two real leaks. agent-memory and personal-knowledge look like leaks at
first glance but their entire purpose is to model a connector's output as a
domain — those are interface-by-design.

Implications

The user's intuition — that there should be a lexical layer where
connector-shaped things like Reddit posts, GitHub issues, Slack messages
live, and domain entities are extracted from them — already has a partial
implementation: Document. It just needs three things to actually fulfill
that role:

Promotion — declare Document (and helpers) in _base.yaml so it's part of
the ontology, visible to the schema view, and generated into agent tools.
Structure — give it real authorship, containment, and provenance edges
instead of opaque metadata.
Identity — fix the {title}-only MERGE key, and add at least an
email-based Person reconciliation pass so cross-connector composition
produces one Alice, not three.

The two leaky domains (product-discovery, software-engineering) would
benefit from this: their connector-shaped entities (Post/Comment/Subreddit,
PullRequest/Issue/Repository) could be retired in favor of a lexical-layer
Document with source: "reddit" / source: "github" and authorship/container
edges, while their truly-domain entities (PainPoint, UseCase, Technology,
Service, Component, Engineer) stay where they are.

Proposed directions (pick one — none committed)

Option A — Minimal: promote Document to _base.yaml only

Declare Document as a base entity type in _base.yaml with its existing
shape (title, content, template_id, template_name, domain, plus source and
source_id for provenance). Add a MENTIONS relationship declaration.
Domain-scope the MERGE key. Generate agent tools for it.

Impact: ontology now sees Documents, agent can query them, MERGE-collision
bug fixed. Doesn't touch connectors, doesn't fix identity reconciliation,
doesn't remove connector-leak from product-discovery /
software-engineering. Lowest risk.

Option B — Document + lexical primitives

Option A plus declare Chunk, Author, Container
(channel/subreddit/repo/folder/thread) in _base.yaml with
NEXT_CHUNK/PART_OF/AUTHORED_BY/IN_CONTAINER relationships. Update ingest.py
and connectors to populate them. Now Post/Comment in product-discovery
could be retired in favor of Document + Container("subreddit") + Author.

Impact: real lexical layer, mix-and-match starts to compose meaningfully
(one Container per source system, one Author per real person if identity
reconciliation is added). Touches every connector. Bigger change.

Option C — B + identity reconciliation pass

Option B plus add a post-ingest reconciliation step: collapse Person nodes
that share email, then optionally fuzzy-match on name + container
co-occurrence. Generalize the Linear-regex into a configurable
cross-reference pattern registry.

Impact: mix-and-match actually works — GitHub + Slack + Gmail in the same
domain produces one Alice. Largest change; affects ingestion model.

Recommendation

If the goal is to validate the architectural idea before committing to a
refactor, Option A is the right next step. It unblocks ontology-level
visibility of Document and the domain-scoped MERGE fix without disturbing
connectors, and it would expose any latent issues (e.g. agent tools that
assume Document is in the ontology) before the bigger change.

Once Option A lands and is stable, evaluate Option B's blast radius by
trying it on one domain — product-discovery is the smallest test case since
it's the most clearly leaked.

jpadams · 2026-05-14T19:26:55Z

based on #35 (comment), I recommend that we remove the product-discovery domain/ontology (since it is mostly about Reddit structures and not about Product things such as: ICP, moat, use-case/feature, market, GTM_motion, competitors, etc) from this PR and just use the Reddit connector to produce "Document" nodes like the other connectors.

SquireGraphs added 2 commits May 13, 2026 10:02

Merge branch 'main' into reddit-connector

53c76b8

jexp requested a review from Copilot May 13, 2026 19:31

Copilot started reviewing on behalf of jexp May 13, 2026 19:31 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

SquireGraphs and others added 6 commits May 13, 2026 12:42

Merge branch 'reddit-connector' of https://github.com/SquireGraphs/cr…

03b060d

…eate-context-graph into reddit-connector

Fix Copilot found bug in Reddit Connector.py

83cbaa7

Copilot found two bugs one when anthropic api key is instantiated but not set and another when pp/uc is null/empty but still tries to set a relationship.

fix: if indent and uneeded f format

4cc2ab3

Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>

fix: remove extraneous imports

32e94b7

Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>

fix: add back local-file

665da92

Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>