Adds a Reddit data connector and domain ontology for graph databases#35
Adds a Reddit data connector and domain ontology for graph databases#35SquireGraphs wants to merge 11 commits into
Conversation
…s on Reddit. connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests. domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit). Modified files: config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants connectors/__init__.py — registers RedditConnector on import cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export
There was a problem hiding this comment.
Pull request overview
Adds a new Reddit ingestion path and a dedicated “graph-data-community” ontology to build a community-intelligence knowledge graph from public subreddit discussions, with optional LLM enrichment for higher-level signals (pain points/use cases/topics/technologies).
Changes:
- Added
graph-data-communitydomain ontology with entities, relationships, document templates, decision traces, and agent tools. - Implemented a
RedditConnectorthat scrapes Reddit’s public JSON endpoints and normalizes posts/comments/authors into the fixture schema (optionally enriched via Anthropic). - Wired the connector into the registry and CLI flow; added default subreddit/keyword configuration.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/create_context_graph/domains/graph-data-community.yaml |
New domain ontology for graph/data community intelligence built from Reddit discussions. |
src/create_context_graph/connectors/reddit_connector.py |
New connector that scrapes Reddit search/comments JSON endpoints and outputs normalized entities/relationships/documents, with optional LLM enrichment. |
src/create_context_graph/connectors/__init__.py |
Registers the new reddit connector via import side effects. |
src/create_context_graph/config.py |
Adds default subreddit + keyword lists for the Reddit connector. |
src/create_context_graph/cli.py |
Injects Anthropic/OpenAI API keys into connector credentials to support optional enrichment. |
Comments suppressed due to low confidence (2)
src/create_context_graph/config.py:90
- Trailing whitespace after the comma will be flagged by linters/formatters and can cause CI style checks to fail. Remove the extra space at end of line.
"apache",
src/create_context_graph/connectors/reddit_connector.py:545
- This relationship is appended even when
ucbecomes an empty string afterstrip(). That will create relationships with an emptytarget_name, which won’t match any node during ingest (and can lead to misleading relationship counts). Only appendDEMONSTRATESwhenucis truthy.
"type": "DEMONSTRATES",
"source_name": post_name,
"source_label": "Post",
"target_name": uc,
"target_label": "UseCase",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…domain PR Summary: Refactors the Reddit connector to be a general-purpose product discovery and community intelligence tool, removing hardcoded graph/data community references so any team can adopt it for their own product or technology. Changes: connectors/reddit_connector.py — removed the hardcoded KEYWORD_TECH_MAP that mapped specific graph/data technology names. Keywords are now title-cased dynamically at runtime, so whatever keywords a user configures become the seed product/technology nodes. The LLM enrichment prompt is rewritten to be domain-agnostic, extracting pain points, use cases, topics, and product mentions from any community discussion. config.py — defaults trimmed to a minimal starting point: one subreddit (neo4j) and three keywords (rag, agentic ai, knowledge graph). This gives new users a working example without prescribing a large fixed scope. domains/product-discovery.yaml — new generic domain ontology for product discovery and community intelligence. Same entity structure as graph-data-community (Post, Comment, Subreddit, Person, Technology, PainPoint, UseCase, Topic) but with neutral language, generic agent tools (get_product_pulse, find_pain_points, product_co_occurrence), and a system prompt framed around product and developer relations teams rather than graph database practitioners.
…eate-context-graph into reddit-connector
Copilot found two bugs one when anthropic api key is instantiated but not set and another when pp/uc is null/empty but still tries to set a relationship.
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
| "Reddit import complete: %d entities, %d relationships, %d documents", | ||
| total_entities, len(relationships), len(documents), | ||
| ) | ||
| return NormalizedData( |
There was a problem hiding this comment.
do we create any indexes/embeddings for the data?
| emoji: "🕸️" | ||
|
|
||
| entity_types: | ||
| - label: Post |
There was a problem hiding this comment.
do we need to repeat them in code if we can configure them from the outside?
make sure to align with the other domains so that labels and key(properties) align and can merge together.
reddit_connector.py — two remaining OAuth fixes: End-of-subreddit delay now uses self._request_delay (respects OAuth rate limit of 1s vs public 6s) _fetch_post_comments() now picks the correct base URL (OAUTH_URL vs BASE_URL), uses self._comment_delay, and passes self._auth_headers through to _get_json() tests/test_connectors.py — three new test cases + fixes: test_credential_prompts — now also asserts client_id, client_secret, reddit_username, reddit_password are present and that reddit_password is marked secret: True test_authenticate_defaults — adds assertions that _auth_headers is falsy and API keys are None by default test_authenticate_oauth_sets_auth_headers — patches _fetch_oauth_token to return a token and verifies the Authorization: bearer <token> header gets set test_authenticate_oauth_falls_back_on_failure — patches _fetch_oauth_token to return None and verifies the connector gracefully falls back to unauthenticated mode Fixed mock_get_json signature to accept **kwargs so the extra_headers argument doesn't break it
- Deleted src/create_context_graph/domains/graph-data-community.yaml - product-discovery.yaml: Subreddit recolored #f97316 → #fb923c (entity_types + visualization) - reddit_connector.py:247: docstring now references product-discovery - name_pools.py: added product-discovery to DOMAIN_INDUSTRY_POOL Signed-off-by: Jeremy Adams <jeremy.adams@neo4j.com>
|
Lexical-layer research: Document, multi-connector composition, and Context Coming out of the reddit-connector review, two ontologies turned out to
This is a research pass to inform whether and how to introduce a proper Findings
Created only by the ingester, not by any ontology.
Every major connector emits Documents. Each maps a source artifact to one ┌─────────────┬───────────────────┬────────────────────────────────────┐ So yes, conceptually the user's question is answered: messages/issue But the Document layer is underbuilt:
CLI accepts multiple --connector flags (cli.py:57); the generated
Net: mix-and-match is a shape that runs without crashing, not a join that
Severity: 🔴 High Two real leaks. agent-memory and personal-knowledge look like leaks at Implications The user's intuition — that there should be a lexical layer where
The two leaky domains (product-discovery, software-engineering) would Proposed directions (pick one — none committed) Option A — Minimal: promote Document to _base.yaml only Declare Document as a base entity type in _base.yaml with its existing Impact: ontology now sees Documents, agent can query them, MERGE-collision Option B — Document + lexical primitives Option A plus declare Chunk, Author, Container Impact: real lexical layer, mix-and-match starts to compose meaningfully Option C — B + identity reconciliation pass Option B plus add a post-ingest reconciliation step: collapse Person nodes Impact: mix-and-match actually works — GitHub + Slack + Gmail in the same Recommendation If the goal is to validate the architectural idea before committing to a Once Option A lands and is stable, evaluate Option B's blast radius by |
|
based on #35 (comment), I recommend that we remove the product-discovery domain/ontology (since it is mostly about Reddit structures and not about Product things such as: ICP, moat, use-case/feature, market, GTM_motion, competitors, etc) from this PR and just use the Reddit connector to produce "Document" nodes like the other connectors. |
…s on Reddit.
connectors/reddit_connector.py — scrapes Reddit via the public JSON API (no auth required), extracts Post, Comment, Person, Subreddit, Technology, and Topic entities. Optionally enriches each post with Claude Haiku to extract PainPoint, UseCase, additional Technology, and Topic entities via LLM. Rate limit safe at ~6s between requests. domains/graph-data-community.yaml — domain ontology for community intelligence across r/neo4j, r/graphdatabase, r/dataengineering, r/LocalLLaMA and related communities. Includes 7 entity types, 16 relationships, 5 document templates, 5 decision traces, and 11 pre-built agent tools (e.g. get_technology_pulse, find_pain_points, sentiment_by_subreddit).
Modified files:
config.py — adds REDDIT_DEFAULT_SUBREDDITS and REDDIT_DEFAULT_KEYWORDS constants connectors/init.py — registers RedditConnector on import cli.py — injects anthropic_api_key and openai_api_key into connector credentials so connectors can use them for enrichment without requiring a separate export