Skip to content

Commit a5805a7

Browse files
authored
Construct reddit scrap foundation. (#280)
* Construct reddit scrap foundation. * Fix CI lint fail. * Update settings.py. * Fix due to coderabbitai review.
1 parent 8b4cba2 commit a5805a7

26 files changed

Lines changed: 749 additions & 1 deletion

.env.example

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -287,6 +287,17 @@ DATABASE_URL=postgres://user:password@localhost:5432/boost_dashboard
287287
# PINECONE_DISCORD_APP_TYPE=discord
288288
# PINECONE_DISCORD_NAMESPACE=discord-messages
289289

290+
# ==============================================================================
291+
# Reddit (reddit_activity_tracker)
292+
# ==============================================================================
293+
# Register a "script" app at https://www.reddit.com/prefs/apps
294+
# REDDIT_CLIENT_ID=your_client_id
295+
# REDDIT_CLIENT_SECRET=your_client_secret
296+
# REDDIT_USER_AGENT=r_cpp_scraper/1.0 by u/yourusername
297+
#
298+
# Optional: minimum seconds between API requests (default 1.0, ~60 req/min)
299+
# REQUEST_INTERVAL=1.0
300+
290301
# ==============================================================================
291302
# YouTube (cppa_youtube_script_tracker)
292303
# ==============================================================================

CONTRIBUTING.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ Each Django app that has **models** provides a **`services.py`** module. This is
6969
| `clang_github_tracker` | `clang_github_tracker/services.py` | Clang/llvm GitHub issue, PR, and commit upserts; fetch watermarks. |
7070
| `boost_mailing_list_tracker` | `boost_mailing_list_tracker/services.py` | Mailing list messages and names. |
7171
| `cppa_slack_tracker` | `cppa_slack_tracker/services.py` | Slack teams, channels, messages, membership. |
72+
| `reddit_activity_tracker` | `reddit_activity_tracker/services.py` | Reddit submissions and comments. |
7273
| `wg21_paper_tracker` | `wg21_paper_tracker/services.py` | WG21 papers, authors, mailings. |
7374

7475
For a full list of functions, parameter/return types, and validation (e.g. empty `name` raises `ValueError`), see **[docs/Service_API.md](docs/Service_API.md)** and the per-app docs in **[docs/service_api/](docs/service_api/)** (index: [docs/service_api/README.md](docs/service_api/README.md)). DTO protocols shared across trackers are documented in **[docs/service_api/core_protocols.md](docs/service_api/core_protocols.md)** (generated from `core/protocols.py`).

config/settings.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@
6969
"clang_github_tracker",
7070
"cppa_slack_tracker",
7171
"discord_activity_tracker",
72+
"reddit_activity_tracker",
7273
"wg21_paper_tracker",
7374
"cppa_youtube_script_tracker",
7475
"slack_event_handler",
@@ -162,6 +163,7 @@
162163
"boost_usage_tracker",
163164
"cppa_slack_tracker",
164165
"discord_activity_tracker",
166+
"reddit_activity_tracker",
165167
"boost_mailing_list_tracker",
166168
"wg21_paper_tracker",
167169
"cppa_youtube_script_tracker",
@@ -524,6 +526,13 @@ def _slack_team_scope_from_env():
524526
or "discord-together-c-cpp"
525527
).strip()
526528

529+
# Reddit configuration (for reddit_activity_tracker)
530+
REDDIT_CLIENT_ID = (env("REDDIT_CLIENT_ID", default="") or "").strip()
531+
REDDIT_CLIENT_SECRET = (env("REDDIT_CLIENT_SECRET", default="") or "").strip()
532+
REDDIT_USER_AGENT = (env("REDDIT_USER_AGENT", default="") or "").strip()
533+
# Minimum seconds between API requests (default 1.0, ~60 req/min). Env: REQUEST_INTERVAL.
534+
REDDIT_REQUEST_INTERVAL = env.float("REQUEST_INTERVAL", default=1.0)
535+
527536
# WG21 Paper Tracker Configuration
528537
WG21_GITHUB_DISPATCH_ENABLED = env.bool("WG21_GITHUB_DISPATCH_ENABLED", default=False)
529538
WG21_GITHUB_DISPATCH_REPO = (env("WG21_GITHUB_DISPATCH_REPO", default="") or "").strip()

config/test_settings.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@
8383
"boost_library_tracker",
8484
"clang_github_tracker",
8585
"discord_activity_tracker",
86+
"reddit_activity_tracker",
8687
"shared",
8788
):
8889
(WORKSPACE_DIR / _slug).mkdir(parents=True, exist_ok=True)

core/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
# file generated by setuptools-scm; do not edit
2-
version = "0.1.1.dev2+g0dd532eaf.d20260527"
2+
version = "0.1.1.dev579+g8b4cba29b.d20260609"

docs/Schema.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -946,6 +946,51 @@ erDiagram
946946

947947
---
948948

949+
### 12. Reddit Activity Tracker (`reddit_activity_tracker`)
950+
951+
Subreddit posts and comments ingested from the Reddit OAuth API. Workspace JSON uses LangChain Document format (`page_content` + `metadata`); see PR2 workspace layout under `workspace/reddit_activity_tracker/{YYYY-MM}/`. No cross-app FKs — author identity is stored as plain strings (`author`, `author_id`).
952+
953+
```mermaid
954+
erDiagram
955+
direction LR
956+
RedditSubmission ||--o{ RedditComment : "has"
957+
958+
RedditSubmission {
959+
int id PK
960+
string reddit_id "UK IX t3_*"
961+
string subreddit "IX"
962+
string author
963+
string author_id
964+
string title
965+
text selftext
966+
text selftext_html
967+
string url
968+
string permalink
969+
int score
970+
int num_comments
971+
int created_utc "IX"
972+
datetime fetched_at
973+
}
974+
975+
RedditComment {
976+
int id PK
977+
string reddit_id "UK IX t1_*"
978+
int submission_id FK
979+
string parent_id "t3_* or t1_*"
980+
string author
981+
string author_id
982+
text body
983+
string url
984+
int score
985+
int created_utc "IX"
986+
datetime fetched_at
987+
}
988+
```
989+
990+
**Note:** `reddit_id` on both tables is the Reddit fullname (`t3_*` for submissions, `t1_*` for comments) and is the natural key for idempotent upserts.
991+
992+
---
993+
949994
## Appendix
950995

951996
### Appendix A: Table summary
@@ -1018,6 +1063,8 @@ erDiagram
10181063
| **DiscordChannel** | Channel in a guild (channel_id UK, type, category, topic, sync/activity timestamps). | 11 |
10191064
| **DiscordMessage** | Message (`message_id` UK, content, type, pin, reply_to, attachments JSON, soft-delete flags). | 11 |
10201065
| **DiscordReaction** | Emoji aggregate per message (unique on message + emoji). | 11 |
1066+
| **RedditSubmission** | Reddit post (`reddit_id` t3_* UK, subreddit, title, selftext, score, created_utc). | 12 |
1067+
| **RedditComment** | Reddit comment (`reddit_id` t1_* UK, submission FK, parent_id, body, score, created_utc). | 12 |
10211068
| **BoostDocContent** | Globally unique scraped page by content hash (url, content_hash UK, first_version_id, last_version_id, is_upserted, scraped_at). One row per unique content hash across all versions. | 10 |
10221069
| **BoostLibraryDocumentation** | Join table: BoostLibraryVersion × BoostDocContent. Records which pages belong to each (library, version) pair. | 10 |
10231070

docs/Service_API.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ All writes to app models must go through the service layer. The API is documente
1919
| **clang_github_tracker** | `clang_github_tracker.services` | Upsert llvm issue/PR/commit rows; fetch watermarks. |
2020
| **boost_mailing_list_tracker** | `boost_mailing_list_tracker.services` | Mailing list messages and names. |
2121
| **cppa_slack_tracker** | `cppa_slack_tracker.services` | Slack teams, channels, messages, membership. |
22+
| **reddit_activity_tracker** | `reddit_activity_tracker.services` | Reddit submissions and comments. |
2223
| **wg21_paper_tracker** | `wg21_paper_tracker.services` | WG21 papers, authors, mailings. |
2324

2425
---
@@ -37,6 +38,7 @@ All writes to app models must go through the service layer. The API is documente
3738
- **[service_api/clang_github_tracker.md](service_api/clang_github_tracker.md)** – API for `clang_github_tracker.services`.
3839
- **[service_api/boost_mailing_list_tracker.md](service_api/boost_mailing_list_tracker.md)** – API for `boost_mailing_list_tracker.services`.
3940
- **[service_api/cppa_slack_tracker.md](service_api/cppa_slack_tracker.md)** – API for `cppa_slack_tracker.services`.
41+
- **[service_api/reddit_activity_tracker.md](service_api/reddit_activity_tracker.md)** – API for `reddit_activity_tracker.services`.
4042
- **[service_api/wg21_paper_tracker.md](service_api/wg21_paper_tracker.md)** – API for `wg21_paper_tracker.services`.
4143
- **[service_api/core_protocols.md](service_api/core_protocols.md)**`core.protocols` DTO protocols (`TrackerResult`, `ActivityRecord`, `IncrementalState`).
4244

docs/cross-app-dependencies.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ document. `core` is excluded because it is shared infrastructure, not a peer tr
3636
| `clang_github_tracker` | Clang/LLVM GitHub activity | Yes |
3737
| `cppa_slack_tracker` | Slack teams, channels, messages | Yes |
3838
| `discord_activity_tracker` | Discord servers, channels, messages | Yes |
39+
| `reddit_activity_tracker` | Reddit subreddit submissions and comments | Yes |
3940
| `wg21_paper_tracker` | WG21 paper tracking | Yes |
4041
| `cppa_youtube_script_tracker` | YouTube video metadata and transcripts | Yes |
4142
| `slack_event_handler` | Slack event listener | No (no domain models) |

docs/service_api/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Index of all app service modules. All writes to app models must go through the s
1818
| [clang_github_tracker.services](clang_github_tracker.md) | clang_github_tracker | Upsert llvm issue/PR/commit rows; DB watermarks for API fetch windows. |
1919
| [boost_mailing_list_tracker.services](boost_mailing_list_tracker.md) | boost_mailing_list_tracker | Mailing list messages and list names. |
2020
| [cppa_slack_tracker.services](cppa_slack_tracker.md) | cppa_slack_tracker | Slack teams, channels, messages, and membership changes. |
21+
| [reddit_activity_tracker.services](reddit_activity_tracker.md) | reddit_activity_tracker | Reddit submissions and comments (upsert helpers in PR2). |
2122
| [wg21_paper_tracker.services](wg21_paper_tracker.md) | wg21_paper_tracker | WG21 papers, authors, and mailings. |
2223
| [core.protocols](core_protocols.md) | core | Runtime-checkable DTO protocols (`TrackerResult`, `ActivityRecord`, `IncrementalState`); see also [Core public API](../Core_public_API.md). |
2324

@@ -37,6 +38,7 @@ Index of all app service modules. All writes to app models must go through the s
3738
- **clang_github_tracker** – Upsert `ClangGithubIssueItem` / `ClangGithubCommit` during sync or backfill; read `Max(github_updated_at)` / `Max(github_committed_at)` for fetch cursors.
3839
- **boost_mailing_list_tracker** – Mailing list message and name helpers.
3940
- **cppa_slack_tracker** – Slack team/channel/message persistence and membership sync.
41+
- **reddit_activity_tracker** – Reddit submission and comment persistence (service functions added in PR2).
4042
- **wg21_paper_tracker** – WG21 paper and author persistence.
4143
- **core.protocols** – Structural contracts for sync outcomes and activity payloads (see [core_protocols.md](core_protocols.md)).
4244

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# reddit_activity_tracker.services
2+
3+
**Module path:** `reddit_activity_tracker.services`
4+
**Description:** Service layer for Reddit submissions and comments. All creates/updates/deletes for this app's models must go through functions here.
5+
6+
**Type notation:** Model types refer to `reddit_activity_tracker.models`.
7+
8+
---
9+
<!-- SERVICE_API:GENERATED:START -->
10+
11+
## Public API (generated)
12+
13+
| Function | Parameters | Return type | Summary |
14+
| --- | --- | --- | --- |
15+
16+
<!-- SERVICE_API:GENERATED:END -->
17+
18+
## Related
19+
20+
- [Service API index](README.md)
21+
- [Schema.md](../Schema.md) – Section 12: Reddit Activity Tracker.
22+
- [CONTRIBUTING.md](../../CONTRIBUTING.md)

0 commit comments

Comments
 (0)