Skip to content

Reddit multi-subreddit support + configurable targets.#297

Open
AuraMindNest wants to merge 4 commits into
cppalliance:developfrom
AuraMindNest:feature/multi-subreddit
Open

Reddit multi-subreddit support + configurable targets.#297
AuraMindNest wants to merge 4 commits into
cppalliance:developfrom
AuraMindNest:feature/multi-subreddit

Conversation

@AuraMindNest

@AuraMindNest AuraMindNest commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Close #292.

Summary

Extend the Reddit activity tracker from a single hardcoded target (r/cpp) to a configurable list of subreddits. Targets are set via REDDIT_SUBREDDITS (comma-separated env var; r/ prefix optional) or overridden per run with --subreddits. The collector iterates all configured subreddits in one scheduled run, using a single shared RedditSession so Reddit API rate limits apply across all targets.

Per-subreddit incremental cursors replace the previous global Max(created_utc) watermarks: get_latest_submission_created_utc and get_latest_comment_created_utc accept an optional subreddit argument, and RedditIncrementalState (new protocol_impl.py) records per-subreddit submission/comment timestamps via load_incremental_state() and _incremental_state_out. Broad subreddits can be narrowed with REDDIT_SUBREDDIT_KEYWORD_FILTERS (JSON env var; default filters r/programming to boost/c++/cpp keywords). The fetcher no longer hardcodes SUBREDDIT; fetch_submissions_in_range and fetch_comments_in_range require an explicit subreddit parameter. Default targets: cpp, cpp_questions, programming. No model migration required — RedditSubmission.subreddit is already indexed.

Apps touched

  • reddit_activity_tracker (collector loop, fetcher, services, protocol_impl.py, tests)
  • config (settings, schedule YAML)
  • docs/service_api (reddit_activity_tracker.md)
  • .env.example, README.md

Test plan

  • python -m pytest (or scoped: python -m pytest <app>/tests)
  • uv run pyright (if typed code changed)
  • lint-imports (if imports or cross-app coupling changed)
  • App command smoke-tested (if collector/command changed):
python manage.py run_reddit_activity_tracker --subreddits cpp,cpp_questions
python manage.py run_reddit_activity_tracker --help

Docs / coupling

  • cross-app-dependencies.md updated (if FKs or cross-app imports changed)
  • python scripts/generate_service_docs.py run (if services.py or core/protocols.py changed)
  • App README or docs/ updated (if behavior or ops changed)

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features
    • Scrapes incrementally across multiple configured Reddit subreddits (or a CLI override).
    • Adds per-subreddit keyword filtering for both submissions and comments.
  • Bug Fixes
    • Latest timestamp tracking can now be scoped to a specific subreddit (otherwise uses global maxima).
  • Documentation
    • Updated configuration examples and service/collector documentation to reflect multi-subreddit behavior and keyword filters.
    • Expanded the reddit_activity_tracker/ README with setup, settings, and command usage details.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d671afd-dc1a-4546-a3d9-4a92a952affa

📥 Commits

Reviewing files that changed from the base of the PR and between 85aca6d and 98b8109.

📒 Files selected for processing (5)
  • README.md
  • config/settings.py
  • core/_version.py
  • reddit_activity_tracker/README.md
  • reddit_activity_tracker/management/commands/run_reddit_activity_tracker.py
✅ Files skipped from review due to trivial changes (2)
  • core/_version.py
  • README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • config/settings.py
  • reddit_activity_tracker/management/commands/run_reddit_activity_tracker.py

📝 Walkthrough

Walkthrough

The Reddit activity tracker is extended from single-subreddit to multi-subreddit scraping. New REDDIT_SUBREDDITS and REDDIT_SUBREDDIT_KEYWORD_FILTERS settings are parsed from environment variables with documented defaults. The fetcher's methods now require an explicit subreddit argument instead of defaulting to cpp. Services gain optional subreddit scoping for incremental timestamp lookups. A new RedditIncrementalState DTO tracks per-subreddit submission and comment cursors. The management command is rewritten to loop per subreddit with keyword filtering and a --subreddits CLI override. Comprehensive module documentation and updated tests complete the changes.

Changes

Reddit Multi-Subreddit Support and Per-Subreddit Scoping

Layer / File(s) Summary
Environment config and settings parsing
.env.example, config/boost_collector_schedule.yaml, config/settings.py, README.md
Adds REDDIT_SUBREDDITS (comma-separated, r/ prefix stripped, default cpp,cpp_questions,programming) and REDDIT_SUBREDDIT_KEYWORD_FILTERS (JSON-parsed dict with fallback to programming-focused defaults) settings parsed from environment, documented in .env.example, YAML schedule, and README table.
Fetcher: make subreddit a required argument
reddit_activity_tracker/fetcher.py
Removes the SUBREDDIT = "cpp" module constant and makes subreddit a required keyword-only argument on fetch_comments_in_range and fetch_submissions_in_range, shifting subreddit resolution to callers.
Services: subreddit-scoped latest UTC queries
reddit_activity_tracker/services.py, docs/service_api/reddit_activity_tracker.md
Drops the SUBREDDIT import, derives subreddit from comment_data in submission stub creation, and adds optional subreddit keyword argument to both "latest created_utc" helpers to scope aggregation per subreddit. API documentation updated to reflect new signatures.
RedditIncrementalState DTO
reddit_activity_tracker/protocol_impl.py
New frozen dataclass RedditIncrementalState implementing IncrementalStateDataclass, with from_subreddit_cursors factory producing checkpoint_token (sorted subreddit names), human_readable_marker, and extras dict from per-subreddit cursor maps.
Management command: multi-subreddit collect loop
reddit_activity_tracker/management/commands/run_reddit_activity_tracker.py
Adds CommandError and RedditIncrementalState imports; introduces helpers for subreddit resolution (raising error if none configured), keyword filter retrieval, and submission/comment filtering by keywords. Refactors __init__, load_incremental_state, and rewrites collect() into per-subreddit loop with per-subreddit cursor tracking. Adds --subreddits CLI argument.
Module documentation and version bump
reddit_activity_tracker/README.md, core/_version.py
Adds comprehensive module README documenting multi-subreddit ingestion workflow, configuration variables, operational commands, scheduling location, and test execution. Includes configuration table, keyword filter behavior, and subreddit deduplication notes. Version updated.
Tests
reddit_activity_tracker/tests/test_collector_integration.py, reddit_activity_tracker/tests/test_run_reddit_activity_tracker_command.py, reddit_activity_tracker/tests/test_services.py
Extends integration tests with multi-subreddit iteration and keyword filtering verification; adds unit tests for filter helpers; replaces command success test with test_run_command_subreddits_override validating CLI subreddit argument; adds @override_settings for command test configuration; replaces single services test with per-subreddit and global-max aggregation variants.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as Management Command
    participant Collector as RedditActivityTrackerCollector
    participant Session as RedditSession
    participant Svc as services
    participant State as RedditIncrementalState

    CLI->>Collector: collect() with resolved subreddits list
    loop for each subreddit
        Collector->>Svc: get_latest_submission_created_utc(subreddit=name)
        Collector->>Svc: get_latest_comment_created_utc(subreddit=name)
        Collector->>Session: fetch_submissions_in_range(subreddit=name)
        Session-->>Collector: submissions[]
        Collector->>Session: fetch_comments_in_range(subreddit=name)
        Session-->>Collector: comments[]
        Collector->>Collector: _filter_submissions_by_keywords(keywords)
        Collector->>Collector: _filter_comments_by_keywords(keywords)
        Collector->>Collector: upsert records, write JSON, record cursor
    end
    Collector->>State: from_subreddit_cursors(submissions=..., comments=...)
    State-->>Collector: _incremental_state_out (checkpoint_token, extras)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • cppalliance/boost-data-collector#280: Lays the initial reddit_activity_tracker foundation; this PR extends that integration to support multi-subreddit scraping and per-subreddit keyword filtering.

Suggested reviewers

  • wpak-ai
  • snowfox1003
  • jonathanMLDev

Poem

🐇 Hop, hop, across subreddits we go,
No more single cpp default in tow.
Keywords filter the noise away,
Per-subreddit cursors mark each foray.
The rabbit checks configs with glee —
Multi-subreddit at last, finally free! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Reddit multi-subreddit support + configurable targets' directly summarizes the main changes: extending the tracker from single hardcoded subreddit to configurable multiple targets.
Description check ✅ Passed The description follows the template structure with Summary, Apps touched, Test plan (all items checked), and Docs/coupling sections completed. All required information is present and comprehensive.
Linked Issues check ✅ Passed All acceptance criteria from issue #292 are met: subreddit configuration via REDDIT_SUBREDDITS, per-subreddit cursor resolution, keyword filtering, error handling for missing targets, and required code changes across services and fetcher.
Out of Scope Changes check ✅ Passed All changes are directly related to multi-subreddit support objectives. Version bump in core/_version.py is expected. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
reddit_activity_tracker/tests/test_collector_integration.py (1)

280-284: ⚡ Quick win

Assert override propagation for comment fetches too.

This test validates --subreddits only on submission calls; add the same check for session.fetch_comments_in_range so both required fetcher methods are contract-tested.

Suggested test assertion
     subreddit_args = [
         call.kwargs["subreddit"]
         for call in session.fetch_submissions_in_range.call_args_list
     ]
     assert subreddit_args == ["cpp_questions", "learnprogramming"]
+    comment_subreddit_args = [
+        call.kwargs["subreddit"]
+        for call in session.fetch_comments_in_range.call_args_list
+    ]
+    assert comment_subreddit_args == ["cpp_questions", "learnprogramming"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reddit_activity_tracker/tests/test_collector_integration.py` around lines 280
- 284, The test currently validates that the subreddit override propagates to
submission fetches via session.fetch_submissions_in_range, but does not verify
the same behavior for comment fetches. Add a parallel assertion block that
extracts the subreddit arguments from
session.fetch_comments_in_range.call_args_list using the same pattern as the
existing submission assertion (accessing call.kwargs["subreddit"] for each
call), and assert that these subreddit arguments also match the expected list of
["cpp_questions", "learnprogramming"] to ensure both fetcher methods properly
receive the overridden subreddits.
reddit_activity_tracker/tests/test_services.py (1)

126-154: ⚡ Quick win

Strengthen global comment-max coverage with cross-subreddit data.

The current global-max comment assertion uses a single comment record, so it doesn’t prove cross-subreddit max selection for comments. Add a second comment on another subreddit with a higher timestamp and assert that value.

Suggested test hardening
-    baker.make(
+    submission_programming = baker.make(
         RedditSubmission,
         reddit_submission_id="t3_b",
         subreddit="programming",
         title="B",
         url="https://example.com/b",
         permalink="/r/programming/comments/b/",
         created_utc=500,
     )
     baker.make(
         RedditComment,
         reddit_comment_id="t1_b",
         submission=submission,
         parent_id="t3_a",
         url="https://example.com/c",
         created_utc=200,
     )
+    baker.make(
+        RedditComment,
+        reddit_comment_id="t1_prog",
+        submission=submission_programming,
+        parent_id="t3_b",
+        url="https://example.com/prog-c",
+        created_utc=900,
+    )
     assert services.get_latest_submission_created_utc() == 500
-    assert services.get_latest_comment_created_utc() == 200
+    assert services.get_latest_comment_created_utc() == 900
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@reddit_activity_tracker/tests/test_services.py` around lines 126 - 154, The
test test_get_latest_submission_and_comment_created_utc_global_max currently
only creates a single RedditComment record with created_utc=200, which doesn't
validate cross-subreddit maximum selection for comments. Add a second
baker.make() call to create another RedditComment on the second submission (the
one in the "programming" subreddit) with a higher created_utc value than 200,
then update the assertion for services.get_latest_comment_created_utc() to
expect this new higher timestamp value to ensure the function correctly selects
the maximum comment timestamp across all subreddits.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@config/settings.py`:
- Around line 568-575: The code assumes that _parsed_keyword_filters is a
dictionary after JSON parsing, but valid JSON like empty lists or strings will
pass json.loads() without error and then fail when calling .items() with an
AttributeError. Add a guard condition after parsing _parsed_keyword_filters to
check that it is actually a dictionary (using isinstance(dict)) before
attempting to call .items() on it in the dictionary comprehension for
REDDIT_SUBREDDIT_KEYWORD_FILTERS. If the parsed JSON is not a dictionary, either
skip setting the variable or set it to an empty dictionary.

In `@reddit_activity_tracker/management/commands/run_reddit_activity_tracker.py`:
- Around line 43-53: The _resolve_subreddit_targets function currently returns a
list of subreddit targets that may contain duplicates, leading to redundant API
calls and inflated metrics. After collecting targets from either the
command-line override (via _parse_subreddit_list) or the settings configuration
(via getattr), deduplicate the targets list before the validation check and
return statement. Convert the targets list to a set to remove duplicates, then
convert it back to a list to maintain the expected return type before returning
from the function.

---

Nitpick comments:
In `@reddit_activity_tracker/tests/test_collector_integration.py`:
- Around line 280-284: The test currently validates that the subreddit override
propagates to submission fetches via session.fetch_submissions_in_range, but
does not verify the same behavior for comment fetches. Add a parallel assertion
block that extracts the subreddit arguments from
session.fetch_comments_in_range.call_args_list using the same pattern as the
existing submission assertion (accessing call.kwargs["subreddit"] for each
call), and assert that these subreddit arguments also match the expected list of
["cpp_questions", "learnprogramming"] to ensure both fetcher methods properly
receive the overridden subreddits.

In `@reddit_activity_tracker/tests/test_services.py`:
- Around line 126-154: The test
test_get_latest_submission_and_comment_created_utc_global_max currently only
creates a single RedditComment record with created_utc=200, which doesn't
validate cross-subreddit maximum selection for comments. Add a second
baker.make() call to create another RedditComment on the second submission (the
one in the "programming" subreddit) with a higher created_utc value than 200,
then update the assertion for services.get_latest_comment_created_utc() to
expect this new higher timestamp value to ensure the function correctly selects
the maximum comment timestamp across all subreddits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 36add6d1-9e91-40a1-84d4-6bd70318b223

📥 Commits

Reviewing files that changed from the base of the PR and between 74c981c and 85aca6d.

📒 Files selected for processing (12)
  • .env.example
  • README.md
  • config/boost_collector_schedule.yaml
  • config/settings.py
  • docs/service_api/reddit_activity_tracker.md
  • reddit_activity_tracker/fetcher.py
  • reddit_activity_tracker/management/commands/run_reddit_activity_tracker.py
  • reddit_activity_tracker/protocol_impl.py
  • reddit_activity_tracker/services.py
  • reddit_activity_tracker/tests/test_collector_integration.py
  • reddit_activity_tracker/tests/test_run_reddit_activity_tracker_command.py
  • reddit_activity_tracker/tests/test_services.py

Comment thread config/settings.py Outdated
Comment thread README.md Outdated
Comment thread config/settings.py Outdated
@AuraMindNest AuraMindNest requested a review from snowfox1003 June 19, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reddit multi-subreddit support + configurable targets

2 participants