Date: 2026-06-02
Boost Data Collector runs two collection paradigms in one Django project without explicit architectural separation:
- Batch (scheduled) collectors — YAML-driven schedules executed by Celery Beat →
run_scheduled_collectors_task→run_scheduled_collectors, which invokesrun_*management commands sequentially within each group batch. - Event-driven (real-time) services — Slack Socket Mode in
slack_event_handler, handling WebSocket callbacks and background worker threads concurrently.
Both paradigms share the same process model at deploy time (one monorepo, one PostgreSQL database, one workspace/ tree, shared INSTALLED_APPS and settings). The batch path’s sequential guarantee applies only inside a single run_scheduled_collectors invocation; it does not extend to the event-driven path. Yet both can touch the same database and filesystem state.
Production scheduling is defined in config/boost_collector_schedule.yaml. As of this ADR, that file defines five groups and ten tasks (UTC default_time per group):
| Group | default_time (UTC) |
Tasks (count) |
|---|---|---|
github |
00:05 | 6 (daily, monthly, on_release) |
boost_library_docs |
16:20 | 1 (on_release) |
slack |
16:30 | 1 (daily) |
discord |
16:40 | 1 (daily) |
mailing_list |
00:10 | 1 (daily) |
Several collectors have run_* commands but are not in the production YAML (see App classification). config/boost_collector_schedule.yaml.example shows additional patterns (e.g. run_clang_github_tracker weekly, run_wg21_paper_tracker interval) that production may adopt later.
Cross-app coupling—especially Foreign Keys into cppa_user_tracker as the identity hub—makes module boundaries expensive to move. Untested apps and the absence of versioned deprecation paths (STABILITY.md) increase the cost of delaying paradigm separation.
The sequential batch model is simple and reliable for nightly runs across many sources today, but it is a poor fit for planned growth: more sources, more organizations, and higher-frequency collection. Without documented swim lanes, every release that mixes paradigms in one deployable increases operational and correctness risk.
For system context, see Architecture overview, Architecture data flow, Workflow, and Cross-app dependencies. Bus-factor and onboarding material: BUS_FACTOR_DELIVERABLES.md, onboarding walkthroughs.
- Correctness — Concurrency and state-sharing bugs (e.g. file-queue races) must not be masked by “it works on the nightly schedule.”
- Operability — Operators need to know which process runs batch work vs real-time listeners, and what fails independently.
- Scale — More collectors, schedules, and orgs imply more parallel batch groups and possibly more event-driven entry points.
- Maintainability — Cross-app FKs and import-linter contracts (
cross-app-dependencies.md) favor incremental separation over a big-bang split. - Safety net — Weak test coverage on some apps increases the value of clear boundaries before refactors.
- Trigger: Celery Beat entries built from
boost_collector_runner/schedule_config.pyreading the YAML schedule. - Entry point:
run_scheduled_collectors_task→ management commandrun_scheduled_collectors. - Execution model: Within one batch, commands run one after another (
call_commandin a loop). Different YAML groups get separate Beat entries and may run in parallel on different Celery workers (tasks.pypassesgroup_idso each Beat run executes one group’s task list). - State: PostgreSQL via each app’s
services.py; optional files underworkspace/<app>/. - Sub-mode — interval batch: Tasks with
schedule: intervalandminutes: Nare still batch collectors (management commands), but Beat fires them every N minutes independently of groupdefault_time(Workflow).
flowchart TB
subgraph batch [Batch paradigm]
Beat[Celery Beat]
Task[run_scheduled_collectors_task]
Cmd[run_scheduled_collectors]
Trackers["run_* collector commands"]
PG[(PostgreSQL)]
WS1[workspace per tracker]
Beat --> Task --> Cmd --> Trackers
Trackers --> PG
Trackers --> WS1
end
- Trigger: Slack events over Socket Mode (and related Bolt handlers), not the YAML schedule.
- Entry point:
run_slack_event_handler(long-running process). - Execution model: Concurrent event callbacks; per-team FIFO job queue with daemon worker threads (
job_queue.py). - State: JSON files under
workspace/slack_event_handler/(state.py); optional GitHub writes viacore.operations. No ORM models in this app.
flowchart TB
subgraph realtime [Event-driven paradigm]
Socket[Slack Socket Mode]
Handler[slack_event_handler]
Queue[JSON state and job_queue]
WS2[workspace/slack_event_handler]
Socket --> Handler --> Queue
Handler --> WS2
end
core— Collector contracts, errors, shared operations (no domain DB).cppa_user_tracker— Identity hub; batch collectors write through it via FKs and service calls. It is shared infrastructure, not “batch” or “event-driven” by itself.
| Separated today | Not separated today |
|---|---|
Batch orchestration isolated in boost_collector_runner |
Single Django project and config/settings.py |
Celery worker + Beat run batch tasks (docker-compose.yml) |
slack_event_handler not a first-class Compose service; often run manually or alongside web (Deployment) |
Sequential order within one run_scheduled_collectors run |
Parallel across YAML groups on multiple workers |
| Event-driven code lives in its own app package | Same PostgreSQL and often same workspace/ volume mount as batch collectors |
File locks for Slack PR-bot state within slack_event_handler |
No process-level fence between batch Celery work and Socket Mode in production layout |
flowchart LR
subgraph deploy [Current production layout]
BatchSvc[celery_worker and celery_beat]
WebSvc[web gunicorn]
end
DB[(PostgreSQL)]
Redis[(Redis)]
WS[workspace volume]
BatchSvc --> DB
BatchSvc --> WS
WebSvc --> DB
WebSvc -.->|may host| Realtime[run_slack_event_handler]
Realtime --> WS
| Paradigm | Django app | Primary entry | Scheduled in production YAML? |
|---|---|---|---|
| Platform | core |
(library) | N/A |
| Platform | cppa_user_tracker |
run_cppa_user_tracker |
No |
| Batch orchestration | boost_collector_runner |
run_scheduled_collectors |
N/A (runner) |
| Batch collector | github_activity_tracker |
via run_boost_github_activity_tracker |
Yes (via github group) |
| Batch collector | boost_library_tracker |
run_boost_github_activity_tracker, collect_boost_libraries, … |
Yes |
| Batch collector | boost_library_docs_tracker |
run_boost_library_docs_tracker |
Yes (on_release) |
| Batch collector | boost_library_usage_dashboard |
run_boost_library_usage_dashboard |
Yes |
| Batch collector | boost_usage_tracker |
run_boost_usage_tracker, run_update_created_repos_by_language |
Yes |
| Batch collector | boost_mailing_list_tracker |
run_boost_mailing_list_tracker |
Yes |
| Batch collector | cppa_slack_tracker |
run_cppa_slack_tracker |
Yes |
| Batch collector | discord_activity_tracker |
run_discord_activity_tracker |
Yes |
| Batch collector | cppa_pinecone_sync |
run_cppa_pinecone_sync |
No |
| Batch collector | clang_github_tracker |
run_clang_github_tracker |
No (in .example only) |
| Batch collector | wg21_paper_tracker |
run_wg21_paper_tracker |
No (in .example as interval) |
| Batch collector | cppa_youtube_script_tracker |
run_cppa_youtube_script_tracker |
No |
| Event-driven | slack_event_handler |
run_slack_event_handler |
No (by design) |
Future event-driven candidates (document only; no commitment): live Discord gateway ingestion, webhooks, or push-based pipelines that cannot be expressed as periodic run_* commands. Any such service should follow the same swim lane as slack_event_handler, not the YAML batch runner.
The Slack PR comment bot illustrates why in-paradigm locking matters, and why process boundaries help but do not remove shared-resource risk.
| Code path | Locking |
|---|---|
enqueue_job |
Mutates queue under modify_state (advisory file lock + in-process mutex) |
estimated_delay_sec |
Calls load_state without the file lock |
Worker loop _worker |
Peeks load_state(...)["queue"] without lock, then dequeues under modify_state |
That asymmetry is a time-of-check/time-of-use (TOCTOU) race inside the event-driven paradigm: delay estimates and empty-queue sleeps can disagree with concurrent enqueue/dequeue.
Intended fix (separate work): Run read-only queue inspection and delay simulation under state_file_lock / modify_state, or otherwise align all queue readers with the same critical section as writers.
Why a process boundary helps: Running run_slack_event_handler in a dedicated process (not mixed with Celery worker threads or ad hoc web usage) isolates failure domains and makes it obvious that batch sequential guarantees do not apply. It does not automatically fix the file-lock bug above.
What a process boundary does not fix: Batch collectors and the Slack handler still share PostgreSQL (including cppa_user_tracker FK writes) and may share workspace/ mounts. Cross-paradigm contention (long-running batch job vs realtime DB load) remains a scheduling and schema-design concern—see cross-app-dependencies.md.
Keep one deployable; document paradigms (this ADR) and coding rules. Lowest effort; weakest isolation.
collector-batch:celery_worker+celery_beat(unchanged responsibility).collector-realtime: dedicated service running onlyrun_slack_event_handler.web: Gunicorn/admin; do not run Socket Mode in the web container.
Shared: PostgreSQL, Redis, workspace/, Django settings. No identity-hub DB split in the first phase.
Extract cppa_user_tracker behind a versioned API or message bus. Maximum flexibility; highest cost given MTI/FK graph and import-linter contracts. Defer unless scale forces it.
Adopt Option B in phases (see Migration path):
- Document paradigms and app mapping (this ADR).
- Harden event-driven state handling (
enqueue_job/estimated_delay_seclocking). - Add a first-class realtime deployable in Compose/systemd docs.
- Improve batch schedule coverage and cross-paradigm DB scheduling discipline.
- Optionally tighten cross-app contracts via
core.protocolsandSTABILITY.md.
Stay in one repository and one database until operational swim lanes prove insufficient. Do not plan a multi-repo split until import/schema coupling is reduced.
Target deployables (same repo):
flowchart LR
subgraph target [Target deployables]
BatchSvc[celery_worker and beat]
RealtimeSvc[slack_event_handler service]
WebSvc[gunicorn web]
end
DB[(PostgreSQL)]
Redis[(Redis)]
WS[workspace volume]
BatchSvc --> DB
BatchSvc --> WS
RealtimeSvc --> WS
WebSvc --> DB
- Clear vocabulary for reviews and onboarding (Architecture overview can link here).
- Safer path to more collectors and higher-frequency batch (
interval) without overloading one mental model. - Realtime failures and restarts do not require redeploying the entire batch stack once processes are split.
cppa_user_trackerFK hub remains; batch and future realtime DB writers still coordinate through PostgreSQL.- Import-linter and cross-app imports still constrain refactors (cross-app-dependencies.md).
- Two operational surfaces (Beat schedule + long-running listener) require monitoring and runbooks.
- Phased work does not by itself improve test coverage on untested apps.
| Phase | Scope | Outcome |
|---|---|---|
| 0 | Publish this ADR; link from docs/README.md | Shared terminology and app→paradigm map |
| 1 | Fix enqueue_job / estimated_delay_sec / worker peek locking; extend test_job_queue.py |
Event-driven lane internally consistent under concurrency |
| 2 | Add slack_event_handler service to Compose/systemd; update Docker.md and Deployment.md |
Process boundary without repo split |
| 3 | Schedule hygiene: add or document unscheduled run_* commands; align prod YAML with .example where intended |
Batch lane completeness |
| 4 | Operational: stagger heavy batch groups vs realtime peaks; document shared-DB contention mitigations | Reduced cross-paradigm interference |
| 5 (optional) | Expand core.protocols DTOs at app edges; follow STABILITY.md for deprecations |
Safer future extraction |
Phase 0 is satisfied by this document. Phases 1–5 are future implementation; they are not part of the ADR file change itself.
enqueue_jobrace fix — Phase 1; see Example:enqueue_jobrace.- Schedule gaps —
run_cppa_user_tracker,run_cppa_pinecone_sync,run_clang_github_tracker,run_wg21_paper_tracker,run_cppa_youtube_script_tracker. - Import boundaries —
lint-imports,scripts/list_cross_app_imports.py. - Workspace orphan grace — Separate file-writer races documented in Workspace.md (
WORKSPACE_ORPHAN_INVALID_JSON_GRACE_SECONDS); analogous “don’t read partial state without coordination” lesson.
- Workflow.md — batch execution order and Celery Beat behavior
- boost_collector_runner/README.md
- slack_event_handler/README.md
- Architecture_data_flow.md — batch vs long-running in persistence table
- BUS_FACTOR_DELIVERABLES.md