|
| 1 | +# Feed Turn Module Refactor Brief |
| 2 | + |
| 3 | +Status: research / implementation handoff. |
| 4 | +Goal: make the reaction -> next meme path easier to change, test, and measure without changing product behavior. |
| 5 | + |
| 6 | +## Why This Matters |
| 7 | + |
| 8 | +The Feed Turn is the product's critical hot path: |
| 9 | + |
| 10 | +```text |
| 11 | +user taps like/dislike |
| 12 | +-> reaction is persisted |
| 13 | +-> next meme or popup is delivered |
| 14 | +-> next impression row is created |
| 15 | +-> queue refill is triggered |
| 16 | +``` |
| 17 | + |
| 18 | +Today this is spread across these modules: |
| 19 | + |
| 20 | +- `src/tgbot/handlers/reaction.py` |
| 21 | +- `src/tgbot/senders/next_message.py` |
| 22 | +- `src/tgbot/senders/meme.py` |
| 23 | +- `src/recommendations/service.py` |
| 24 | +- `src/recommendations/meme_queue.py` |
| 25 | +- `src/recommendations/candidates.py` |
| 26 | +- `src/redis.py` |
| 27 | + |
| 28 | +The target is a deeper Module: a small Interface that owns one Feed Turn while keeping the existing implementations behind adapters until parity is proven. |
| 29 | + |
| 30 | +## Current Invariants |
| 31 | + |
| 32 | +- One `user_meme_reaction` row per `(user_id, meme_id)`. |
| 33 | +- A delivered meme is considered seen when the pending `user_meme_reaction` row is inserted, even before the user reacts. |
| 34 | +- `update_user_meme_reaction()` is the idempotency seam: it only updates rows with `reaction_id IS NULL` and returns `rowcount > 0`. |
| 35 | +- Duplicate callback clicks must not trigger another next meme. |
| 36 | +- Queue candidates must be `status='ok'`, language-compatible, unseen for the user, and contain `id`, `type`, `telegram_file_id`, and `recommended_by`. |
| 37 | +- Queue refill uses a Redis lock and only generates when queue length is `<= 8`. |
| 38 | +- Redis queue key format and queued meme JSON shape must not change during the refactor. |
| 39 | + |
| 40 | +## Known Failure Modes To Preserve Or Fix Deliberately |
| 41 | + |
| 42 | +- Malformed callback data can raise before persistence because the handler pattern is broad. |
| 43 | +- Duplicate callbacks currently still run counter cache update, moderator invite check, last active update, and daily reward scheduling before DB dedupe. |
| 44 | +- If no pending reaction row exists, the feed does not advance, but pre-dedupe side effects may already have happened. |
| 45 | +- `Forbidden` in `send_new_message_with_meme()` marks the user blocked and returns `None`; the caller may still create a pending next-impression row. |
| 46 | +- `TimedOut` retries after popping a meme. If Telegram actually delivered, no pending row is created for that popped meme. |
| 47 | +- `BadRequest` disables broken media and retries; after too many failures the user gets the queue-preparing alert. |
| 48 | +- `check_queue()` logs and swallows generation failures, so users may fall through to empty queue behavior. |
| 49 | + |
| 50 | +Do not "clean these up" silently. Write characterization tests first, then decide which behaviors are bugs worth changing. |
| 51 | + |
| 52 | +## Proposed Module Shape |
| 53 | + |
| 54 | +Create a new package only after tests lock current behavior: |
| 55 | + |
| 56 | +```text |
| 57 | +src/feed_turn/ |
| 58 | + __init__.py |
| 59 | + contracts.py # TurnRequest, TurnResult, QueueSnapshot, CandidateBatch |
| 60 | + ports.py # Protocols for queue, candidates, reactions, user info, delivery, observability |
| 61 | + planner.py # pure maturity-stage / engine-plan selection |
| 62 | + refill.py # queue refill orchestration |
| 63 | + turn.py # FeedTurnService: one feed turn = react + find + deliver + record + refill |
| 64 | + adapters/ |
| 65 | + legacy_queue.py |
| 66 | + legacy_candidates.py |
| 67 | + legacy_reactions.py |
| 68 | + telegram_delivery.py |
| 69 | + user_info.py |
| 70 | +``` |
| 71 | + |
| 72 | +Keep these public functions import-compatible at first: |
| 73 | + |
| 74 | +- `src.tgbot.handlers.reaction.handle_reaction` |
| 75 | +- `src.tgbot.senders.next_message.next_message` |
| 76 | +- `src.recommendations.meme_queue.check_queue` |
| 77 | +- `src.recommendations.meme_queue.generate_recommendations` |
| 78 | +- `src.tgbot.senders.meme.send_meme_to_user` |
| 79 | + |
| 80 | +## Adapter Seams |
| 81 | + |
| 82 | +- `ReactionLedgerPort`: wraps `create_user_meme_reaction`, `update_user_meme_reaction`, and `user_meme_reaction_exists`. |
| 83 | +- `QueuePort`: wraps Redis LIST operations, queue length, queued IDs, and per-user refill lock. |
| 84 | +- `CandidateRetrieverPort`: wraps `CandidatesRetriever`. |
| 85 | +- `BlenderPort`: wraps `blend()` exactly, including `fixed_pos` and `random_seed`. |
| 86 | +- `DeliveryPort`: wraps caption creation, keyboard creation, send/edit behavior, broken media handling, and blocked-user handling. |
| 87 | +- `UserInfoPort`: wraps cached user info and language lookup. |
| 88 | +- `ObservabilityPort`: structured logs / async analytics writes, never blocking Telegram delivery. |
| 89 | + |
| 90 | +## Safe Implementation Plan |
| 91 | + |
| 92 | +1. Freeze current behavior with characterization tests. |
| 93 | + Cover reaction idempotency, duplicate callbacks, positive/negative delivery path, stale queue entries, popup branch, first-meme nudge ordering, empty queue alert, `Forbidden`, `TimedOut`, `BadRequest`, and `check_queue` lock behavior. |
| 94 | + |
| 95 | +2. Extract a pure planner. |
| 96 | + Move only the decision table from `generate_recommendations()`: cold start phases, growing, mature, moderator/admin quota, fixed positions, weights, and fallback chain. No Redis writes, SQL, or Telegram calls in this step. |
| 97 | + |
| 98 | +3. Split candidate selection from enqueue. |
| 99 | + Keep `generate_recommendations()` as the compatibility wrapper, but introduce internal `select_candidates()` and `enqueue_candidates()` so tests can exercise the selection Interface without mutating Redis. |
| 100 | + |
| 101 | +4. Add observability wrappers around the legacy path. |
| 102 | + First instrument existing behavior before moving behavior. This gives baseline metrics for parity. |
| 103 | + |
| 104 | +5. Introduce `FeedTurnService` behind `next_message()`. |
| 105 | + `next_message()` should delegate to the service through legacy adapters while preserving the old function signature. |
| 106 | + |
| 107 | +6. Move `handle_reaction()` orchestration behind the service. |
| 108 | + Only after delivery behavior is locked and monitored. |
| 109 | + |
| 110 | +7. Move secondary entry points last. |
| 111 | + Examples: empty queue alert, language reset queue clear, upload completion queue check, broadcasts, moderator manual queue edits. |
| 112 | + |
| 113 | +## Feature Flag And Rollback |
| 114 | + |
| 115 | +Add a config flag such as: |
| 116 | + |
| 117 | +```text |
| 118 | +FEED_TURN_MODULE_ENABLED=false |
| 119 | +``` |
| 120 | + |
| 121 | +Default it to false until tests and production telemetry prove parity. |
| 122 | + |
| 123 | +Rollback must be a config flip, not a data migration. Keep these stable: |
| 124 | + |
| 125 | +- Redis key format: `meme_queue:{user_id}` |
| 126 | +- queued meme JSON shape |
| 127 | +- `recommended_by` values |
| 128 | +- `user_meme_reaction` write timing |
| 129 | +- existing public function signatures |
| 130 | + |
| 131 | +Shadow mode is allowed only for pure planning/candidate selection. Do not shadow queue writes or reaction writes. |
| 132 | + |
| 133 | +## Test Gate |
| 134 | + |
| 135 | +Run this focused gate before and after each phase: |
| 136 | + |
| 137 | +```bash |
| 138 | +pytest \ |
| 139 | + tests/recommendations/test_blender.py \ |
| 140 | + tests/recommendations/test_meme_queue.py \ |
| 141 | + tests/recommendations/test_queue_correctness.py \ |
| 142 | + tests/recommendations/test_engine_contracts.py \ |
| 143 | + tests/recommendations/test_reaction_service.py \ |
| 144 | + tests/tgbot/test_reaction_handler.py \ |
| 145 | + tests/tgbot/test_first_meme_nudge.py \ |
| 146 | + tests/test_redis.py |
| 147 | +``` |
| 148 | + |
| 149 | +Add new tests before moving behavior: |
| 150 | + |
| 151 | +- `tests/feed_turn/test_planner.py` |
| 152 | +- `tests/feed_turn/test_turn_service.py` |
| 153 | +- `tests/feed_turn/test_refill.py` |
| 154 | +- `tests/tgbot/test_next_message_delivery.py` |
| 155 | + |
| 156 | +Test doubles should sit at adapter seams, not inside implementation details. |
| 157 | + |
| 158 | +## Observability Contract |
| 159 | + |
| 160 | +The refactor is not done until Feed Turn can be monitored. |
| 161 | + |
| 162 | +Use low-cardinality structured logs or a non-blocking analytics writer for always-on data. If a table is added, make it append-only and time-retained. |
| 163 | + |
| 164 | +Recommended event names: |
| 165 | + |
| 166 | +- `ff.feed_turn.started` |
| 167 | +- `ff.feed_turn.completed` |
| 168 | +- `ff.feed_turn.failed` |
| 169 | +- `ff.recs.batch.generated` |
| 170 | +- `ff.recs.engine.completed` |
| 171 | + |
| 172 | +Recommended bounded dimensions: |
| 173 | + |
| 174 | +- `outcome` |
| 175 | +- `failure_class` |
| 176 | +- `reaction_id` |
| 177 | +- `maturity_stage`: `cold_start_1`, `cold_start_2`, `cold_start_3`, `growing`, `mature`, `moderator` |
| 178 | +- `user_type` |
| 179 | +- `prev_engine` |
| 180 | +- `next_engine` |
| 181 | +- `send_method`: `new`, `edit`, `popup`, `alert` |
| 182 | +- `media_type` |
| 183 | +- `language_bucket`: `ru`, `en`, `other`, `multi` |
| 184 | +- `queue_len_bucket`: `0`, `1-2`, `3-8`, `9+` |
| 185 | + |
| 186 | +Never use `user_id`, `meme_id`, `telegram_file_id`, caption text, queue key, or source URL as metric labels. Raw IDs can exist only in sampled logs or analytics tables. |
| 187 | + |
| 188 | +Core metrics: |
| 189 | + |
| 190 | +- `feed_turn_completed_total` |
| 191 | +- `reaction_duplicate_total` |
| 192 | +- `reaction_to_next_delivery_ms` |
| 193 | +- `component_latency_ms` for reaction DB update, queue pop, caption build, Telegram send/edit, impression insert |
| 194 | +- `queue_pop_attempts` |
| 195 | +- `queue_stale_pop_total` |
| 196 | +- `queue_refill_total` |
| 197 | +- `recommendation_batch_duration_ms` |
| 198 | +- `engine_candidate_count` |
| 199 | +- `engine_empty_total` |
| 200 | +- `blend_selected_count` |
| 201 | +- `delivery_failure_total` |
| 202 | +- `continuation_rate_30m` |
| 203 | +- `next_reaction_rate` |
| 204 | +- `fast_dislike_rate` |
| 205 | + |
| 206 | +Recommended alerts: |
| 207 | + |
| 208 | +- p95 `reaction_to_next_delivery_ms` regression |
| 209 | +- queue-empty alert rate spike |
| 210 | +- stale pop rate spike |
| 211 | +- refill failure rate spike |
| 212 | +- Telegram timeout rate spike |
| 213 | +- duplicate reaction rate spike |
| 214 | +- recommendation engine empty rate spike |
| 215 | +- stats freshness lag |
| 216 | + |
| 217 | +## Optional Analytics Tables |
| 218 | + |
| 219 | +If structured logs are not enough, add these only after validating write volume. |
| 220 | + |
| 221 | +`feed_turn_event`: |
| 222 | + |
| 223 | +```text |
| 224 | +turn_id |
| 225 | +event_version |
| 226 | +started_at |
| 227 | +completed_at |
| 228 | +user_id |
| 229 | +prev_meme_id |
| 230 | +prev_engine |
| 231 | +reaction_id |
| 232 | +reaction_is_new |
| 233 | +next_meme_id |
| 234 | +next_engine |
| 235 | +outcome |
| 236 | +failure_class |
| 237 | +maturity_stage |
| 238 | +user_type |
| 239 | +language_bucket |
| 240 | +send_method |
| 241 | +media_type |
| 242 | +queue_before |
| 243 | +queue_after |
| 244 | +pop_attempts |
| 245 | +stale_pops |
| 246 | +refill_triggered |
| 247 | +latencies_ms JSONB |
| 248 | +experiments JSONB |
| 249 | +created_at |
| 250 | +``` |
| 251 | + |
| 252 | +`feed_recommendation_batch`: |
| 253 | + |
| 254 | +```text |
| 255 | +batch_id |
| 256 | +created_at |
| 257 | +user_id |
| 258 | +maturity_stage |
| 259 | +limit |
| 260 | +queue_len_before |
| 261 | +exclude_count |
| 262 | +engine_counts JSONB |
| 263 | +selected_counts JSONB |
| 264 | +fallback_used |
| 265 | +enqueued_count |
| 266 | +duration_ms |
| 267 | +lock_status |
| 268 | +error_class |
| 269 | +``` |
| 270 | + |
| 271 | +For debugging recommendation decisions, prefer sampled or failure-only decision logs modeled after `crossposting_decision_log`. |
| 272 | + |
| 273 | +## New Session Prompt |
| 274 | + |
| 275 | +Use this prompt to continue in a new session: |
| 276 | + |
| 277 | +```text |
| 278 | +We are in /Users/ohld/Documents/GitHub/ff-backend. |
| 279 | +
|
| 280 | +Read AGENTS.md and specs/feed-turn-module.md first. Then inspect the current Feed Turn hot path: |
| 281 | +- src/tgbot/handlers/reaction.py |
| 282 | +- src/tgbot/senders/next_message.py |
| 283 | +- src/tgbot/senders/meme.py |
| 284 | +- src/recommendations/service.py |
| 285 | +- src/recommendations/meme_queue.py |
| 286 | +- src/recommendations/candidates.py |
| 287 | +- src/redis.py |
| 288 | +
|
| 289 | +Goal: implement the Feed Turn Module refactor incrementally without changing behavior. |
| 290 | +
|
| 291 | +Start by writing characterization tests for the current behavior. Do not move production code until tests cover the behavior being moved. |
| 292 | +
|
| 293 | +Initial preferred slice: |
| 294 | +1. Add tests for the pure planner shape you want. |
| 295 | +2. Extract a pure planner from generate_recommendations() into src/feed_turn/planner.py. |
| 296 | +3. Keep generate_recommendations() behavior and public signature unchanged. |
| 297 | +4. Run the focused test gate from specs/feed-turn-module.md. |
| 298 | +
|
| 299 | +Architecture vocabulary: |
| 300 | +- Module: anything with an Interface and Implementation. |
| 301 | +- Interface: all facts callers must know, including invariants and error modes. |
| 302 | +- Seam: where behavior can vary without editing the caller. |
| 303 | +- Adapter: concrete implementation at a Seam. |
| 304 | +- Deep Module: lots of behavior behind a small Interface. |
| 305 | +
|
| 306 | +Constraints: |
| 307 | +- Preserve Redis queue key format and queued meme JSON shape. |
| 308 | +- Preserve user_meme_reaction semantics. |
| 309 | +- No synchronous analytics writes in the hot path. |
| 310 | +- Do not introduce a broad abstraction unless there are at least two real adapters or it hides real complexity. |
| 311 | +- Keep feature flag / rollback in mind for later phases. |
| 312 | +
|
| 313 | +Before finalizing, summarize: |
| 314 | +- what behavior is now covered by tests, |
| 315 | +- what behavior changed, if any, |
| 316 | +- which metrics or logs were added, |
| 317 | +- exact test commands run. |
| 318 | +``` |
| 319 | + |
0 commit comments