Skip to content

Commit f3df668

Browse files
authored
Merge pull request #118 from GeiserX/feature/skip-topic-ids
feat: exclude specific topics in forum supergroups
2 parents 22de6da + 2b6da8a commit f3df668

13 files changed

Lines changed: 2942 additions & 43 deletions

.env.example

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,11 @@ CHECKPOINT_INTERVAL=1
7070
# Set to false to keep existing media but skip future downloads
7171
SKIP_MEDIA_DELETE_EXISTING=true
7272

73+
# Skip specific topics in forum supergroups (format: chat_id:topic_id,...)
74+
# Messages in matching topics are completely excluded from backup
75+
# Example: SKIP_TOPIC_IDS=-1001234567890:42,-1001234567890:1337
76+
# SKIP_TOPIC_IDS=
77+
7378
# Hour (0-23) to recalculate backup statistics daily
7479
# STATS_CALCULATION_HOUR=3
7580

AGENTS.md

Lines changed: 74 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ You assist developers working on telegram-archive.
2121
2222
## Repository & Infrastructure
2323

24-
- **License:** mit
24+
- **License:** GPL-3.0
2525
- **CI/CD:**
2626
- **Commits:** Follow [Conventional Commits](https://conventionalcommits.org) format
2727
- **Versioning:** Follow [Semantic Versioning](https://semver.org) (semver)
@@ -83,19 +83,6 @@ Follow these conventions:
8383
- Add comments for complex logic only
8484
- Keep functions focused and testable
8585

86-
## Testing Strategy
87-
88-
### Test Levels
89-
90-
- **Unit:** Unit tests for individual functions and components
91-
- **Integration:** Integration tests for component interactions
92-
93-
### Frameworks
94-
95-
Use: pytest
96-
97-
### Coverage Target: 80%
98-
9986
## 🔐 Security Configuration
10087

10188
### Secrets Management
@@ -120,6 +107,79 @@ Use: pytest
120107
121108
**🔍 Security Audit Recommendation:** When making changes that involve authentication, data handling, API endpoints, or dependencies, proactively offer to perform a security review of the affected code.
122109

110+
## Architecture — Key Patterns
111+
112+
### Module Structure
113+
114+
- **`src/telegram_backup.py`** — Scheduled backup flow: `backup_all()``_backup_dialog()` → iterates messages → `_process_message()``_commit_batch()`. Gap filling: `_fill_gaps()``_fill_gap_range()`. Forum topics: `_backup_forum_topics()`.
115+
- **`src/listener.py`** — Real-time event handlers: `on_new_message`, `on_message_edited`, `on_message_deleted`, `on_chat_action`, `on_pinned_messages`. Instantiated with `TelegramListener(config, db, client)`.
116+
- **`src/config.py`** — All config from env vars. Required: `API_ID`, `API_HASH`, `PHONE_NUMBER`. Properties are lazy-parsed from env.
117+
- **`src/message_utils.py`** — Shared utility module. Contains `extract_topic_id(message)` used by both backup and listener.
118+
- **`src/db/adapter.py`** — Database operations. `src/db/models.py` — SQLAlchemy models. `src/db/base.py` — DB manager.
119+
120+
### Forum Topic Filtering
121+
122+
Topic IDs are extracted from `message.reply_to.reply_to_top_id` (primary) with fallback to `reply_to_msg_id`. The General topic (id=1) service messages may not carry `reply_to` metadata and can bypass filtering. The `SKIP_TOPIC_IDS` env var uses format `chat_id:topic_id,...` parsed into `dict[int, set[int]]`.
123+
124+
### Logging Rules
125+
126+
- **Never log chat IDs, topic IDs, or topic titles** — these are considered PII per the project's guidelines. Log only aggregated counts (e.g., "skipping N topics across M chats").
127+
- **Never log message content** — same PII rule applies.
128+
129+
## CI/CD Pipeline
130+
131+
### Lint Workflow (`.github/workflows/lint.yml`)
132+
133+
CI runs **both** `ruff check .` AND `ruff format --check .`. Always run both locally before pushing:
134+
```bash
135+
python3 -m ruff check . && python3 -m ruff format --check .
136+
```
137+
138+
### Test Workflow (`.github/workflows/tests.yml`)
139+
140+
- Runs `pytest tests/` with `--cov=src --cov-report=xml`
141+
- Uploads to Codecov
142+
- Python 3.14 on Ubuntu
143+
- Web tests (test_database_viewer, test_multi_user_auth, test_v720_features) require FastAPI/pydantic — may fail locally if versions mismatch
144+
145+
### CodeRabbit
146+
147+
- Free OSS plan has **hourly commit rate limits** (low threshold)
148+
- The incremental review system marks rate-limited commits as "reviewed" even though no review was posted
149+
- To force a full review after rate limit expires: comment `@coderabbitai full review`
150+
- **Do NOT trigger repeatedly** — each trigger counts against the limit and extends the cooldown
151+
- Wait the **full cooldown** (check the minutes shown in the rate limit message), then trigger exactly **once**
152+
153+
## Testing — Critical Patterns
154+
155+
### MagicMock Truthiness Pitfall
156+
157+
When using `MagicMock()` for Telegram message objects, any attribute access returns a truthy MagicMock. This breaks code that checks `message.reply_to` or `getattr(reply_to, "forum_topic", False)`.
158+
159+
**ALWAYS set these on mock messages:**
160+
```python
161+
msg = MagicMock()
162+
msg.reply_to = None # Prevents false-positive topic filtering
163+
```
164+
165+
**ALWAYS set this on mock configs that use `MagicMock()` (not real Config):**
166+
```python
167+
config.should_skip_topic = MagicMock(return_value=False)
168+
```
169+
170+
### Test Style
171+
172+
- Existing tests use `unittest.TestCase` with `MagicMock`/`AsyncMock` — follow this pattern for consistency
173+
- Config tests use `patch.dict(os.environ, {...}, clear=True)` — required env vars: `API_ID`, `API_HASH`, `PHONE_NUMBER`
174+
- Async tests use `pytest.mark.asyncio`
175+
- The `TelegramBackup` is instantiated via `TelegramBackup.__new__(TelegramBackup)` with mocked `db`, `client`, `config`
176+
177+
### Frameworks
178+
179+
Use: pytest, pytest-asyncio, pytest-cov
180+
181+
### Coverage Target: 80%
182+
123183
## Alembic Migrations — Critical Reminders
124184

125185
- **`Base.metadata.create_all(checkfirst=True)`** creates ALL tables from SQLAlchemy models at once, including tables that should be created by future Alembic migrations. This means pre-Alembic databases can have schema objects from migrations that haven't "run" yet.

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,7 @@ The **Scope** column shows whether each variable applies to the backup scheduler
258258
| `PRIORITY_CHAT_IDS` | - | B | Comma-separated chat IDs to process first in all operations |
259259
| `SKIP_MEDIA_CHAT_IDS` | - | B | Skip media downloads for specific chats (messages still backed up with text) |
260260
| `SKIP_MEDIA_DELETE_EXISTING` | `true` | B | Delete existing media files and DB records for chats in skip list to reclaim storage |
261+
| `SKIP_TOPIC_IDS` | - | B | Skip specific topics in forum supergroups (format: `chat_id:topic_id,...`) |
261262
| `LOG_LEVEL` | `INFO` | B/V | Logging verbosity: `DEBUG`, `INFO`, `WARNING`/`WARN`, `ERROR` |
262263
| **Chat Filtering** | | | See [Chat Filtering](#chat-filtering) below |
263264
| `CHAT_IDS` | - | B | **Whitelist mode**: backup ONLY these chats (ignores all other filters) |
@@ -341,6 +342,15 @@ CHANNELS_INCLUDE_CHAT_IDS=-1001234567890
341342

342343
Find a chat's ID by forwarding a message to [@userinfobot](https://t.me/userinfobot).
343344

345+
**Topic filtering** — For forum-enabled supergroups, you can exclude specific topics without excluding the entire chat using `SKIP_TOPIC_IDS`:
346+
347+
```bash
348+
# Skip topics 42 and 1337 in one chat, and topic 7 in another
349+
SKIP_TOPIC_IDS=-1001234567890:42,-1001234567890:1337,-1009876543210:7
350+
```
351+
352+
> Note: The topic-creating service message (1 per topic) may still be backed up since it lacks `reply_to` metadata. This does not affect user-generated content.
353+
344354
### Real-time Listener
345355

346356
The scheduled backup only captures new messages. To also track edits and deletions between backups, enable the real-time listener:

docker-compose.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ services:
3131
# PRIORITY_CHAT_IDS: ${PRIORITY_CHAT_IDS:-}
3232
# SKIP_MEDIA_CHAT_IDS: ${SKIP_MEDIA_CHAT_IDS:-}
3333
# SKIP_MEDIA_DELETE_EXISTING: ${SKIP_MEDIA_DELETE_EXISTING:-true}
34+
SKIP_TOPIC_IDS: ${SKIP_TOPIC_IDS:-}
3435
# STATS_CALCULATION_HOUR: ${STATS_CALCULATION_HOUR:-3}
3536

3637
# =======================================================================

docs/CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,10 @@ For upgrade instructions, see [Upgrading](#upgrading) at the bottom.
66

77
## [Unreleased]
88

9+
### Added
10+
11+
- **Topic filtering for forum supergroups** — New `SKIP_TOPIC_IDS` environment variable to exclude specific topics from backup while keeping the rest of the chat. Format: `chat_id:topic_id,...`. Works in both scheduled backup and real-time listener flows (#117)
12+
913
## [7.2.0] - 2026-03-10
1014

1115
### Added

src/config.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,11 @@ def __init__(self):
184184
# Delete existing media files and records for chats in skip list (reclaim storage)
185185
self.skip_media_delete_existing = os.getenv("SKIP_MEDIA_DELETE_EXISTING", "true").lower() == "true"
186186

187+
# Skip specific topics inside forum supergroups
188+
# Format: SKIP_TOPIC_IDS=-1001234567890:42,-1001234567890:1337
189+
# Each entry is chat_id:topic_id — skips that topic but keeps the rest of the chat
190+
self.skip_topic_ids = self._parse_topic_skip_list(os.getenv("SKIP_TOPIC_IDS", ""))
191+
187192
# Session configuration
188193
self.session_name = os.getenv("SESSION_NAME", "telegram_backup")
189194
self.telegram_proxy = build_telegram_proxy_from_env()
@@ -363,6 +368,9 @@ def __init__(self):
363368
if self.skip_media_chat_ids:
364369
cleanup_status = "will delete existing media" if self.skip_media_delete_existing else "keeps existing media"
365370
logger.info(f"Media downloads skipped for chat IDs: {self.skip_media_chat_ids} ({cleanup_status})")
371+
if self.skip_topic_ids:
372+
total_topics = sum(len(t) for t in self.skip_topic_ids.values())
373+
logger.info(f"Topic filtering: skipping {total_topics} topic(s) across {len(self.skip_topic_ids)} chat(s)")
366374
if self.telegram_proxy:
367375
logger.info("Telegram proxy enabled (type=socks5, rdns=%s)", self.telegram_proxy["rdns"])
368376
logger.debug(
@@ -377,6 +385,49 @@ def _parse_id_list(self, id_str: str) -> set:
377385
return set()
378386
return {int(id.strip()) for id in id_str.split(",") if id.strip()}
379387

388+
def _parse_topic_skip_list(self, skip_str: str) -> dict[int, set[int]]:
389+
"""Parse SKIP_TOPIC_IDS into {chat_id: {topic_id, ...}}.
390+
391+
Format: chat_id:topic_id,chat_id:topic_id,...
392+
Example: -1001234567890:42,-1001234567890:1337,-1009876543210:7
393+
"""
394+
result: dict[int, set[int]] = {}
395+
if not skip_str or not skip_str.strip():
396+
return result
397+
for entry in skip_str.split(","):
398+
entry = entry.strip()
399+
if not entry:
400+
continue
401+
if ":" not in entry:
402+
raise ValueError(f"Invalid SKIP_TOPIC_IDS entry '{entry}': expected format chat_id:topic_id")
403+
chat_part, topic_part = entry.split(":", 1)
404+
try:
405+
chat_id = int(chat_part.strip())
406+
topic_id = int(topic_part.strip())
407+
except ValueError as e:
408+
raise ValueError(
409+
f"Invalid SKIP_TOPIC_IDS entry '{entry}': chat_id and topic_id must be integers"
410+
) from e
411+
result.setdefault(chat_id, set()).add(topic_id)
412+
return result
413+
414+
def should_skip_topic(self, chat_id: int, topic_id: int | None) -> bool:
415+
"""Check if a specific topic in a chat should be skipped.
416+
417+
Args:
418+
chat_id: Telegram chat ID (marked format)
419+
topic_id: Forum topic ID (reply_to_top_id), or None for non-topic messages
420+
421+
Returns:
422+
True if this topic should be skipped, False otherwise
423+
"""
424+
if topic_id is None or not self.skip_topic_ids:
425+
return False
426+
skip_set = self.skip_topic_ids.get(chat_id)
427+
if skip_set is None:
428+
return False
429+
return topic_id in skip_set
430+
380431
def _get_required_env(self, key: str, value_type: type):
381432
"""
382433
Get a required environment variable and convert to specified type.

src/listener.py

Lines changed: 22 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@
3636
from .avatar_utils import get_avatar_paths
3737
from .config import Config
3838
from .db import DatabaseAdapter, create_adapter
39+
from .message_utils import extract_topic_id
3940
from .realtime import NotificationType, RealtimeNotifier
4041

4142
logger = logging.getLogger(__name__)
@@ -288,6 +289,9 @@ def __init__(self, config: Config, db: DatabaseAdapter, client: TelegramClient |
288289
logger.info(" LISTEN_NEW_MESSAGES_MEDIA: false (media on scheduled backup)")
289290
else:
290291
logger.info(" LISTEN_NEW_MESSAGES: false (saved on scheduled backup)")
292+
if config.skip_topic_ids:
293+
total = sum(len(t) for t in config.skip_topic_ids.values())
294+
logger.info(f" SKIP_TOPIC_IDS: {total} topic(s) excluded across {len(config.skip_topic_ids)} chat(s)")
291295
logger.info(f" Protection threshold: {config.mass_operation_threshold} ops triggers block")
292296
logger.info(f" Protection window: {config.mass_operation_window_seconds}s")
293297
logger.info(f" Buffer delay: {config.mass_operation_buffer_delay}s (operations held before applying)")
@@ -655,9 +659,12 @@ async def on_message_edited(event: events.MessageEdited.Event) -> None:
655659
if not self._should_process_chat(chat_id):
656660
return
657661

658-
self.stats["edits_received"] += 1
659-
662+
# Skip edits in excluded forum topics
660663
message = event.message
664+
if self.config.should_skip_topic(chat_id, extract_topic_id(message)):
665+
return
666+
667+
self.stats["edits_received"] += 1
661668
new_text = message.text or ""
662669
edit_date = message.edit_date
663670

@@ -790,15 +797,24 @@ async def on_new_message(event: events.NewMessage.Event) -> None:
790797
if not self._should_process_chat(chat_id):
791798
return
792799

800+
# Save the message to database
801+
message = event.message
802+
803+
# Extract topic ID early for filtering and message_data
804+
# v6.2.0: reply_to_top_id added for forum topic threading
805+
reply_to_top_id = extract_topic_id(message)
806+
807+
# Skip messages in excluded forum topics
808+
if self.config.should_skip_topic(chat_id, reply_to_top_id):
809+
logger.debug(f"⏭️ Skipping message in excluded topic {reply_to_top_id}: chat={chat_id}")
810+
return
811+
793812
self.stats["new_messages_received"] += 1
794813

795-
# If LISTEN_NEW_MESSAGES is disabled, we're done
814+
# If LISTEN_NEW_MESSAGES is disabled, just track for edits/deletions
796815
if not self.config.listen_new_messages:
797816
return
798817

799-
# Save the message to database
800-
message = event.message
801-
802818
# Ensure chat exists in database (prevents FK violation for new chats)
803819
chat_entity = await event.get_chat()
804820
if chat_entity:
@@ -824,15 +840,6 @@ async def on_new_message(event: events.NewMessage.Event) -> None:
824840
}
825841
await self.db.upsert_user(user_data)
826842

827-
# Extract message data
828-
# v6.0.0: media_type, media_id, media_path removed - stored in media table
829-
# v6.2.0: reply_to_top_id added for forum topic threading
830-
reply_to_top_id = None
831-
if message.reply_to and getattr(message.reply_to, "forum_topic", False):
832-
reply_to_top_id = getattr(message.reply_to, "reply_to_top_id", None)
833-
if reply_to_top_id is None:
834-
reply_to_top_id = getattr(message.reply_to, "reply_to_msg_id", None)
835-
836843
message_data = {
837844
"id": message.id,
838845
"chat_id": chat_id,

src/message_utils.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
"""Shared message processing utilities used by backup and listener modules."""
2+
3+
4+
def extract_topic_id(message: object) -> int | None:
5+
"""Extract forum topic ID from a Telegram message's reply_to metadata.
6+
7+
Forum messages carry the topic ID in reply_to.reply_to_top_id.
8+
When that field is absent (e.g. topic-creating service messages),
9+
reply_to.reply_to_msg_id is used as a fallback.
10+
11+
Returns None for non-forum messages or messages without reply_to.
12+
"""
13+
if not message.reply_to or not getattr(message.reply_to, "forum_topic", False):
14+
return None
15+
topic_id = getattr(message.reply_to, "reply_to_top_id", None)
16+
if topic_id is None:
17+
topic_id = getattr(message.reply_to, "reply_to_msg_id", None)
18+
return topic_id

0 commit comments

Comments
 (0)