Skip to content

Commit 57fdec5

Browse files
committed
fix: deduplicate papers within batch by arxiv_id before LLM
Prevents inbox + arXiv overlap from causing double classification. (processed_ids already backfilled on server with 319 existing papers)
1 parent 7739db9 commit 57fdec5

1 file changed

Lines changed: 14 additions & 0 deletions

File tree

automation/main.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,21 @@ def run_daily() -> None:
126126
# ── 3. Merge + deduplicate ──────────────────────────────────────────────
127127
all_papers = raw_papers + inbox_papers
128128
processed_set = set(state["processed_ids"]) | set(state["rejected_ids"])
129+
130+
# Dedup 1: filter already processed
129131
new_papers = [p for p in all_papers if p.get("arxiv_id") not in processed_set]
132+
133+
# Dedup 2: within this batch, keep first occurrence by arxiv_id
134+
# (arXiv crawl and inbox may contain the same paper)
135+
seen: dict[str, bool] = {}
136+
deduped = []
137+
for p in new_papers:
138+
aid = p.get("arxiv_id", "")
139+
if aid and aid not in seen:
140+
seen[aid] = True
141+
deduped.append(p)
142+
new_papers = deduped
143+
130144
logger.info("After dedup: %d new papers to process", len(new_papers))
131145

132146
if not new_papers:

0 commit comments

Comments
 (0)