update papers

GiggleLiu · GiggleLiu · commit 8cd6ac168151 · 2026-04-10T19:15:33.000+08:00
diff --git a/.claude/CLAUDE.md b/.claude/CLAUDE.md
@@ -39,6 +39,7 @@ These repo-local skills live under `.claude/skills/*/SKILL.md`.
 - [dev-setup](skills/dev-setup/SKILL.md) -- Interactive wizard to install and configure all development tools for new maintainers.
 - [verify-reduction](skills/verify-reduction/SKILL.md) -- Standalone mathematical verification of a reduction rule: Typst proof, constructor Python (≥5000 checks), adversary Python (≥5000 independent checks). Reports verdict, no artifacts saved. Also called as a subroutine by `/add-rule` (default behavior).
 - [tutorial](skills/tutorial/SKILL.md) -- Interactive tutorial — walk through the pred CLI to explore, reduce, and solve NP-hard problems. No Rust internals.
+- [update-papers](skills/update-papers/SKILL.md) -- Update research paper collection: download new papers from references.bib, retry failed downloads, sync to Google Drive, regenerate index.md.
 
 ## Codex Compatibility
 - Claude slash commands such as `/issue-to-pr 42 --execute` are aliases for the matching repo-local skill files under `.claude/skills/`.
diff --git a/.claude/skills/update-papers/SKILL.md b/.claude/skills/update-papers/SKILL.md
@@ -0,0 +1,129 @@
+---
+name: update-papers
+description: Update the research paper collection — download new papers from references.bib, retry failed downloads, sync to Google Drive, and regenerate index.md
+---
+
+# Update Papers
+
+Maintain the research paper collection in `docs/research/`. Downloads papers referenced in `docs/paper/references.bib`, manages the manifest, syncs to Google Drive, and keeps `docs/research/index.md` current.
+
+## Prerequisites
+
+- `rclone` installed and configured with a `gdrive` remote
+- `PAPERS_REMOTE` env var set (e.g., `gdrive:problemreductions-papers`)
+
+## Step 1: Check Current Status
+
+```bash
+make papers-status
+```
+
+Note the counts: total entries, PDFs on disk, pending downloads, missing papers.
+
+## Step 2: Lookup New Papers
+
+Run the lookup to find arxiv/OA URLs for any new entries in `references.bib` since the last run. This is incremental — it skips entries already found in the manifest.
+
+```bash
+make papers-lookup
+```
+
+Review the output:
+- New arxiv papers found
+- New OA (open access) papers found
+- Papers with no free source (will need Sci-Hub in Step 4)
+
+## Step 3: Download Free Papers
+
+Download papers with known free URLs (arxiv + open access). Skips PDFs already on disk.
+
+```bash
+make papers-download
+```
+
+If some OA downloads fail with 403, that's expected — publisher paywalls. These will be picked up by Sci-Hub in the next step.
+
+## Step 4: Fetch Remaining via Sci-Hub
+
+For papers with DOIs that aren't on disk yet, try Sci-Hub mirrors. This is the slowest step (~5 seconds per paper).
+
+```bash
+make papers-scihub
+```
+
+The script tries multiple mirrors (`sci-hub.ru`, `sci-hub.do`, `sci-hub.it.nf`, `sci-hub.es.ht`). If all mirrors are down, retry later — the script is fully idempotent.
+
+## Step 4b: Manual Web Search for Remaining Failures
+
+After Sci-Hub, check `make papers-status` for papers still missing. For each one with a DOI that Sci-Hub couldn't find:
+
+1. **Web search** for `"<title>" <first-author> PDF` — try:
+   - Author homepages (Stanford, university pages)
+   - Open-access publishers: LIPIcs/Dagstuhl (all free), HAL archives, ECCC
+   - Preprint servers: arxiv (search by title), IACR ePrint
+2. **Download manually** with `curl -L -o docs/research/raw/<key>.pdf "<url>"`
+3. **Verify** the file is a real PDF: `file docs/research/raw/<key>.pdf`
+
+Skip textbooks (garey1979, sipser2012, cormen2022, conway1967) — these aren't available as single PDFs.
+
+## Step 5: Regenerate Index
+
+Update `docs/research/index.md` with the latest paper collection, cross-referenced against reduction rules and problem definitions in `reductions.typ`.
+
+```bash
+make papers-index
+```
+
+Verify the index looks correct:
+- Check the download count at the top
+- Spot-check that new papers appear in the correct section (rules / problems / other)
+- Confirm PDF links resolve for newly downloaded papers
+
+## Step 6: Sync to Google Drive
+
+Push updated PDFs and manifest to the shared Google Drive remote. Only uploads new/changed files.
+
+First verify the remote is configured:
+
+```bash
+echo $PAPERS_REMOTE    # should show e.g. gdrive:problemreductions-papers
+# If empty, set it:
+export PAPERS_REMOTE=gdrive:problemreductions-papers
+```
+
+Then push:
+
+```bash
+make papers-push
+```
+
+## Step 7: Final Status
+
+```bash
+make papers-status
+```
+
+Report to the user:
+- How many new papers were downloaded
+- How many remain missing (and why: no DOI, textbooks, Sci-Hub mirrors down)
+- Whether the Google Drive sync succeeded
+
+## One-Liner
+
+For a full update in one command:
+
+```bash
+make papers && make papers-index
+```
+
+This runs: lookup → download → scihub → status → index.
+
+## Troubleshooting
+
+**Sci-Hub mirrors all fail**: Mirrors rotate frequently. Update `SCIHUB_DOMAINS` in `scripts/fetch_papers.py` or retry later.
+
+**rclone auth expired**: Run `rclone config reconnect gdrive:` to refresh the OAuth token.
+
+**Manifest is stale**: Delete `docs/research/manifest.json` and re-run `make papers-lookup` to rebuild from scratch. Existing PDFs on disk are preserved.
+
+**New bib entry not appearing**: Ensure the entry is in `docs/paper/references.bib` with proper formatting. The parser expects `@type{key, ... }` with fields like `title`, `doi`, `author`, `year`.
diff --git a/Makefile b/Makefile
@@ -1,6 +1,6 @@
 # Makefile for problemreductions
 
-.PHONY: help build test mcp-test fmt clippy doc mdbook paper clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever board-next board-claim board-ack board-move issue-context issue-guards pr-context pr-wait-ci worktree-issue worktree-pr diagrams jl-testdata cli cli-demo copilot-review papers papers-lookup papers-download papers-scihub papers-status papers-push papers-pull
+.PHONY: help build test mcp-test fmt clippy doc mdbook paper clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever board-next board-claim board-ack board-move issue-context issue-guards pr-context pr-wait-ci worktree-issue worktree-pr diagrams jl-testdata cli cli-demo copilot-review papers papers-lookup papers-download papers-scihub papers-status papers-push papers-pull papers-index
 
 RUNNER ?= codex
 CLAUDE_MODEL ?= opus
@@ -627,6 +627,10 @@ papers-push:
 papers-pull:
 	python3 scripts/fetch_papers.py pull
 
+# Regenerate docs/research/index.md from references.bib + reductions.typ
+papers-index:
+	python3 scripts/gen_paper_index.py
+
 # Show current collection stats
 papers-status:
 	python3 scripts/fetch_papers.py status
diff --git a/scripts/fetch_papers.py b/scripts/fetch_papers.py
@@ -41,7 +41,7 @@
 MAX_RETRIES = 3       # retries per API call on 429
 DOWNLOAD_DELAY = 2.0  # seconds between PDF downloads
 SCIHUB_DELAY = 5.0    # seconds between Sci-Hub requests (be polite)
-SCIHUB_DOMAINS = ["sci-hub.se", "sci-hub.st", "sci-hub.ru"]
+SCIHUB_DOMAINS = ["sci-hub.ru", "sci-hub.do", "sci-hub.it.nf", "sci-hub.es.ht", "sci-hub.se", "sci-hub.st"]
 
 
 def parse_bib(path: Path) -> list[dict]:
@@ -310,7 +310,14 @@ def download_pdfs(manifest_entries: list[dict]):
 
         print(f"[{downloaded+1}] {key}: {url[:70]}...")
         try:
-            req = urllib.request.Request(url, headers={"User-Agent": "problemreductions/1.0"})
+            # Use browser-like headers to avoid 403 from publisher sites
+            headers = {
+                "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+                              "AppleWebKit/537.36 (KHTML, like Gecko) "
+                              "Chrome/120.0.0.0 Safari/537.36",
+                "Accept": "application/pdf,*/*",
+            }
+            req = urllib.request.Request(url, headers=headers)
             with urllib.request.urlopen(req, timeout=60) as resp:
                 data = resp.read()
 
@@ -354,32 +361,63 @@ def _try_scihub_download(doi: str, dest: Path) -> bool:
                 dest.write_bytes(content)
                 return True
 
-            # Parse page for embedded PDF iframe/link
+            # Parse page for embedded PDF link
             html = content.decode("utf-8", errors="ignore")
-            # Look for iframe src or direct PDF link
-            pdf_match = re.search(
-                r'(?:iframe|embed)[^>]+src\s*=\s*["\']([^"\']*\.pdf[^"\']*)["\']',
-                html, re.IGNORECASE
+            pdf_path = None
+
+            # Strategy A: citation_pdf_url meta tag (sci-hub.ru pattern)
+            m = re.search(
+                r'citation_pdf_url["\']?\s+content\s*=\s*["\']([^"\']+)',
+                html, re.IGNORECASE,
             )
-            if not pdf_match:
-                pdf_match = re.search(
+            if m:
+                pdf_path = m.group(1)
+
+            # Strategy B: /storage/ path in page
+            if not pdf_path:
+                m = re.search(r'(/storage/[^\s"\'<>,]+\.pdf)', html)
+                if m:
+                    pdf_path = m.group(1)
+
+            # Strategy C: iframe/embed src with .pdf
+            if not pdf_path:
+                m = re.search(
+                    r'(?:iframe|embed)[^>]+src\s*=\s*["\']([^"\']*\.pdf[^"\']*)["\']',
+                    html, re.IGNORECASE,
+                )
+                if m:
+                    pdf_path = m.group(1)
+
+            # Strategy D: any absolute PDF URL
+            if not pdf_path:
+                m = re.search(
                     r'(https?://[^\s"\'<>]+\.pdf(?:\?[^\s"\'<>]*)?)',
-                    html, re.IGNORECASE
+                    html, re.IGNORECASE,
                 )
-            if not pdf_match:
-                # Try //domain/path pattern (protocol-relative)
-                pdf_match = re.search(
+                if m:
+                    pdf_path = m.group(1)
+
+            # Strategy E: protocol-relative PDF URL
+            if not pdf_path:
+                m = re.search(
                     r'src\s*=\s*["\']?(//[^\s"\'<>]+\.pdf[^\s"\'<>]*)',
-                    html, re.IGNORECASE
+                    html, re.IGNORECASE,
                 )
+                if m:
+                    pdf_path = m.group(1)
+
+            pdf_match = pdf_path  # unify variable name
 
             if pdf_match:
-                pdf_url = pdf_match.group(1)
+                pdf_url = pdf_match
                 if pdf_url.startswith("//"):
                     pdf_url = "https:" + pdf_url
+                elif pdf_url.startswith("/"):
+                    pdf_url = f"https://{domain}{pdf_url}"
 
                 pdf_req = urllib.request.Request(pdf_url, headers={
-                    "User-Agent": "Mozilla/5.0",
+                    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
+                                  "AppleWebKit/537.36",
                     "Referer": url,
                 })
                 with urllib.request.urlopen(pdf_req, timeout=60) as pdf_resp:
@@ -448,31 +486,54 @@ def show_status():
         return
 
     all_entries = list(manifest.values())
-    found = [r for r in all_entries if r.get("pdf_url")]
-    arxiv = [r for r in all_entries if r.get("arxiv_id")]
-    oa_only = [r for r in found if not r.get("arxiv_id")]
-    missing = [r for r in all_entries if not r.get("pdf_url")]
 
+    # What's actually on disk (the ground truth)
     pdfs = list(OUTPUT_DIR.glob("*.pdf")) if OUTPUT_DIR.exists() else []
-    pdf_keys = {p.stem for p in pdfs}
-    total_size = sum(p.stat().st_size for p in pdfs)
-
-    print(f"=== MANIFEST ===")
-    print(f"Total entries: {len(all_entries)}")
-    print(f"Arxiv: {len(arxiv)}")
-    print(f"OA (non-arxiv): {len(oa_only)}")
-    print(f"Not found: {len(missing)}")
-    print()
-    print(f"=== DOWNLOADS ===")
-    print(f"PDFs on disk: {len(pdfs)}")
-    print(f"Total size: {total_size / 1024 / 1024:.1f} MB")
-    print(f"Pending download: {len(found) - len(pdf_keys & {e['key'] for e in found})}")
-
-    if missing:
-        print(f"\n=== NOT FOUND ({len(missing)}) ===")
-        for r in sorted(missing, key=lambda r: r.get("year", "0")):
-            doi_str = f" doi:{r.get('doi','')}" if r.get("doi") else ""
-            print(f"  {r['key']} ({r.get('year','?')}): {r.get('title','')[:55]}{doi_str}")
+    pdf_keys = {p.stem for p in pdfs if p.stat().st_size > 1000}
+    total_size = sum(p.stat().st_size for p in pdfs if p.stat().st_size > 1000)
+    all_keys = {e["key"] for e in all_entries}
+
+    # Truly missing = in manifest but no PDF on disk
+    truly_missing = []
+    for e in all_entries:
+        if e["key"] not in pdf_keys:
+            truly_missing.append(e)
+
+    # Categorize missing
+    textbooks = {"garey1979", "sipser2012", "cormen2022", "conway1967"}
+    missing_with_doi = [e for e in truly_missing if e.get("doi") and e["key"] not in textbooks]
+    missing_no_doi = [e for e in truly_missing if not e.get("doi") and e["key"] not in textbooks]
+    missing_textbooks = [e for e in truly_missing if e["key"] in textbooks]
+
+    print(f"=== COLLECTION ===")
+    print(f"Total in references.bib: {len(all_entries)}")
+    print(f"PDFs on disk:            {len(pdf_keys)} ({total_size / 1024 / 1024:.1f} MB)")
+    print(f"Truly missing:           {len(truly_missing)}")
+    print(f"  With DOI (retry):      {len(missing_with_doi)}")
+    print(f"  No DOI (manual):       {len(missing_no_doi)}")
+    print(f"  Textbooks:             {len(missing_textbooks)}")
+
+    # Remote status
+    if PAPERS_REMOTE:
+        print(f"\nRemote: {PAPERS_REMOTE}")
+    else:
+        print(f"\nRemote: not configured (set PAPERS_REMOTE)")
+
+    if truly_missing:
+        if missing_with_doi:
+            print(f"\n=== MISSING WITH DOI — retry with 'make papers-scihub' ({len(missing_with_doi)}) ===")
+            for r in sorted(missing_with_doi, key=lambda r: r.get("year", "0")):
+                print(f"  {r['key']} ({r.get('year','?')}): {r.get('title','')[:55]}  doi:{r['doi']}")
+
+        if missing_no_doi:
+            print(f"\n=== MISSING WITHOUT DOI — manual web search needed ({len(missing_no_doi)}) ===")
+            for r in sorted(missing_no_doi, key=lambda r: r.get("year", "0")):
+                print(f"  {r['key']} ({r.get('year','?')}): {r.get('title','')[:60]}")
+
+        if missing_textbooks:
+            print(f"\n=== TEXTBOOKS — not downloadable as PDF ({len(missing_textbooks)}) ===")
+            for r in sorted(missing_textbooks, key=lambda r: r.get("year", "0")):
+                print(f"  {r['key']} ({r.get('year','?')}): {r.get('title','')[:60]}")
 
 
 def _require_remote() -> str:
diff --git a/scripts/gen_paper_index.py b/scripts/gen_paper_index.py