Skip to content

Commit 8cd6ac1

Browse files
committed
update papers
1 parent 7e44a41 commit 8cd6ac1

5 files changed

Lines changed: 380 additions & 40 deletions

File tree

.claude/CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ These repo-local skills live under `.claude/skills/*/SKILL.md`.
3939
- [dev-setup](skills/dev-setup/SKILL.md) -- Interactive wizard to install and configure all development tools for new maintainers.
4040
- [verify-reduction](skills/verify-reduction/SKILL.md) -- Standalone mathematical verification of a reduction rule: Typst proof, constructor Python (≥5000 checks), adversary Python (≥5000 independent checks). Reports verdict, no artifacts saved. Also called as a subroutine by `/add-rule` (default behavior).
4141
- [tutorial](skills/tutorial/SKILL.md) -- Interactive tutorial — walk through the pred CLI to explore, reduce, and solve NP-hard problems. No Rust internals.
42+
- [update-papers](skills/update-papers/SKILL.md) -- Update research paper collection: download new papers from references.bib, retry failed downloads, sync to Google Drive, regenerate index.md.
4243

4344
## Codex Compatibility
4445
- Claude slash commands such as `/issue-to-pr 42 --execute` are aliases for the matching repo-local skill files under `.claude/skills/`.
Lines changed: 129 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,129 @@
1+
---
2+
name: update-papers
3+
description: Update the research paper collection — download new papers from references.bib, retry failed downloads, sync to Google Drive, and regenerate index.md
4+
---
5+
6+
# Update Papers
7+
8+
Maintain the research paper collection in `docs/research/`. Downloads papers referenced in `docs/paper/references.bib`, manages the manifest, syncs to Google Drive, and keeps `docs/research/index.md` current.
9+
10+
## Prerequisites
11+
12+
- `rclone` installed and configured with a `gdrive` remote
13+
- `PAPERS_REMOTE` env var set (e.g., `gdrive:problemreductions-papers`)
14+
15+
## Step 1: Check Current Status
16+
17+
```bash
18+
make papers-status
19+
```
20+
21+
Note the counts: total entries, PDFs on disk, pending downloads, missing papers.
22+
23+
## Step 2: Lookup New Papers
24+
25+
Run the lookup to find arxiv/OA URLs for any new entries in `references.bib` since the last run. This is incremental — it skips entries already found in the manifest.
26+
27+
```bash
28+
make papers-lookup
29+
```
30+
31+
Review the output:
32+
- New arxiv papers found
33+
- New OA (open access) papers found
34+
- Papers with no free source (will need Sci-Hub in Step 4)
35+
36+
## Step 3: Download Free Papers
37+
38+
Download papers with known free URLs (arxiv + open access). Skips PDFs already on disk.
39+
40+
```bash
41+
make papers-download
42+
```
43+
44+
If some OA downloads fail with 403, that's expected — publisher paywalls. These will be picked up by Sci-Hub in the next step.
45+
46+
## Step 4: Fetch Remaining via Sci-Hub
47+
48+
For papers with DOIs that aren't on disk yet, try Sci-Hub mirrors. This is the slowest step (~5 seconds per paper).
49+
50+
```bash
51+
make papers-scihub
52+
```
53+
54+
The script tries multiple mirrors (`sci-hub.ru`, `sci-hub.do`, `sci-hub.it.nf`, `sci-hub.es.ht`). If all mirrors are down, retry later — the script is fully idempotent.
55+
56+
## Step 4b: Manual Web Search for Remaining Failures
57+
58+
After Sci-Hub, check `make papers-status` for papers still missing. For each one with a DOI that Sci-Hub couldn't find:
59+
60+
1. **Web search** for `"<title>" <first-author> PDF` — try:
61+
- Author homepages (Stanford, university pages)
62+
- Open-access publishers: LIPIcs/Dagstuhl (all free), HAL archives, ECCC
63+
- Preprint servers: arxiv (search by title), IACR ePrint
64+
2. **Download manually** with `curl -L -o docs/research/raw/<key>.pdf "<url>"`
65+
3. **Verify** the file is a real PDF: `file docs/research/raw/<key>.pdf`
66+
67+
Skip textbooks (garey1979, sipser2012, cormen2022, conway1967) — these aren't available as single PDFs.
68+
69+
## Step 5: Regenerate Index
70+
71+
Update `docs/research/index.md` with the latest paper collection, cross-referenced against reduction rules and problem definitions in `reductions.typ`.
72+
73+
```bash
74+
make papers-index
75+
```
76+
77+
Verify the index looks correct:
78+
- Check the download count at the top
79+
- Spot-check that new papers appear in the correct section (rules / problems / other)
80+
- Confirm PDF links resolve for newly downloaded papers
81+
82+
## Step 6: Sync to Google Drive
83+
84+
Push updated PDFs and manifest to the shared Google Drive remote. Only uploads new/changed files.
85+
86+
First verify the remote is configured:
87+
88+
```bash
89+
echo $PAPERS_REMOTE # should show e.g. gdrive:problemreductions-papers
90+
# If empty, set it:
91+
export PAPERS_REMOTE=gdrive:problemreductions-papers
92+
```
93+
94+
Then push:
95+
96+
```bash
97+
make papers-push
98+
```
99+
100+
## Step 7: Final Status
101+
102+
```bash
103+
make papers-status
104+
```
105+
106+
Report to the user:
107+
- How many new papers were downloaded
108+
- How many remain missing (and why: no DOI, textbooks, Sci-Hub mirrors down)
109+
- Whether the Google Drive sync succeeded
110+
111+
## One-Liner
112+
113+
For a full update in one command:
114+
115+
```bash
116+
make papers && make papers-index
117+
```
118+
119+
This runs: lookup → download → scihub → status → index.
120+
121+
## Troubleshooting
122+
123+
**Sci-Hub mirrors all fail**: Mirrors rotate frequently. Update `SCIHUB_DOMAINS` in `scripts/fetch_papers.py` or retry later.
124+
125+
**rclone auth expired**: Run `rclone config reconnect gdrive:` to refresh the OAuth token.
126+
127+
**Manifest is stale**: Delete `docs/research/manifest.json` and re-run `make papers-lookup` to rebuild from scratch. Existing PDFs on disk are preserved.
128+
129+
**New bib entry not appearing**: Ensure the entry is in `docs/paper/references.bib` with proper formatting. The parser expects `@type{key, ... }` with fields like `title`, `doi`, `author`, `year`.

Makefile

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Makefile for problemreductions
22

3-
.PHONY: help build test mcp-test fmt clippy doc mdbook paper clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever board-next board-claim board-ack board-move issue-context issue-guards pr-context pr-wait-ci worktree-issue worktree-pr diagrams jl-testdata cli cli-demo copilot-review papers papers-lookup papers-download papers-scihub papers-status papers-push papers-pull
3+
.PHONY: help build test mcp-test fmt clippy doc mdbook paper clean coverage rust-export compare qubo-testdata export-schemas release run-plan run-issue run-pipeline run-pipeline-forever run-review run-review-forever board-next board-claim board-ack board-move issue-context issue-guards pr-context pr-wait-ci worktree-issue worktree-pr diagrams jl-testdata cli cli-demo copilot-review papers papers-lookup papers-download papers-scihub papers-status papers-push papers-pull papers-index
44

55
RUNNER ?= codex
66
CLAUDE_MODEL ?= opus
@@ -627,6 +627,10 @@ papers-push:
627627
papers-pull:
628628
python3 scripts/fetch_papers.py pull
629629

630+
# Regenerate docs/research/index.md from references.bib + reductions.typ
631+
papers-index:
632+
python3 scripts/gen_paper_index.py
633+
630634
# Show current collection stats
631635
papers-status:
632636
python3 scripts/fetch_papers.py status

scripts/fetch_papers.py

Lines changed: 100 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@
4141
MAX_RETRIES = 3 # retries per API call on 429
4242
DOWNLOAD_DELAY = 2.0 # seconds between PDF downloads
4343
SCIHUB_DELAY = 5.0 # seconds between Sci-Hub requests (be polite)
44-
SCIHUB_DOMAINS = ["sci-hub.se", "sci-hub.st", "sci-hub.ru"]
44+
SCIHUB_DOMAINS = ["sci-hub.ru", "sci-hub.do", "sci-hub.it.nf", "sci-hub.es.ht", "sci-hub.se", "sci-hub.st"]
4545

4646

4747
def parse_bib(path: Path) -> list[dict]:
@@ -310,7 +310,14 @@ def download_pdfs(manifest_entries: list[dict]):
310310

311311
print(f"[{downloaded+1}] {key}: {url[:70]}...")
312312
try:
313-
req = urllib.request.Request(url, headers={"User-Agent": "problemreductions/1.0"})
313+
# Use browser-like headers to avoid 403 from publisher sites
314+
headers = {
315+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
316+
"AppleWebKit/537.36 (KHTML, like Gecko) "
317+
"Chrome/120.0.0.0 Safari/537.36",
318+
"Accept": "application/pdf,*/*",
319+
}
320+
req = urllib.request.Request(url, headers=headers)
314321
with urllib.request.urlopen(req, timeout=60) as resp:
315322
data = resp.read()
316323

@@ -354,32 +361,63 @@ def _try_scihub_download(doi: str, dest: Path) -> bool:
354361
dest.write_bytes(content)
355362
return True
356363

357-
# Parse page for embedded PDF iframe/link
364+
# Parse page for embedded PDF link
358365
html = content.decode("utf-8", errors="ignore")
359-
# Look for iframe src or direct PDF link
360-
pdf_match = re.search(
361-
r'(?:iframe|embed)[^>]+src\s*=\s*["\']([^"\']*\.pdf[^"\']*)["\']',
362-
html, re.IGNORECASE
366+
pdf_path = None
367+
368+
# Strategy A: citation_pdf_url meta tag (sci-hub.ru pattern)
369+
m = re.search(
370+
r'citation_pdf_url["\']?\s+content\s*=\s*["\']([^"\']+)',
371+
html, re.IGNORECASE,
363372
)
364-
if not pdf_match:
365-
pdf_match = re.search(
373+
if m:
374+
pdf_path = m.group(1)
375+
376+
# Strategy B: /storage/ path in page
377+
if not pdf_path:
378+
m = re.search(r'(/storage/[^\s"\'<>,]+\.pdf)', html)
379+
if m:
380+
pdf_path = m.group(1)
381+
382+
# Strategy C: iframe/embed src with .pdf
383+
if not pdf_path:
384+
m = re.search(
385+
r'(?:iframe|embed)[^>]+src\s*=\s*["\']([^"\']*\.pdf[^"\']*)["\']',
386+
html, re.IGNORECASE,
387+
)
388+
if m:
389+
pdf_path = m.group(1)
390+
391+
# Strategy D: any absolute PDF URL
392+
if not pdf_path:
393+
m = re.search(
366394
r'(https?://[^\s"\'<>]+\.pdf(?:\?[^\s"\'<>]*)?)',
367-
html, re.IGNORECASE
395+
html, re.IGNORECASE,
368396
)
369-
if not pdf_match:
370-
# Try //domain/path pattern (protocol-relative)
371-
pdf_match = re.search(
397+
if m:
398+
pdf_path = m.group(1)
399+
400+
# Strategy E: protocol-relative PDF URL
401+
if not pdf_path:
402+
m = re.search(
372403
r'src\s*=\s*["\']?(//[^\s"\'<>]+\.pdf[^\s"\'<>]*)',
373-
html, re.IGNORECASE
404+
html, re.IGNORECASE,
374405
)
406+
if m:
407+
pdf_path = m.group(1)
408+
409+
pdf_match = pdf_path # unify variable name
375410

376411
if pdf_match:
377-
pdf_url = pdf_match.group(1)
412+
pdf_url = pdf_match
378413
if pdf_url.startswith("//"):
379414
pdf_url = "https:" + pdf_url
415+
elif pdf_url.startswith("/"):
416+
pdf_url = f"https://{domain}{pdf_url}"
380417

381418
pdf_req = urllib.request.Request(pdf_url, headers={
382-
"User-Agent": "Mozilla/5.0",
419+
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
420+
"AppleWebKit/537.36",
383421
"Referer": url,
384422
})
385423
with urllib.request.urlopen(pdf_req, timeout=60) as pdf_resp:
@@ -448,31 +486,54 @@ def show_status():
448486
return
449487

450488
all_entries = list(manifest.values())
451-
found = [r for r in all_entries if r.get("pdf_url")]
452-
arxiv = [r for r in all_entries if r.get("arxiv_id")]
453-
oa_only = [r for r in found if not r.get("arxiv_id")]
454-
missing = [r for r in all_entries if not r.get("pdf_url")]
455489

490+
# What's actually on disk (the ground truth)
456491
pdfs = list(OUTPUT_DIR.glob("*.pdf")) if OUTPUT_DIR.exists() else []
457-
pdf_keys = {p.stem for p in pdfs}
458-
total_size = sum(p.stat().st_size for p in pdfs)
459-
460-
print(f"=== MANIFEST ===")
461-
print(f"Total entries: {len(all_entries)}")
462-
print(f"Arxiv: {len(arxiv)}")
463-
print(f"OA (non-arxiv): {len(oa_only)}")
464-
print(f"Not found: {len(missing)}")
465-
print()
466-
print(f"=== DOWNLOADS ===")
467-
print(f"PDFs on disk: {len(pdfs)}")
468-
print(f"Total size: {total_size / 1024 / 1024:.1f} MB")
469-
print(f"Pending download: {len(found) - len(pdf_keys & {e['key'] for e in found})}")
470-
471-
if missing:
472-
print(f"\n=== NOT FOUND ({len(missing)}) ===")
473-
for r in sorted(missing, key=lambda r: r.get("year", "0")):
474-
doi_str = f" doi:{r.get('doi','')}" if r.get("doi") else ""
475-
print(f" {r['key']} ({r.get('year','?')}): {r.get('title','')[:55]}{doi_str}")
492+
pdf_keys = {p.stem for p in pdfs if p.stat().st_size > 1000}
493+
total_size = sum(p.stat().st_size for p in pdfs if p.stat().st_size > 1000)
494+
all_keys = {e["key"] for e in all_entries}
495+
496+
# Truly missing = in manifest but no PDF on disk
497+
truly_missing = []
498+
for e in all_entries:
499+
if e["key"] not in pdf_keys:
500+
truly_missing.append(e)
501+
502+
# Categorize missing
503+
textbooks = {"garey1979", "sipser2012", "cormen2022", "conway1967"}
504+
missing_with_doi = [e for e in truly_missing if e.get("doi") and e["key"] not in textbooks]
505+
missing_no_doi = [e for e in truly_missing if not e.get("doi") and e["key"] not in textbooks]
506+
missing_textbooks = [e for e in truly_missing if e["key"] in textbooks]
507+
508+
print(f"=== COLLECTION ===")
509+
print(f"Total in references.bib: {len(all_entries)}")
510+
print(f"PDFs on disk: {len(pdf_keys)} ({total_size / 1024 / 1024:.1f} MB)")
511+
print(f"Truly missing: {len(truly_missing)}")
512+
print(f" With DOI (retry): {len(missing_with_doi)}")
513+
print(f" No DOI (manual): {len(missing_no_doi)}")
514+
print(f" Textbooks: {len(missing_textbooks)}")
515+
516+
# Remote status
517+
if PAPERS_REMOTE:
518+
print(f"\nRemote: {PAPERS_REMOTE}")
519+
else:
520+
print(f"\nRemote: not configured (set PAPERS_REMOTE)")
521+
522+
if truly_missing:
523+
if missing_with_doi:
524+
print(f"\n=== MISSING WITH DOI — retry with 'make papers-scihub' ({len(missing_with_doi)}) ===")
525+
for r in sorted(missing_with_doi, key=lambda r: r.get("year", "0")):
526+
print(f" {r['key']} ({r.get('year','?')}): {r.get('title','')[:55]} doi:{r['doi']}")
527+
528+
if missing_no_doi:
529+
print(f"\n=== MISSING WITHOUT DOI — manual web search needed ({len(missing_no_doi)}) ===")
530+
for r in sorted(missing_no_doi, key=lambda r: r.get("year", "0")):
531+
print(f" {r['key']} ({r.get('year','?')}): {r.get('title','')[:60]}")
532+
533+
if missing_textbooks:
534+
print(f"\n=== TEXTBOOKS — not downloadable as PDF ({len(missing_textbooks)}) ===")
535+
for r in sorted(missing_textbooks, key=lambda r: r.get("year", "0")):
536+
print(f" {r['key']} ({r.get('year','?')}): {r.get('title','')[:60]}")
476537

477538

478539
def _require_remote() -> str:

0 commit comments

Comments
 (0)