|
| 1 | +--- |
| 2 | +name: "survey-repo-explorer" |
| 3 | +description: "Extracts categorized paper links from survey README and builds a local knowledgebase workflow. Invoke when user wants systematic literature exploration from a survey repo." |
| 4 | +--- |
| 5 | + |
| 6 | +# Survey Repo Explorer |
| 7 | + |
| 8 | +## Purpose |
| 9 | +This skill turns a survey repository README paper list into a structured local knowledgebase for scientific investigation, including parsing, export, downloading, reading optimization, and evidence tracking. |
| 10 | + |
| 11 | +## When To Invoke |
| 12 | +Invoke this skill when the user asks to: |
| 13 | +- mine paper links from a survey README |
| 14 | +- build category-based paper datasets locally |
| 15 | +- batch download papers and organize PDFs by topic |
| 16 | +- run objective literature tracking and reporting workflows |
| 17 | +- summarize paper distribution with a simple CLI |
| 18 | + |
| 19 | +## 0. Initialization |
| 20 | +Before any extraction or file generation, ask the user: |
| 21 | +1. Reading goal (scientific question or objective) |
| 22 | +2. Preferred categories |
| 23 | +3. Depth level (breadth scan vs deep reading) |
| 24 | +4. Output preferences |
| 25 | + |
| 26 | +Proceed only after capturing these preferences. |
| 27 | + |
| 28 | +## 1. Primary Workflow |
| 29 | +1. Parse README by category headings. |
| 30 | +2. Extract metadata with emphasis on canonical paper links. |
| 31 | +3. Build the local knowledgebase directory. |
| 32 | +4. Export per-category metadata into CSV and TXT. |
| 33 | +5. Generate scripts under `tools/` for extraction, downloading, optimization, and statistics. |
| 34 | +6. Maintain `AGENTS.md` and `REPORT.md` for process and findings. |
| 35 | + |
| 36 | +## Directory Contract |
| 37 | +Use this structure: |
| 38 | + |
| 39 | +```text |
| 40 | +./survey2knowledgebase/ |
| 41 | + categories/ |
| 42 | + <category>/ |
| 43 | + csv/ |
| 44 | + papers.csv |
| 45 | + txt/ |
| 46 | + papers.txt |
| 47 | + pdf/ |
| 48 | + <year>-<short_name>.pdf |
| 49 | + tools/ |
| 50 | + extract_from_readme.py |
| 51 | + download_pdfs.py |
| 52 | + pdf_optimize.py |
| 53 | + stats_cli.py |
| 54 | + AGENTS.md |
| 55 | + REPORT.md |
| 56 | +``` |
| 57 | + |
| 58 | +## Execution Boundary |
| 59 | +- Generate and save required files and scripts in the workspace. |
| 60 | +- Provide runnable commands for the user to execute. |
| 61 | +- Do not auto-run downloading scripts unless the user explicitly asks for execution. |
| 62 | + |
| 63 | +## Data Export Rules |
| 64 | +- One CSV and one TXT per category. |
| 65 | +- Store files inside each category directory, separated by file type (`csv/`, `txt/`, `pdf/`). |
| 66 | +- Required fields: `short_name`, `title`, `category`, `year`, `month`, `venue`, `paper_url`, `source_type`, `code_url`, `data_url`. |
| 67 | +- `short_name` rule: `FirstAuthorLastName_FirstImportantTitleWord` (example: `Vaswani_Attention`). |
| 68 | +- Preserve original links exactly. |
| 69 | +- If multiple paper links exist, choose the strongest canonical paper link in this order: DOI > ACL > OpenReview > arXiv > Website. |
| 70 | +- Use chunked parsing for long README files to avoid truncation and extraction errors. |
| 71 | + |
| 72 | +## PDF Download Rules |
| 73 | +- Download PDFs into `./survey2knowledgebase/categories/<category>/pdf/`. |
| 74 | +- Keep deterministic filenames: `<year>-<short_name>.pdf`. |
| 75 | +- Skip existing files unless `--force` is set. |
| 76 | +- Use randomized rate limiting between requests (`time.sleep` in 2–5s range). |
| 77 | +- Use retry and error handling for network failures and 403/404 responses. |
| 78 | +- Save failures into `./survey2knowledgebase/tools/download_errors.log`. |
| 79 | + |
| 80 | +## PDF Reading Optimization Tools |
| 81 | +Generate utilities for: |
| 82 | +- text extraction to markdown |
| 83 | +- section split by headings |
| 84 | +- chunking for long-context agent reading |
| 85 | +- keyword indexing for quick retrieval |
| 86 | +- prefer robust libraries such as PyMuPDF or pdfplumber when available in the environment |
| 87 | + |
| 88 | +## Research Records |
| 89 | +`AGENTS.md` must record: |
| 90 | +- objective, scope, extraction source, steps taken, and toolchain decisions |
| 91 | +- reproducibility details (commands, paths, parameters) |
| 92 | + |
| 93 | +`REPORT.md` must record: |
| 94 | +- key scientific findings from user dialogues and extracted papers |
| 95 | +- objective statements with evidence |
| 96 | +- each claim linked to paper title and local file path |
| 97 | + |
| 98 | +## Citation Requirements |
| 99 | +- Always cite paper title and local file link together. |
| 100 | +- Keep tone objective, falsifiable, and evidence-based. |
| 101 | +- Separate observation from hypothesis. |
0 commit comments