Skip to content

Commit 74063e1

Browse files
author
Nora
committed
feat: add survey-repo-explorer skill
- Extract categorized paper links from survey READMEs - Auto-generate local knowledgebase directory structure - Create Python toolchain for batch PDF downloading and text extraction - Enforce strict metadata rules and rate-limiting for arXiv
1 parent e23525c commit 74063e1

File tree

3 files changed

+109
-0
lines changed

3 files changed

+109
-0
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@ Based on a systematic review of **204 papers and online resources**, this survey
8787
- 🤝 **[Contributing](#-contributing)**: How to contribute to this project
8888
<!-- END EXPLORE -->
8989

90+
> 🧭 **Survey Exploration Skill**: We provide a local skill at **`./skill/survey-repo-explorer`** to extract categorized paper links from survey READMEs and build a structured knowledgebase under **`./survey2knowledgebase`** for agent-assisted literature exploration.
91+
9092
> 🔎 **Browse & Export**: The full paper database is searchable and exportable at **[deepsoftwareanalytics.github.io/Awesome-Issue-Resolution/admin/](https://deepsoftwareanalytics.github.io/Awesome-Issue-Resolution/admin/)** — filter by category, date, or keyword, and export results as CSV.
9193
9294
<!-- START DEMO -->
@@ -560,6 +562,9 @@ python app.py
560562
python app.py --init
561563
```
562564

565+
Survey exploration skill is available at **`./skill/survey-repo-explorer`**.
566+
It can work with your local agent to build a categorized literature knowledgebase under **`./survey2knowledgebase`**.
567+
563568
Open **http://localhost:5000/admin** to manage papers, datasets, and methods.
564569

565570
| Command | Description |

app/view/sync_readme.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -373,6 +373,9 @@ def generate_usage_section() -> str:
373373
python app.py --init
374374
```
375375
376+
Survey exploration skill is available at **`./skill/survey-repo-explorer`**.
377+
It can work with your local agent to build a categorized literature knowledgebase under **`./survey2knowledgebase`**.
378+
376379
Open **http://localhost:5000/admin** to manage papers, datasets, and methods.
377380
378381
| Command | Description |
Lines changed: 101 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,101 @@
1+
---
2+
name: "survey-repo-explorer"
3+
description: "Extracts categorized paper links from survey README and builds a local knowledgebase workflow. Invoke when user wants systematic literature exploration from a survey repo."
4+
---
5+
6+
# Survey Repo Explorer
7+
8+
## Purpose
9+
This skill turns a survey repository README paper list into a structured local knowledgebase for scientific investigation, including parsing, export, downloading, reading optimization, and evidence tracking.
10+
11+
## When To Invoke
12+
Invoke this skill when the user asks to:
13+
- mine paper links from a survey README
14+
- build category-based paper datasets locally
15+
- batch download papers and organize PDFs by topic
16+
- run objective literature tracking and reporting workflows
17+
- summarize paper distribution with a simple CLI
18+
19+
## 0. Initialization
20+
Before any extraction or file generation, ask the user:
21+
1. Reading goal (scientific question or objective)
22+
2. Preferred categories
23+
3. Depth level (breadth scan vs deep reading)
24+
4. Output preferences
25+
26+
Proceed only after capturing these preferences.
27+
28+
## 1. Primary Workflow
29+
1. Parse README by category headings.
30+
2. Extract metadata with emphasis on canonical paper links.
31+
3. Build the local knowledgebase directory.
32+
4. Export per-category metadata into CSV and TXT.
33+
5. Generate scripts under `tools/` for extraction, downloading, optimization, and statistics.
34+
6. Maintain `AGENTS.md` and `REPORT.md` for process and findings.
35+
36+
## Directory Contract
37+
Use this structure:
38+
39+
```text
40+
./survey2knowledgebase/
41+
categories/
42+
<category>/
43+
csv/
44+
papers.csv
45+
txt/
46+
papers.txt
47+
pdf/
48+
<year>-<short_name>.pdf
49+
tools/
50+
extract_from_readme.py
51+
download_pdfs.py
52+
pdf_optimize.py
53+
stats_cli.py
54+
AGENTS.md
55+
REPORT.md
56+
```
57+
58+
## Execution Boundary
59+
- Generate and save required files and scripts in the workspace.
60+
- Provide runnable commands for the user to execute.
61+
- Do not auto-run downloading scripts unless the user explicitly asks for execution.
62+
63+
## Data Export Rules
64+
- One CSV and one TXT per category.
65+
- Store files inside each category directory, separated by file type (`csv/`, `txt/`, `pdf/`).
66+
- Required fields: `short_name`, `title`, `category`, `year`, `month`, `venue`, `paper_url`, `source_type`, `code_url`, `data_url`.
67+
- `short_name` rule: `FirstAuthorLastName_FirstImportantTitleWord` (example: `Vaswani_Attention`).
68+
- Preserve original links exactly.
69+
- If multiple paper links exist, choose the strongest canonical paper link in this order: DOI > ACL > OpenReview > arXiv > Website.
70+
- Use chunked parsing for long README files to avoid truncation and extraction errors.
71+
72+
## PDF Download Rules
73+
- Download PDFs into `./survey2knowledgebase/categories/<category>/pdf/`.
74+
- Keep deterministic filenames: `<year>-<short_name>.pdf`.
75+
- Skip existing files unless `--force` is set.
76+
- Use randomized rate limiting between requests (`time.sleep` in 2–5s range).
77+
- Use retry and error handling for network failures and 403/404 responses.
78+
- Save failures into `./survey2knowledgebase/tools/download_errors.log`.
79+
80+
## PDF Reading Optimization Tools
81+
Generate utilities for:
82+
- text extraction to markdown
83+
- section split by headings
84+
- chunking for long-context agent reading
85+
- keyword indexing for quick retrieval
86+
- prefer robust libraries such as PyMuPDF or pdfplumber when available in the environment
87+
88+
## Research Records
89+
`AGENTS.md` must record:
90+
- objective, scope, extraction source, steps taken, and toolchain decisions
91+
- reproducibility details (commands, paths, parameters)
92+
93+
`REPORT.md` must record:
94+
- key scientific findings from user dialogues and extracted papers
95+
- objective statements with evidence
96+
- each claim linked to paper title and local file path
97+
98+
## Citation Requirements
99+
- Always cite paper title and local file link together.
100+
- Keep tone objective, falsifiable, and evidence-based.
101+
- Separate observation from hypothesis.

0 commit comments

Comments
 (0)