A GitHub CI/CD pipeline that automatically fetches the latest scientific papers from arXiv.org, stores their PDFs, and saves metadata in CSV files — all version‑controlled inside a Git repository.
Researchers often need offline access to the newest literature, especially in regions with unreliable internet. This project was born to help Iranian scholars cope with recurrent internet blackouts by caching articles during available connectivity windows. Once downloaded, the papers live in the repository, accessible even when the network is down. As I'm writing this README, the goverment blocked internet for 69 days.
🤖 This entire project was vibe‑coded by DeepSeek, the AI assistant, with no human hand in its creation.
- Fetches the top 15 newest papers for each configured arXiv category
- Downloads full PDFs and saves them in a clean directory structure
- Creates a CSV metadata file (
latest_articles.csv) with title, first author, DOI, and category - Runs automatically every 1st and 16th of the month (approximately every 15 days) via GitHub Actions
- Manual trigger supported for on‑demand updates
- No API keys or authentication needed — arXiv’s public API is free
- Built‑in repository size management with an optional history‑squash workflow
- Fork this repository to your own GitHub account
- Go to Actions → enable workflows
- Edit
fetch_articles.pyand adjust theCATEGORIESset to your interests:
CATEGORIES = {"cs.AI", "cs.CL", "quant-ph"}- (Optional) Change the schedule in .github/workflows/articles-bot.yml if you need a different frequency
- The bot will run automatically on the 1st and 16th, or you can manually trigger it from the Actions tab
.
├── .github/workflows/
│ ├── articles-bot.yml # Main scheduled workflow
│ └── squash-history.yml # Manual history clean‑up
├── articles/
│ └── <category>/ # e.g., cs/AI/
│ ├── <arxiv_id>.pdf # Downloaded papers
│ └── latest_articles.csv
├── fetch_articles.py # The core bot script
└── README.mdSimply edit the CATEGORIES dictionary in fetch_articles.py. Each entry must be a valid arXiv category (e.g., "cs.LG", "math.NA", "physics.optics"). A complete taxonomy list is available at arXiv.org.
You can also change the number of articles fetched by modifying max_results=15 inside the script.
Because every run adds new PDFs, the repo can grow quickly. A manual squash workflow is included:
- Go to Actions → Squash Article History → Run workflow
- Type
YESto confirm
This rewrites the main branch, keeping only the latest version of all articles while discarding old commits.
- All code in this repository was generated by DeepSeek, a large language model, through an iterative vibe‑coding process.
- Thank you to arXiv for providing an amazing open‑access resource and API.
GPL3 — do whatever you want, just keep the papers free.