Skip to content

EhsanAramide/arxiv-feedreader-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Article Fetcher Bot

A GitHub CI/CD pipeline that automatically fetches the latest scientific papers from arXiv.org, stores their PDFs, and saves metadata in CSV files — all version‑controlled inside a Git repository.

🎯 Why this exists

Researchers often need offline access to the newest literature, especially in regions with unreliable internet. This project was born to help Iranian scholars cope with recurrent internet blackouts by caching articles during available connectivity windows. Once downloaded, the papers live in the repository, accessible even when the network is down. As I'm writing this README, the goverment blocked internet for 69 days.

🤖 This entire project was vibe‑coded by DeepSeek, the AI assistant, with no human hand in its creation.

✨ Features

  • Fetches the top 15 newest papers for each configured arXiv category
  • Downloads full PDFs and saves them in a clean directory structure
  • Creates a CSV metadata file (latest_articles.csv) with title, first author, DOI, and category
  • Runs automatically every 1st and 16th of the month (approximately every 15 days) via GitHub Actions
  • Manual trigger supported for on‑demand updates
  • No API keys or authentication needed — arXiv’s public API is free
  • Built‑in repository size management with an optional history‑squash workflow

🚀 Quick start

  1. Fork this repository to your own GitHub account
  2. Go to Actions → enable workflows
  3. Edit fetch_articles.py and adjust the CATEGORIES set to your interests:
   CATEGORIES = {"cs.AI", "cs.CL", "quant-ph"}
  1. (Optional) Change the schedule in .github/workflows/articles-bot.yml if you need a different frequency
  2. The bot will run automatically on the 1st and 16th, or you can manually trigger it from the Actions tab

📁 Repository structure

.
├── .github/workflows/
│   ├── articles-bot.yml      # Main scheduled workflow
│   └── squash-history.yml    # Manual history clean‑up
├── articles/
│   └── <category>/           # e.g., cs/AI/
│       ├── <arxiv_id>.pdf    # Downloaded papers
│       └── latest_articles.csv
├── fetch_articles.py         # The core bot script
└── README.md

⚙️ Customization

Simply edit the CATEGORIES dictionary in fetch_articles.py. Each entry must be a valid arXiv category (e.g., "cs.LG", "math.NA", "physics.optics"). A complete taxonomy list is available at arXiv.org.

You can also change the number of articles fetched by modifying max_results=15 inside the script.

📦 Repository size management

Because every run adds new PDFs, the repo can grow quickly. A manual squash workflow is included:

  • Go to ActionsSquash Article HistoryRun workflow
  • Type YES to confirm

This rewrites the main branch, keeping only the latest version of all articles while discarding old commits. ⚠️ It rewrites Git history — run it only when the repository becomes too large (recommended every few months).

🙏 Credits

  • All code in this repository was generated by DeepSeek, a large language model, through an iterative vibe‑coding process.
  • Thank you to arXiv for providing an amazing open‑access resource and API.

📜 License

GPL3 — do whatever you want, just keep the papers free.

About

A GitHub CI/CD pipeline that automatically fetches the latest scientific papers from arXiv.org, stores their PDFs, and saves metadata in CSV files. A handy tool to bypass Iran governmental internet blackout.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages