Python/Selenium scraper for personal X/Twitter archiving, with an interactive CLI and a Tauri desktop interface. It can collect public profile posts or authenticated bookmarks, then export results as JSON, Markdown, or DOCX.
- Profile scraping for public posts
- Bookmark scraping for the signed-in user's saved posts
- Count, last-N-days, and date-range scrape modes
- Best-effort expansion for long tweets and X Articles
- JSON, Markdown, and Word export
- Session cookie reuse in the desktop sidecar
- Pause, resume, cancel, and partial-result handling in supported flows
- Versioned JSON export schema (
schema_version: "0.2") - Safer output handling with sanitized filenames and per-target output folders
- Selector diagnostics and structured run logs (
schema_version: "0.3")
This project automates the X web UI through Selenium. X changes its DOM, labels, and login flows frequently, so selectors and long-form extraction can break without warning. Treat scraped data as best-effort and verify important exports manually.
This project does not bypass access controls, does not include credentials, and should only be used for educational or personal archiving workflows that you are authorized to perform.
git clone https://github.com/utkuvibing/twitter_scraper.git
cd twitter_scraper
pip install -r requirements.txtChrome must be installed. webdriver-manager downloads the matching ChromeDriver.
python main.pyThe CLI prompts for:
- Login method
- Profile or bookmarks source
- Count, date range, or last-N-days mode
- JSON, Markdown, or DOCX export
Exports are written under output/<target>/ by default.
Run selector diagnostics without starting a scrape:
python main.py --diagnosticsThe CLI opens Chrome, lets you navigate or log in, then checks the currently loaded page for core X selectors such as login fields, tweet articles, tweet text, status links, long-tweet controls, and article links. A diagnostics run writes a structured log under output/diagnostics/logs/.
Diagnostics only reports what is detectable on the current page. It does not guarantee a full scrape will succeed, and it does not bypass login, rate limits, private content, or platform restrictions.
npm install
npm run tauri devThe desktop app uses python_sidecar/ for scraping and streams structured events to the React UI.
Every CLI scrape and sidecar scrape writes a JSON run log under:
output/<target>/logs/
Run logs include:
- scrape stage (
login,profile_navigation,bookmarks_navigation,timeline_loading,tweet_parsing,full_text_extraction,article_extraction,export_saving) - severity level
- failure reason codes such as
login_failed,profile_navigation_failed,timeline_empty,tweet_parse_failed,full_text_failed,article_extraction_failed, orexport_failed - selector names where a selector was involved
- timing and diagnostic details
These logs are intended for debugging X DOM changes and incomplete runs. They do not contain credentials.
JSON exports use a stable top-level shape:
{
"schema_version": "0.2",
"source": "x.com",
"scrape_type": "profile",
"user": "@example",
"target": "example",
"exported_at": "2026-05-12T10:30:00+00:00",
"total_tweets": 1,
"tweets": [
{
"id": "1234567890",
"text": "Tweet content",
"date": "2026-05-12T10:30:00+00:00",
"date_str": "May 12, 2026",
"url": "https://x.com/example/status/1234567890",
"tweet_url": "https://x.com/example/status/1234567890",
"has_media": false,
"media_urls": [],
"has_article": false,
"needs_full_text": false,
"likes": 0,
"retweets": 0,
"replies": 0,
"views": 0
}
]
}url is kept for compatibility; tweet_url is the explicit canonical field used by the app.
Run the local validation suite without opening a browser:
python -m unittest discover -s tests
python -m compileall main.py scraper.py document_generator.py export_schema.py diagnostics.py python_sidecarFor the desktop frontend:
npm run buildmain.py Interactive Python CLI
scraper.py CLI Selenium scraper
document_generator.py JSON/Markdown/DOCX export writers
export_schema.py Shared export schema and safe write helpers
diagnostics.py Selector diagnostics and structured run logs
python_sidecar/ JSON-line scraper service used by Tauri
src/ React frontend
src-tauri/ Tauri/Rust backend
tests/ Browser-free validation tests
MIT