X Scraper

Python/Selenium scraper for personal X/Twitter archiving, with an interactive CLI and a Tauri desktop interface. It can collect public profile posts or authenticated bookmarks, then export results as JSON, Markdown, or DOCX.

Current Capabilities

Profile scraping for public posts
Bookmark scraping for the signed-in user's saved posts
Count, last-N-days, and date-range scrape modes
Best-effort expansion for long tweets and X Articles
JSON, Markdown, and Word export
Session cookie reuse in the desktop sidecar
Pause, resume, cancel, and partial-result handling in supported flows
Versioned JSON export schema (schema_version: "0.2")
Safer output handling with sanitized filenames and per-target output folders
Selector diagnostics and structured run logs (schema_version: "0.3")

Important Limits

This project automates the X web UI through Selenium. X changes its DOM, labels, and login flows frequently, so selectors and long-form extraction can break without warning. Treat scraped data as best-effort and verify important exports manually.

This project does not bypass access controls, does not include credentials, and should only be used for educational or personal archiving workflows that you are authorized to perform.

Install

git clone https://github.com/utkuvibing/twitter_scraper.git
cd twitter_scraper
pip install -r requirements.txt

Chrome must be installed. webdriver-manager downloads the matching ChromeDriver.

CLI Usage

python main.py

The CLI prompts for:

Login method
Profile or bookmarks source
Count, date range, or last-N-days mode
JSON, Markdown, or DOCX export

Exports are written under output/<target>/ by default.

Selector Diagnostics

Run selector diagnostics without starting a scrape:

python main.py --diagnostics

The CLI opens Chrome, lets you navigate or log in, then checks the currently loaded page for core X selectors such as login fields, tweet articles, tweet text, status links, long-tweet controls, and article links. A diagnostics run writes a structured log under output/diagnostics/logs/.

Diagnostics only reports what is detectable on the current page. It does not guarantee a full scrape will succeed, and it does not bypass login, rate limits, private content, or platform restrictions.

Desktop Usage

npm install
npm run tauri dev

The desktop app uses python_sidecar/ for scraping and streams structured events to the React UI.

Run Logs

Every CLI scrape and sidecar scrape writes a JSON run log under:

output/<target>/logs/

Run logs include:

scrape stage (login, profile_navigation, bookmarks_navigation, timeline_loading, tweet_parsing, full_text_extraction, article_extraction, export_saving)
severity level
failure reason codes such as login_failed, profile_navigation_failed, timeline_empty, tweet_parse_failed, full_text_failed, article_extraction_failed, or export_failed
selector names where a selector was involved
timing and diagnostic details

These logs are intended for debugging X DOM changes and incomplete runs. They do not contain credentials.

JSON Export Schema

JSON exports use a stable top-level shape:

{
  "schema_version": "0.2",
  "source": "x.com",
  "scrape_type": "profile",
  "user": "@example",
  "target": "example",
  "exported_at": "2026-05-12T10:30:00+00:00",
  "total_tweets": 1,
  "tweets": [
    {
      "id": "1234567890",
      "text": "Tweet content",
      "date": "2026-05-12T10:30:00+00:00",
      "date_str": "May 12, 2026",
      "url": "https://x.com/example/status/1234567890",
      "tweet_url": "https://x.com/example/status/1234567890",
      "has_media": false,
      "media_urls": [],
      "has_article": false,
      "needs_full_text": false,
      "likes": 0,
      "retweets": 0,
      "replies": 0,
      "views": 0
    }
  ]
}

url is kept for compatibility; tweet_url is the explicit canonical field used by the app.

Validation

Run the local validation suite without opening a browser:

python -m unittest discover -s tests
python -m compileall main.py scraper.py document_generator.py export_schema.py diagnostics.py python_sidecar

For the desktop frontend:

npm run build

Project Structure

main.py                 Interactive Python CLI
scraper.py              CLI Selenium scraper
document_generator.py   JSON/Markdown/DOCX export writers
export_schema.py        Shared export schema and safe write helpers
diagnostics.py          Selector diagnostics and structured run logs
python_sidecar/         JSON-line scraper service used by Tauri
src/                    React frontend
src-tauri/              Tauri/Rust backend
tests/                  Browser-free validation tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
python_sidecar		python_sidecar
src-tauri		src-tauri
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
config.py		config.py
diagnostics.py		diagnostics.py
document_generator.py		document_generator.py
export_schema.py		export_schema.py
index.html		index.html
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
postcss.config.js		postcss.config.js
requirements.txt		requirements.txt
scraper.py		scraper.py
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X Scraper

Current Capabilities

Important Limits

Install

CLI Usage

Selector Diagnostics

Desktop Usage

Run Logs

JSON Export Schema

Validation

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

X Scraper

Current Capabilities

Important Limits

Install

CLI Usage

Selector Diagnostics

Desktop Usage

Run Logs

JSON Export Schema

Validation

Project Structure

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages