Skip to content

longchung90/SQL_Capstone_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Project Banner

🏛️ Congressional Twitter Intelligence

A decade of congressional tweets analyzed to build a data-driven lobbying targeting system.

Python SQLite Jupyter scikit-learn License: MIT

1,243,370 tweets · 548 members · 2008–2017


📊 Figures

Fig 1–2 — Top 20 Bigrams by Party
Fig 1–2 — Top 20 Bigrams by Party
Fig 3 — Top 15 Bipartisan Bigrams
Fig 3 — Top 15 Bipartisan Bigrams
Fig 4 — Vocabulary Divergence Over Time
Fig 4 — Vocabulary Divergence Over Time
Fig 5 — Retweet Distribution: Senate vs House
Fig 5 — Retweet Distribution: Senate vs House
Fig 6 — Sentiment vs Log Retweet Count
Fig 6 — Sentiment vs Log Retweet Count
Fig 7 — OLS Regression Coefficients
Fig 7 — OLS Regression Coefficients
Fig 8 — Top 20 Members by LLS
Fig 8 — Top 20 Members by LLS
Fig 9 — Bipartisan Window Score Heatmap
Fig 9 — Bipartisan Window Score Heatmap by State × Month

🗂️ Project Overview

This capstone project analyzes a decade of congressional Twitter activity to build a data-driven lobbying targeting system. Twitter stores no political metadata — party, chamber, and state are all missing. We solved this through a three-step enrichment join against the @unitedstates legislators database, recovering metadata for 74.8% of members.


🚀 Quickstart

Note: The raw dataset is not included due to GitHub's file size limit. See Data below.

git clone https://github.com/username/congressional-twitter-intelligence.git
cd congressional-twitter-intelligence
pip install -r requirements.txt
jupyter notebook notebooks/M3.ipynb

🛠️ Stack

Tool Purpose
SQLite Primary database and storage
Python / pandas Data wrangling and analysis
scikit-learn TF-IDF vectorization, OLS regression
TextBlob Sentiment scoring
matplotlib Data visualization
scipy Pearson correlation tests

📁 Structure

repo/
├── figures/                        # All saved charts & plots (Fig 1–9 + supplementary)
├── data/
│   ├── export/                     # CSV exports for Tableau
│   └── US_PoliticalTweets.tar.gz   # Raw dataset (not tracked — see Data section)
├── notebooks/
│   ├── M3.ipynb                    # TF-IDF, regression, custom metrics
│   ├── US_Political_Tweet_War_M2.ipynb  # Descriptive statistics
│   ├── Project_Proposal.ipynb      # Project proposal
│   └── Project_Proposal.pdf        # Proposal PDF export
├── presentation/
│   └── Lobbyists4America.html      # Final HTML presentation
├── sql/
│   └── SQL-Query.ipynb             # SQL queries notebook
├── .gitignore
└── README.md

💾 Data

The raw dataset (US_PoliticalTweets.tar.gz, 229MB) is not included in this repo due to GitHub's file size limit.

Download it from: [link to original source or Google Drive]

Once downloaded, place it in the data/ folder and run notebooks/M3.ipynb from the top.


⚠️ Caveats

  • Dataset covers 2008–2017 only — Twitter's 280-char limit, follower growth, and political intensity all postdate the archive
  • 24.4% of tweets could not be matched to a party (Independent members, data gaps)
  • Sentiment scored via TextBlob on a 20K random sample — not full corpus
  • LLS and BWS scores should be re-validated with updated data before operational use

About

Analyzing 1.24M U.S. congressional tweets (2008–2017) to build a lobbying targeting system — TF-IDF, OLS regression, sentiment analysis, and two custom metrics (LLS + BWS) built in Python, and SQLite.

file:///Users/longhoa/Documents/GitHub/SQL_Capstone_Project/figures/html_export.html

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors