|
| 1 | +# Real-Time Bitcoin Sentiment Analysis with spaCy and Selenium |
| 2 | + |
| 3 | +**Author**: Siddhi Rohan |
| 4 | +**UID**: 121302823 |
| 5 | +**Course**: DATA605 — Spring 2025 |
| 6 | + |
| 7 | +## Project Overview and Goals |
| 8 | + |
| 9 | +This project, **TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium**, is a real-time sentiment analysis pipeline focused on Bitcoin-related tweets from X (Twitter). It leverages Selenium for web scraping, spaCy for natural language processing, VADER for sentiment analysis, and the CoinGecko API for Bitcoin price data, with results visualized in Jupyter notebooks. Developed for the DATA605 course in Spring 2025, the pipeline analyzes public sentiment and its correlation with Bitcoin price trends. |
| 10 | + |
| 11 | +### Goals |
| 12 | +- **Data Collection**: Scrape tweets containing keywords "Bitcoin" and "BTC" to capture public sentiment. |
| 13 | +- **Sentiment Analysis**: Use VADER to analyze tweet sentiment, categorizing tweets as positive, negative, or neutral. |
| 14 | +- **Correlation Analysis**: Compute multiple correlation measures (Pearson, Spearman, Kendall, lagged Pearson, and rolling) between sentiment scores and Bitcoin prices. |
| 15 | +- **Visualization**: Generate insightful visualizations, including: |
| 16 | + - Line plot of sentiment vs. Bitcoin price over time. |
| 17 | + - Box plot of sentiment distribution. |
| 18 | + - Area plot of cumulative sentiment vs. Bitcoin price. |
| 19 | + - Correlation heatmap of sentiment, price, price change, and rolling correlation. |
| 20 | + - Rolling correlation plot over time. |
| 21 | +- **Usability**: Display visualizations inline in Jupyter notebooks for easy analysis and exploration. |
| 22 | + |
| 23 | +## Project Structure |
| 24 | + |
| 25 | +The project is organized within the `tutorials` repository under the `DATA605/Spring2025/projects` directory. Below is the project structure: |
| 26 | + |
| 27 | +``` |
| 28 | +TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium/ |
| 29 | +│ |
| 30 | +├── README.md # Project documentation and setup instructions |
| 31 | +├── spacy_selenium_API.md # Documentation for the sentiment analysis pipeline |
| 32 | +├── spacy_selenium_example.md # Example usage of the sentiment analysis pipeline |
| 33 | +├── spacy_selenium_utils.py # Core class for scraping, NLP, sentiment analysis, and visualization |
| 34 | +├── spacy_selenium_API.ipynb # Tutorial notebook demonstrating spaCy and Selenium APIs |
| 35 | +├── spacy_selenium_example.ipynb # End-to-end pipeline notebook |
| 36 | +├── requirements.txt # List of Python dependencies |
| 37 | +├── Dockerfile # Docker configuration for the project |
| 38 | +├── .gitignore # Specifies files and directories to ignore in Git |
| 39 | +├── docker_build.sh # Builds the Docker container |
| 40 | +├── docker_bash.sh # Launches Jupyter notebook server |
| 41 | +├── install_project_packages.sh # Installs pip dependencies inside the container |
| 42 | +└── bashrc, etc_sudoers, utils.sh # Helper configurations |
| 43 | +``` |
| 44 | + |
| 45 | +## How It Works |
| 46 | + |
| 47 | +The pipeline operates in the following steps: |
| 48 | + |
| 49 | +1. **Data Ingestion**: |
| 50 | + - Uses Selenium to scrape tweets from X for keywords "Bitcoin" and "BTC". |
| 51 | + - Handles X login requirements with provided credentials to access live search results. |
| 52 | + - Removes duplicate tweets based on text content. |
| 53 | + |
| 54 | +2. **Data Preprocessing**: |
| 55 | + - Cleans tweets using spaCy for tokenization, stop-word removal, lemmatization, and Named Entity Recognition (NER). |
| 56 | + - Matches extracted entities with cryptocurrencies using CoinGecko data. |
| 57 | + |
| 58 | +3. **Sentiment Analysis**: |
| 59 | + - Analyzes tweet sentiment using the VADER sentiment analyzer. |
| 60 | + - Categorizes tweets as positive, negative, or neutral based on compound scores. |
| 61 | + |
| 62 | +4. **Correlation with Bitcoin Prices**: |
| 63 | + - Fetches Bitcoin price data from the CoinGecko API over a 1-day period. |
| 64 | + - Computes multiple correlation measures: |
| 65 | + - **Pearson**: Linear relationship. |
| 66 | + - **Spearman**: Monotonic relationship. |
| 67 | + - **Kendall**: Rank-based correlation. |
| 68 | + - **Lagged Pearson**: Explores if past sentiment predicts price. |
| 69 | + |
| 70 | +5. **Visualizations**: |
| 71 | + - Generates inline plots in Jupyter notebooks: |
| 72 | + - **Sentiment vs. Price Over Time**: Line plot of sentiment scores and Bitcoin prices. |
| 73 | + - **Sentiment Distribution**: Box plot of sentiment score distribution. |
| 74 | + - **Cumulative Sentiment vs. Price**: Area plot comparing cumulative sentiment with price trends. |
| 75 | + - **Correlation Heatmap**: Heatmap of correlations between sentiment, price, price change, and rolling correlation. |
| 76 | + - **Rolling Correlation**: Line plot of rolling correlation over time. |
| 77 | + |
| 78 | +## Getting Started |
| 79 | + |
| 80 | +### Prerequisites |
| 81 | +- **Python 3.9+**: Ensure Python is installed. |
| 82 | +- **Google Chrome and ChromeDriver**: ChromeDriver must match your Chrome version for Selenium. |
| 83 | +- **X (Twitter) Account**: Valid credentials (`x_username`, `x_password`) are required for scraping. |
| 84 | +- **Docker and Docker Compose** (optional): For containerized execution. |
| 85 | +- **Stable Internet Connection**: Required for scraping tweets and fetching price data. |
| 86 | +- **(Optional) CoinGecko API Key**: For paid tier to avoid rate limits. |
| 87 | + |
| 88 | +### Setup Instructions (Local Development) |
| 89 | + |
| 90 | +1. **Clone the Repository**: |
| 91 | + ```bash |
| 92 | + git clone https://github.com/causify-ai/tutorials.git |
| 93 | + cd tutorials/DATA605/Spring2025/projects/TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium |
| 94 | + ``` |
| 95 | + |
| 96 | +2. **Create and Activate a Virtual Environment (Windows)**: |
| 97 | + ```bash |
| 98 | + python -m venv venv |
| 99 | + venv\Scripts\activate |
| 100 | + ``` |
| 101 | + |
| 102 | +3. **Create and Activate a Virtual Environment (macOS/Linux)**: |
| 103 | + ```bash |
| 104 | + python3 -m venv venv |
| 105 | + source venv/bin/activate |
| 106 | + ``` |
| 107 | + |
| 108 | +4. **Install Dependencies**: |
| 109 | + ```bash |
| 110 | + pip install -r requirements.txt |
| 111 | + ``` |
| 112 | + |
| 113 | +5. **Install ChromeDriver**: |
| 114 | + - Download ChromeDriver matching your Chrome version from [chromedriver.chromium.org](https://chromedriver.chromium.org/downloads). |
| 115 | + - Place `chromedriver` in the project directory or a directory in your `PATH`. |
| 116 | + - Update `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py` with the correct ChromeDriver path if needed. |
| 117 | + |
| 118 | +6. **Update X Credentials**: |
| 119 | + - Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb`. |
| 120 | + - Update the `x_username` and `x_password` fields in the relevant cell: |
| 121 | + ```python |
| 122 | + x_username="your_username" |
| 123 | + x_password="your_password" |
| 124 | + ``` |
| 125 | + - Ensure 2FA is disabled for your X account, as Selenium cannot handle 2FA prompts. |
| 126 | + |
| 127 | +7. **(Optional) Set Up CoinGecko API Key**: |
| 128 | + - Create a `.env` file in the project root: |
| 129 | + ```ini |
| 130 | + COINGECKO_API_KEY=your_key_here |
| 131 | + ``` |
| 132 | + |
| 133 | +8. **Run the Jupyter Notebook Locally**: |
| 134 | + ```bash |
| 135 | + jupyter notebook |
| 136 | + ``` |
| 137 | + - Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` in your browser and run the cells to execute the pipeline. |
| 138 | + |
| 139 | +### Setup Instructions (Docker) |
| 140 | +Some of the Docker-related scripts were adjusted as per the project requirements. (Dockerfile, docker_build.sh and docker_bash.sh) |
| 141 | +1. **Install Docker Desktop** for your operating system. |
| 142 | + |
| 143 | +2. **Build the Docker Image**: |
| 144 | + ```bash |
| 145 | + chmod +x docker_data605_style/docker_*.sh |
| 146 | + ./docker_data605_style/docker_build.sh |
| 147 | + ``` |
| 148 | + |
| 149 | +3. **Jupyter Notebook Server**: |
| 150 | + - Start the container: |
| 151 | + ```bash |
| 152 | + ./docker_data605_style/docker_bash.sh |
| 153 | + ``` |
| 154 | + - The jupyter notebook server loads up for you to dive right into the project for easier and faster access. |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | +## Usage |
| 159 | + |
| 160 | +### Run the API Functionality Demo |
| 161 | +```bash |
| 162 | +jupyter notebook spacy_selenium_API.ipynb |
| 163 | +``` |
| 164 | +This notebook demonstrates: |
| 165 | +- **spaCy API**: Tokenization, lemmatization, NER, and dependency parsing. |
| 166 | +- **Selenium API**: Scraping tweets from X with authenticated login. |
| 167 | +- Integration with `spacy_selenium_utils.py` for preprocessing and analysis. |
| 168 | + |
| 169 | +### Run the Full Pipeline |
| 170 | +```bash |
| 171 | +jupyter notebook spacy_selenium_example.ipynb |
| 172 | +``` |
| 173 | +The pipeline: |
| 174 | +- Scrapes tweets for "Bitcoin" and "BTC". |
| 175 | +- Preprocesses tweets with spaCy. |
| 176 | +- Analyzes sentiment with VADER. |
| 177 | +- Fetches Bitcoin price data from CoinGecko. |
| 178 | +- Computes correlations and generates visualizations. |
| 179 | + |
| 180 | +### Explore Interactively |
| 181 | +- Start with `spacy_selenium_API.ipynb` to understand the APIs. |
| 182 | +- Run `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` for the full pipeline. |
| 183 | +- Use **Restart & Run All** in JupyterLab for consistent results. |
| 184 | + |
| 185 | +## Troubleshooting |
| 186 | + |
| 187 | +- **X Login Failure**: |
| 188 | + - Verify `x_username` and `x_password` in `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py`. |
| 189 | + - Disable 2FA on your X account. |
| 190 | + - Check for verification prompts (email/phone) and handle manually (screenshots saved as `*.png`). |
| 191 | + |
| 192 | +- **Selenium TimeoutException**: |
| 193 | + - Ensure ChromeDriver matches your Chrome version. |
| 194 | + - Increase `WebDriverWait` timeouts in `spacy_selenium_utils.py` (e.g., from 15s to 30s). |
| 195 | + |
| 196 | +- **CoinGecko API Rate Limits**: |
| 197 | + - Use a paid API key in `.env` if rate-limited. |
| 198 | + - Reduce `max_tweets` in `spacy_selenium_example.ipynb`. |
| 199 | + |
| 200 | +- **Docker Port Issues**: |
| 201 | + - Confirm `-p 8888:8888` is included in Docker commands. |
| 202 | + - Ensure port 8888 is free on your host machine. |
| 203 | + |
| 204 | +- **Visualization Issues**: |
| 205 | + - Add `%matplotlib inline` at the top of notebook cells. |
| 206 | + - Update `matplotlib`, `seaborn`, and `ipython`: |
| 207 | + ```bash |
| 208 | + pip install --upgrade matplotlib seaborn ipython |
| 209 | + ``` |
| 210 | + |
| 211 | +## References |
| 212 | +- [spaCy Documentation](https://spacy.io/usage) |
| 213 | +- [Selenium Documentation](https://www.selenium.dev/documentation/) |
| 214 | +- [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment) |
| 215 | +- [CoinGecko API Documentation](https://www.coingecko.com/en/api/documentation) |
| 216 | +- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html) |
| 217 | +- [Seaborn Documentation](https://seaborn.pydata.org/) |
0 commit comments