Skip to content

Commit 77afff1

Browse files
committed
Final Commit for this Project
1 parent ec62d7a commit 77afff1

12 files changed

Lines changed: 2706 additions & 445 deletions
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Real-Time Bitcoin Sentiment Analysis with spaCy and Selenium
2+
3+
**Author**: Siddhi Rohan
4+
**UID**: 121302823
5+
**Course**: DATA605 — Spring 2025
6+
7+
## Project Overview and Goals
8+
9+
This project, **TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium**, is a real-time sentiment analysis pipeline focused on Bitcoin-related tweets from X (Twitter). It leverages Selenium for web scraping, spaCy for natural language processing, VADER for sentiment analysis, and the CoinGecko API for Bitcoin price data, with results visualized in Jupyter notebooks. Developed for the DATA605 course in Spring 2025, the pipeline analyzes public sentiment and its correlation with Bitcoin price trends.
10+
11+
### Goals
12+
- **Data Collection**: Scrape tweets containing keywords "Bitcoin" and "BTC" to capture public sentiment.
13+
- **Sentiment Analysis**: Use VADER to analyze tweet sentiment, categorizing tweets as positive, negative, or neutral.
14+
- **Correlation Analysis**: Compute multiple correlation measures (Pearson, Spearman, Kendall, lagged Pearson, and rolling) between sentiment scores and Bitcoin prices.
15+
- **Visualization**: Generate insightful visualizations, including:
16+
- Line plot of sentiment vs. Bitcoin price over time.
17+
- Box plot of sentiment distribution.
18+
- Area plot of cumulative sentiment vs. Bitcoin price.
19+
- Correlation heatmap of sentiment, price, price change, and rolling correlation.
20+
- Rolling correlation plot over time.
21+
- **Usability**: Display visualizations inline in Jupyter notebooks for easy analysis and exploration.
22+
23+
## Project Structure
24+
25+
The project is organized within the `tutorials` repository under the `DATA605/Spring2025/projects` directory. Below is the project structure:
26+
27+
```
28+
TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium/
29+
30+
├── README.md # Project documentation and setup instructions
31+
├── spacy_selenium_API.md # Documentation for the sentiment analysis pipeline
32+
├── spacy_selenium_example.md # Example usage of the sentiment analysis pipeline
33+
├── spacy_selenium_utils.py # Core class for scraping, NLP, sentiment analysis, and visualization
34+
├── spacy_selenium_API.ipynb # Tutorial notebook demonstrating spaCy and Selenium APIs
35+
├── spacy_selenium_example.ipynb # End-to-end pipeline notebook
36+
├── requirements.txt # List of Python dependencies
37+
├── Dockerfile # Docker configuration for the project
38+
├── .gitignore # Specifies files and directories to ignore in Git
39+
├── docker_build.sh # Builds the Docker container
40+
├── docker_bash.sh # Launches Jupyter notebook server
41+
├── install_project_packages.sh # Installs pip dependencies inside the container
42+
└── bashrc, etc_sudoers, utils.sh # Helper configurations
43+
```
44+
45+
## How It Works
46+
47+
The pipeline operates in the following steps:
48+
49+
1. **Data Ingestion**:
50+
- Uses Selenium to scrape tweets from X for keywords "Bitcoin" and "BTC".
51+
- Handles X login requirements with provided credentials to access live search results.
52+
- Removes duplicate tweets based on text content.
53+
54+
2. **Data Preprocessing**:
55+
- Cleans tweets using spaCy for tokenization, stop-word removal, lemmatization, and Named Entity Recognition (NER).
56+
- Matches extracted entities with cryptocurrencies using CoinGecko data.
57+
58+
3. **Sentiment Analysis**:
59+
- Analyzes tweet sentiment using the VADER sentiment analyzer.
60+
- Categorizes tweets as positive, negative, or neutral based on compound scores.
61+
62+
4. **Correlation with Bitcoin Prices**:
63+
- Fetches Bitcoin price data from the CoinGecko API over a 1-day period.
64+
- Computes multiple correlation measures:
65+
- **Pearson**: Linear relationship.
66+
- **Spearman**: Monotonic relationship.
67+
- **Kendall**: Rank-based correlation.
68+
- **Lagged Pearson**: Explores if past sentiment predicts price.
69+
70+
5. **Visualizations**:
71+
- Generates inline plots in Jupyter notebooks:
72+
- **Sentiment vs. Price Over Time**: Line plot of sentiment scores and Bitcoin prices.
73+
- **Sentiment Distribution**: Box plot of sentiment score distribution.
74+
- **Cumulative Sentiment vs. Price**: Area plot comparing cumulative sentiment with price trends.
75+
- **Correlation Heatmap**: Heatmap of correlations between sentiment, price, price change, and rolling correlation.
76+
- **Rolling Correlation**: Line plot of rolling correlation over time.
77+
78+
## Getting Started
79+
80+
### Prerequisites
81+
- **Python 3.9+**: Ensure Python is installed.
82+
- **Google Chrome and ChromeDriver**: ChromeDriver must match your Chrome version for Selenium.
83+
- **X (Twitter) Account**: Valid credentials (`x_username`, `x_password`) are required for scraping.
84+
- **Docker and Docker Compose** (optional): For containerized execution.
85+
- **Stable Internet Connection**: Required for scraping tweets and fetching price data.
86+
- **(Optional) CoinGecko API Key**: For paid tier to avoid rate limits.
87+
88+
### Setup Instructions (Local Development)
89+
90+
1. **Clone the Repository**:
91+
```bash
92+
git clone https://github.com/causify-ai/tutorials.git
93+
cd tutorials/DATA605/Spring2025/projects/TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium
94+
```
95+
96+
2. **Create and Activate a Virtual Environment (Windows)**:
97+
```bash
98+
python -m venv venv
99+
venv\Scripts\activate
100+
```
101+
102+
3. **Create and Activate a Virtual Environment (macOS/Linux)**:
103+
```bash
104+
python3 -m venv venv
105+
source venv/bin/activate
106+
```
107+
108+
4. **Install Dependencies**:
109+
```bash
110+
pip install -r requirements.txt
111+
```
112+
113+
5. **Install ChromeDriver**:
114+
- Download ChromeDriver matching your Chrome version from [chromedriver.chromium.org](https://chromedriver.chromium.org/downloads).
115+
- Place `chromedriver` in the project directory or a directory in your `PATH`.
116+
- Update `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py` with the correct ChromeDriver path if needed.
117+
118+
6. **Update X Credentials**:
119+
- Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb`.
120+
- Update the `x_username` and `x_password` fields in the relevant cell:
121+
```python
122+
x_username="your_username"
123+
x_password="your_password"
124+
```
125+
- Ensure 2FA is disabled for your X account, as Selenium cannot handle 2FA prompts.
126+
127+
7. **(Optional) Set Up CoinGecko API Key**:
128+
- Create a `.env` file in the project root:
129+
```ini
130+
COINGECKO_API_KEY=your_key_here
131+
```
132+
133+
8. **Run the Jupyter Notebook Locally**:
134+
```bash
135+
jupyter notebook
136+
```
137+
- Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` in your browser and run the cells to execute the pipeline.
138+
139+
### Setup Instructions (Docker)
140+
Some of the Docker-related scripts were adjusted as per the project requirements. (Dockerfile, docker_build.sh and docker_bash.sh)
141+
1. **Install Docker Desktop** for your operating system.
142+
143+
2. **Build the Docker Image**:
144+
```bash
145+
chmod +x docker_data605_style/docker_*.sh
146+
./docker_data605_style/docker_build.sh
147+
```
148+
149+
3. **Jupyter Notebook Server**:
150+
- Start the container:
151+
```bash
152+
./docker_data605_style/docker_bash.sh
153+
```
154+
- The jupyter notebook server loads up for you to dive right into the project for easier and faster access.
155+
156+
157+
158+
## Usage
159+
160+
### Run the API Functionality Demo
161+
```bash
162+
jupyter notebook spacy_selenium_API.ipynb
163+
```
164+
This notebook demonstrates:
165+
- **spaCy API**: Tokenization, lemmatization, NER, and dependency parsing.
166+
- **Selenium API**: Scraping tweets from X with authenticated login.
167+
- Integration with `spacy_selenium_utils.py` for preprocessing and analysis.
168+
169+
### Run the Full Pipeline
170+
```bash
171+
jupyter notebook spacy_selenium_example.ipynb
172+
```
173+
The pipeline:
174+
- Scrapes tweets for "Bitcoin" and "BTC".
175+
- Preprocesses tweets with spaCy.
176+
- Analyzes sentiment with VADER.
177+
- Fetches Bitcoin price data from CoinGecko.
178+
- Computes correlations and generates visualizations.
179+
180+
### Explore Interactively
181+
- Start with `spacy_selenium_API.ipynb` to understand the APIs.
182+
- Run `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` for the full pipeline.
183+
- Use **Restart & Run All** in JupyterLab for consistent results.
184+
185+
## Troubleshooting
186+
187+
- **X Login Failure**:
188+
- Verify `x_username` and `x_password` in `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py`.
189+
- Disable 2FA on your X account.
190+
- Check for verification prompts (email/phone) and handle manually (screenshots saved as `*.png`).
191+
192+
- **Selenium TimeoutException**:
193+
- Ensure ChromeDriver matches your Chrome version.
194+
- Increase `WebDriverWait` timeouts in `spacy_selenium_utils.py` (e.g., from 15s to 30s).
195+
196+
- **CoinGecko API Rate Limits**:
197+
- Use a paid API key in `.env` if rate-limited.
198+
- Reduce `max_tweets` in `spacy_selenium_example.ipynb`.
199+
200+
- **Docker Port Issues**:
201+
- Confirm `-p 8888:8888` is included in Docker commands.
202+
- Ensure port 8888 is free on your host machine.
203+
204+
- **Visualization Issues**:
205+
- Add `%matplotlib inline` at the top of notebook cells.
206+
- Update `matplotlib`, `seaborn`, and `ipython`:
207+
```bash
208+
pip install --upgrade matplotlib seaborn ipython
209+
```
210+
211+
## References
212+
- [spaCy Documentation](https://spacy.io/usage)
213+
- [Selenium Documentation](https://www.selenium.dev/documentation/)
214+
- [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment)
215+
- [CoinGecko API Documentation](https://www.coingecko.com/en/api/documentation)
216+
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
217+
- [Seaborn Documentation](https://seaborn.pydata.org/)
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Real-Time Bitcoin Sentiment Analysis with spaCy and Selenium
2+
3+
## Project Overview and Goals
4+
5+
This project, **TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium**, is a real-time sentiment analysis pipeline focused on Bitcoin-related tweets. It scrapes tweets using Selenium, preprocesses them with spaCy, analyzes sentiment with VADER, correlates sentiment with Bitcoin prices from CoinGecko, and visualizes the results. The project was developed as part of the DATA605 course in Spring 2025.
6+
7+
### Goals
8+
- **Data Collection**: Scrape tweets containing keywords "Bitcoin" and "BTC" to capture public sentiment.
9+
- **Sentiment Analysis**: Analyze tweet sentiment using VADER and categorize tweets as positive, negative, or neutral.
10+
- **Correlation Analysis**: Compute multiple correlation measures (Pearson, Spearman, Kendall, and lagged Pearson) between sentiment scores and Bitcoin prices.
11+
- **Visualization**: Generate insightful visualizations, including:
12+
- Line plot of sentiment vs. Bitcoin price over time.
13+
- Box plot of sentiment distribution.
14+
- Area plot of cumulative sentiment vs. Bitcoin price.
15+
- Correlation heatmap of sentiment, price, and other metrics.
16+
- Rolling correlation plot over time.
17+
- **Usability**: Display visualizations directly in the Jupyter notebook for easy analysis.
18+
19+
## Project Structure
20+
21+
The project is organized within the `tutorials` repository under the `DATA605/Spring2025/projects` directory. Below is a diagram of the project structure:
22+
23+
TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium/
24+
25+
├── README.md # Project documentation and setup instructions
26+
├── init.py # Makes the directory a Python package
27+
├── main.py # Main script to run the pipeline
28+
├── Bitcoin_API.md # Documentation for the sentiment analysis pipeline
29+
├── Bitcoin_API.py # Core script for scraping, preprocessing, and visualization
30+
├── Bitcoin_example.md # Example usage of the sentiment analysis pipeline
31+
├── Bitcoin_example.py # Example script for single-tweet sentiment analysis
32+
├── Bitcoin_utils.py # Utility functions for logging
33+
├── spacy_utils.py # Utility functions for NLP and CoinGecko API interactions
34+
├── Bitcoin_Sentiment_Analysis.ipynb # Jupyter notebook demonstrating the pipeline
35+
├── requirements.txt # List of Python dependencies
36+
├── Dockerfile # Docker configuration for the project
37+
├── docker-compose.yml # Docker Compose configuration for running the project
38+
└── .gitignore # Specifies files and directories to ignore in Git
39+
40+
41+
## How It Works
42+
43+
The pipeline operates in the following steps:
44+
45+
1. **Data Ingestion**:
46+
- Uses Selenium to scrape tweets from X (Twitter) for the keywords "Bitcoin" and "BTC".
47+
- Handles X login requirements with provided credentials, ensuring access to live search results.
48+
- Removes duplicate tweets based on text content.
49+
50+
2. **Data Preprocessing**:
51+
- Cleans tweets using spaCy for tokenization, stop-word removal, lemmatization, and Named Entity Recognition (NER).
52+
- Extracts entities and matches them with cryptocurrencies using CoinGecko data.
53+
54+
3. **Sentiment Analysis**:
55+
- Analyzes tweet sentiment using the VADER sentiment analyzer.
56+
- Categorizes tweets as positive, negative, or neutral based on compound scores.
57+
58+
4. **Correlation with Bitcoin Prices**:
59+
- Fetches Bitcoin price data from the CoinGecko API over a 1-day period.
60+
- Computes multiple correlation measures:
61+
- Pearson (linear relationship)
62+
- Spearman (monotonic relationship)
63+
- Kendall (rank-based)
64+
- Lagged Pearson (to explore if past sentiment predicts price)
65+
- Rolling correlation (to see how the relationship evolves over time)
66+
67+
5. **Visualizations**:
68+
- Generates plots displayed inline in the Jupyter notebook:
69+
- **Sentiment vs. Price Over Time**: A line plot showing sentiment scores and Bitcoin prices.
70+
- **Sentiment Distribution**: A box plot of sentiment scores across tweets.
71+
- **Cumulative Sentiment vs. Price**: An area plot comparing cumulative sentiment with price trends.
72+
- **Correlation Heatmap**: A heatmap of correlations between sentiment, price, price change, and rolling correlation.
73+
- **Rolling Correlation**: A line plot of the rolling correlation over time.
74+
75+
## Getting Started
76+
77+
### Prerequisites
78+
- **Python 3.9+**: Ensure Python is installed on your system.
79+
- **Google Chrome and ChromeDriver**: ChromeDriver must match your Chrome version for Selenium.
80+
- **X (Twitter) Account**: Credentials are required for scraping tweets due to X's login requirements.
81+
- **Docker and Docker Compose** (optional): For containerized execution.
82+
- **Stable Internet Connection**: Needed for scraping tweets and fetching Bitcoin prices.
83+
84+
### Setup Instructions (Local Development)
85+
86+
1. Clone the Repository:
87+
- git clone https://github.com/causify-ai/tutorials.git
88+
- cd tutorials/DATA605/Spring2025/projects/TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium
89+
90+
2. Create and Activate a Virtual Environment (Windows):
91+
- python -m venv venv
92+
- venv\Scripts\activate
93+
94+
3. Create and Activate a Virtual Environment (macOS/Linux):
95+
- python3 -m venv venv
96+
- source venv/bin/activate
97+
98+
4. Install Dependencies:
99+
- pip install -r requirements.txt
100+
101+
5. Install ChromeDriver:
102+
- Download ChromeDriver matching your Chrome version from https://googlechromelabs.github.io/chromedriver/
103+
- Place the chromedriver executable in the project directory or a directory in your PATH
104+
- Update Bitcoin_Sentiment_Analysis.ipynb with the correct ChromeDriver path
105+
106+
6. Update X Credentials:
107+
- Open Bitcoin_Sentiment_Analysis.ipynb
108+
- Update the x_username and x_password fields in Cell 3 with your X credentials:
109+
- x_username="your_username"
110+
- x_password="your_password"
111+
- Ensure 2FA is disabled for your X account, as Selenium cannot handle 2FA prompts automatically
112+
113+
7. Run the Jupyter Notebook Locally:
114+
- jupyter notebook
115+
- Open Bitcoin_Sentiment_Analysis.ipynb in your browser and run the cells to execute the pipeline

0 commit comments

Comments
 (0)