Skip to content

Commit a33351d

Browse files
TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium_v2 (#508)
Co-authored-by: Krishna P Taduri <krishna.pratardan@gmail.com> Co-authored-by: Krishna P Taduri <40231735+tkpratardan@users.noreply.github.com>
1 parent 6e8f708 commit a33351d

22 files changed

Lines changed: 2962 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
FROM ubuntu:20.04
2+
ENV DEBIAN_FRONTEND=noninteractive
3+
4+
RUN apt-get update && apt-get install -y \
5+
wget curl unzip gnupg \
6+
sudo git vim \
7+
python3 python3-pip python3-dev \
8+
build-essential libffi-dev libssl-dev libpng-dev libjpeg-dev \
9+
libfreetype6-dev gfortran libopenblas-dev liblapack-dev \
10+
libnss3 libatk-bridge2.0-0 libgtk-3-0 libx11-xcb1 \
11+
libxcomposite1 libxcursor1 libxdamage1 libxi6 libxtst6 \
12+
libxrandr2 libasound2 libpangocairo-1.0-0 libpangoft2-1.0-0 \
13+
fonts-liberation libgbm1 xdg-utils ca-certificates \
14+
&& apt-get clean && rm -rf /var/lib/apt/lists/*
15+
16+
17+
# Install Chrome
18+
RUN wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - && \
19+
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" > /etc/apt/sources.list.d/google-chrome.list && \
20+
apt-get update && apt-get install -y google-chrome-stable && \
21+
rm -rf /var/lib/apt/lists/*
22+
23+
# Set ChromeDriver version matching Chrome 136
24+
ENV CHROMEDRIVER_VERSION=136.0.7103.113
25+
26+
# Install ChromeDriver manually
27+
RUN wget -O /tmp/chromedriver.zip "https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/${CHROMEDRIVER_VERSION}/linux64/chromedriver-linux64.zip" && \
28+
unzip -o /tmp/chromedriver.zip -d /tmp/ && \
29+
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver && \
30+
chmod +x /usr/local/bin/chromedriver && \
31+
rm -rf /tmp/chromedriver.zip /tmp/chromedriver-linux64
32+
33+
34+
# Upgrade pip and install core Python dependencies
35+
RUN python3 -m pip install --upgrade pip setuptools wheel && \
36+
pip3 install numpy==1.24.4 && \
37+
pip3 install "cython<3" "blis<0.8" "thinc<8.2" "murmurhash<1.1.0" "cymem<2.1.0" "preshed<3.1.0"
38+
39+
# Now install high-level packages that depend on those
40+
RUN pip3 install spacy==3.5.4 && \
41+
python3 -m spacy download en_core_web_sm && \
42+
pip3 install \
43+
ipython \
44+
notebook \
45+
psycopg2-binary \
46+
yapf \
47+
selenium \
48+
pandas \
49+
matplotlib \
50+
requests \
51+
vaderSentiment \
52+
jupyter \
53+
seaborn \
54+
scipy
55+
56+
# Download spaCy model
57+
RUN python3 -m spacy download en_core_web_sm
58+
59+
# Copy your local files into the container
60+
COPY . /data
61+
WORKDIR /data
62+
63+
# Optional scripts (handle gracefully if not present)
64+
RUN bash /data/install_jupyter_extensions.sh || true
65+
RUN bash /data/version.sh || true
66+
67+
# Jupyter exposed
68+
EXPOSE 8888
69+
CMD ["jupyter-notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root", "--NotebookApp.token=''", "--NotebookApp.password=''"]
70+
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Real-Time Bitcoin Sentiment Analysis with spaCy and Selenium
2+
3+
**Author**: Siddhi Rohan
4+
**UID**: 121302823
5+
**Course**: DATA605 — Spring 2025
6+
7+
## Project Overview and Goals
8+
9+
This project, **TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium**, is a real-time sentiment analysis pipeline focused on Bitcoin-related tweets from X (Twitter). It leverages Selenium for web scraping, spaCy for natural language processing, VADER for sentiment analysis, and the CoinGecko API for Bitcoin price data, with results visualized in Jupyter notebooks. Developed for the DATA605 course in Spring 2025, the pipeline analyzes public sentiment and its correlation with Bitcoin price trends.
10+
11+
### Goals
12+
- **Data Collection**: Scrape tweets containing keywords "Bitcoin" and "BTC" to capture public sentiment.
13+
- **Sentiment Analysis**: Use VADER to analyze tweet sentiment, categorizing tweets as positive, negative, or neutral.
14+
- **Correlation Analysis**: Compute multiple correlation measures (Pearson, Spearman, Kendall, lagged Pearson, and rolling) between sentiment scores and Bitcoin prices.
15+
- **Visualization**: Generate insightful visualizations, including:
16+
- Line plot of sentiment vs. Bitcoin price over time.
17+
- Box plot of sentiment distribution.
18+
- Area plot of cumulative sentiment vs. Bitcoin price.
19+
- Correlation heatmap of sentiment, price, price change, and rolling correlation.
20+
- Rolling correlation plot over time.
21+
- **Usability**: Display visualizations inline in Jupyter notebooks for easy analysis and exploration.
22+
23+
## Project Structure
24+
25+
The project is organized within the `tutorials` repository under the `DATA605/Spring2025/projects` directory. Below is the project structure:
26+
27+
```
28+
TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium/
29+
30+
├── README.md # Project documentation and setup instructions
31+
├── spacy_selenium_API.md # Documentation for the sentiment analysis pipeline
32+
├── spacy_selenium_example.md # Example usage of the sentiment analysis pipeline
33+
├── spacy_selenium_utils.py # Core class for scraping, NLP, sentiment analysis, and visualization
34+
├── spacy_selenium_API.ipynb # Tutorial notebook demonstrating spaCy and Selenium APIs
35+
├── spacy_selenium_example.ipynb # End-to-end pipeline notebook
36+
├── requirements.txt # List of Python dependencies
37+
├── Dockerfile # Docker configuration for the project
38+
├── .gitignore # Specifies files and directories to ignore in Git
39+
├── docker_build.sh # Builds the Docker container
40+
├── docker_bash.sh # Launches Jupyter notebook server
41+
├── install_project_packages.sh # Installs pip dependencies inside the container
42+
└── bashrc, etc_sudoers, utils.sh # Helper configurations
43+
```
44+
45+
## How It Works
46+
47+
The pipeline operates in the following steps:
48+
49+
1. **Data Ingestion**:
50+
- Uses Selenium to scrape tweets from X for keywords "Bitcoin" and "BTC".
51+
- Handles X login requirements with provided credentials to access live search results.
52+
- Removes duplicate tweets based on text content.
53+
54+
2. **Data Preprocessing**:
55+
- Cleans tweets using spaCy for tokenization, stop-word removal, lemmatization, and Named Entity Recognition (NER).
56+
- Matches extracted entities with cryptocurrencies using CoinGecko data.
57+
58+
3. **Sentiment Analysis**:
59+
- Analyzes tweet sentiment using the VADER sentiment analyzer.
60+
- Categorizes tweets as positive, negative, or neutral based on compound scores.
61+
62+
4. **Correlation with Bitcoin Prices**:
63+
- Fetches Bitcoin price data from the CoinGecko API over a 1-day period.
64+
- Computes multiple correlation measures:
65+
- **Pearson**: Linear relationship.
66+
- **Spearman**: Monotonic relationship.
67+
- **Kendall**: Rank-based correlation.
68+
- **Lagged Pearson**: Explores if past sentiment predicts price.
69+
70+
5. **Visualizations**:
71+
- Generates inline plots in Jupyter notebooks:
72+
- **Sentiment vs. Price Over Time**: Line plot of sentiment scores and Bitcoin prices.
73+
- **Sentiment Distribution**: Box plot of sentiment score distribution.
74+
- **Cumulative Sentiment vs. Price**: Area plot comparing cumulative sentiment with price trends.
75+
- **Correlation Heatmap**: Heatmap of correlations between sentiment, price, price change, and rolling correlation.
76+
- **Rolling Correlation**: Line plot of rolling correlation over time.
77+
78+
## Getting Started
79+
80+
### Prerequisites
81+
- **Python 3.9+**: Ensure Python is installed.
82+
- **Google Chrome and ChromeDriver**: ChromeDriver must match your Chrome version for Selenium.
83+
- **X (Twitter) Account**: Valid credentials (`x_username`, `x_password`) are required for scraping.
84+
- **Docker and Docker Compose** (optional): For containerized execution.
85+
- **Stable Internet Connection**: Required for scraping tweets and fetching price data.
86+
- **(Optional) CoinGecko API Key**: For paid tier to avoid rate limits.
87+
88+
### Setup Instructions (Local Development)
89+
90+
1. **Clone the Repository**:
91+
```bash
92+
git clone https://github.com/causify-ai/tutorials.git
93+
cd tutorials/DATA605/Spring2025/projects/TutorTask204_Spring2025_RealTime_Bitcoin_Sentiment_Analysis_spaCy_Selenium
94+
```
95+
96+
2. **Create and Activate a Virtual Environment (Windows)**:
97+
```bash
98+
python -m venv venv
99+
venv\Scripts\activate
100+
```
101+
102+
3. **Create and Activate a Virtual Environment (macOS/Linux)**:
103+
```bash
104+
python3 -m venv venv
105+
source venv/bin/activate
106+
```
107+
108+
4. **Install Dependencies**:
109+
```bash
110+
pip install -r requirements.txt
111+
```
112+
113+
5. **Install ChromeDriver**:
114+
- Download ChromeDriver matching your Chrome version from [chromedriver.chromium.org](https://chromedriver.chromium.org/downloads).
115+
- Place `chromedriver` in the project directory or a directory in your `PATH`.
116+
- Update `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py` with the correct ChromeDriver path if needed.
117+
118+
6. **Update X Credentials**:
119+
- Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb`.
120+
- Update the `x_username` and `x_password` fields in the relevant cell:
121+
```python
122+
x_username="your_username"
123+
x_password="your_password"
124+
```
125+
- Ensure 2FA is disabled for your X account, as Selenium cannot handle 2FA prompts.
126+
127+
7. **(Optional) Set Up CoinGecko API Key**:
128+
- Create a `.env` file in the project root:
129+
```ini
130+
COINGECKO_API_KEY=your_key_here
131+
```
132+
133+
8. **Run the Jupyter Notebook Locally**:
134+
```bash
135+
jupyter notebook
136+
```
137+
- Open `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` in your browser and run the cells to execute the pipeline.
138+
139+
### Setup Instructions (Docker)
140+
Some of the Docker-related scripts were adjusted as per the project requirements. (Dockerfile, docker_build.sh and docker_bash.sh)
141+
1. **Install Docker Desktop** for your operating system.
142+
143+
2. **Build the Docker Image**:
144+
```bash
145+
chmod +x docker_data605_style/docker_*.sh
146+
./docker_data605_style/docker_build.sh
147+
```
148+
149+
3. **Jupyter Notebook Server**:
150+
- Start the container:
151+
```bash
152+
./docker_data605_style/docker_bash.sh
153+
```
154+
- The jupyter notebook server loads up for you to dive right into the project for easier and faster access.
155+
156+
157+
158+
## Usage
159+
160+
### Run the API Functionality Demo
161+
```bash
162+
jupyter notebook spacy_selenium_API.ipynb
163+
```
164+
This notebook demonstrates:
165+
- **spaCy API**: Tokenization, lemmatization, NER, and dependency parsing.
166+
- **Selenium API**: Scraping tweets from X with authenticated login.
167+
- Integration with `spacy_selenium_utils.py` for preprocessing and analysis.
168+
169+
### Run the Full Pipeline
170+
```bash
171+
jupyter notebook spacy_selenium_example.ipynb
172+
```
173+
The pipeline:
174+
- Scrapes tweets for "Bitcoin" and "BTC".
175+
- Preprocesses tweets with spaCy.
176+
- Analyzes sentiment with VADER.
177+
- Fetches Bitcoin price data from CoinGecko.
178+
- Computes correlations and generates visualizations.
179+
180+
### Explore Interactively
181+
- Start with `spacy_selenium_API.ipynb` to understand the APIs.
182+
- Run `spacy_selenium_example.ipynb` or `Bitcoin_Sentiment_Analysis.ipynb` for the full pipeline.
183+
- Use **Restart & Run All** in JupyterLab for consistent results.
184+
185+
## Troubleshooting
186+
187+
- **X Login Failure**:
188+
- Verify `x_username` and `x_password` in `spacy_selenium_example.ipynb` or `spacy_selenium_utils.py`.
189+
- Disable 2FA on your X account.
190+
- Check for verification prompts (email/phone) and handle manually (screenshots saved as `*.png`).
191+
192+
- **Selenium TimeoutException**:
193+
- Ensure ChromeDriver matches your Chrome version.
194+
- Increase `WebDriverWait` timeouts in `spacy_selenium_utils.py` (e.g., from 15s to 30s).
195+
196+
- **CoinGecko API Rate Limits**:
197+
- Use a paid API key in `.env` if rate-limited.
198+
- Reduce `max_tweets` in `spacy_selenium_example.ipynb`.
199+
200+
- **Docker Port Issues**:
201+
- Confirm `-p 8888:8888` is included in Docker commands.
202+
- Ensure port 8888 is free on your host machine.
203+
204+
- **Visualization Issues**:
205+
- Add `%matplotlib inline` at the top of notebook cells.
206+
- Update `matplotlib`, `seaborn`, and `ipython`:
207+
```bash
208+
pip install --upgrade matplotlib seaborn ipython
209+
```
210+
211+
## References
212+
- [spaCy Documentation](https://spacy.io/usage)
213+
- [Selenium Documentation](https://www.selenium.dev/documentation/)
214+
- [VADER Sentiment Analysis](https://github.com/cjhutto/vaderSentiment)
215+
- [CoinGecko API Documentation](https://www.coingecko.com/en/api/documentation)
216+
- [Matplotlib Documentation](https://matplotlib.org/stable/contents.html)
217+
- [Seaborn Documentation](https://seaborn.pydata.org/)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../../docker_common/bashrc
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
#!/bin/bash -xe
2+
3+
REPO_NAME=umd_data605
4+
IMAGE_NAME=umd_data605_template
5+
FULL_IMAGE_NAME=$REPO_NAME/$IMAGE_NAME
6+
7+
docker image ls $FULL_IMAGE_NAME
8+
9+
CONTAINER_NAME=$IMAGE_NAME
10+
docker run --rm -ti \
11+
--name $CONTAINER_NAME \
12+
-p 8888:8888 \
13+
-v $(pwd):/data \
14+
$FULL_IMAGE_NAME
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/bin/bash -e
2+
3+
GIT_ROOT=$(git rev-parse --show-toplevel)
4+
source $GIT_ROOT/docker_common/utils.sh
5+
6+
REPO_NAME=umd_data605
7+
IMAGE_NAME=umd_data605_template
8+
9+
# Build container.
10+
export DOCKER_BUILDKIT=1
11+
#export DOCKER_BUILDKIT=0
12+
build_container_image

0 commit comments

Comments
 (0)