|
1 | | -<!-- toc --> |
| 1 | +# Real-Time Bitcoin Price Analysis Using Amazon EMR |
2 | 2 |
|
| 3 | +This project demonstrates a real-time data processing pipeline that collects Bitcoin price data from a public API, stores it in Amazon S3, and processes it using Apache Spark on Amazon EMR for time-series analysis. |
3 | 4 |
|
| 5 | +--- |
4 | 6 |
|
5 | | -<!-- tocstop --> |
| 7 | +## Technologies Used |
6 | 8 |
|
7 | | -- Author: <your name> |
8 | | -- Date: <date> |
| 9 | +- **CoinGecko API** – Fetching live Bitcoin price in USD |
| 10 | +- **Python** – Core scripting language |
| 11 | +- **Boto3** – AWS SDK to interact with Amazon S3 |
| 12 | +- **Amazon S3** – For storing raw and processed data |
| 13 | +- **Apache Spark (Structured Streaming)** – For 1-minute windowed aggregation |
| 14 | +- **Amazon EMR** – Cluster to run Spark jobs at scale |
| 15 | +- **Docker** – Containerized environment for portability and reproducibility |
9 | 16 |
|
10 | | -<Describe all the files in the projects> |
| 17 | +--- |
11 | 18 |
|
12 | | -This project contains the following files |
| 19 | +## Project Structure |
13 | 20 |
|
14 | | -- `template`.API.ipynb: a notebook describing the native API of <Package> |
15 | | -- `template`.API.md: a description of the native API of <Package> |
16 | | -- `template`.API.py: code for using API of <Package> |
17 | | -- `template`.example.ipynb: a notebook implementing a project using <Package> |
18 | | -- `template`.example.md: a markdown description of the project |
19 | | -- `template`.example.py: code for implementing the project |
| 21 | +| File/Folder | Description | |
| 22 | +|-------------|-------------| |
| 23 | +| `bitcoin_producer.py` | Fetches real-time Bitcoin prices and writes records to S3 (`data_v2/`) | |
| 24 | +| `bitcoin_streaming_consumer_emr_debug.py` | Spark job to compute 1-min windowed average from S3 and write to `output/` | |
| 25 | +| `bitcoin_kafka/bitcoin_emr_utils.py` | Helper functions for API fetching, timestamping, and S3 upload | |
| 26 | +| `bitcoin_emr.API.ipynb` | Demonstrates utility API functions (with simulated S3 upload fallback) | |
| 27 | +| `bitcoin_emr.example.ipynb` | Simulates full pipeline with producer input and EMR output | |
| 28 | +| `bitcoin_emr.API.md` | Markdown documenting the API and helper layer | |
| 29 | +| `bitcoin_emr.example.md` | Markdown documenting the full pipeline example | |
| 30 | +| `requirements.txt` | Python package requirements | |
| 31 | +| `Dockerfile` + `*.sh` | Docker setup and run scripts | |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Output Format |
| 36 | + |
| 37 | +### Input Record (stored in S3) |
| 38 | + |
| 39 | +```json |
| 40 | +{ |
| 41 | + "timestamp": "2025-05-17T09:58:00", |
| 42 | + "price_usd": 102723.12 |
| 43 | +} |
| 44 | +``` |
| 45 | + |
| 46 | +### Processed Output (via Spark on EMR) |
| 47 | + |
| 48 | +```json |
| 49 | +{ |
| 50 | + "window": { |
| 51 | + "start": "2025-05-17T09:58:00", |
| 52 | + "end": "2025-05-17T09:59:00" |
| 53 | + }, |
| 54 | + "avg_price": 102750.13 |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## AWS Credentials Note |
| 61 | + |
| 62 | +This project uses `boto3` to upload Bitcoin price records to Amazon S3. |
| 63 | + |
| 64 | +If valid AWS credentials are present, records will be uploaded to: |
| 65 | + |
| 66 | +```text |
| 67 | +s3://bitcoin-price-streaming-data/data_v2/ |
| 68 | +``` |
| 69 | + |
| 70 | +⚠️ If credentials are not present, the upload will be skipped gracefully, and the JSON record will be printed instead. |
| 71 | + |
| 72 | +This ensures the notebooks run end-to-end even without AWS setup. |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +## Docker Setup Instructions |
| 77 | + |
| 78 | +You can run this project entirely in Docker without installing any local dependencies. |
| 79 | + |
| 80 | +### To Build the Image |
| 81 | + |
| 82 | +```bash |
| 83 | +bash docker_build.sh |
| 84 | +``` |
| 85 | + |
| 86 | +### To Run the Container |
| 87 | + |
| 88 | +```bash |
| 89 | +bash docker_bash.sh |
| 90 | +``` |
| 91 | + |
| 92 | +### Open Jupyter |
| 93 | + |
| 94 | +Once the container is running, open your browser and go to: |
| 95 | + |
| 96 | +```text |
| 97 | +http://localhost:8888 |
| 98 | +``` |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +### Notebooks to Run |
| 103 | + |
| 104 | +- `bitcoin_emr.API.ipynb` – Test API functions, simulate S3 upload |
| 105 | +- `bitcoin_emr.example.ipynb` – Simulate full pipeline input + output |
| 106 | +- Corresponding Markdown Documentation: |
| 107 | + - `bitcoin_emr.API.md` |
| 108 | + - `bitcoin_emr.example.md` |
| 109 | + |
| 110 | +Both notebooks run without requiring cloud setup. |
| 111 | + |
| 112 | +--- |
| 113 | + |
| 114 | +## Running the Spark Job on Amazon EMR (Optional) |
| 115 | + |
| 116 | +To run the Spark job (`bitcoin_streaming_consumer_emr_debug.py`) on an actual Amazon EMR cluster and process the real-time Bitcoin data stored in S3: |
| 117 | + |
| 118 | +### 1. Upload Input Data |
| 119 | + |
| 120 | +Ensure the producer script or notebook has pushed data to: |
| 121 | + |
| 122 | +```text |
| 123 | +s3://bitcoin-price-streaming-data/data_v2/ |
| 124 | +``` |
| 125 | + |
| 126 | +This folder should contain timestamped `.json` records with the following structure: |
| 127 | + |
| 128 | +```json |
| 129 | +{ |
| 130 | + "timestamp": "YYYY-MM-DDTHH:MM:SS", |
| 131 | + "price_usd": FLOAT |
| 132 | +} |
| 133 | +``` |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +### 2. Launch and Configure EMR Cluster |
| 138 | + |
| 139 | +Navigate to the [EMR Console](https://console.aws.amazon.com/elasticmapreduce/) and create a cluster with the following configurations: |
| 140 | + |
| 141 | +#### Software Configuration |
| 142 | + |
| 143 | +- **Release version**: EMR 6.x (e.g., 6.13.0) |
| 144 | +- **Applications**: Spark (uncheck others if not needed) |
| 145 | + |
| 146 | +#### Hardware Configuration |
| 147 | + |
| 148 | +- **Instance type**: `m5.xlarge` (for both Master and Core) |
| 149 | +- **Core nodes**: At least 1 |
| 150 | +- **Auto-termination**: Enable if needed to save costs |
| 151 | + |
| 152 | +#### General Configuration |
| 153 | + |
| 154 | +- **Cluster name**: `bitcoin-emr-cluster` |
| 155 | +- **Logging**: Enable and set an S3 log path (e.g., `s3://your-bucket/emr-logs/`) |
| 156 | +- **EC2 key pair**: Select a key pair for SSH access (optional but recommended) |
| 157 | + |
| 158 | +#### Networking |
| 159 | + |
| 160 | +- **VPC**: Use the default or a custom one with public subnet |
| 161 | +- **Permissions**: |
| 162 | + - Use a service role with `AmazonS3FullAccess` and `AmazonEMRFullAccessPolicy_v2` |
| 163 | + - Ensure the EC2 instance profile also has access to S3 |
| 164 | + |
| 165 | +--- |
| 166 | + |
| 167 | +### 3. Submit the Spark Job |
| 168 | + |
| 169 | +You can submit the job in one of two ways: |
| 170 | + |
| 171 | +#### (a) Add a Step from the Console |
| 172 | + |
| 173 | +- Upload `bitcoin_streaming_consumer_emr_debug.py` to S3 (e.g., `s3://your-bucket/scripts/`) |
| 174 | +- In the cluster’s "Steps" tab, add a new step: |
| 175 | + - **Type**: Spark |
| 176 | + - **Name**: `Run Bitcoin Streaming Job` |
| 177 | + - **Script location**: |
| 178 | + ```bash |
| 179 | + s3://your-bucket/scripts/bitcoin_streaming_consumer_emr_debug.py |
| 180 | + ``` |
| 181 | + - **Arguments**: Leave blank |
| 182 | + |
| 183 | +#### (b) SSH and Run Manually |
| 184 | + |
| 185 | +1. SSH into the master node: |
| 186 | + ```bash |
| 187 | + ssh -i your-key.pem hadoop@<master-node-public-dns> |
| 188 | + ``` |
| 189 | + |
| 190 | +2. Run the script using: |
| 191 | + ```bash |
| 192 | + spark-submit --deploy-mode cluster --master yarn s3://your-bucket/scripts/bitcoin_streaming_consumer_emr_debug.py |
| 193 | + ``` |
| 194 | + |
| 195 | +--- |
| 196 | + |
| 197 | +### 4. Output Location |
| 198 | + |
| 199 | +After execution, check the results in your S3 bucket: |
| 200 | + |
| 201 | +```text |
| 202 | +s3://bitcoin-price-streaming-data/output/ |
| 203 | +``` |
| 204 | + |
| 205 | +Each file contains windowed average price data over 1-minute intervals in JSON format. |
| 206 | + |
| 207 | +--- |
| 208 | + |
| 209 | +### 📝 Tip |
| 210 | + |
| 211 | +To reduce costs: |
| 212 | +- Use **auto-termination** after job completion |
| 213 | +- Always **terminate idle clusters** |
| 214 | +- Monitor logs in `emr-logs/` for errors or debug output |
| 215 | + |
| 216 | +--- |
| 217 | + |
| 218 | +## Summary |
| 219 | + |
| 220 | +- Docker runs the entire project with zero setup |
| 221 | +- AWS and EMR usage is optional but supported |
| 222 | +- Notebooks simulate output if cloud access is unavailable |
| 223 | +- Fully reproducible for grading or real deployment |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +**Author:** Rithika Baskaran |
| 228 | +**Course:** DATA605 — Spring 2025 |
0 commit comments