Big Data Module

Overview

This coursework focuses on parallelisation and scalability in the cloud using PySpark and TensorFlow/Keras.

The main objectives are to:

The work is structured into three main sections:

Data Preprocessing
- Preprocess a dataset of flower images (3600 images, 5 classes).
- Re-implement TensorFlow preprocessing in Spark to take advantage of parallelisation.
- Write processed data to the cloud using TFRecord files.
Performance Measurement
- Measure data reading speeds for different parameters in the cloud.
- Parallelise speed tests with Spark to evaluate efficiency and throughput.
- Analyse results, perform linear regression, and compare cloud vs. single-machine performance.
Theoretical Discussion
- Relate results to cloud configuration optimisation concepts (Cherrypick, Alipourfard et al., 2017).
- Discuss strategies for different workloads (batch vs. streaming).
- Provide insights on practical implications for large-scale machine learning in the cloud.

Notebook: Main .ipynb file with coding tasks, comments, and outputs.
Report: PDF document containing theoretical answers, analysis, tables, and screenshots from Google Cloud.
Scripts: PySpark scripts for preprocessing and performance tests (e.g., spark_job.py).
Cloud Resources: Storage buckets and Dataproc clusters for running parallelised tasks.

Set up a Google Cloud project with Dataproc and Storage APIs enabled.
Create a storage bucket to hold processed data and results.
Deploy PySpark jobs on Dataproc clusters (single-node and multi-node) to run preprocessing and speed tests.
Collect performance metrics from the Dataproc web interface for analysis.

Code includes clear comments and explanations.
For cloud tasks, performance data for reporting and analysis is recorded.
Focus is on understanding high-level workflow, parallelisation, and scalability, rather than low-level TensorFlow internals.