Name	Name	Last commit message	Last commit date
parent directory ..
Questions.ipynb	Questions.ipynb
README.md	README.md
Report.docx	Report.docx

Name

Last commit message

Last commit date

README.md

Report.docx

Big Data Module

Overview

This coursework focuses on parallelisation and scalability in the cloud using PySpark and TensorFlow/Keras.

The main objectives are to:

Parallelise data preprocessing tasks with Spark.
Measure cloud performance for different data processing configurations.
Analyse and evaluate results.
Discuss theoretical implications of cloud-based data processing.

The work is structured into three main sections:

Data Preprocessing
- Preprocess a dataset of flower images (3600 images, 5 classes).
- Re-implement TensorFlow preprocessing in Spark to take advantage of parallelisation.
- Write processed data to the cloud using TFRecord files.
Performance Measurement
- Measure data reading speeds for different parameters in the cloud.
- Parallelise speed tests with Spark to evaluate efficiency and throughput.
- Analyse results, perform linear regression, and compare cloud vs. single-machine performance.
Theoretical Discussion
- Relate results to cloud configuration optimisation concepts (Cherrypick, Alipourfard et al., 2017).
- Discuss strategies for different workloads (batch vs. streaming).
- Provide insights on practical implications for large-scale machine learning in the cloud.

Project Structure

Notebook: Main .ipynb file with coding tasks, comments, and outputs.
Report: PDF document containing theoretical answers, analysis, tables, and screenshots from Google Cloud.
Scripts: PySpark scripts for preprocessing and performance tests (e.g., spark_job.py).
Cloud Resources: Storage buckets and Dataproc clusters for running parallelised tasks.

Running the Project

Local Development

Use Google Colab or a local Jupyter notebook.
Install Spark locally for testing before deploying to the cloud.
Mount Google Drive for persistent storage and create a project directory.

Cloud Execution

Set up a Google Cloud project with Dataproc and Storage APIs enabled.
Create a storage bucket to hold processed data and results.
Deploy PySpark jobs on Dataproc clusters (single-node and multi-node) to run preprocessing and speed tests.
Collect performance metrics from the Dataproc web interface for analysis.

Notes

Code includes clear comments and explanations.
For cloud tasks, performance data for reporting and analysis is recorded.
Focus is on understanding high-level workflow, parallelisation, and scalability, rather than low-level TensorFlow internals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Big Data Module

Overview

Project Structure

Running the Project

Local Development

Cloud Execution

Notes

FilesExpand file tree

5.Big-Data-Module

Directory actions

More options

Directory actions

More options

Latest commit

History

5.Big-Data-Module

Folders and files

parent directory

README.md

Big Data Module

Overview

Project Structure

Running the Project

Local Development

Cloud Execution

Notes