Uploading data to GCS:
gsutil -m cp -r pq/ gs://dtc_data_lake_de-zoomcamp-nytaxi/pqDownload the jar for connecting to GCS to any location (e.g. the lib folder):
Note: For other versions of GCS connector for Hadoop see Cloud Storage connector .
gsutil cp gs://hadoop-lib/gcs/gcs-connector-hadoop3-2.2.5.jar ./lib/See the notebook with configuration in 09_spark_gcs.ipynb
(Thanks Alvin Do for the instructions!)
Creating a stand-alone cluster (docs):
./sbin/start-master.shCreating a worker:
URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"
./sbin/start-slave.sh ${URL}
# for newer versions of spark use that:
#./sbin/start-worker.sh ${URL}Turn the notebook into a script:
jupyter nbconvert --to=script 06_spark_sql.ipynbEdit the script and then run it:
python 06_spark_sql.py \
--input_green=data/pq/green/2020/*/ \
--input_yellow=data/pq/yellow/2020/*/ \
--output=data/report-2020Use spark-submit for running the script on the cluster
URL="spark://de-zoomcamp.europe-west1-b.c.de-zoomcamp-nytaxi.internal:7077"
spark-submit \
--master="${URL}" \
06_spark_sql.py \
--input_green=data/pq/green/2021/*/ \
--input_yellow=data/pq/yellow/2021/*/ \
--output=data/report-2021Upload the script to GCS:
gsutil -m cp -r 06_spark_sql.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.pyParams for the job:
--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2021/*/--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2021/*/--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2021
Using Google Cloud SDK for submitting to dataproc (link)
gcloud dataproc jobs submit pyspark \
--cluster=de-zoomcamp-cluster \
--region=europe-west6 \
gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql.py \
-- \
--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
--output=gs://dtc_data_lake_de-zoomcamp-nytaxi/report-2020Upload the script to GCS:
gsutil -m cp -r 06_spark_sql_big_query.py gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.pyWrite results to big query (docs):
gcloud dataproc jobs submit pyspark \
--cluster=de-zoomcamp-cluster \
--region=europe-west6 \
--jars=gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar \
gs://dtc_data_lake_de-zoomcamp-nytaxi/code/06_spark_sql_big_query.py \
-- \
--input_green=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/green/2020/*/ \
--input_yellow=gs://dtc_data_lake_de-zoomcamp-nytaxi/pq/yellow/2020/*/ \
--output=trips_data_all.reports-2020There can be issue with latest Spark version and the Big query connector. Download links to the jar file for respective Spark versions can be found at: Spark and Big query connector
Note: Dataproc on GCE 2.1+ images pre-install Spark BigQquery connector: DataProc Release 2.2. Therefore, no need to include the jar file in the job submission.