We will start the Spark Streaming process in the DataProc cluster we created to communicate with the Kafka VM instance over the port 9092. Remember, we opened port 9092 for it to be able to accept connections.
-
Establish SSH connection to the master node
ssh streamify-spark
-
Clone git repo
git clone https://github.com/ankurchavda/streamify.git && \ cd streamify/spark_streaming
-
Set the evironment variables -
-
External IP of the Kafka VM so that spark can connect to it
-
Name of your GCS bucket. (What you gave during the terraform setup)
export KAFKA_ADDRESS=IP.ADD.RE.SS export GCP_GCS_BUCKET=bucket-name
Note: You will have to setup these env vars every time you create a new shell session. Or if you stop/start your cluster
-
-
Start reading messages
spark-submit \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \ stream_all_events.py
-
If all went right, you should see new
parquetfiles in your bucket! That is Spark writing a file every two minutes for each topic. -
Topics we are reading from
- listen_events
- page_view_events
- auth_events
