streamify/setup/spark.md at main · ankurchavda/streamify

Setup Spark Cluster

We will start the Spark Streaming process in the DataProc cluster we created to communicate with the Kafka VM instance over the port 9092. Remember, we opened port 9092 for it to be able to accept connections.

Establish SSH connection to the master node
```
ssh streamify-spark
```

Clone git repo

git clone https://github.com/ankurchavda/streamify.git && \
cd streamify/spark_streaming

Set the evironment variables -
- External IP of the Kafka VM so that spark can connect to it
- Name of your GCS bucket. (What you gave during the terraform setup)
```
export KAFKA_ADDRESS=IP.ADD.RE.SS
export GCP_GCS_BUCKET=bucket-name
```
  Note: You will have to setup these env vars every time you create a new shell session. Or if you stop/start your cluster

Start reading messages

spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2 \
stream_all_events.py

If all went right, you should see new parquet files in your bucket! That is Spark writing a file every two minutes for each topic.
Topics we are reading from
- listen_events
- page_view_events
- auth_events

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup Spark Cluster

FilesExpand file tree

spark.md

Latest commit

History

spark.md

File metadata and controls

Setup Spark Cluster