Skip to content

Commit c3ed833

Browse files
Add: Spark application, job, stage, task hierarchy explained
Closes #228
1 parent cd0107e commit c3ed833

1 file changed

Lines changed: 135 additions & 0 deletions

File tree

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
---
2+
id: fe61ce3be4
3+
question: What is the difference between a Spark application, job, stage, and task?
4+
sort_order: 67
5+
---
6+
7+
One of the first places where Spark concepts appear is in the graphical interface. There we see terms like application, job, stage, or task, but at first it's not always clear how they relate to each other. Understanding this hierarchy is very useful because it allows us to interpret what Spark is doing internally, debug problems, and better understand the performance of our processes.
8+
9+
Applications
10+
11+
A Spark application is the complete program we execute. It is the entire process that begins when we launch something like:
12+
13+
```bash
14+
spark-submit script.py
15+
```
16+
17+
or when we start a session in PySpark or a notebook.
18+
19+
```python
20+
import pyspark
21+
import os
22+
from pyspark.sql import SparkSession
23+
24+
spark = SparkSession.builder \
25+
.master(os.environ.get('SPARK_MASTER')) \
26+
.appName("csv-to-parquet") \
27+
.getOrCreate()
28+
```
29+
30+
A Spark application includes:
31+
32+
* The driver program, which coordinates execution.
33+
* The executors, which perform the distributed work.
34+
* All operations executed by the program until it finishes.
35+
36+
We can think of the application as the complete execution of our program.
37+
38+
For example, if we run a PySpark script that 1) reads a dataset, 2) performs several transformations, and 3) writes a result; all of that forms a single Spark application.
39+
40+
The graphical interface shows one entry for each executed application.
41+
42+
Job
43+
44+
Within an application, Spark divides the work into jobs. A job is created every time we execute an action on a DataFrame. As we saw in previous chapters, in Spark there are two types of operations:
45+
46+
* Transformations: describe a transformation but are not executed immediately.
47+
* Actions: trigger the immediate execution of the transformations described up to that point.
48+
49+
Some examples of actions are `show()`, `count()`, `collect()`, `write`, or `save`. Every time we call an action, Spark creates a new job.
50+
51+
In the script:
52+
53+
```python
54+
df = spark.read.parquet("data.parquet")
55+
df_filtered = df.filter("price > 10")
56+
57+
df_filtered.count()
58+
df_filtered.show()
59+
```
60+
61+
... two separate jobs will be executed, one for `count()` and one for `show()`, even though both use the same _DataFrame_.
62+
63+
This happens because Spark evaluates transformations lazily and only executes the plan when a result is requested.
64+
65+
Stages
66+
67+
Each job is divided into stages. Stages represent groups of operations that can be executed without needing to redistribute data between nodes.
68+
69+
The reason they are separated into stages is usually an operation called a _shuffle_. A _shuffle_ occurs when data must be redistributed between partitions; for example in operations like: `groupBy`, `join`, `distinct`, and `reduceByKey`.
70+
71+
When Spark detects that a _shuffle_ is needed, it divides the job into several stages.
72+
73+
```python
74+
df.groupBy("city").count()
75+
```
76+
77+
This typically generates an execution plan roughly like this:
78+
79+
* Stage 1: data reading and initial transformation
80+
* Shuffle
81+
* Stage 2: final aggregation
82+
83+
Each stage can be executed in parallel across multiple nodes.
84+
85+
Task
86+
87+
A task is the smallest unit of work in Spark. Each stage is divided into multiple tasks, and each task processes one partition of data.
88+
89+
For example, for a dataset with 200 partitions in one stage, Spark will launch 200 tasks. And each task will be executed by an executor.
90+
91+
In other words:
92+
93+
```
94+
Stage
95+
├─ Task 1: processes partition 1
96+
├─ Task 2: processes partition 2
97+
├─ Task 3: processes partition 3
98+
...
99+
```
100+
101+
The more partitions there are, the more tasks Spark can execute in parallel.
102+
103+
Full Hierarchy in an Example
104+
105+
Imagine this code:
106+
107+
```python
108+
df = spark.read.parquet("rides.parquet")
109+
110+
result = (
111+
df
112+
.filter("passenger_count > 2")
113+
.groupBy("PULocationID")
114+
.count()
115+
)
116+
117+
result.show()
118+
```
119+
120+
The execution might look like this:
121+
122+
* Application: the complete script.
123+
* Job: created by `show()`.
124+
* Stages:
125+
* Reading and filtering.
126+
* _Shuffle_ and aggregation.
127+
* Tasks: one per partition.
128+
129+
Relevant Links
130+
131+
To get in-depth information about these concepts, check:
132+
133+
* [Job Scheduling](https://spark.apache.org/docs/latest/job-scheduling.html)
134+
* [Resilient Distributed Dataset Programming Guide: Transformations](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations)
135+
* [Resilient Distributed Dataset Programming Guide: Actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions)

0 commit comments

Comments
 (0)