|
2 | 2 |
|
3 | 3 | = Spark |
4 | 4 |
|
5 | | -We will not use Hadoop (which offers both a file system and a computing framework with MapReduce), but a more modern version: Spark. |
| 5 | +A modern framework for big data processing is Spark, it does not provide a distributed file system, but only the *processing* framework. |
| 6 | +Storage needs to be handled by another technology (typically _Hadoop_). |
| 7 | +Main Spark concepts are: |
6 | 8 |
|
7 | | -Spark is essentially a workflow system with additional features, such as a more efficient way of coping with failures and grouping tasks among compute nodes. The central data abstraction of Spark is called the Resilient Distributed Dataset (RDD). |
| 9 | +/ Resilient Distributed Datasets (RDDs): the main *data* abstraction in Spark, representing a distributed collection of objects (with the same format) that can be processed in parallel. |
| 10 | + An RDD is partitioned in *chunks* that may be held at different compute nodes. |
| 11 | + RDDs are *immutable*, meaning that once created, they cannot be modified. Instead, *transformations* on RDDs produce new RDDs. |
8 | 12 |
|
9 | | -The adjective "resilient" symbolizes the capacity to recover from the loss of any or all chunks of an RDD. |
| 13 | + #note[ |
| 14 | + The adjective "resilient" symbolizes the capacity to recover from the loss of any or all chunks of an RDD. |
| 15 | + ] |
10 | 16 |
|
11 | | -#note[ An RDD is a collection of objects of the same type, similar to the files of key-value pairs used in MapReduce systems. ] |
| 17 | + #note[ |
| 18 | + Spark does *not* enforce key-value pairs as the data model for RDDs, but many operations (like reduceByKey) are designed for pair RDDs. |
| 19 | + ] |
12 | 20 |
|
13 | | -#warning[ RDDs are distributed in the sense that an RDD is normally broken into chunks that may be held at different compute nodes. ] |
| 21 | + #warning[ |
| 22 | + An RDD is *not* persistent: it is lost as soon as the machine turns off. |
| 23 | + Results must be stored in a distributed filesystem. |
| 24 | + ] |
14 | 25 |
|
15 | | -#warning[ Spark does not offer a distributed file system; it only provides the processing framework. Storage needs to be handled by another technology (typically Hadoop). ] |
| 26 | +/ Workflow: a sequence of *lazy transformations* and *actions* on RDDs that define a Spark program. |
16 | 27 |
|
17 | | -#note[ We refer to the machine that sends commands to Spark and displays output as the driver. This machine is not meant to handle big data itself. ] |
| 28 | + #warning[ |
| 29 | + Each operation on an RDD is lazy, meaning that it does not immediately compute a result. Instead, it builds up a graph of transformations that is executed when an action (collect, count, etc.) is called. |
18 | 30 |
|
19 | | -Basically, a Spark program is a sequence of transformations (operations) that modify an RDD and return a new RDD. |
| 31 | + Calling multiple times an action on the same RDD will trigger the execution of the entire graph of transformations each time, unless the RDD is *cached*. |
| 32 | + ] |
20 | 33 |
|
21 | | -There is also another type of operation called actions. These take an RDD and either save it to the surrounding filesystem or produce an output to pass back to the application that called the Spark program. |
| 34 | +/ Driver: the machine that runs the main program and coordinates the execution of tasks across the cluster. |
22 | 35 |
|
23 | | -#warning[ There is no restriction on the type of elements that comprise an RDD. ] |
| 36 | +The full #link("https://spark.apache.org/docs/latest/api/python/reference/pyspark.html")[RDD Spark API] is large, the essential operations are: |
24 | 37 |
|
25 | | -#warning[ An RDD is not persistent; it is lost as soon as the machine turns off. Results must be stored in a distributed filesystem. ] |
| 38 | +- *RDD creation*: |
| 39 | + - *parallelize*: creates an RDD from an in memory collection on the driver. |
| 40 | + - *textFile*: creates an RDD from a text file (typically in distributed storage). |
26 | 41 |
|
27 | | -While there are many possible operations in a Spark program, we can list the most essential ones: |
| 42 | +- *local transformations* (no shuffle, usually faster): |
| 43 | + - *map*: applies a function to each element, producing exactly one output element per input element. |
| 44 | + - *flatMap*: like `map`, but each input element can produce zero, one, or many output elements. |
| 45 | + - *filter*: keeps only elements that satisfy a predicate. |
| 46 | + - *union*: combines two RDDs into one RDD, concatenating their elements. |
28 | 47 |
|
29 | | -- *parallelize*: transform an object that lives in RAM to a RDD |
| 48 | +- *shuffle transformations* (network communication, usually slower): |
| 49 | + - *reduceByKey*: for pair RDDs `(K, V)`, aggregates values per key using an associative and commutative function. |
| 50 | + - *groupByKey*: for pair RDDs `(K, V)`, groups all values for each key into `(K, Iterable[V])`. |
| 51 | + - *join*: for pair RDDs, joins by key and returns `(K, (V1, V2))`. |
| 52 | + - *distinct*: removes duplicates. |
30 | 53 |
|
31 | | -- *textFile*: transform a file into a RDD |
| 54 | +- *persistence*: |
| 55 | + - *cache*: asks Spark to persist the RDD instead of recomputing it every time it is needed (usually in memory, possibly with fallback to disk depending on storage level). |
32 | 56 |
|
33 | | -- *map*: takes a parameter that is a function, and applies it to avery element of an RDD, producing an RDD. note that respect of the map reduce, in here a map function can apply to any object type, but it produced exactly one object as result. |
34 | | - |
35 | | -- *flatMap*: like map, but flatten the result (if multi-dimensional). |
36 | | - Is the analogous to the function Map of MapReduce, but without the requirement that all types be key-value pairs. |
37 | | - |
38 | | -- *filter*: select only elements that satisfy a predicate. |
39 | | - Takes as a parameter a predicate that applies to the type of objects in the input RDD, and returns `true` or `false` for each object. |
40 | | - The final output consists of only those objects for which the filter function returns `true` |
41 | | - |
42 | | -- *reduceByKey*: reduction like in functional programming. |
43 | | - Is an action. |
44 | | - Takes as input a function which takes two elements of some type `T` and return another element of type `T`. |
45 | | - The action is applied repeatedly to each pairs of consecutive elements until it remains a single element. |
46 | | - The operation should be _commutative_ and _associative_. |
47 | | - It does both _shuffling_ and _reducing_ steps. |
48 | | - Can be applied only if the working set is composed of pairs. |
49 | | - |
50 | | -- *groupByKey*: groups values by key without aggregating them. |
51 | | - Can be applied only if the working set is composed of pairs. |
52 | | - Takes in input an RDD whose type is a key-value pairs. |
53 | | - Produces key-value pairs where the value is a list of all values for that key. |
54 | | - |
55 | | -- *join*: joins two RDDs by key. The type of each RDD must be a key-value pair, and the key types of both relations must be the same. |
56 | | - Returns pairs of (key, (value1, value2)). |
57 | | - |
58 | | -- *count*: count the total number of elements in the RDD. |
59 | | - This is an action that returns a value to the driver. |
60 | | - |
61 | | -- *collect*: bring the data from the RDD into the RAM of the driver _(use with caution!)_ |
62 | | - |
63 | | -- *take*: like collect, but instead of getting all the data, only selects as many random records as specified (safe alternative to `collect`) |
64 | | - |
65 | | -- *distinct*: remove duplicate elements from the RDD |
66 | | - |
67 | | -- *union*: combine two RDDs into a single RDD |
68 | | - |
69 | | -- *cache*: cache the RDD, keeping it in the RAM instead of distributing over the whole system (applied only if possible) |
70 | | - |
71 | | -- *saveAsTextFile*: save the RDD as a text file in the distributed filesystem. |
72 | | - Each element is written on a separate line. |
73 | | - This is the standard way to persist results beyond the Spark session. |
| 57 | +- *actions* (trigger execution and return results or write output): |
| 58 | + - *count*: returns the number of elements to the driver. |
| 59 | + - *collect*: returns all elements to the driver memory (use with caution). |
| 60 | + - *take*: returns the first `n` elements to the driver (safer than `collect` for inspection). |
| 61 | + - *saveAsTextFile*: writes the RDD to text files in distributed storage (one element per line). |
0 commit comments