Skip to content

Commit 8a04321

Browse files
authored
Merge pull request #18 from Favo02/various-fixes
Various fixes
2 parents 79084e0 + 4c0c6fd commit 8a04321

4 files changed

Lines changed: 50 additions & 65 deletions

File tree

README.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
11
# Handbook for Algorithms for Massive Datasets
22

3-
> [!WARNING]
4-
> This repository contains the notes for the course.
5-
> Our projects are in separate repos: [Luca](https://github.com/Favo02/recommendation-system), [Giacomo](https://github.com/comitanigiacomo/Comitani_85673A_Stream_Analysis).
6-
73
> [!CAUTION]
8-
> The notes are still unfinished (and will probably remain like that).
4+
> The notes are still unfinished (and will probably remain like that as we both passed the exam).
95
> All topics have an open Pull Request, that must be reviewed before merging.
106
>
117
> For a complete _temporary_ PDF, check [this](https://github.com/Favo02/algorithms-for-massive-datasets/pull/17) PR.
@@ -18,8 +14,9 @@ These notes are open source: [github.com/Favo02/algorithms-for-massive-datasets]
1814

1915
Contributions and corrections are welcome via Issues or Pull Requests.
2016

21-
> [!CAUTION]
22-
> Check `TODO` comments, open Issues and Pull Requests.
17+
> [!NOTE]
18+
> This repository contains the notes for the course.
19+
> Our projects are in separate repos: [Luca](https://github.com/Favo02/recommendation-system), [Giacomo](https://github.com/comitanigiacomo/Comitani_85673A_Stream_Analysis).
2320
2421
## Compiled PDF
2522

chapters/3-similarity.typ

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -685,7 +685,7 @@ The *cosine distance* can be used on vectors with a common origin (directions in
685685
The resulting distance will be in the range 0 to 180 degrees regardless of the dimension of the vector.
686686

687687
Given two vectors $x$ and $y$, the cosine distance is defined exploiting the dot product and normalization:
688-
$ d(x, y) = theta_(x y) = arccos (x y)/(||x|| ||y||) $
688+
$ d(x, y) = theta_(x y) = (x y)/(||x|| ||y||) $
689689

690690
#warning[
691691
Two vectors with the same direction but different *magnitude* are considered the same.

chapters/4-frequent-itemsets.typ

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -746,8 +746,8 @@ But what about FN?
746746
$
747747
"Supp"(I) & = sum_c "Supp"_(c)(I) \
748748
& < sum_c p s \
749-
& = k dot p s \
750-
& = s
749+
& < k dot p s \
750+
& < s
751751
$
752752

753753
Because $I$ is frequent, it must be $"Supp"(I) >= s$, which is a contradiction $qed$.

chapters/a-spark.typ

Lines changed: 43 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -2,72 +2,60 @@
22

33
= Spark
44

5-
We will not use Hadoop (which offers both a file system and a computing framework with MapReduce), but a more modern version: Spark.
5+
A modern framework for big data processing is Spark, it does not provide a distributed file system, but only the *processing* framework.
6+
Storage needs to be handled by another technology (typically _Hadoop_).
7+
Main Spark concepts are:
68

7-
Spark is essentially a workflow system with additional features, such as a more efficient way of coping with failures and grouping tasks among compute nodes. The central data abstraction of Spark is called the Resilient Distributed Dataset (RDD).
9+
/ Resilient Distributed Datasets (RDDs): the main *data* abstraction in Spark, representing a distributed collection of objects (with the same format) that can be processed in parallel.
10+
An RDD is partitioned in *chunks* that may be held at different compute nodes.
11+
RDDs are *immutable*, meaning that once created, they cannot be modified. Instead, *transformations* on RDDs produce new RDDs.
812

9-
The adjective "resilient" symbolizes the capacity to recover from the loss of any or all chunks of an RDD.
13+
#note[
14+
The adjective "resilient" symbolizes the capacity to recover from the loss of any or all chunks of an RDD.
15+
]
1016

11-
#note[ An RDD is a collection of objects of the same type, similar to the files of key-value pairs used in MapReduce systems. ]
17+
#note[
18+
Spark does *not* enforce key-value pairs as the data model for RDDs, but many operations (like reduceByKey) are designed for pair RDDs.
19+
]
1220

13-
#warning[ RDDs are distributed in the sense that an RDD is normally broken into chunks that may be held at different compute nodes. ]
21+
#warning[
22+
An RDD is *not* persistent: it is lost as soon as the machine turns off.
23+
Results must be stored in a distributed filesystem.
24+
]
1425

15-
#warning[ Spark does not offer a distributed file system; it only provides the processing framework. Storage needs to be handled by another technology (typically Hadoop). ]
26+
/ Workflow: a sequence of *lazy transformations* and *actions* on RDDs that define a Spark program.
1627

17-
#note[ We refer to the machine that sends commands to Spark and displays output as the driver. This machine is not meant to handle big data itself. ]
28+
#warning[
29+
Each operation on an RDD is lazy, meaning that it does not immediately compute a result. Instead, it builds up a graph of transformations that is executed when an action (collect, count, etc.) is called.
1830

19-
Basically, a Spark program is a sequence of transformations (operations) that modify an RDD and return a new RDD.
31+
Calling multiple times an action on the same RDD will trigger the execution of the entire graph of transformations each time, unless the RDD is *cached*.
32+
]
2033

21-
There is also another type of operation called actions. These take an RDD and either save it to the surrounding filesystem or produce an output to pass back to the application that called the Spark program.
34+
/ Driver: the machine that runs the main program and coordinates the execution of tasks across the cluster.
2235

23-
#warning[ There is no restriction on the type of elements that comprise an RDD. ]
36+
The full #link("https://spark.apache.org/docs/latest/api/python/reference/pyspark.html")[RDD Spark API] is large, the essential operations are:
2437

25-
#warning[ An RDD is not persistent; it is lost as soon as the machine turns off. Results must be stored in a distributed filesystem. ]
38+
- *RDD creation*:
39+
- *parallelize*: creates an RDD from an in memory collection on the driver.
40+
- *textFile*: creates an RDD from a text file (typically in distributed storage).
2641

27-
While there are many possible operations in a Spark program, we can list the most essential ones:
42+
- *local transformations* (no shuffle, usually faster):
43+
- *map*: applies a function to each element, producing exactly one output element per input element.
44+
- *flatMap*: like `map`, but each input element can produce zero, one, or many output elements.
45+
- *filter*: keeps only elements that satisfy a predicate.
46+
- *union*: combines two RDDs into one RDD, concatenating their elements.
2847

29-
- *parallelize*: transform an object that lives in RAM to a RDD
48+
- *shuffle transformations* (network communication, usually slower):
49+
- *reduceByKey*: for pair RDDs `(K, V)`, aggregates values per key using an associative and commutative function.
50+
- *groupByKey*: for pair RDDs `(K, V)`, groups all values for each key into `(K, Iterable[V])`.
51+
- *join*: for pair RDDs, joins by key and returns `(K, (V1, V2))`.
52+
- *distinct*: removes duplicates.
3053

31-
- *textFile*: transform a file into a RDD
54+
- *persistence*:
55+
- *cache*: asks Spark to persist the RDD instead of recomputing it every time it is needed (usually in memory, possibly with fallback to disk depending on storage level).
3256

33-
- *map*: takes a parameter that is a function, and applies it to avery element of an RDD, producing an RDD. note that respect of the map reduce, in here a map function can apply to any object type, but it produced exactly one object as result.
34-
35-
- *flatMap*: like map, but flatten the result (if multi-dimensional).
36-
Is the analogous to the function Map of MapReduce, but without the requirement that all types be key-value pairs.
37-
38-
- *filter*: select only elements that satisfy a predicate.
39-
Takes as a parameter a predicate that applies to the type of objects in the input RDD, and returns `true` or `false` for each object.
40-
The final output consists of only those objects for which the filter function returns `true`
41-
42-
- *reduceByKey*: reduction like in functional programming.
43-
Is an action.
44-
Takes as input a function which takes two elements of some type `T` and return another element of type `T`.
45-
The action is applied repeatedly to each pairs of consecutive elements until it remains a single element.
46-
The operation should be _commutative_ and _associative_.
47-
It does both _shuffling_ and _reducing_ steps.
48-
Can be applied only if the working set is composed of pairs.
49-
50-
- *groupByKey*: groups values by key without aggregating them.
51-
Can be applied only if the working set is composed of pairs.
52-
Takes in input an RDD whose type is a key-value pairs.
53-
Produces key-value pairs where the value is a list of all values for that key.
54-
55-
- *join*: joins two RDDs by key. The type of each RDD must be a key-value pair, and the key types of both relations must be the same.
56-
Returns pairs of (key, (value1, value2)).
57-
58-
- *count*: count the total number of elements in the RDD.
59-
This is an action that returns a value to the driver.
60-
61-
- *collect*: bring the data from the RDD into the RAM of the driver _(use with caution!)_
62-
63-
- *take*: like collect, but instead of getting all the data, only selects as many random records as specified (safe alternative to `collect`)
64-
65-
- *distinct*: remove duplicate elements from the RDD
66-
67-
- *union*: combine two RDDs into a single RDD
68-
69-
- *cache*: cache the RDD, keeping it in the RAM instead of distributing over the whole system (applied only if possible)
70-
71-
- *saveAsTextFile*: save the RDD as a text file in the distributed filesystem.
72-
Each element is written on a separate line.
73-
This is the standard way to persist results beyond the Spark session.
57+
- *actions* (trigger execution and return results or write output):
58+
- *count*: returns the number of elements to the driver.
59+
- *collect*: returns all elements to the driver memory (use with caution).
60+
- *take*: returns the first `n` elements to the driver (safer than `collect` for inspection).
61+
- *saveAsTextFile*: writes the RDD to text files in distributed storage (one element per line).

0 commit comments

Comments
 (0)