-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathIMD_D4_M2_Vectorization.Rmd
More file actions
109 lines (81 loc) · 5.39 KB
/
IMD_D4_M2_Vectorization.Rmd
File metadata and controls
109 lines (81 loc) · 5.39 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#### Vectorization
On the first day of this class, we introduced vectors. Here are some examples:
```{r VectorExamplesDay4, echo=TRUE, eval=TRUE}
# Numeric vectors
numbers <- 1:10
other_numbers <- 6:15
# Character vectors
code_lang <- c("R", "C++", "Java", "Python", "C")
human_lang <- c("English", "Spanish", "Chinese", "Russian")
```
What if we want to multiply `numbers` by two? The brute-force method looks something like this:
```{r VectorLoop, echo=TRUE, eval=TRUE}
new_numbers <- c() # Create a new NULL vector
new_numbers[1] <- 2 * numbers[1]
new_numbers[2] <- 2 * numbers[2]
new_numbers[3] <- 2 * numbers[3]
new_numbers[4] <- 2 * numbers[4]
new_numbers[5] <- 2 * numbers[5]
new_numbers[6] <- 2 * numbers[6]
new_numbers[7] <- 2 * numbers[7]
new_numbers[8] <- 2 * numbers[8]
new_numbers[9] <- 2 * numbers[9]
new_numbers[10] <- 2 * numbers[10]
new_numbers
```
Obviously we're making this harder than it needs to be. Many functions and operations in R are *vectorized*, meaning that they will be applied in parallel to each element of a vector. In fact, you've already seen examples of vectorization earlier in this class. Because data frame columns are just vectors, much of the data wrangling you've done so far takes advantage of vectorization. Here, we'll talk about vectorization in a little more detail, but a lot of this may look familiar!
As you probably know, all we need to do to multiply each element in `numbers` by two is:
```{r VectorMult, echo=TRUE, eval=TRUE}
new_numbers <- numbers * 2
new_numbers
```
Much less tedious. In general, if you feel like you're doing something by brute force in R, that's a good indicator that there's a better way to do it.
As you might expect, division, addition, and subtraction are also vectorized. In fact, this even works with two or more vectors.
```{r TwoVectorOps, echo=TRUE, eval=TRUE}
numbers
other_numbers
other_numbers - numbers
other_numbers * numbers
```
So far we've only looked at operations between vectors and single values or two vectors of the same length. But what happens if we try to add two vectors of different lengths? Let's try it out.
```{r DiffVectorLengths, echo=TRUE, eval=TRUE, warning=TRUE}
ones <- rep(1, 6) # Vector of 1's, length 6
ones
add_this <- c(1, 2)
this_one_breaks <- c(1, 2, 3, 4)
ones + add_this # This works fine
ones + this_one_breaks # This generates a warning
```
What happened here? When we added the vector `[1, 2]` to the vector `[1, 1, 1, 1, 1, 1]`, we got `[2, 3, 2, 3, 2, 3]`. When we added `[1, 2, 3, 4]` to `[1, 1, 1, 1, 1, 1]`, we got a result, but we also got a warning along with it. The text of the warning actually gives a clue as to what R is doing. It says "longer object length is not a multiple of shorter object length" because when R tries to operate on vectors of different lengths, it recycles the shorter vector until it matches the length of the longer vector.
`ones` is a vector of length 6. Since `add_this` is length 2 and 2 is a multiple of 6, R adds `[1, 1, 1, 1, 1, 1] + [1, 2, 1, 2, 1, 2]`, repeating `add_this` exactly three times. However, `this_one_breaks` is a vector of length 4, which is not a multiple of 6. Therefore, we get a warning because R cannot fit `this_one_breaks` into a vector of length 6 without truncating it.
Operating on vectors of different lengths is often useful. However, it's easy to make mistakes if you forget that vectors in R work this way! Keep this in mind when you are debugging.
As you may remember from working with data frames, lots of operations in R are vectorized, not just basic arithmetic. This includes comparison operators (`<`, `>`, `==`, `<=`, `>=`, `!=`), logical operators (`&`, `|`, `!`), and many functions (`is.na()`, `gsub()`, `grep()`, and too many others to list).
Here's another example. It's fairly common to end up with data that was entered as text and has extra whitespace:
```{r MessyData}
messy_data <- tibble::tibble(plot_id = c("1 ", " 2 ", " ", " 4", "5."),
year_monitored = c(2019, 2019, 2019, 2020, 2020))
messy_data
```
The `dplyr` way to fix this might look something like this:
```{r FixMessyDataDplyr}
clean_data <- messy_data %>% dplyr::mutate(plot_id = trimws(plot_id), # Get rid of leading and trailing spaces
plot_id = gsub(pattern = ".", replacement = "", x = plot_id, fixed = TRUE)) %>% # Replace the stray "." with nothing
dplyr::filter(plot_id != "") %>%
dplyr::mutate(plot_id = as.integer(plot_id))
clean_data
```
`trimws`, `gsub`, and `!=` are all vectorized operations being applied to the column plot_id. Let's see what's happening to just the `plot_id` column step by step under the hood.
```{r FixMessyDataVector}
plot_id_vec <- messy_data$plot_id
plot_id_vec
plot_id_vec <- trimws(plot_id_vec)
plot_id_vec
plot_id_vec <- gsub(pattern = ".", replacement = "", x = plot_id_vec, fixed = TRUE)
plot_id_vec
id_is_valid <- plot_id_vec != ""
id_is_valid
plot_id_vec <- plot_id_vec[id_is_valid]
plot_id_vec
plot_id_vec <- as.integer(plot_id_vec)
```
Vectorization in R is a powerful tool. However, not everything is vectorized. One common example is `read.csv()`. `read.csv()` takes the path to a .csv file and returns a data frame of the data in the csv. But what if the data is stored across multiple .csv files? `read.csv(c(file_path_1, file_path_2, file_path_3))` doesn't work because `read.csv` isn't vectorized. Fortunately, vectorization isn't the only way to avoid repeating yourself in R.