-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathwrangleplot.Rmd
More file actions
257 lines (173 loc) · 12.1 KB
/
wrangleplot.Rmd
File metadata and controls
257 lines (173 loc) · 12.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
```{r setup, echo=FALSE, warning=FALSE, purl=FALSE, message=FALSE}
options(repos = "http://cran.rstudio.com/")
pkgs <- c("tidyverse", "knitr")
x<-lapply(pkgs, library, character.only = TRUE)
opts_chunk$set(tidy = FALSE, message = F, warning = F)
```
# Wrangling and plotting data in R
## Lesson Outline
- [Goals and Motivation]
- [Data Wrangling with dplyr]
- [Joining datasets]
- [Plotting with ggplot2]
## Lesson Exercises
- [Exercise 1]
- [Exercise 2]
- [Exercise 3]
- [Exercise 4]
## Goals and Motivation
Data wrangling (manipulation, cleaning, ninjery, etc.) is the part of any data analysis that will take the most time. While it may not necessarily be fun, it is foundational to all the work that follows. I strongly believe that mastering these skills has more value than mastering a particular analysis. Check out [this article](https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html) if you don’t believe me. Creating a data visualization will always begin with wrangling, so we'll cover some core wrangling concepts first before we start plotting.
After this session you should be able answer the following:
* What are some basic wrangling functions from dplyr?
* How do I join different datasets?
* How do I make some basic plots with ggplot?
## Data wrangling with dplyr

The data wrangling process includes data import, tidying, and transformation. The process directly feeds into, and is not mutually exclusive with, the *understanding* or modelling side of data exploration. More generally, I consider data wrangling as the manipulation or combination of datasets for the purpose of analysis.
**All wrangling is based on a purpose.** No one wrangles for the sake of wrangling (usually), so the process always begins by answering the following two questions:
* What do my input data look like?
* What should my input data look like given what I want to do?
At the most basic level, going from what your data looks like to what it should look like will require a few key operations. Some common examples:
* Selecting specific variables
* Filtering observations by some criteria
* Adding or modifying existing variables
* Renaming variables
* Arranging rows by a variable
* Summarizing a variable conditional on others
The dplyr package provides easy tools for these common data manipulation tasks and is a core package from the [tidyverse](https://www.tidyverse.org/) suite of packages. The philosophy of dplyr is that one function does one thing and the name of the function says what it does. Some additional dplyr resources:
* [dplyr GitHub repo](https://github.com/hadley/dplyr)
* [CRAN page: vignettes here](http://cran.rstudio.com/web/packages/dplyr/)
* [Cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
### Exercise 1
We'll start this lesson by opening a clean script, loading the packages we need, and importing our data.
1. Open a new script from the file menu on the top left and save the script with an informative name, e.g., "wrangling_and_plotting.R".
1. Add in a comment line on the top and write some text about the script. It should look something like: `# Exercise 2: Wrangling and plotting bioassessment data`.
1. Below the comment line, add the code to import the tidyverse, i.e., `library(tidyverse)`.
1. In the next line, add the following to import our two datasets.
```{r}
cscidat <- read.csv('data/cscidat.csv', stringsAsFactors = F)
ascidat <- read.csv('data/ascidat.csv', stringsAsFactors = F)
```
1. When you're done, run all of the code in the console (highlight all then `ctrl+enter`)
Our goal with these data is to combine the bioassessment scores by each unique location, date, and replicate, while keeping only the data we need for our plots.
### Selecting
Let's begin using dplyr. The `select` function lets you retain or exclude columns.
```{r}
# first, select some columns
dplyr_sel1 <- select(cscidat, SampleID_old, New_Lat, New_Long, CSCI)
head(dplyr_sel1)
# select everything but CSCI and COMID
dplyr_sel2 <- select(cscidat, -CSCI, -COMID)
head(dplyr_sel2)
# select columns that contain the letter c
dplyr_sel3 <- select(cscidat, matches('c'))
head(dplyr_sel3)
```
### Filtering
After selecting columns, you'll probably want to remove observations that don't fit some criteria. For example, maybe you want to remove CSCI scores less than some threshold, find stations above a certain latitude, or both.
```{r}
# get CSCI scores greater than 0.79
dplyr_filt1 <- filter(cscidat, CSCI > 0.79)
head(dplyr_filt1)
# get CSCI scores above latitude 37N
dplyr_filt2 <- filter(cscidat, New_Lat > 37)
head(dplyr_filt2)
# use both filters
dplyr_filt3 <- filter(cscidat, CSCI > 0.79 & New_Lat > 37)
head(dplyr_filt3)
```
Filtering can take a bit of time to master because there are several ways to tell R what you want. To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: `>`, `>=`, `<`, `<=`, `!=` (not equal), and `==` (equal). When you're starting out with R, the easiest mistake to make is to use `=` instead of `==` when testing for equality.
### Mutating
Now that we've seen how to filter observations and select columns of a data frame, maybe we want to add a new column. In dplyr, `mutate` allows us to add new columns. These can be vectors you are adding or based on expressions applied to existing columns. For instance, maybe we want to convert a numeric column into a categorical using some criteria or maybe we want to make a new column based on some arithmetic on some other columns.
```{r}
# get observed taxa
dplyr_mut1 <- mutate(cscidat, observed = OE * E)
head(dplyr_mut1)
# add a column for lo/hi csci scores
dplyr_mut2 <- mutate(cscidat, CSCIcat = ifelse(CSCI <= 0.79, 'lo', 'hi'))
head(dplyr_mut2)
```
Some other useful dplyr functions include `arrange` to sort the observations (rows) by a column and `rename` to (you guessed it) rename a column.
```{r}
# arrange by CSCI scores
dplyr_arr <- arrange(cscidat, CSCI)
head(dplyr_arr)
# rename lat/lon (note the multiple arguments)
dplyr_rnm <- rename(cscidat,
lat = New_Lat,
lon = New_Long
)
head(dplyr_rnm)
```
### Exercise 2
Let's clean up our CSCI dataset in preparation to join with our ASCI dataset. We'll select columns we want and rename those with odd names.
1. Select the unique sample ID column (`SampleID_old`), latitude (`New_Lat`), longitude (`New_Long`), and `CSCI` columns. Reassign the `cscidat` data object at the same time.
```{r}
cscidat <- select(cscidat, SampleID_old, New_Lat, New_Long, CSCI)
```
1. Rename the `SampleID_old` column to `id`, `New_Lat` to `lat`, and `New_Long` to `lon`.
```{r}
cscidat <- rename(cscidat,
id = SampleID_old,
lat = New_Lat,
lon = New_Long
)
```
## Joining datasets
Combining data is a common task of data wrangling. All joins require that each of the tables can be linked by shared identifiers. These are called 'keys' and are usually represented as a separate column that acts as a unique variable for the observations. Our example datasets include the `id` column that represents a unique identifier as a combination of station, sample date, and replicate.
The challenge with joins is that the two datasets may not represent the same observations for a given key. For example, you might have one table with all observations for every key, another with only some observations, or two tables with only a few shared keys. What you get back from a join will depend on what's shared between tables, in addition to the type of join you use.
For our data, we'll be using an **inner-join** that combines datasets by shared keys (for an overview of the other types of joins, see [here](http://r4ds.had.co.nz/relational-data.html#outer-join)).

### Exercise 3
We'll join our ASCI data to our CSCI data in this exercise to make a single dataset that has scores for both bioassessment indices taken at the same place and time. This will help us plot and map the data later.
1. Before you start, check the dimensions of both datasets (e.g., `dim` or `nrow`). How many rows in each?
```{r, eval = F}
dim(cscidat)
dim(ascidat)
```
1. Using the `inner_join` function from dplyr, join `cscidat` with `ascidat` using the `id` column as the key.
```{r}
alldat <- inner_join(cscidat, ascidat, by = 'id')
```
1. What are the dimensions of the new dataset (i.e., how many unique bioassessment scores were collected at the same time and place)?
## Plotting with ggplot2
The entire workflow of data exploration is enhanced through looking at your data, whether you’re exploring a dataset for the first time or creating publication-ready figures. Viewing your data provides insight into patterns that can help you explore different hypotheses. No analysis is complete without a solid graphic.
We'll only introduce some of the core concepts behind the popular [ggplot2](https://ggplot2.tidyverse.org/reference/ggplot2-package.html) package. This package follows a strict philosophy known as the *grammar of graphics* that was designed to make thinking, reasoning, and communicating about graphs easier by following a few simple rules. Like building a sentence in speech (aka grammar), all graphs start with a foundational component that is used for building other graph pieces.
With ggplot2, you begin a plot with the function `ggplot()`. `ggplot()` creates a coordinate system that you can add layers to. The first argument of `ggplot()` is the dataset to use in the graph. So `ggplot(data = alldat)` creates an empty base graph.
```{r, eval = F}
ggplot(data = alldat)
```
You complete your graph by adding one or more layers (aka `geoms`) to `ggplot()`. The function `geom_point()` adds a layer of points to your plot, which creates a scatterplot. Ggplot2 comes with many geom functions that each add a different type of layer to a plot.
```{r, eval = F}
ggplot(data = alldat) +
geom_point()
```
Each geom function in ggplot2 takes a `mapping` argument. This defines how variables in your dataset are mapped to visual properties. The `mapping` argument is defined with `aes()`, and the `x` and `y` arguments of `aes()` specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the `data` argument, in this case, `alldat`.
```{r, eval = F}
ggplot(data = alldat) +
geom_point(mapping = aes(x = CSCI, y = ASCI))
```
Just remember these requirements:
* All ggplot plots start with the `ggplot` function
* It will need three pieces of information: the **data**, how the data are **mapped** to the plot **aesthetics**, and a **geom** layer
The core unit of every ggplot looks like this:
```{r eval = FALSE}
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
Applied to the data:
```{r}
ggplot(data = alldat) +
geom_point(mapping = aes(x = CSCI, y = ASCI))
```
### Exercise 4
Let's make a quick ggplot using some of the guidance from above. In this example, we'll create some boxplots to show the distribution of CSCI scores at different site types (i.e., reference, intermediate, and stressed). This follows the same syntax as above but we'll use a categorical variable for the x aesthetic and use the `geom_boxplot` geometry.
1. Copy the code from the last example plot to your script.
1. Replace the `geom_point` function with `geom_boxplot`
1. Map the `site_type` column to the x aesthetic and the `CSCI` scores to the y aesthetic. The final code should look like this:
```{r, eval = F}
ggplot(data = alldat) +
geom_boxplot(mapping = aes(x = site_type, y = CSCI))
```
1. When you're done, run the code in the console and view the plot. What does it tell you about the distribution of CSCI scores?
There's certainly much, much more we can do with ggplot2. Feel free to checkout the official [ggplot2](https://ggplot2.tidyverse.org/reference/ggplot2-package.html) website for more information. The RStudio [cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf) is also very helpful.