-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathwrangleplot.html
More file actions
665 lines (594 loc) · 29.8 KB
/
wrangleplot.html
File metadata and controls
665 lines (594 loc) · 29.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="pandoc" />
<title>wrangleplot.utf8.md</title>
<script src="site_libs/jquery-1.11.3/jquery.min.js"></script>
<meta name="viewport" content="width=device-width, initial-scale=1" />
<link href="site_libs/bootstrap-3.3.5/css/cosmo.min.css" rel="stylesheet" />
<script src="site_libs/bootstrap-3.3.5/js/bootstrap.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/html5shiv.min.js"></script>
<script src="site_libs/bootstrap-3.3.5/shim/respond.min.js"></script>
<script src="site_libs/jqueryui-1.11.4/jquery-ui.min.js"></script>
<link href="site_libs/tocify-1.9.1/jquery.tocify.css" rel="stylesheet" />
<script src="site_libs/tocify-1.9.1/jquery.tocify.js"></script>
<script src="site_libs/navigation-1.1/tabsets.js"></script>
<link href="site_libs/highlightjs-9.12.0/textmate.css" rel="stylesheet" />
<script src="site_libs/highlightjs-9.12.0/highlight.js"></script>
<link href="site_libs/font-awesome-5.0.13/css/fa-svg-with-js.css" rel="stylesheet" />
<script src="site_libs/font-awesome-5.0.13/js/fontawesome-all.min.js"></script>
<script src="site_libs/font-awesome-5.0.13/js/fa-v4-shims.min.js"></script>
<style type="text/css">code{white-space: pre;}</style>
<style type="text/css">
pre:not([class]) {
background-color: white;
}
</style>
<script type="text/javascript">
if (window.hljs) {
hljs.configure({languages: []});
hljs.initHighlightingOnLoad();
if (document.readyState && document.readyState === "complete") {
window.setTimeout(function() { hljs.initHighlighting(); }, 0);
}
}
</script>
<style type="text/css">
h1 {
font-size: 34px;
}
h1.title {
font-size: 38px;
}
h2 {
font-size: 30px;
}
h3 {
font-size: 24px;
}
h4 {
font-size: 18px;
}
h5 {
font-size: 16px;
}
h6 {
font-size: 12px;
}
.table th:not([align]) {
text-align: left;
}
</style>
</head>
<body>
<style type = "text/css">
.main-container {
max-width: 940px;
margin-left: auto;
margin-right: auto;
}
code {
color: inherit;
background-color: rgba(0, 0, 0, 0.04);
}
img {
max-width:100%;
height: auto;
}
.tabbed-pane {
padding-top: 12px;
}
.html-widget {
margin-bottom: 20px;
}
button.code-folding-btn:focus {
outline: none;
}
</style>
<style type="text/css">
/* padding for bootstrap navbar */
body {
padding-top: 51px;
padding-bottom: 40px;
}
/* offset scroll position for anchor links (for fixed navbar) */
.section h1 {
padding-top: 56px;
margin-top: -56px;
}
.section h2 {
padding-top: 56px;
margin-top: -56px;
}
.section h3 {
padding-top: 56px;
margin-top: -56px;
}
.section h4 {
padding-top: 56px;
margin-top: -56px;
}
.section h5 {
padding-top: 56px;
margin-top: -56px;
}
.section h6 {
padding-top: 56px;
margin-top: -56px;
}
</style>
<script>
// manage active state of menu based on current page
$(document).ready(function () {
// active menu anchor
href = window.location.pathname
href = href.substr(href.lastIndexOf('/') + 1)
if (href === "")
href = "index.html";
var menuAnchor = $('a[href="' + href + '"]');
// mark it active
menuAnchor.parent().addClass('active');
// if it's got a parent navbar menu mark it active as well
menuAnchor.closest('li.dropdown').addClass('active');
});
</script>
<div class="container-fluid main-container">
<!-- tabsets -->
<script>
$(document).ready(function () {
window.buildTabsets("TOC");
});
</script>
<!-- code folding -->
<script>
$(document).ready(function () {
// move toc-ignore selectors from section div to header
$('div.section.toc-ignore')
.removeClass('toc-ignore')
.children('h1,h2,h3,h4,h5').addClass('toc-ignore');
// establish options
var options = {
selectors: "h1,h2,h3",
theme: "bootstrap3",
context: '.toc-content',
hashGenerator: function (text) {
return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase();
},
ignoreSelector: ".toc-ignore",
scrollTo: 0
};
options.showAndHide = true;
options.smoothScroll = true;
// tocify
var toc = $("#TOC").tocify(options).data("toc-tocify");
});
</script>
<style type="text/css">
#TOC {
margin: 25px 0px 20px 0px;
}
@media (max-width: 768px) {
#TOC {
position: relative;
width: 100%;
}
}
.toc-content {
padding-left: 30px;
padding-right: 40px;
}
div.main-container {
max-width: 1200px;
}
div.tocify {
width: 20%;
max-width: 260px;
max-height: 85%;
}
@media (min-width: 768px) and (max-width: 991px) {
div.tocify {
width: 25%;
}
}
@media (max-width: 767px) {
div.tocify {
width: 100%;
max-width: none;
}
}
.tocify ul, .tocify li {
line-height: 20px;
}
.tocify-subheader .tocify-item {
font-size: 0.90em;
padding-left: 25px;
text-indent: 0;
}
.tocify .list-group-item {
border-radius: 0px;
}
</style>
<!-- setup 3col/9col grid for toc_float and main content -->
<div class="row-fluid">
<div class="col-xs-12 col-sm-4 col-md-3">
<div id="TOC" class="tocify">
</div>
</div>
<div class="toc-content col-xs-12 col-sm-8 col-md-9">
<div class="navbar navbar-inverse navbar-fixed-top" role="navigation">
<div class="container">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target="#navbar">
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="index.html">CABW 2018 R training</a>
</div>
<div id="navbar" class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<li>
<a href="index.html">Home</a>
</li>
<li>
<a href="usingr.html">What, why, and how of R</a>
</li>
<li>
<a href="wrangleplot.html">Wrangling and plotting</a>
</li>
<li>
<a href="mapping.html">Mapping</a>
</li>
<li>
<a href="resources.html">
<span class="fa fa-list"></span>
Resources
</a>
</li>
</ul>
<ul class="nav navbar-nav navbar-right">
</ul>
</div><!--/.nav-collapse -->
</div><!--/.container -->
</div><!--/.navbar -->
<div class="fluid-row" id="header">
</div>
<div id="wrangling-and-plotting-data-in-r" class="section level1">
<h1>Wrangling and plotting data in R</h1>
<div id="lesson-outline" class="section level2">
<h2>Lesson Outline</h2>
<ul>
<li><a href="#goals-and-motivation">Goals and Motivation</a></li>
<li><a href="#data-wrangling-with-dplyr">Data Wrangling with dplyr</a></li>
<li><a href="#joining-datasets">Joining datasets</a></li>
<li><a href="#plotting-with-ggplot2">Plotting with ggplot2</a></li>
</ul>
</div>
<div id="lesson-exercises" class="section level2">
<h2>Lesson Exercises</h2>
<ul>
<li><a href="#exercise-1">Exercise 1</a></li>
<li><a href="#exercise-2">Exercise 2</a></li>
<li><a href="#exercise-3">Exercise 3</a></li>
<li><a href="#exercise-4">Exercise 4</a></li>
</ul>
</div>
<div id="goals-and-motivation" class="section level2">
<h2>Goals and Motivation</h2>
<p>Data wrangling (manipulation, cleaning, ninjery, etc.) is the part of any data analysis that will take the most time. While it may not necessarily be fun, it is foundational to all the work that follows. I strongly believe that mastering these skills has more value than mastering a particular analysis. Check out <a href="https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html">this article</a> if you don’t believe me. Creating a data visualization will always begin with wrangling, so we’ll cover some core wrangling concepts first before we start plotting.</p>
<p>After this session you should be able answer the following:</p>
<ul>
<li>What are some basic wrangling functions from dplyr?</li>
<li>How do I join different datasets?</li>
<li>How do I make some basic plots with ggplot?</li>
</ul>
</div>
<div id="data-wrangling-with-dplyr" class="section level2">
<h2>Data wrangling with dplyr</h2>
<p><img src="figure/data-science-wrangle.png" /></p>
<p>The data wrangling process includes data import, tidying, and transformation. The process directly feeds into, and is not mutually exclusive with, the <em>understanding</em> or modelling side of data exploration. More generally, I consider data wrangling as the manipulation or combination of datasets for the purpose of analysis.</p>
<p><strong>All wrangling is based on a purpose.</strong> No one wrangles for the sake of wrangling (usually), so the process always begins by answering the following two questions:</p>
<ul>
<li>What do my input data look like?</li>
<li>What should my input data look like given what I want to do?</li>
</ul>
<p>At the most basic level, going from what your data looks like to what it should look like will require a few key operations. Some common examples:</p>
<ul>
<li>Selecting specific variables</li>
<li>Filtering observations by some criteria</li>
<li>Adding or modifying existing variables</li>
<li>Renaming variables</li>
<li>Arranging rows by a variable</li>
<li>Summarizing a variable conditional on others</li>
</ul>
<p>The dplyr package provides easy tools for these common data manipulation tasks and is a core package from the <a href="https://www.tidyverse.org/">tidyverse</a> suite of packages. The philosophy of dplyr is that one function does one thing and the name of the function says what it does. Some additional dplyr resources:</p>
<ul>
<li><a href="https://github.com/hadley/dplyr">dplyr GitHub repo</a></li>
<li><a href="http://cran.rstudio.com/web/packages/dplyr/">CRAN page: vignettes here</a></li>
<li><a href="https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf">Cheatsheet</a></li>
</ul>
<div id="exercise-1" class="section level3">
<h3>Exercise 1</h3>
<p>We’ll start this lesson by opening a clean script, loading the packages we need, and importing our data.</p>
<ol style="list-style-type: decimal">
<li><p>Open a new script from the file menu on the top left and save the script with an informative name, e.g., “wrangling_and_plotting.R”.</p></li>
<li><p>Add in a comment line on the top and write some text about the script. It should look something like: <code># Exercise 2: Wrangling and plotting bioassessment data</code>.</p></li>
<li><p>Below the comment line, add the code to import the tidyverse, i.e., <code>library(tidyverse)</code>.</p></li>
<li><p>In the next line, add the following to import our two datasets.</p>
<pre class="r"><code>cscidat <- read.csv('data/cscidat.csv', stringsAsFactors = F)
ascidat <- read.csv('data/ascidat.csv', stringsAsFactors = F)</code></pre></li>
<li><p>When you’re done, run all of the code in the console (highlight all then <code>ctrl+enter</code>)</p></li>
</ol>
<p>Our goal with these data is to combine the bioassessment scores by each unique location, date, and replicate, while keeping only the data we need for our plots.</p>
</div>
<div id="selecting" class="section level3">
<h3>Selecting</h3>
<p>Let’s begin using dplyr. The <code>select</code> function lets you retain or exclude columns.</p>
<pre class="r"><code># first, select some columns
dplyr_sel1 <- select(cscidat, SampleID_old, New_Lat, New_Long, CSCI)
head(dplyr_sel1)</code></pre>
<pre><code>## SampleID_old New_Lat New_Long CSCI
## 1 000CAT148_8.10.10_1 39.07523 -119.8994 0.9879779
## 2 000CAT228_8.10.10_1 39.07307 -119.9201 0.9811505
## 3 102PS0139_8.9.10_1 41.99595 -122.9597 1.0715694
## 4 103CDCHHR_9.14.10_1 41.78890 -124.0778 1.0866419
## 5 103FC1106_7.15.14_1 41.93407 -124.1081 0.9971599
## 6 103FCA168_7.24.13_1 41.64962 -124.0912 1.0633122</code></pre>
<pre class="r"><code># select everything but CSCI and COMID
dplyr_sel2 <- select(cscidat, -CSCI, -COMID)
head(dplyr_sel2)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long E OE
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 16.05804 0.9309977
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 16.08960 0.9726777
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 15.46439 1.0896002
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 21.10443 1.0898184
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 16.83757 1.0779468
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 19.07408 1.0931064
## pMMI SampleID_old.1
## 1 1.0449580 000CAT148_8.10.10_1
## 2 0.9896232 000CAT228_8.10.10_1
## 3 1.0535386 102PS0139_8.9.10_1
## 4 1.0834653 103CDCHHR_9.14.10_1
## 5 0.9163731 103FC1106_7.15.14_1
## 6 1.0335179 103FCA168_7.24.13_1</code></pre>
<pre class="r"><code># select columns that contain the letter c
dplyr_sel3 <- select(cscidat, matches('c'))
head(dplyr_sel3)</code></pre>
<pre><code>## StationCode COMID CSCI
## 1 000CAT148 8942501 0.9879779
## 2 000CAT228 8942503 0.9811505
## 3 102PS0139 23936337 1.0715694
## 4 103CDCHHR 22226836 1.0866419
## 5 103FC1106 22226634 0.9971599
## 6 103FCA168 22226990 1.0633122</code></pre>
</div>
<div id="filtering" class="section level3">
<h3>Filtering</h3>
<p>After selecting columns, you’ll probably want to remove observations that don’t fit some criteria. For example, maybe you want to remove CSCI scores less than some threshold, find stations above a certain latitude, or both.</p>
<pre class="r"><code># get CSCI scores greater than 0.79
dplyr_filt1 <- filter(cscidat, CSCI > 0.79)
head(dplyr_filt1)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1</code></pre>
<pre class="r"><code># get CSCI scores above latitude 37N
dplyr_filt2 <- filter(cscidat, New_Lat > 37)
head(dplyr_filt2)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1</code></pre>
<pre class="r"><code># use both filters
dplyr_filt3 <- filter(cscidat, CSCI > 0.79 & New_Lat > 37)
head(dplyr_filt3)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1</code></pre>
<p>Filtering can take a bit of time to master because there are several ways to tell R what you want. To use filtering effectively, you have to know how to select the observations that you want using the comparison operators. R provides the standard suite: <code>></code>, <code>>=</code>, <code><</code>, <code><=</code>, <code>!=</code> (not equal), and <code>==</code> (equal). When you’re starting out with R, the easiest mistake to make is to use <code>=</code> instead of <code>==</code> when testing for equality.</p>
</div>
<div id="mutating" class="section level3">
<h3>Mutating</h3>
<p>Now that we’ve seen how to filter observations and select columns of a data frame, maybe we want to add a new column. In dplyr, <code>mutate</code> allows us to add new columns. These can be vectors you are adding or based on expressions applied to existing columns. For instance, maybe we want to convert a numeric column into a categorical using some criteria or maybe we want to make a new column based on some arithmetic on some other columns.</p>
<pre class="r"><code># get observed taxa
dplyr_mut1 <- mutate(cscidat, observed = OE * E)
head(dplyr_mut1)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1 observed
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1 14.95
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1 15.65
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1 16.85
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1 23.00
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1 18.15
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1 20.85</code></pre>
<pre class="r"><code># add a column for lo/hi csci scores
dplyr_mut2 <- mutate(cscidat, CSCIcat = ifelse(CSCI <= 0.79, 'lo', 'hi'))
head(dplyr_mut2)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1 CSCIcat
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1 hi
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1 hi
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1 hi
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1 hi
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1 hi
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1 hi</code></pre>
<p>Some other useful dplyr functions include <code>arrange</code> to sort the observations (rows) by a column and <code>rename</code> to (you guessed it) rename a column.</p>
<pre class="r"><code># arrange by CSCI scores
dplyr_arr <- arrange(cscidat, CSCI)
head(dplyr_arr)</code></pre>
<pre><code>## SampleID_old StationCode New_Lat New_Long COMID E
## 1 206PS0073_7.20.10_1 206PS0073 38.09831 -122.5670 1669863 13.344409
## 2 205R01390_5.23.13_1 205R01390 37.53080 -121.9702 17692585 13.117454
## 3 205R00878_4.24.13_1 205R00878 37.55460 -121.9870 17691075 14.005475
## 4 801S03971_6.16.14_1 801S03971 33.67551 -117.8277 20355412 8.229786
## 5 204R00383_6.11.12_1 204R00383 37.65910 -122.1368 2804015 10.678176
## 6 204R00583_6.13.12_1 204R00583 37.61910 -122.0593 2804187 13.292084
## OE pMMI CSCI SampleID_old.1
## 1 0.1498755 0.0663000 0.1080869 206PS0073_7.20.10_1
## 2 0.1524686 0.0852000 0.1188203 205R01390_5.23.13_1
## 3 0.2142019 0.0336000 0.1239177 205R00878_4.24.13_1
## 4 0.0729000 0.2306298 0.1517679 801S03971_6.16.14_1
## 5 0.2809468 0.0325000 0.1567290 204R00383_6.11.12_1
## 6 0.2783612 0.0448000 0.1616011 204R00583_6.13.12_1</code></pre>
<pre class="r"><code># rename lat/lon (note the multiple arguments)
dplyr_rnm <- rename(cscidat,
lat = New_Lat,
lon = New_Long
)
head(dplyr_rnm)</code></pre>
<pre><code>## SampleID_old StationCode lat lon COMID E
## 1 000CAT148_8.10.10_1 000CAT148 39.07523 -119.8994 8942501 16.05804
## 2 000CAT228_8.10.10_1 000CAT228 39.07307 -119.9201 8942503 16.08960
## 3 102PS0139_8.9.10_1 102PS0139 41.99595 -122.9597 23936337 15.46439
## 4 103CDCHHR_9.14.10_1 103CDCHHR 41.78890 -124.0778 22226836 21.10443
## 5 103FC1106_7.15.14_1 103FC1106 41.93407 -124.1081 22226634 16.83757
## 6 103FCA168_7.24.13_1 103FCA168 41.64962 -124.0912 22226990 19.07408
## OE pMMI CSCI SampleID_old.1
## 1 0.9309977 1.0449580 0.9879779 000CAT148_8.10.10_1
## 2 0.9726777 0.9896232 0.9811505 000CAT228_8.10.10_1
## 3 1.0896002 1.0535386 1.0715694 102PS0139_8.9.10_1
## 4 1.0898184 1.0834653 1.0866419 103CDCHHR_9.14.10_1
## 5 1.0779468 0.9163731 0.9971599 103FC1106_7.15.14_1
## 6 1.0931064 1.0335179 1.0633122 103FCA168_7.24.13_1</code></pre>
</div>
<div id="exercise-2" class="section level3">
<h3>Exercise 2</h3>
<p>Let’s clean up our CSCI dataset in preparation to join with our ASCI dataset. We’ll select columns we want and rename those with odd names.</p>
<ol style="list-style-type: decimal">
<li><p>Select the unique sample ID column (<code>SampleID_old</code>), latitude (<code>New_Lat</code>), longitude (<code>New_Long</code>), and <code>CSCI</code> columns. Reassign the <code>cscidat</code> data object at the same time.</p>
<pre class="r"><code>cscidat <- select(cscidat, SampleID_old, New_Lat, New_Long, CSCI) </code></pre></li>
<li><p>Rename the <code>SampleID_old</code> column to <code>id</code>, <code>New_Lat</code> to <code>lat</code>, and <code>New_Long</code> to <code>lon</code>.</p>
<pre class="r"><code>cscidat <- rename(cscidat,
id = SampleID_old,
lat = New_Lat,
lon = New_Long
)</code></pre></li>
</ol>
</div>
</div>
<div id="joining-datasets" class="section level2">
<h2>Joining datasets</h2>
<p>Combining data is a common task of data wrangling. All joins require that each of the tables can be linked by shared identifiers. These are called ‘keys’ and are usually represented as a separate column that acts as a unique variable for the observations. Our example datasets include the <code>id</code> column that represents a unique identifier as a combination of station, sample date, and replicate.</p>
<p>The challenge with joins is that the two datasets may not represent the same observations for a given key. For example, you might have one table with all observations for every key, another with only some observations, or two tables with only a few shared keys. What you get back from a join will depend on what’s shared between tables, in addition to the type of join you use.</p>
<p>For our data, we’ll be using an <strong>inner-join</strong> that combines datasets by shared keys (for an overview of the other types of joins, see <a href="http://r4ds.had.co.nz/relational-data.html#outer-join">here</a>).</p>
<p><img src="figure/join-inner.png" /></p>
<div id="exercise-3" class="section level3">
<h3>Exercise 3</h3>
<p>We’ll join our ASCI data to our CSCI data in this exercise to make a single dataset that has scores for both bioassessment indices taken at the same place and time. This will help us plot and map the data later.</p>
<ol style="list-style-type: decimal">
<li><p>Before you start, check the dimensions of both datasets (e.g., <code>dim</code> or <code>nrow</code>). How many rows in each?</p>
<pre class="r"><code>dim(cscidat)
dim(ascidat)</code></pre></li>
<li><p>Using the <code>inner_join</code> function from dplyr, join <code>cscidat</code> with <code>ascidat</code> using the <code>id</code> column as the key.</p>
<pre class="r"><code>alldat <- inner_join(cscidat, ascidat, by = 'id')</code></pre></li>
<li><p>What are the dimensions of the new dataset (i.e., how many unique bioassessment scores were collected at the same time and place)?</p></li>
</ol>
</div>
</div>
<div id="plotting-with-ggplot2" class="section level2">
<h2>Plotting with ggplot2</h2>
<p>The entire workflow of data exploration is enhanced through looking at your data, whether you’re exploring a dataset for the first time or creating publication-ready figures. Viewing your data provides insight into patterns that can help you explore different hypotheses. No analysis is complete without a solid graphic.</p>
<p>We’ll only introduce some of the core concepts behind the popular <a href="https://ggplot2.tidyverse.org/reference/ggplot2-package.html">ggplot2</a> package. This package follows a strict philosophy known as the <em>grammar of graphics</em> that was designed to make thinking, reasoning, and communicating about graphs easier by following a few simple rules. Like building a sentence in speech (aka grammar), all graphs start with a foundational component that is used for building other graph pieces.</p>
<p>With ggplot2, you begin a plot with the function <code>ggplot()</code>. <code>ggplot()</code> creates a coordinate system that you can add layers to. The first argument of <code>ggplot()</code> is the dataset to use in the graph. So <code>ggplot(data = alldat)</code> creates an empty base graph.</p>
<pre class="r"><code>ggplot(data = alldat)</code></pre>
<p>You complete your graph by adding one or more layers (aka <code>geoms</code>) to <code>ggplot()</code>. The function <code>geom_point()</code> adds a layer of points to your plot, which creates a scatterplot. Ggplot2 comes with many geom functions that each add a different type of layer to a plot.</p>
<pre class="r"><code>ggplot(data = alldat) +
geom_point()</code></pre>
<p>Each geom function in ggplot2 takes a <code>mapping</code> argument. This defines how variables in your dataset are mapped to visual properties. The <code>mapping</code> argument is defined with <code>aes()</code>, and the <code>x</code> and <code>y</code> arguments of <code>aes()</code> specify which variables to map to the x and y axes. ggplot2 looks for the mapped variable in the <code>data</code> argument, in this case, <code>alldat</code>.</p>
<pre class="r"><code>ggplot(data = alldat) +
geom_point(mapping = aes(x = CSCI, y = ASCI))</code></pre>
<p>Just remember these requirements:</p>
<ul>
<li>All ggplot plots start with the <code>ggplot</code> function</li>
<li>It will need three pieces of information: the <strong>data</strong>, how the data are <strong>mapped</strong> to the plot <strong>aesthetics</strong>, and a <strong>geom</strong> layer</li>
</ul>
<p>The core unit of every ggplot looks like this:</p>
<pre class="r"><code>ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))</code></pre>
<p>Applied to the data:</p>
<pre class="r"><code>ggplot(data = alldat) +
geom_point(mapping = aes(x = CSCI, y = ASCI))</code></pre>
<p><img src="wrangleplot_files/figure-html/unnamed-chunk-14-1.png" width="672" /></p>
<div id="exercise-4" class="section level3">
<h3>Exercise 4</h3>
<p>Let’s make a quick ggplot using some of the guidance from above. In this example, we’ll create some boxplots to show the distribution of CSCI scores at different site types (i.e., reference, intermediate, and stressed). This follows the same syntax as above but we’ll use a categorical variable for the x aesthetic and use the <code>geom_boxplot</code> geometry.</p>
<ol style="list-style-type: decimal">
<li><p>Copy the code from the last example plot to your script.</p></li>
<li><p>Replace the <code>geom_point</code> function with <code>geom_boxplot</code></p></li>
<li><p>Map the <code>site_type</code> column to the x aesthetic and the <code>CSCI</code> scores to the y aesthetic. The final code should look like this:</p>
<pre class="r"><code>ggplot(data = alldat) +
geom_boxplot(mapping = aes(x = site_type, y = CSCI))</code></pre></li>
<li><p>When you’re done, run the code in the console and view the plot. What does it tell you about the distribution of CSCI scores?</p></li>
</ol>
<p>There’s certainly much, much more we can do with ggplot2. Feel free to checkout the official <a href="https://ggplot2.tidyverse.org/reference/ggplot2-package.html">ggplot2</a> website for more information. The RStudio <a href="https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf">cheatsheet</a> is also very helpful.</p>
</div>
</div>
</div>
</div>
</div>
</div>
<script>
// add bootstrap table styles to pandoc tables
function bootstrapStylePandocTables() {
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
}
$(document).ready(function () {
bootstrapStylePandocTables();
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement("script");
script.type = "text/javascript";
script.src = "https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML";
document.getElementsByTagName("head")[0].appendChild(script);
})();
</script>
</body>
</html>