-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathREADME.Rmd
More file actions
88 lines (69 loc) · 3.04 KB
/
README.Rmd
File metadata and controls
88 lines (69 loc) · 3.04 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# phylotyprrefdata
<!-- badges: start -->
<!-- badges: end -->
A data package that provides users of the {phylotypr} package with the
reference data needed to classify 16S rRNA gene sequences using the RDP,
Silva, and greengenes taxonomic references. The data were originally posted
by each of the databases and reformatted for use with mothur. You can find links to those versions of the file on the [mothur wiki](https://mothur.org/wiki/taxonomy_outline/).
## Installation
You can install the development version of {phylotyprrefdata} from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("mothur/phylotyprrefdata")
```
## Information
This package provides reference data from the RDP, Silva, and greengenes databases. Note that in the current version, the Silva sequences were selected based on those sequences that aligned as full length sequences (not all were truly full length). In contrast, the greengenes and RDP sequences are generally long, but do not follow the same restriction. This is why there are fewer sequences in the Silva database versus the RDP and greengenes. You may want to recreate your own version of these references to include partial reads that fully overlap your region of interest. I'm of mixed opinions on whether the reads should be full length or allow shorter reads. For what it's worth, the RDP included all sequences in their database when their site was operational. This probably warrants further analysis.
If you notice that I don't have a version of the greengenes, RDP, or Silva databases that you are interested in, please [file an issue](https://github.com/mothur/phylotyprrefdata/issues). Similarly, if there is another database you would like me to include, [file an issue](https://github.com/mothur/phylotyprrefdata/issues).
This is a pretty basic data package. If you would like to see any additional features for the package, [let me know your ideas](https://github.com/mothur/phylotyprrefdata/issues).
## Statistics
Here is a profile of the available reference data:
```{r example}
library(phylotyprrefdata)
nrow(rdp_trainset10_pds)
nrow(rdp_trainset10)
nrow(rdp_trainset14_pds)
nrow(rdp_trainset14)
nrow(rdp_trainset16_pds)
nrow(rdp_trainset16)
nrow(rdp_trainset18_pds)
nrow(rdp_trainset18)
nrow(rdp_trainset19_pds)
nrow(rdp_trainset19)
nrow(rdp_trainset6_pds)
nrow(rdp_trainset6)
nrow(rdp_trainset7_pds)
nrow(rdp_trainset7)
nrow(rdp_trainset9_pds)
nrow(rdp_trainset9)
nrow(gg_12_1)
nrow(gg_13_5)
nrow(gg_13_8)
nrow(gg_20_10)
nrow(gg_24_09)
nrow(silva_v119_nr)
nrow(silva_v119_seed)
nrow(silva_v123_nr)
nrow(silva_v123_seed)
nrow(silva_v128_nr)
nrow(silva_v128_seed)
nrow(silva_v132_nr)
nrow(silva_v132_seed)
nrow(silva_v138_1_nr)
nrow(silva_v138_1_seed)
nrow(silva_v138_2_nr)
nrow(silva_v138_2_seed)
nrow(silva_v138_nr)
nrow(silva_v138_seed)
```