parcr/README.Rmd at main · SystemsBioinformatics/parcr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Edit README.Rmd -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(parcr)
```

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/parcr)](https://cran.r-project.org/package=parcr)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/SystemsBioinformatics/parcr/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->

## Construct parser combinator functions for parsing character vectors

This R package contains tools to construct parser combinator functions, higher
order functions that parse input. The main goal of this package is to simplify
the creation of *transparent* parsers for structured text files generated by
machines like laboratory instruments. Such files consist of lines of text
organized in higher-order structures like headers with metadata and blocks of
measured values. To read these data into R you first need to create a parser
that processes these files and creates R-objects as output. The `parcr` package
simplifies the task of creating such parsers.

This package was inspired by the package
["Ramble"](https://github.com/NoRaincheck/Ramble) by Chapman Siu and co-workers
and by the paper
["Higher-order functions for parsing"](https://doi.org/10.1017/S0956796800000411)
by [Graham Hutton](https://orcid.org/0000-0001-9584-5150) (1992).

## Installation

Install the stable version from CRAN

```
install.packages("parcr")
```

To install the development version including its vignette run the following command

```
install_github("SystemsBioinformatics/parcr", build_vignettes=TRUE)
```

## Example application: a parser for *fasta* sequence files

As an example of a realistic application we write a parser for
fasta-formatted files for nucleotide and protein sequences. We use a few
simplifying assumptions about this format for the sake of the example. Real
fasta files are more complex than we pretend here.

*Please note that more background about the functions that we use here is
available in the package documentation. Here we only present a summary.*

A fasta file with mixed sequence types could look like the example below:

```{r, echo=FALSE, comment = NA}
data("fastafile")
cat(paste0(fastafile, collapse="\n"))
```

Since fasta files are text files we could read such a file using `readLines()`
into a character vector. The package provides the data set `fastafile` which
contains that character vector.

```{r, eval=FALSE}
data("fastafile")
```

We can distinguish the following higher order components in a fasta file:

- A **fasta** file: consists of one or more **sequence blocks** until the
  **end of the file**.
- A **sequence block**: consist of a **header** and a
  **nucleotide sequence** or a **protein sequence**. A sequence block could be
  preceded by zero or more **empty lines**.
- A **nucleotide sequence**: consists of one or more
  **nucleotide sequence strings**.
- A **protein sequence**: consists of one or more
  **protein sequence strings**.
- A **header** is a *string* that starts with a ">" immediately followed by
  a **title** without spaces.
- A **nucleotide sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{G,A,T,C}`.
- A **protein sequence string** is a *string* without spaces that consists
  *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`.

It now becomes clear what we mean when we say that the package allows us
to write *transparent* parsers: the description above of the structure of fasta
files can be put straight into code for a `Fasta()` parser:

```{r}
Fasta <- function() {
  one_or_more(SequenceBlock()) %then%
    eof()
}

SequenceBlock <- function() {
  MaybeEmpty() %then%
    Header() %then%
    (NuclSequence() %or% ProtSequence()) %using%
    function(x) list(x)
}

NuclSequence <- function() {
  one_or_more(NuclSequenceString()) %using%
    function(x) list(type = "Nucl", sequence = paste(x, collapse=""))
}

ProtSequence <- function() {
  one_or_more(ProtSequenceString()) %using%
    function(x) list(type = "Prot", sequence = paste(x, collapse=""))
}
```

Functions like `one_or_more()`, `%then%`, `%or%`, `%using%`, `eof()` and
`MaybeEmpty()` are defined in the package and are the basic parsers with
which the package user can build complex parsers. The `%using%` operator uses
the function on its right-hand side to modify parser output on its left hand
side. Please see the vignette in the `parcr` package for more explanation why
this is useful or necessary even.

Notice that the new parser functions that we define above are higher order
functions taking no input, hence the empty argument brackets `()` behind their
names.

Now we need to define the parsers `Header()`, `NuclSequenceString()`
and `ProtSequenceString()` that actually recognize and process the header line
string and strings of nucleotide or protein sequences in the character vector
`fastafile`. We use the function constructor `stringparser()` from the package
to construct helper functions that recognize and capture the desired matches,
and we use `match_s()` to to create `parcr` compliant parsers from these.

```{r}
Header <- function() {
  match_s(stringparser("^>(\\w+)")) %using%
    function(x) list(title = unlist(x))
}

NuclSequenceString <- function() {
  match_s(stringparser("^([GATC]+)$"))
}

ProtSequenceString <- function() {
  match_s(stringparser("^([ARNDBCEQZGHILKMFPSTWYV]+)$"))
}
```

Now we have all the elements that we need to apply the `Fasta()` parser.

```{r}
Fasta()(fastafile)
```

The output of the parser consists of two elements, `L` and `R`, where `L`
contains the parsed and processed part of the input and `R` the remaining
un-parsed part of the input. Since we explicitly demanded to parse until the
end of the file by the `eof()` function in the definition of the `Fasta()`
parser, the `R` element contains an empty list to signal that the parser was
indeed at the end of the input. Please see the package documentation for more
examples and explanation.

Finally, let's present the result of the parse more concisely using the names
of the elements inside the `L` element:

```{r}
d <- Fasta()(fastafile)[["L"]]
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
```

## Getting useful error messages when parsing

Basic error messaging is implemented in the function `reporter()`. You can wrap
a parser in the `reporter()` function to obtain an error message that reports
the line of the input in which the parser ultimately failed as well as some lines
around it to provide context. Suppose we have the following badly formatted
fasta file:

```{r}
bad_header <- c(
  "*sequence_A",
  "GGTAAGTCCTCTAGTACAAACACCCCCAAT",
  ">sequence_B",
  "ATTGTGATATAATTAAAATTATATTCATAT"
)
```

Note that the first header starts with `*` instead of a `>`. Upgrading the
`Fasta()` parser with the `reporter()` function to an *error reporting parser*
yields a basic error message:

```{r}
#| eval: false
reporter(Fasta())(bad_header)
```
```{r}
#| echo: false
try(reporter(Fasta())(bad_header))
```


We could, however, get better error messaging by upgrading the `Header()` parser
to a named parser:

```{r}
Header <- function() {
  named(
    match_s(stringparser("^>(\\w+)")) %using%
      function(x) list(title = unlist(x)),
    "FASTA header (>sequence_name)"
  )
}
```

where the first argument to the `named()` function is a parser body and the
second argument is a brief description of the parser. Now, the reporter yields
a more detailed message:

```{r}
#| eval: false
reporter(Fasta())(bad_header)
```


```{r}
#| echo: false
try(reporter(Fasta())(bad_header))
```

Suppose we have the following bad fasta file:

```{r}
missing_sequence <- c(
  ">sequence_A",
  ">sequence_B",
  "ATTGTGATATAATTAAAATTATATTCATAT"
)
```

Upgrading the `NuclSequence` and `ProtSequence` to named parsers yields a
better error message:

```{r}
NuclSequence <- function() {
  named(
    one_or_more(NuclSequenceString()) %using%
      function(x) list(type = "Nucl", sequence = paste(x, collapse="")),
    "Nucleotide_Sequence"
  )
}

ProtSequence <- function() {
  named(
    one_or_more(ProtSequenceString()) %using%
      function(x) list(type = "Prot", sequence = paste(x, collapse="")),
    "Protein_Sequence"

  )
}
```

```{r}
#| eval: false
reporter(Fasta())(missing_sequence)
```

```{r}
#| echo: false
try(reporter(Fasta())(missing_sequence))
```