You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.Rmd
+28-15Lines changed: 28 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
output: github_document
3
3
---
4
4
5
-
<!-- README.md is generated from README.Rmd. Please edit that file -->
5
+
<!-- README.md is generated from README.html. Please edit that file -->
6
6
7
7
```{r, include = FALSE}
8
8
knitr::opts_chunk$set(
@@ -18,45 +18,58 @@ knitr::opts_chunk$set(
18
18
<!-- badges: start -->
19
19
<!-- badges: end -->
20
20
21
-
**excelDataGuide** is an R package designed to streamline the process of importing data from spreadsheet *data reporting templates* (DRTs) into R.
21
+
**excelDataGuide** is an R package that streamlines reading data from standardized Excel spreadsheet templates into R.
22
22
23
-
A *data reporting template* is a standardized spreadsheet file (in either xls or xlsx format) used for reporting and processing experimental data. These templates significantly reduce the time required for data analysis and encourage users to present their data in a structured format, minimizing errors and misinterpretations.
23
+
## The problem
24
24
25
-
The **excelDataGuide** package eliminates the need for data analysts to write and maintain complex code for reading data from various complex spreadsheet DRTs. Additionally, it offers a robust framework for validating data, ensuring that the correct data types are utilized, and facilitating data wrangling when necessary. This functionality supports *Interoperability* for DRTs, a key aspect of the [FAIR](https://www.go-fair.org/fair-principles/) principles.
25
+
Spreadsheet templates are widely used in laboratories to standardize data recording and reduce errors. However, extracting data from these templates into R typically requires writing custom, template-specific code. This is tedious and error-prone.
26
26
27
-
The package features a user-friendly interface for extracting data from Excel files and converting it into R objects. It accommodates three types of data structures: key-value pairs, tabular data, and microplate-formatted data. The locations of these structures within the Excel template are specified by a **data guide**, which is a YAML file — a structured format that is both human- and machine-readable.
27
+
## The solution
28
+
29
+
The **excelDataGuide** package eliminates this burden by:
30
+
31
+
1.**Defining a data guide** — a simple YAML file that describes where data are located in your template and how they should be interpreted
32
+
2.**Reading data with one command** — the `read_data()` function uses the guide to extract data correctly and automatically
33
+
34
+
The data guide approach also supports the [FAIR principles](https://www.go-fair.org/fair-principles/) by making your data structure explicit and machine-readable.
28
35
29
36
## Installation
30
37
31
-
You can install the development version of excelDataGuide in a recent version of R from GitHub with:
38
+
You can install the development version of excelDataGuide from GitHub with:
32
39
33
40
```r
34
41
# install.packages("pak")
35
42
pak::pak("SystemsBioinformatics/excelDataGuide")
36
43
```
37
44
38
-
## Usage
45
+
## Quick start
39
46
40
-
The basic usage of the package requires only one command with two file paths: the path to the Excel data file and the path to the data guide file. Here is an example:
47
+
Reading data from an Excel template requires just two files: the template itself and a data guide.
The output of the `read_data()` function is a list object the format of which is determined for a large part by the design of the data guide.
62
+
The output is a list containing the data organized according to your guide.
50
63
51
-
## How it works
64
+
## Next steps
52
65
53
-
When you design a template spreadsheet file for data reporting and analysis you also create a *data guide* file that specifies the structure and location of the data in the template. Examples of a template with data and of a data guide are provided in the package (`system.file("extdata", "example_data.xlsx", package = "excelDataGuide")` and `system.file("extdata", "example_guide.yml", package = "excelDataGuide")`).
66
+
For detailed guidance on using this package:
54
67
55
-
Once you have entered the data and metadata in a template you can use the `read_data()` function in the package to extract the data into R with a single command. The package will check and coerce the data types to the required formats.
68
+
-**[Designing templates](articles/writing_templates.html)** — Best practices for structuring your Excel templates (version numbers, protected cells, parameter sheets, *etc.*).
56
69
57
-
Details about the data guide format and how to write one as well as about how to design a template can be found in the package vignettes.
70
+
-**[Writing data guides](articles/writing_data_guides.html)** — Step-by-step instructions for creating YAML guides, with examples of all four data types (keyvalue, cells, table, platedata) and a complete working example.
58
71
59
72
## Future work
60
73
61
-
-Complete the vignette ([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/2))
62
-
- Provide guide and template structures for data types without upper size limit, typically time series with no pre-determined length ([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)).
74
+
-[Provide guide and template structures for unbounded data types (time series, *etc.*)](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)
We urge you to use the `NA()` function to represent missing values in your tenmplates, in particular in calculations. The advantage of using `NA()` is that calculations in the sheets will automatically handle `NA()` and pass them on to subsequent caclulations, avoiding errors and producing sensible results. A disadvantage of using `NA()` is that it requires special care to detect and handle missing values in formulas. One particularly weird problem is that you can not use detection of the string "#N/A" in a cell as a way to generically detect missing values in formulas, even though this "solution" is often presented on internet fora, even in official documentation. The reason is that different language settings of Excel use different string representations for missing values. You have to consistently use the `ISNA()` function to detect `NA()` values throughout your entire template.
137
+
We strongly recommend using the `NA()` function to represent missing values in your templates, especially in calculations. Using `NA()` offers several benefits:
138
+
139
+
**Advantages:**
140
+
141
+
- Calculations automatically propagate `NA()` values through subsequent formulas, avoiding errors and producing sensible results
142
+
- Missing values are handled consistently and transparently
143
+
144
+
**Challenges:**
145
+
146
+
Missing values in Excel require special care in formulas. The main issue is that different language settings of Excel use different string representations for missing values (*e.g.*, `#N/A` in English, `#NV` in German). This creates problems:
147
+
148
+
- You cannot reliably detect missing values using string matching like `"<>#N/A"`, even though this approach is often suggested online
149
+
- Conditional aggregation functions (`SUMIF`, `COUNTIF`, *etc*.) do not work correctly with `NA()` values because they need a criterion like `<>#N/A` which detect the string `#N/A` in cells
150
+
151
+
**Solutions:**
152
+
153
+
- Always use the `ISNA()` function in your formulas to detect `NA()` values in cells
154
+
- Always use `IFNA()` to handle `NA()` values in aggregation formulas. For example:
155
+
-`=SUM(IFNA(A1:A10, 0))` — sums values, treating `NA()` as 0
156
+
-`=PRODUCT(IFNA(A1:A10, 1))` — multiplies values, treating `NA()` as 1
157
+
-`=AVERAGE(IFNA(A1:A10, ""))` — calculates the average, treating `NA()` as ""
158
+
-`=COUNT(IFNA(A1:A10, ""))` — counts non-missing values, treating `NA()` as ""
159
+
160
+
These formulas work correctly regardless of Excel's language setting and handle `NA()` values properly.
161
+
162
+
### Flagged values (bad values)
163
+
164
+
Sometimes you have raw measurements that you want to exclude from analysis, but deleting them from the spreadsheet is not advisable. Flagging (rather than deleting) allows others—or your future self—to reconsider whether the measurement is truly "bad," since this judgment can be subjective.
165
+
166
+
**How to flag bad values:**
167
+
Add a marker symbol (typically a star or asterisk) before or after the value:
168
+
-`1000*` or `*1000` — marks a flagged measurement
169
+
170
+
**Documenting flagged values:**
171
+
In the same sheet, maintain a table documenting each flagged value with:
172
+
- Cell address of the flagged measurement
173
+
- Reason why it was flagged
174
+
175
+
This creates an audit trail and allows someone to revisit the decision later.
176
+
177
+
**Detecting flagged values in calculations:**
178
+
You can detect "starred" values in Excel using type-checking functions (`ISNUMBER()`, `ISTEXT()`, *etc*.) and convert them to `NA()`. For example:
179
+
180
+
`=IF(NOT(ISNUMBER(A1)), NA(), A1)` — returns `NA()` if the cell is not a number (i.e., contains a starred value like `1000*`)
178
181
179
-
### Labeled values (bad values)
182
+
**Visual indicators:**
183
+
Use conditional formatting to highlight flagged values with a distinct font color or cell background, making them visible at a glance.
180
184
181
-
You may have obtained raw measurements that you do not want to include in your analysis. Clearly, you should not delete these measurements from the spreadsheet, because labelling a value as a "bad" measurement is, to some degree, a subjective action with which an other user or your future self may disagree. Instead, you can label them as "bad". An easy way to do this is by adding a star before or behind the value, *e.g.* `1000*` or `*1000`. You should also add a note explaining why the value is bad in a table with columns of cell addresses and remarks at a logical position in the same sheet. You can detect such "starred" values in Excel by using for example the `ISERROR()`, `ISNUMBER()` or `ISNONTEXT()` functions in a clause in calculations with these values and set a calculated cell to `NA()` based on the result. For example, `=IF(NOT(ISNUMBER(A1)), NA(), A1)` will set the cell with this formula to `NA()` if the value is not a number. An additional visual aid to detect "starred" values is to use a different font color or cell background for such cells using conditional formatting.
185
+
**In the excelDataGuide package:**
186
+
The package provides two utility functions to work with flagged values:
182
187
183
-
In the excelDataGuide package we provide the functions `has_star()` and `star_to_number()` to detect "starred" values, convert them back to numbers, but label them as "bad" in a separate column in the template output.
188
+
-`star_to_number()` — Removes the star marker and converts the value back to a number
189
+
-`has_star()` — Detects which values are flagged (contain a star/asterisk). The output is a logical vector indicating which values are flagged. It can be used to create a column of flags next to the original values.
0 commit comments