Skip to content

Commit 45c2f20

Browse files
committed
WIP
1 parent 26bc296 commit 45c2f20

File tree

3 files changed

+127
-111
lines changed

3 files changed

+127
-111
lines changed

README.Rmd

Lines changed: 28 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
output: github_document
33
---
44

5-
<!-- README.md is generated from README.Rmd. Please edit that file -->
5+
<!-- README.md is generated from README.html. Please edit that file -->
66

77
```{r, include = FALSE}
88
knitr::opts_chunk$set(
@@ -18,45 +18,58 @@ knitr::opts_chunk$set(
1818
<!-- badges: start -->
1919
<!-- badges: end -->
2020

21-
**excelDataGuide** is an R package designed to streamline the process of importing data from spreadsheet *data reporting templates* (DRTs) into R.
21+
**excelDataGuide** is an R package that streamlines reading data from standardized Excel spreadsheet templates into R.
2222

23-
A *data reporting template* is a standardized spreadsheet file (in either xls or xlsx format) used for reporting and processing experimental data. These templates significantly reduce the time required for data analysis and encourage users to present their data in a structured format, minimizing errors and misinterpretations.
23+
## The problem
2424

25-
The **excelDataGuide** package eliminates the need for data analysts to write and maintain complex code for reading data from various complex spreadsheet DRTs. Additionally, it offers a robust framework for validating data, ensuring that the correct data types are utilized, and facilitating data wrangling when necessary. This functionality supports *Interoperability* for DRTs, a key aspect of the [FAIR](https://www.go-fair.org/fair-principles/) principles.
25+
Spreadsheet templates are widely used in laboratories to standardize data recording and reduce errors. However, extracting data from these templates into R typically requires writing custom, template-specific code. This is tedious and error-prone.
2626

27-
The package features a user-friendly interface for extracting data from Excel files and converting it into R objects. It accommodates three types of data structures: key-value pairs, tabular data, and microplate-formatted data. The locations of these structures within the Excel template are specified by a **data guide**, which is a YAML file — a structured format that is both human- and machine-readable.
27+
## The solution
28+
29+
The **excelDataGuide** package eliminates this burden by:
30+
31+
1. **Defining a data guide** — a simple YAML file that describes where data are located in your template and how they should be interpreted
32+
2. **Reading data with one command** — the `read_data()` function uses the guide to extract data correctly and automatically
33+
34+
The data guide approach also supports the [FAIR principles](https://www.go-fair.org/fair-principles/) by making your data structure explicit and machine-readable.
2835

2936
## Installation
3037

31-
You can install the development version of excelDataGuide in a recent version of R from GitHub with:
38+
You can install the development version of excelDataGuide from GitHub with:
3239

3340
``` r
3441
# install.packages("pak")
3542
pak::pak("SystemsBioinformatics/excelDataGuide")
3643
```
3744

38-
## Usage
45+
## Quick start
3946

40-
The basic usage of the package requires only one command with two file paths: the path to the Excel data file and the path to the data guide file. Here is an example:
47+
Reading data from an Excel template requires just two files: the template itself and a data guide.
4148

4249
```{r example}
4350
library(excelDataGuide)
51+
52+
# Path to your Excel file
4453
datafile <- system.file("extdata", "example_data.xlsx", package = "excelDataGuide")
54+
55+
# Path to the data guide (YAML file)
4556
guidefile <- system.file("extdata", "example_guide.yml", package = "excelDataGuide")
57+
58+
# Read the data
4659
data <- read_data(datafile, guidefile)
4760
```
4861

49-
The output of the `read_data()` function is a list object the format of which is determined for a large part by the design of the data guide.
62+
The output is a list containing the data organized according to your guide.
5063

51-
## How it works
64+
## Next steps
5265

53-
When you design a template spreadsheet file for data reporting and analysis you also create a *data guide* file that specifies the structure and location of the data in the template. Examples of a template with data and of a data guide are provided in the package (`system.file("extdata", "example_data.xlsx", package = "excelDataGuide")` and `system.file("extdata", "example_guide.yml", package = "excelDataGuide")`).
66+
For detailed guidance on using this package:
5467

55-
Once you have entered the data and metadata in a template you can use the `read_data()` function in the package to extract the data into R with a single command. The package will check and coerce the data types to the required formats.
68+
- **[Designing templates](articles/writing_templates.html)** — Best practices for structuring your Excel templates (version numbers, protected cells, parameter sheets, *etc.*).
5669

57-
Details about the data guide format and how to write one as well as about how to design a template can be found in the package vignettes.
70+
- **[Writing data guides](articles/writing_data_guides.html)** — Step-by-step instructions for creating YAML guides, with examples of all four data types (keyvalue, cells, table, platedata) and a complete working example.
5871

5972
## Future work
6073

61-
- Complete the vignette ([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/2))
62-
- Provide guide and template structures for data types without upper size limit, typically time series with no pre-determined length ([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)).
74+
- [Provide guide and template structures for unbounded data types (time series, *etc.*)](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)
75+
```

README.md

Lines changed: 49 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,85 +1,82 @@
11

2-
<!-- README.md is generated from README.Rmd. Please edit that file -->
2+
<!-- README.md is generated from README.html. Please edit that file -->
33

44
# excelDataGuide
55

66
<!-- badges: start -->
77

88
<!-- badges: end -->
99

10-
**excelDataGuide** is an R package designed to streamline the process of
11-
importing data from spreadsheet *data reporting templates* (DRTs) into
12-
R.
13-
14-
A *data reporting template* is a standardized spreadsheet file (in
15-
either xls or xlsx format) used for reporting and processing
16-
experimental data. These templates significantly reduce the time
17-
required for data analysis and encourage users to present their data in
18-
a structured format, minimizing errors and misinterpretations.
19-
20-
The **excelDataGuide** package eliminates the need for data analysts to
21-
write and maintain complex code for reading data from various complex
22-
spreadsheet DRTs. Additionally, it offers a robust framework for
23-
validating data, ensuring that the correct data types are utilized, and
24-
facilitating data wrangling when necessary. This functionality supports
25-
*Interoperability* for DRTs, a key aspect of the
26-
[FAIR](https://www.go-fair.org/fair-principles/) principles.
27-
28-
The package features a user-friendly interface for extracting data from
29-
Excel files and converting it into R objects. It accommodates three
30-
types of data structures: key-value pairs, tabular data, and
31-
microplate-formatted data. The locations of these structures within the
32-
Excel template are specified by a **data guide**, which is a YAML file —
33-
a structured format that is both human- and machine-readable.
10+
**excelDataGuide** is an R package that streamlines reading data from
11+
standardized Excel spreadsheet templates into R.
12+
13+
## The problem
14+
15+
Spreadsheet templates are widely used in laboratories to standardize
16+
data recording and reduce errors. However, extracting data from these
17+
templates into R typically requires writing custom, template-specific
18+
code. This is tedious and error-prone.
19+
20+
## The solution
21+
22+
The **excelDataGuide** package eliminates this burden by:
23+
24+
1. **Defining a data guide** — a simple YAML file that describes where
25+
data are located in your template and how they should be interpreted
26+
2. **Reading data with one command** — the `read_data()` function uses
27+
the guide to extract data correctly and automatically
28+
29+
The data guide approach also supports the [FAIR
30+
principles](https://www.go-fair.org/fair-principles/) by making your
31+
data structure explicit and machine-readable.
3432

3533
## Installation
3634

37-
You can install the development version of excelDataGuide in a recent
38-
version of R from GitHub with:
35+
You can install the development version of excelDataGuide from GitHub
36+
with:
3937

4038
``` r
4139
# install.packages("pak")
4240
pak::pak("SystemsBioinformatics/excelDataGuide")
4341
```
4442

45-
## Example
43+
## Quick start
4644

47-
The basic usage of the package requires only one command with two file
48-
paths: the path to the Excel data file and the path to the data guide
49-
file. Here is an example:
45+
Reading data from an Excel template requires just two files: the
46+
template itself and a data guide.
5047

5148
``` r
5249
library(excelDataGuide)
50+
51+
# Path to your Excel file
5352
datafile <- system.file("extdata", "example_data.xlsx", package = "excelDataGuide")
53+
54+
# Path to the data guide (YAML file)
5455
guidefile <- system.file("extdata", "example_guide.yml", package = "excelDataGuide")
56+
57+
# Read the data
5558
data <- read_data(datafile, guidefile)
5659
```
5760

58-
The output of the `read_data()` function is a list object the format of
59-
which is determined for a large part by the design of the data guide.
61+
The output is a list containing the data organized according to your
62+
guide.
6063

61-
## How it works
64+
## Next steps
6265

63-
When you design a template spreadsheet file for data reporting and
64-
analysis you also create a *data guide* file that specifies the
65-
structure and location of the data in the template. Examples of a
66-
template with data and of a data guide are provided in the package
67-
(`system.file("extdata", "example_data.xlsx", package = "excelDataGuide")`
68-
and
69-
`system.file("extdata", "example_guide.yml", package = "excelDataGuide")`).
66+
For detailed guidance on using this package:
7067

71-
Once you have entered the data and metadata in a template you can use
72-
the `read_data()` function in the package to extract the data into R
73-
with a single command. The package will check and coerce the data types
74-
to the required formats.
68+
- **[Designing templates](articles/writing_templates.html)** — Best
69+
practices for structuring your Excel templates (version numbers,
70+
protected cells, parameter sheets, *etc.*).
7571

76-
Details about the data guide format and how to write one as well as
77-
about how to design a template can be found in the package vignettes.
72+
- **[Writing data guides](articles/writing_data_guides.html)**
73+
Step-by-step instructions for creating YAML guides, with examples of
74+
all four data types (keyvalue, cells, table, platedata) and a complete
75+
working example.
7876

7977
## Future work
8078

81-
- Complete the vignette
82-
([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/2))
83-
- Provide guide and template structures for data types without upper
84-
size limit, typically time series with no pre-determined length
85-
([issue](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)).
79+
- [Provide guide and template structures for unbounded data types (time
80+
series,
81+
*etc.*)](https://github.com/SystemsBioinformatics/excelDataGuide/issues/1)
82+
\`\`\`

vignettes/writing_templates.Rmd

Lines changed: 50 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -19,46 +19,6 @@ knitr::opts_chunk$set(
1919
library(excelDataGuide)
2020
```
2121

22-
## Introduction
23-
24-
Spreadsheets are used in biochemical laboratories for both recording and
25-
analyzing experiments. In case of routine experiments, spreadsheet templates
26-
are often used to streamline workflows and ensure consistency.
27-
28-
The goal of the **excelDataGuide** package is to enable the use of Excel
29-
spreadsheets alongside scripting environments as effective data analysis tools.
30-
While scripting languages offer more flexibility and power — especially for
31-
analyzing large datasets across multiple workbooks — the spreadsheet is the
32-
*primary source of all data*.
33-
34-
This *single source of truth* approach ensures that both spreadsheet-based
35-
and script-based analyses rely on the same underlying data and parameters.
36-
This includes:
37-
38-
- **Metadata** (user data, instrument settings, *etc*.)
39-
- **Experimental parameters** (acceptance criteria, standard concentrations, *etc*.)
40-
- **Experimental data** (raw measurements, concentrations *etc*.)
41-
42-
Parameters such as acceptance criteria and standard concentration are typically defined by standard
43-
operating procedures (SOPs) and can be stored in single cells or ranges of cells in a
44-
spreadsheet. They can be referred to by abolute referencing or by using named cells or named ranges.
45-
Other values, such as raw measurements or fitted parameters, vary per
46-
experiment and are entered by the user.
47-
48-
Sometimes it may be beneficial for the script to also use *calculated data*
49-
from the spreadsheet — especially when those calculations are
50-
automatically triggered upon user input and these results should be compared to
51-
results from a script-based analysis.
52-
53-
## Spreadsheet templates and data guides
54-
55-
The solution that we propose here is to use a data guide, a file that describes
56-
in a concise manner where data can be found in a template. Furthermore, it
57-
also describes the expected type of data (*e.g*, numeric or text or date) so that
58-
the data can be read in the correct format. The data guide is a *yaml* file. We
59-
chose this format because it has a simple structure, and can be easily read and
60-
edited in a text editor.
61-
6222
## Constructing a spreadsheet template
6323

6424
To provide a link between the data structures of programming languages and
@@ -174,13 +134,59 @@ knitr::include_graphics("images/data.png")
174134

175135
### Missing values in spreadsheets
176136

177-
We urge you to use the `NA()` function to represent missing values in your tenmplates, in particular in calculations. The advantage of using `NA()` is that calculations in the sheets will automatically handle `NA()` and pass them on to subsequent caclulations, avoiding errors and producing sensible results. A disadvantage of using `NA()` is that it requires special care to detect and handle missing values in formulas. One particularly weird problem is that you can not use detection of the string "#N/A" in a cell as a way to generically detect missing values in formulas, even though this "solution" is often presented on internet fora, even in official documentation. The reason is that different language settings of Excel use different string representations for missing values. You have to consistently use the `ISNA()` function to detect `NA()` values throughout your entire template.
137+
We strongly recommend using the `NA()` function to represent missing values in your templates, especially in calculations. Using `NA()` offers several benefits:
138+
139+
**Advantages:**
140+
141+
- Calculations automatically propagate `NA()` values through subsequent formulas, avoiding errors and producing sensible results
142+
- Missing values are handled consistently and transparently
143+
144+
**Challenges:**
145+
146+
Missing values in Excel require special care in formulas. The main issue is that different language settings of Excel use different string representations for missing values (*e.g.*, `#N/A` in English, `#NV` in German). This creates problems:
147+
148+
- You cannot reliably detect missing values using string matching like `"<>#N/A"`, even though this approach is often suggested online
149+
- Conditional aggregation functions (`SUMIF`, `COUNTIF`, *etc*.) do not work correctly with `NA()` values because they need a criterion like `<>#N/A` which detect the string `#N/A` in cells
150+
151+
**Solutions:**
152+
153+
- Always use the `ISNA()` function in your formulas to detect `NA()` values in cells
154+
- Always use `IFNA()` to handle `NA()` values in aggregation formulas. For example:
155+
- `=SUM(IFNA(A1:A10, 0))` — sums values, treating `NA()` as 0
156+
- `=PRODUCT(IFNA(A1:A10, 1))` — multiplies values, treating `NA()` as 1
157+
- `=AVERAGE(IFNA(A1:A10, ""))` — calculates the average, treating `NA()` as ""
158+
- `=COUNT(IFNA(A1:A10, ""))` — counts non-missing values, treating `NA()` as ""
159+
160+
These formulas work correctly regardless of Excel's language setting and handle `NA()` values properly.
161+
162+
### Flagged values (bad values)
163+
164+
Sometimes you have raw measurements that you want to exclude from analysis, but deleting them from the spreadsheet is not advisable. Flagging (rather than deleting) allows others—or your future self—to reconsider whether the measurement is truly "bad," since this judgment can be subjective.
165+
166+
**How to flag bad values:**
167+
Add a marker symbol (typically a star or asterisk) before or after the value:
168+
- `1000*` or `*1000` — marks a flagged measurement
169+
170+
**Documenting flagged values:**
171+
In the same sheet, maintain a table documenting each flagged value with:
172+
- Cell address of the flagged measurement
173+
- Reason why it was flagged
174+
175+
This creates an audit trail and allows someone to revisit the decision later.
176+
177+
**Detecting flagged values in calculations:**
178+
You can detect "starred" values in Excel using type-checking functions (`ISNUMBER()`, `ISTEXT()`, *etc*.) and convert them to `NA()`. For example:
179+
180+
`=IF(NOT(ISNUMBER(A1)), NA(), A1)` — returns `NA()` if the cell is not a number (i.e., contains a starred value like `1000*`)
178181

179-
### Labeled values (bad values)
182+
**Visual indicators:**
183+
Use conditional formatting to highlight flagged values with a distinct font color or cell background, making them visible at a glance.
180184

181-
You may have obtained raw measurements that you do not want to include in your analysis. Clearly, you should not delete these measurements from the spreadsheet, because labelling a value as a "bad" measurement is, to some degree, a subjective action with which an other user or your future self may disagree. Instead, you can label them as "bad". An easy way to do this is by adding a star before or behind the value, *e.g.* `1000*` or `*1000`. You should also add a note explaining why the value is bad in a table with columns of cell addresses and remarks at a logical position in the same sheet. You can detect such "starred" values in Excel by using for example the `ISERROR()`, `ISNUMBER()` or `ISNONTEXT()` functions in a clause in calculations with these values and set a calculated cell to `NA()` based on the result. For example, `=IF(NOT(ISNUMBER(A1)), NA(), A1)` will set the cell with this formula to `NA()` if the value is not a number. An additional visual aid to detect "starred" values is to use a different font color or cell background for such cells using conditional formatting.
185+
**In the excelDataGuide package:**
186+
The package provides two utility functions to work with flagged values:
182187

183-
In the excelDataGuide package we provide the functions `has_star()` and `star_to_number()` to detect "starred" values, convert them back to numbers, but label them as "bad" in a separate column in the template output.
188+
- `star_to_number()` — Removes the star marker and converts the value back to a number
189+
- `has_star()` — Detects which values are flagged (contain a star/asterisk). The output is a logical vector indicating which values are flagged. It can be used to create a column of flags next to the original values.
184190

185191
### What else?
186192

0 commit comments

Comments
 (0)