rc3helpers/README.Rmd at main · MoseleyBioinformaticsLab/rc3helpers · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# rc3helpers

<!-- badges: start -->
<!-- badges: end -->

rc3helpers is a little package designed around making it easier work with local samples in a few ways:

1. Running recount-pump and recount-unify;
1. Copying the output of recount-unify into the directory structure that allows importing local data into the [recount3](https://bioconductor.org/packages/release/bioc/html/recount3.html) R package.
1. Checking that something hasn't happened to your *.fq.gz files during download or copying between locations.

## Installation

You can install the development version of `rc3helpers` from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("MoseleyBioinformaticsLab/rc3helpers")
```

## Running Pump & Unify

This assumes you've already gotten everything setup for running pump and unify locally.
You can find instructions for that on the [monorail-external](https://github.com/langmead-lab/monorail-external) README.

I realize that the rc3_run_pump_unify function call takes a **lot** of arguments.
However, given that I'm horrible at writing shell scripts, and typically have more than a single sample to run on local hardware, not in the cloud
like the Langmead lab does, this saves me a lot of typing as I only have to give it the directory of samples, besides the other locations.

```{r}
#| label: running-pump-unify
#| eval: false
# NOT RUN, for example only
unify_output = rc3_run_pump_unify(
  fasta = "path/to/fasta.fq.gz",
  outputs = "path/to/outputs/dir",
  monorail = "path/to/monorail/repo",
  recount_pump = "path/to/recount-pump.XXX.sif",
  recount_unify = "path/to/recount-unify.XXX.sif",
  reference_path = "path/to/references",
  reference = "hg38",
  studyid = "studyid",
  shortid = "shortid",
  ncore = 2,
  run = "both",
  delete_oldpump = "yes",
  run_or_show = "run"
)
```

When you are processing **a larger** number of samples (I've seen it happen with 17 samples), one or more of the samples might fail during the **pump** stage.
`recount-unify` should die at that point from checking the samples, as `rc3_run_pump_unify` defaults to expecting the same samples in both the **pump** and **unify** stage.
You should verify **what** caused the sample to error by checking the log file, normally in `path/to/outputs/output/sampleid_att0/std.out`.
As long as it doesn't say *decompression* failed, you should be able to **delete** the sample directory, and then set `run = "pump", delete_oldpump = "no"` to just do the **pump** part on the samples that previously failed.
When that is complete, you can then do it again, with `run = "unify"`, and it will generate the **unify** results.


### Custom Organisms

For the `monorail` argument, this can either be the top level directory that is normally *monorail-external*, or a vector of length 2, giving the paths to both
of the shell scripts for orchestrating *recount-pump* and *recount-unify* respectively.
This is useful when you are working with a custom organism outside of **human** and **mouse**.
For example, I have been working with **zebrafish**, so I created a slightly modified script that handles the *recount-unify* step properly for zebrafish, and
I would define the `monorail` argument like this, so that the unify step calls the correct shell script:

```r
monorail = c("monorail-external/singularity/run_recount_pump.sh",
             "monorail-external/sngularity/run_recount_unify_mz11.sh")
```

### Samples and IDs

This script **currently** expects each sample to have two fasta files that are named:

* sample_id_1.fq.gz
* sample_id_2.fq.gz

*monorail-external* also assumes that the last two places of any identifier such as **sample_id**, **study_id** or **shortid** <span style="color: red">do not</span> contain an underscore or any other non-alphanumeric character, i.e. they should contain [a-z] and or [0-9].

## Copying Unify Outputs

Two important notes:

1. As of recount3 v 1.22.0, there is no option to have an organism outside of *human* or *mouse*, so we always use a directory of one of those two organisms.
1. This also means that only certain annotations / references are supported. Therefore, we create symbolic links to the files that `recount3` expects to be there.


I previously ran two human SRA samples in local mode through the pump and unify workflow, with the `short_id` **sratest**.

```{r}
#| label: setup-directories
library(rc3helpers)
tmp_dir = tempdir()
rc3_dirs = rc3_setup_directory(base_dir = tmp_dir, short_id = "sratest")
```

The directories created should look like this:

```
human
├── annotations
│   ├── exon_sums
│   └── gene_sums
└── data_sources
    └── sratest
        ├── base_sums
        ├── exon_sums
        ├── gene_sums
        ├── junctions
        └── metadata
```

The unify output looks like this:

```
├── exon_sums_per_study
│   └── st
│       └── sratest
│           ├── sratest.exon_sums.sratest.ERCC.gz
│           ├── sratest.exon_sums.sratest.F006.gz
│           ├── sratest.exon_sums.sratest.G026.gz
│           ├── sratest.exon_sums.sratest.G029.gz
│           ├── sratest.exon_sums.sratest.R109.gz
│           └── sratest.exon_sums.sratest.SIRV.gz
├── gene_sums_per_study
│   └── st
│       └── sratest
│           ├── sratest.gene_sums.sratest.ERCC.gz
│           ├── sratest.gene_sums.sratest.F006.gz
│           ├── sratest.gene_sums.sratest.G026.gz
│           ├── sratest.gene_sums.sratest.G029.gz
│           ├── sratest.gene_sums.sratest.R109.gz
│           └── sratest.gene_sums.sratest.SIRV.gz
├── ids.tsv
├── junction_counts_per_study
│   └── st
│       └── sratest
│           ├── sratest.junctions.sratest.ALL.ID.gz
│           ├── sratest.junctions.sratest.ALL.MM.gz
│           ├── sratest.junctions.sratest.ALL.RR.gz
│           ├── sratest.junctions.sratest.UNIQUE.ID.gz
│           ├── sratest.junctions.sratest.UNIQUE.MM.gz
│           └── sratest.junctions.sratest.UNIQUE.RR.gz
├── junctions.bgz
├── junctions.bgz.tbi
├── junctions.sqlite
|
|
|
```

Where this is a model organism, you are going to need the reference sets of things downloaded too.
We can download the gene and exon sum annotations from recount3, in our case for G026.

```
wget http://duffel.rail.bio/recount3/human/new_annotations/gene_sums/human.gene_sums.G026.gtf.gz
wget http://duffel.rail.bio/recount3/human/new_annotations/exon_sums/human.exon_sums.G026.gtf.gz
```

We can pass the list of directories created by `rc3_setup_directories` to `rc3_copy_unify_outputs`:

```{r}
#| label: copy-unify
#| eval: false
rc3_dirs |>
  rc3_copy_unify_outputs(
    short_id = "sratest",
    unify_directory = "/path/to/unify",
    references_directory = "/path/to/references/hg38"
  )
```

Which will create this directory structure. You may see an error when working with human data, as the function tries to link a file that actually already exists.
That's OK.

```
human
├── annotations
│   ├──exon_sums
│   │   ├── human.exon_sums.G026.gtf.gz
│   └── gene_sums
│       ├── human.gene_sums.G026.gtf.gz
└── data_sources
    └── sratest
        ├── base_sums
        ├── exon_sums
        │   └── st
        │       └── sratest
        │           ├── sratest.exon_sums.sratest.ERCC.gz
        │           ├── sratest.exon_sums.sratest.F006.gz
        │           ├── sratest.exon_sums.sratest.G026.gz
        │           ├── sratest.exon_sums.sratest.G029.gz
        │           ├── sratest.exon_sums.sratest.R109.gz
        │           └── sratest.exon_sums.sratest.SIRV.gz
        ├── gene_sums
        │   └── st
        │       └── sratest
        │           ├── sratest.gene_sums.sratest.ERCC.gz
        │           ├── sratest.gene_sums.sratest.F006.gz
        │           ├── sratest.gene_sums.sratest.G026.gz
        │           ├── sratest.gene_sums.sratest.G029.gz
        │           ├── sratest.gene_sums.sratest.R109.gz
        │           └── sratest.gene_sums.sratest.SIRV.gz
        ├── junctions
        │   └── st
        │       └── sratest
        │           ├── sratest.junctions.sratest.ALL.ID.gz
        │           ├── sratest.junctions.sratest.ALL.MM.gz
        │           ├── sratest.junctions.sratest.ALL.RR.gz
        │           ├── sratest.junctions.sratest.UNIQUE.ID.gz
        │           ├── sratest.junctions.sratest.UNIQUE.MM.gz
        │           └── sratest.junctions.sratest.UNIQUE.RR.gz
        └── metadata
            ├── sratest.recount_project.MD.gz
            └── st
                └── sratest
                    ├── sratest.recount_project.sratest.MD.gz
                    ├── sratest.recount_qc.sratest.MD.gz
                    └── sratest.sratest.sratest.MD.gz
```

Then we can load up recount3 and start working with this dataset:

```{r}
#| label: run-recount3
#| eval: false
library(recount3)
hp = available_projects(recount3_url = tmp_dir)
rse_genes = create_rse(hp[1, ], recount3_url = tmp_dir)
```

## Checking fq.gz

Sometimes, if there is an issue with your filesystem, or the disks where data is being stored, when you copy or move a file, something can go wrong with the file.
recount-pump depends on decompressing your fasta files into the docker container to run, so if anything gets corrupted with the fq.gz file, it just won't run.

So, assuming you have gzip installed on your system, we can check either an entire directory or particular files before running recount-pump and unify.

```{r}
#| label: gz-check
#| eval: false
# Not run
file_checks = rc3_check_gz(fasta)
```

This returns a data.frame, with any error messages emitted from gzip for each file.
Ideally, you see "NA" in every entry before running the pump and unify steps.