Skip to content

Commit c3b3c02

Browse files
committed
more docs info
1 parent a060766 commit c3b3c02

2 files changed

Lines changed: 174 additions & 34 deletions

File tree

README.md

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,11 +37,12 @@ reads in BAM format. These are provided to the samplesheet as input.
3737
2. [`bambu`](github.com/GoekeLab/bambu) - very minor read correction
3838
3. [`IsoQuant`](https://ablab.github.io/IsoQuant/) - allows read correction
3939
4. [`StringTie`](https://github.com/skovaka/stringtie2)
40-
7. Fusion gene detection [`JAFFA`](github.com/Oshlack/JAFFA)
41-
8. Transcriptome assessment [`gffutils`](https://ccb.jhu.edu/software/stringtie/gff.shtml)
42-
9. Transcript quantification ( [`TranSigner`](https://github.com/haydenji0731/TranSigner), [oarfish](https://github.com/COMBINE-lab/oarfish) )
43-
44-
Small test datasets for the pipeline are included in the [assets directory](https://github.com/number-25/LongTranscriptomics/assets/test_data).
40+
<!-- 7. Fusion gene detection [`JAFFA`](github.com/Oshlack/JAFFA) -->
41+
7. Transcriptome assessment [`gffcompare`](https://ccb.jhu.edu/software/stringtie/gff.shtml)
42+
8. Transcript quantification
43+
1. [oarfish](https://github.com/COMBINE-lab/oarfish) )
44+
<!-- ( [`TranSigner`](https://github.com/haydenji0731/TranSigner),
45+
Small test datasets for the pipeline are included in the [assets directory](https://github.com/number-25/LongTranscriptomics/assets/test_data). -->
4546

4647
## Usage
4748

docs/output.md

Lines changed: 168 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,10 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
2929
- [bambu](#bambu)
3030
- [IsoQuant](#IsoQuant)
3131
- [StringTie](#StringTie)
32-
<!-- 7. Fusion gene detection [`JAFFA`](github.com/Oshlack/JAFFA) -->
32+
<!-- 7. Fusion gene detection [`JAFFA`](github.com/Oshlack/JAFFA) -->
3333
- [Transcriptome assessment](#Transcriptome-assessment)
34-
- [gffutils](#gffutils)
34+
- [gffcompare](#gffcompare)
3535
- [Transcript quantification](#Transcript-quantification)
36-
- [TranSigner](#TranSigner)
3736
- [oarfish](#oarfish)
3837
- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
3938
- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
@@ -117,34 +116,32 @@ statistics in both ??
117116
<details markdown="1">
118117
<summary>Output files</summary>
119118

120-
- `multiqc/`
121-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
122-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
123-
- `multiqc_plots/`: directory containing static images from the report in various formats.
119+
- `mapping/`
120+
- `*_minimap2.sam`: the mapped output SAM file from minimap2.
124121

125122
</details>
126123

127-
[minimap2](https://github.com/lh3/minimap2) is perhaps the most popular
124+
[minimap2](https://github.com/lh3/minimap2) is perhaps the most popular
128125
long-read sequence aligner. In general, it aligns the sequence reads to the reference
129126
genome/transcriptome provided by the user. Taken directly from the developers
127+
130128
> Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA
131-
sequences against a large reference database. Typical use cases include: (1)
132-
mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
133-
finding overlaps between long reads with error rate up to ~15%; (3)
134-
splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
135-
against a reference genome; (4) aligning Illumina single- or paired-end reads;
136-
(5) assembly-to-assembly alignment; (6) full-genome alignment between two
137-
closely related species with divergence below ~15%.
129+
> sequences against a large reference database. Typical use cases include: (1)
130+
> mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2)
131+
> finding overlaps between long reads with error rate up to ~15%; (3)
132+
> splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads
133+
> against a reference genome; (4) aligning Illumina single- or paired-end reads;
134+
> (5) assembly-to-assembly alignment; (6) full-genome alignment between two
135+
> closely related species with divergence below ~15%.
138136
139137
### samtools sort index
140138

141139
<details markdown="1">
142140
<summary>Output files</summary>
143141

144-
- `multiqc/`
145-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
146-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
147-
- `multiqc_plots/`: directory containing static images from the report in various formats.
142+
- `mapping/`
143+
- `*_minimap2.bam`: the mapped output BAM file that has been sorted by samtools.
144+
- `*_minimap2.bam.bai`: the index of the mapped output BAM file.
148145

149146
</details>
150147

@@ -160,10 +157,9 @@ this file.
160157
<details markdown="1">
161158
<summary>Output files</summary>
162159

163-
- `multiqc/`
164-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
165-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
166-
- `multiqc_plots/`: directory containing static images from the report in various formats.
160+
- `mapping_visualisation/`
161+
- `*_+.bedGraph`: the intermediary bedgraph file from the positive (+) strand.
162+
- `*_+.bedGraph`: the intermediary bedgraph file from the negative (-) strand.
167163

168164
</details>
169165

@@ -177,10 +173,9 @@ format, in preparation for conversion to BigWig.
177173
<details markdown="1">
178174
<summary>Output files</summary>
179175

180-
- `multiqc/`
181-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
182-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
183-
- `multiqc_plots/`: directory containing static images from the report in various formats.
176+
- `mapping_visualisation/`
177+
- `*_+.bigWig`: the file bigWig file from the positive (+) strand.
178+
- `*_+.bigWig`: the file bigWig file from the negative (-) strand.
184179

185180
</details>
186181

@@ -195,12 +190,27 @@ in a lightweight way.
195190

196191
### samtools flagstat
197192

193+
<details markdown="1">
194+
<summary>Output files</summary>
195+
196+
- `mapping_qc/samtools_flagstat/`
197+
- `*.flagstat.tsv`: the output of samtools flagstat in tsv format.
198+
</details>
199+
198200
[samtools](http://www.htslib.org/doc/#manual-pages) flagstats provides summary
199201
statistics on the mapped BAM file. Specifically, it counts the number of
200202
alignments for each FLAG type.
201203

202204
### cramino
203205

206+
<details markdown="1">
207+
<summary>Output files</summary>
208+
209+
- `mapping_qc/cramino/`
210+
- `*_cramino.stats`: the output of cramino.
211+
212+
</details>
213+
204214
[cramino](https://github.com/wdecoster/cramino) is a tool for quick quality assessment of cram and bam files, intended for long read sequencing.
205215

206216
```
@@ -219,22 +229,50 @@ Creation time 09/09/2022 10:53:36
219229

220230
### alfred
221231

232+
<details markdown="1">
233+
<summary>Output files</summary>
234+
235+
- `mapping_qc/alfred/`
236+
- `*_alfred.transposed.stats`: the transposed output of alfred.
237+
- `*_alfred.tsv.gz`: the output of alfred in gzipped tsv format.
238+
239+
</details>
240+
222241
[alfred](https://www.gear-genomics.com/docs/alfred/cli/) computes various
223242
alignment metrics and summary statistics by read group.
224243

225244
### ngs-bits
226245

246+
<details markdown="1">
247+
<summary>Output files</summary>
248+
249+
- `multiqc/`
250+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
251+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
252+
- `multiqc_plots/`: directory containing static images from the report in various formats.
253+
254+
</details>
255+
227256
[ngs-bits
228257
mappingQC](https://github.com/imgag/ngs-bits/blob/master/doc/tools/MappingQC/index.md)
229258
provides one more technique for quality control of the mapped BAM files. It's
230259
advantage is that it has an output that is compatible with
231260
[MultiQC](https://github.com/MultiQC/MultiQC/blob/main/docs/markdown/modules/ngsbits.md).
232261

233-
234262
## Transcriptome reconstruction
235263

236264
### FLAIR
237265

266+
<details markdown="1">
267+
<summary>Output files</summary>
268+
269+
- `multiqc/`
270+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
271+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
272+
- `multiqc_plots/`: directory containing static images from the report in various formats.
273+
274+
</details>
275+
238276
[FLAIR](https://github.com/BrooksLabUCSC/flair) **F**ull **L**ength
239277
**A**lternative **I**soform analysis of **R**NA is used for the correction,
240278
isoform definition, and alternative splicing analysis of noisy reads. FLAIR has
@@ -247,11 +285,56 @@ instead be putative variants which should not be corrected.
247285

248286
### bambu
249287

288+
<details markdown="1">
289+
<summary>Output files</summary>
290+
291+
- `multiqc/`
292+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
293+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
294+
- `multiqc_plots/`: directory containing static images from the report in various formats.
295+
296+
</details>
297+
298+
[bambu](https://github.com/GoekeLab/bambu) provides reference-guided transcript discovery and quantification for long read RNA-Seq data. It performs very slight splice site correction, and is currently the most widely used long-read transcript reconstruction software in the field.
299+
250300
### IsoQuant
251301

302+
<details markdown="1">
303+
<summary>Output files</summary>
304+
305+
- `multiqc/`
306+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
307+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
308+
- `multiqc_plots/`: directory containing static images from the report in various formats.
309+
310+
</details>
311+
312+
[IsoQuant](https://github.com/ablab/IsoQuant) is a tool for the genome-based analysis of long RNA
313+
reads, such as PacBio or Oxford Nanopores. IsoQuant allows to reconstruct and
314+
quantify transcript models with high precision and decent recall. If the
315+
reference annotation is given, IsoQuant also assigns reads to the annotated
316+
isoforms based on their intron and exon structure. IsoQuant further performs
317+
annotated gene, isoform, exon and intron quantification. If reads are grouped
318+
(e.g. according to cell type), counts are reported according to the provided
319+
grouping. IsoQuant, like FLAIR, provides optional read correction capabilities, which should be used accordingly.
320+
252321
### StringTie
253322

323+
<details markdown="1">
324+
<summary>Output files</summary>
325+
326+
- `multiqc/`
327+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
328+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
329+
- `multiqc_plots/`: directory containing static images from the report in various formats.
330+
331+
</details>
254332

333+
[StringTie](ccb.jhu.edu/software/stringtie/) is a fast and highly
334+
efficient assembler of RNA-Seq alignments into potential transcripts. It uses a
335+
novel network flow algorithm as well as an optional de novo assembly step to
336+
assemble and quantitate full-length transcripts representing multiple splice
337+
variants for each gene locus. StringTie does not perform read correction.
255338

256339
<details markdown="1">
257340
<summary>Output files</summary>
@@ -263,11 +346,67 @@ instead be putative variants which should not be corrected.
263346

264347
</details>
265348

349+
## Transcriptome assessment
350+
351+
### gffcompare
352+
353+
<details markdown="1">
354+
<summary>Output files</summary>
355+
356+
- `multiqc/`
357+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
358+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
359+
- `multiqc_plots/`: directory containing static images from the report in various formats.
360+
361+
</details>
362+
363+
[gffcompare](https://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare) can be used to compare, merge, annotate and estimate
364+
accuracy of one or more GFF files (the "query" files), when compared with a
365+
reference annotation (also provided as GFF/GTF).
366+
367+
```
368+
#= Summary for dataset: stringtie_asm.gtf
369+
# Query mRNAs : 23555 in 17628 loci (17231 multi-exon transcripts)
370+
# (3731 multi-transcript loci, ~1.3 transcripts per locus)
371+
# Reference mRNAs : 16628 in 12062 loci (15850 multi-exon)
372+
# Super-loci w/ reference transcripts: 11552
373+
#-----------------| Sensitivity | Precision |
374+
Base level: 82.4 | 76.5 |
375+
Exon level: 81.2 | 82.9 |
376+
Intron level: 86.1 | 94.8 |
377+
Intron chain level: 56.9 | 52.4 |
378+
Transcript level: 55.2 | 38.9 |
379+
Locus level: 70.1 | 48.0 |
380+
```
381+
382+
## Transcript quantification
383+
384+
### Oarfish
385+
386+
<details markdown="1">
387+
<summary>Output files</summary>
388+
389+
- `multiqc/`
390+
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
391+
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
392+
- `multiqc_plots/`: directory containing static images from the report in various formats.
393+
394+
</details>
395+
396+
[oarfish](https://github.com/COMBINE-lab/oarfish) is a program for quantifying
397+
transcript-level expression from long-read (i.e. Oxford nanopore cDNA and
398+
direct RNA and PacBio) sequencing technologies. oarfish requires a sample of
399+
sequencing reads aligned to the transcriptome (currntly not to the genome). It
400+
handles multi-mapping reads through the use of probabilistic allocation via an
401+
expectation-maximization (EM) algorithm.
402+
403+
## MultiQC
404+
266405
[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
267406

268407
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
269408

270-
### Pipeline information
409+
## Pipeline information
271410

272411
<details markdown="1">
273412
<summary>Output files</summary>

0 commit comments

Comments
 (0)