Skip to content

Commit a4cfb2c

Browse files
committed
adding final additions to docs/outputs
1 parent 02d4b13 commit a4cfb2c

4 files changed

Lines changed: 61 additions & 38 deletions

File tree

docs/output.md

Lines changed: 59 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
2929
- [bambu](#bambu)
3030
- [IsoQuant](#IsoQuant)
3131
- [StringTie](#StringTie)
32+
- [TranscriptToFasta](#Transcripts-FASTA)
3233
<!-- 7. Fusion gene detection [`JAFFA`](github.com/Oshlack/JAFFA) -->
3334
- [Transcriptome assessment](#Transcriptome-assessment)
3435
- [gffcompare](#gffcompare)
@@ -254,20 +255,15 @@ alignment metrics and summary statistics by read group. The transposed output is
254255
<details markdown="1">
255256
<summary>Output files</summary>
256257

257-
#### TODO
258-
259-
- `mapping_qc/ngf-bits/`
260-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
261-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
262-
- `multiqc_plots/`: directory containing static images from the report in various formats.
258+
- `mapping_qc/ngs-bits/`
259+
- `*_ngsbits.qcML`: ngsbits summary file in XML format. Used as input to MultiQC.
263260

264261
</details>
265262

266-
[ngs-bits
267-
mappingQC](https://github.com/imgag/ngs-bits/blob/master/doc/tools/MappingQC/index.md)
263+
[ngs-bits mappingQC](https://github.com/imgag/ngs-bits/blob/master/doc/tools/MappingQC/index.md)
268264
provides one more technique for quality control of the mapped BAM files. It's
269265
advantage is that it has an output that is compatible with
270-
[MultiQC](https://github.com/MultiQC/MultiQC/blob/main/docs/markdown/modules/ngsbits.md).
266+
[MultiQC](https://github.com/MultiQC/MultiQC/blob/main/docs/markdown/modules/ngsbits.md). Outut comes formatted in XML format, so is not particularly human readable.
271267

272268
## Transcriptome reconstruction
273269

@@ -276,20 +272,26 @@ advantage is that it has an output that is compatible with
276272
<details markdown="1">
277273
<summary>Output files</summary>
278274

279-
- `multiqc/`
280-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
281-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
282-
- `multiqc_plots/`: directory containing static images from the report in various formats.
275+
- `transcript_reconstruction/flair/bam_to_bed/`
276+
- `*_.bed`: genome alignment that has been converted from BAM to BED format, to be used as input to FLAIR.
277+
- `correct/`(optional)
278+
- `*_flair_correct_all_corrected.bed`: BED file of correct reads that is used in subsequent steps.
279+
- `*_flair_correct_all_inconsistent.bed`: BED file of rejected alignments.
280+
- `*_flair_correct_cannot_verify.bed`: BED file of unknown alignments, (only if the) chromosome is not found in annotation.
281+
- `collapse/`
282+
- `*_collapsed_isoforms.bed`: BED file of high confidence isoforms.
283+
- `*_collapsed_isoforms.gtf`: as above but in GTF format.
284+
- `*_collapsed_isoforms.fa`: fasta sequences of high confidence isoforms.
283285

284286
</details>
285287

286288
[FLAIR](https://github.com/BrooksLabUCSC/flair) **F**ull **L**ength
287289
**A**lternative **I**soform analysis of **R**NA is used for the correction,
288290
isoform definition, and alternative splicing analysis of noisy reads. FLAIR has
289291
primarily been used for nanopore cDNA, native RNA, and PacBio sequencing reads.
290-
FLAIR is able to be used with and without read correction, making it amenable to
292+
FLAIR is able to be used with and without read correction (splice site correction), making it amenable to
291293
sensitive sample types, such as those coming from cancer where errors may
292-
instead be putative variants which should not be corrected.
294+
instead be putative variants which should not be corrected. FLAIR accepts a BED file as input, therefore, the aligned BAM file is always converted to BED format prior to input.
293295

294296
![FLAIR - example schematic](images/flair_workflow_compartmentalized.png)
295297

@@ -298,10 +300,10 @@ instead be putative variants which should not be corrected.
298300
<details markdown="1">
299301
<summary>Output files</summary>
300302

301-
- `multiqc/`
302-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
303-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
304-
- `multiqc_plots/`: directory containing static images from the report in various formats.
303+
- `transcript_reconstruction/bambu/`
304+
- `counts_gene.txt`: gene level estimated counts.
305+
- `counts_transcript.txt`: transcript level estimated counts.
306+
- `extended_annotations.gtf`: contains all transcript models from the reference annotations and any novel high confidence transcript models (below NDR threshold).
305307

306308
</details>
307309

@@ -312,9 +314,20 @@ instead be putative variants which should not be corrected.
312314
<details markdown="1">
313315
<summary>Output files</summary>
314316

315-
- `multiqc/`
316-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
317-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
317+
- `transcript_reconstruction/isoquant/`
318+
- `*_isoquant.corrected_reads.bed.gz`: BED file with corrected read alignments (gzipped by default).
319+
- `*_isoquant.discovered_gene_counts.tsv`: raw read counts for discovered genes (corresponds to SAMPLE_ID.transcript_models.gtf).
320+
- `*_isoquant.discovered_gene_tpm.tsv`: expression of discovered genes in TPM (corresponds to SAMPLE_ID.transcript_models.gtf).
321+
- `*_isoquant.discovered_transcript_counts.tsv`: raw read counts for discovered transcript models (corresponds to SAMPLE_ID.transcript_models.gtf).
322+
- `*_isoquant.discovered_transcript_tpm.tsv`: expression of discovered transcripts models in TPM (corresponds to SAMPLE_ID.transcript_models.gtf).
323+
- `*_isoquant.extended_annotation.gtf`: GTF file with the entire reference annotation plus all discovered novel transcripts.
324+
- `*_isoquant.gene_counts.tsv`: TSV file with raw read counts for reference genes.
325+
- `*_isoquant.gene_tpm.tsv`: TSV file with reference gene expression in TPM.
326+
- `*_isoquant.transcript_counts.tsv`: TSV file with raw read counts for reference transcript.
327+
- `*_isoquant.transcript_tpm.tsv`: TSV file with reference transcript expression in TPM.
328+
- `*_isoquant.read_assignments.tsv.gz`: TSV file with read to isoform assignments (gzipped by default).
329+
- `*_isoquant.transcript_model_reads.tsv.gz`: TSV file indicating which reads contributed to transcript models (gzipped by default).
330+
- `*_isoquant.transcript_models.gtf`: GTF file with discovered expressed transcript (both known and novel transcripts).
318331

319332
</details>
320333

@@ -332,39 +345,49 @@ grouping. IsoQuant, like FLAIR, provides optional read correction capabilities,
332345
<details markdown="1">
333346
<summary>Output files</summary>
334347

335-
- `transcript_reconstruction/stringtie`
336-
- `KCMF1.1.stringtie.coverage.gtf`: a standalone HTML file that can be viewed in your web browser.
337-
- `KCMF1.1.stringtie.transcripts.gtf`: directory containing parsed statistics from the different tools used in the pipeline.
338-
339-
</details>
348+
- `transcript_reconstruction/stringtie/`
349+
- `*_stringtie.transcripts.gtf`: main output GTF file containing the assembled transcripts.
350+
- `*_stringtie.coverage.gtf`: fully covered transcripts that match the reference annotation, in GTF format.
340351

341352
[StringTie](ccb.jhu.edu/software/stringtie/) is a fast and highly
342353
efficient assembler of RNA-Seq alignments into potential transcripts. It uses a
343354
novel network flow algorithm as well as an optional de novo assembly step to
344355
assemble and quantitate full-length transcripts representing multiple splice
345356
variants for each gene locus. StringTie does not perform read correction.
357+
i
358+
359+
</details>
360+
361+
### Transcripts FASTA
346362

347363
<details markdown="1">
348364
<summary>Output files</summary>
349365

350-
- `multiqc/`
351-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
352-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
353-
- `multiqc_plots/`: directory containing static images from the report in various formats.
366+
- `transcript_reconstruction/transcripts_fasta/`
367+
- `*_bambu_transcripts.fa`: FASTA sequences of all assembled transcripts from bambu.
368+
- `*_isoquant_transcripts.fa`: FASTA sequences of all assembled transcripts from isoquant.
369+
- `*_stringtie_transcripts.fa`: FASTA sequences of all assembled transcripts from stringtie.
354370

355371
</details>
356372

373+
This is a straight forward module which simply converts the
374+
reconstruction/assembled transcripts from the various software into their FASTA
375+
sequences. If using FLAIR, the transcript sequences in FASTA format will be
376+
created by the tool itself.
377+
357378
## Transcriptome assessment
358379

359380
### gffcompare
360381

361382
<details markdown="1">
362383
<summary>Output files</summary>
363384

364-
- `multiqc/`
365-
- `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
366-
- `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
367-
- `multiqc_plots/`: directory containing static images from the report in various formats.
385+
- `transcriptome_assessment/{flair,isoquant,bambu,stringtie}/`
386+
- `*_{flair,isoquant,bambu,stringtie}.annotated.gtf`: input transcriptome GTF file annotated with the reference transcriptome provided.
387+
- `*_{flair,isoquant,bambu,stringtie}.gtf.refmap`: this tab-delimited file lists, for each reference transcript, which query transcript either fully or partially matches that reference transcript.
388+
- `*_{flair,isoquant,bambu,stringtie}.gtf.tmap`: this tab delimited file lists the most closely matching reference transcript for each query transcript.
389+
- `*_{flair,isoquant,bambu,stringtie}.stats`: in this output file Gffcompare reports various statistics related to the “accuracy” (or a measure of agreement) of the input transcripts when compared to reference annotation data. These accuracy measures are calculated under the assumption that the input GFF/GTF file(s) (the "query" transcripts, or transfrags, from one or multiple "samples") are coming from some transcript discovery/assembly pipeline (e.g. Cufflinks or StringTie), or from any other gene/transcript prediction pipeline. GffCompare can be used to assess the accuracy of such pipelines, when comparing their results to a known reference annotation
390+
- `*_{flair,isoquant,bambu,stringtie}.tracking`: this file matches transcripts up between samples. This file matches transcripts up between samples. Each row represents a transcript structure that is preserved (structurally equivalent) across all the input GTF files. GffCompare considers transcripts "matching" (i.e. structurally equivalent) if all their introns are identical. Note that "matching" transcripts are allowed to differ on the length of the first and last exons, since these lengths can usually vary across samples for the same biological transcript.
368391

369392
</details>
370393

@@ -393,7 +416,7 @@ Intron chain level: 56.9 | 52.4 |
393416
<details markdown="1">
394417
<summary>Output files</summary>
395418

396-
- `transcript_quantification/oarfish/<samplename>/`
419+
- `transcript_quantification/oarfish/`
397420
- `*.quant.gz`: a tab separated file listing the quantified targets, as well as information about their length and other metadata. The num_reads column provides the estimate of the number of reads originating from each target.
398421
- `*.meta_info.json`: a JSON format file containing information about relevant parameters with which oarfish was run, and other relevant inforamtion from the processed sample apart from the actual transcript quantifications.
399422

modules/local/flair/bam_to_bed12/nextflow.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
process {
22
withName: 'BAM_TO_BED12' {
33
publishDir = [
4-
path: { "${params.outdir}/transcript_reconstruction/flair/${meta.id}_${meta.replicate}" },
4+
path: { "${params.outdir}/transcript_reconstruction/flair/bam_to_bed/${meta.id}_${meta.replicate}" },
55
//mode: params.publish_dir_mode
66
]
77
}

modules/local/gffcompare/gffcompare/nextflow.config

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
process {
22
withName: 'GFFCOMPARE' {
33
publishDir = [
4-
path: { "${params.outdir}/transcriptome_assessment/gffcompare/${meta.id}_${meta.replicate}" },
4+
path: { "${params.outdir}/transcriptome_assessment/gffcompare/${origin}/${meta.id}_${meta.replicate}" },
55
//mode: params.publish_dir_mode
66
]
77
}
11.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)