Skip to content

Commit 7e374ed

Browse files
author
Rob Patro
committed
Merge branch 'develop-salmon' into develop
2 parents e4fd2a4 + 0efc774 commit 7e374ed

2 files changed

Lines changed: 32 additions & 48 deletions

File tree

README.md

Lines changed: 9 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,11 @@
22

33
# What is RapMap?
44

5-
RapMap is a testing ground for ideas in quasi-mapping / (lightweight / pseudo) transcriptome alignment. That means that, at this point, it is somewhat experimental. The `develop` branch will have the latest improvements and additions, but is not guaranteed to be stable between commits. Breaking changes to the master branch will be accompanied by a tag to the version before the breaking change. Currently, RapMap is a stand-alone quasi-mapper that can be used with other tools. It is also being used as part of [Sailfish](https://github.com/kingsfordgroup/sailfish) and [Salmon](https://github.com/COMBINE-lab/salmon). Eventually, the hope is to create and stabilize an API so that it can be used as a library from other tools.
6-
7-
Quasi-mapping / (lightweight / pseudo)-alignment is the term we are using here for the type of information required for certain tasks (e.g.
8-
transcript quantification) that is less "heavyweight" than what is provided by traditional alignment. For example, one may
9-
only need to know the transcripts / contigs to which a read aligns and, perhaps, the position within those transcripts rather
10-
than the optimal alignment and base-for-base `CIGAR` string that aligns the read and substring of the transcript. For details on RapMap (quasi-mapping in particular), please check out the [associated paper](http://bioinformatics.oxfordjournals.org/content/32/12/i192.full.pdf). Note: RapMap implements both quasi-mapping and pseudo-alignment (originally introduced in [Bray et al. 2016](http://www.nature.com/nbt/journal/v34/n5/full/nbt.3519.html)), these two are not the same thing. They are distinct concepts, and RapMap simply happens to implement algorithms for computing both.
11-
12-
There are a number of different ways to collect such information, and the idea of RapMap (as the repository grows) will be to explore multiple different strategies in how to most rapidly determine all *feasible* / *compatible* locations for a read within the transcriptome. In this sense, it is like an *all-mapper*; the mappings it outputs are intended to be (eventually) disambiguated (*Really, it's more like an "all-best" mapper, since it returns all hits in the top "stratum" of quasi-mapping / (lightweight / pseudo)-alignments*). If there is a need for it, *best-mapper* functionality may be added in the future.
5+
RapMap is a testing ground for ideas in quasi-mapping and selective alignment. That means that, at this point, it is somewhat experimental. The `develop` branch will have the latest improvements and additions, but is not guaranteed to be stable between commits. Breaking changes to the master branch will be accompanied by a tag to the version before the breaking change. Currently, RapMap is a stand-alone quasi-mapper that can be used with other tools. It is also being used as part of [Salmon](https://github.com/COMBINE-lab/salmon) and [Sailfish](https://github.com/kingsfordgroup/sailfish). Eventually, the hope is to create and stabilize an API so that it can be used as a library from other tools.
136

147
# Building RapMap
158

16-
To build RapMap, you need a C++11 compliant compiler (g++ >= 4.7 and clang >= 3.4) and CMake. RapMap is built with the following steps (assuming that `path_to_rapmap` is the toplevel directory where you have cloned this repository):
9+
To build RapMap, you need a C++14 compliant compiler (g++ >= 4.9 and clang >= 3.4) and CMake (>= 3.9). RapMap is built with the following steps (assuming that `path_to_rapmap` is the toplevel directory where you have cloned this repository):
1710

1811
```
1912
[path_to_rapmap] > mkdir build && cd build
@@ -23,6 +16,7 @@ To build RapMap, you need a C++11 compliant compiler (g++ >= 4.7 and clang >= 3.
2316
[path_to_rapmap/build] > cd ../bin
2417
[path_to_rapmap/bin] > ./rapmap -h
2518
```
19+
2620
This should output the standard help message for rapmap.
2721

2822
# Using RapMap
@@ -44,48 +38,31 @@ the `-p` option enables the minimum perfect hash and `-x 4` tells RapMap to use
4438
The index itself will record whether it was built with the aid of minimum perfect hashing or not, so no extra information concerning this need be provided when mapping. For the purposes of this example, we'll assume that we wish to map paired-end reads with the first mates in the file `r1.fq.gz` and the second mates in the file `r2.fq.gz`. We can perform the mapping like so:
4539

4640
```
47-
> rapmap quasimap -i ref_index -1 <(gunzip -c r1.fq.gz) -2 <(gunzip -c r2.fq.gz) -t 8 -o mapped_reads.sam
41+
> rapmap quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 -o mapped_reads.sam
4842
```
4943

50-
This will tell RapMap to map the paired-end reads using 8 threads, and to write the resulting `SAM` records to the file `mapped_reads.sam`. The SAM format is rather verbose, and so such output files can be rather large (and slow to write) if you're mapping many reads. For that reason, we recommend that you use [samtools](http://www.htslib.org/) to convert the `SAM` file to a `BAM` file on-the-fly. Assuming `samtools` is installed an in your path, that can be accomplished with the following command:
44+
This will tell RapMap to map the paired-end reads using 8 threads, and to write the resulting `SAM` records to the file `mapped_reads.sam`. The `-s` flag tells RapMap to use selective alignment to generate better mappings and to validate the alignment _score_ of hits. The SAM format is rather verbose, and so such output files can be rather large (and slow to write) if you're mapping many reads. For that reason, we recommend that you use [samtools](http://www.htslib.org/) to convert the `SAM` file to a `BAM` file on-the-fly. Assuming `samtools` is installed an in your path, that can be accomplished with the following command:
5145

5246
```
53-
> rapmap quasimap -i ref_index -1 <(gunzip -c r1.fq.gz) -2 <(gunzip -c r2.fq.gz) -t 8 | samtools view -Sb -@ 4 - > mapped_reads.bam
47+
> rapmap quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 | samtools view -Sb -@ 4 - > mapped_reads.bam
5448
```
5549

5650
This will stream the output from RapMap to standard out, and then convert it into a `BAM` file (using up to an additional 4 threads for `BAM` compression) and write the resulting output to the file `mapped_reads.bam`. To reduce the amount that needs to be typed in the common case, and to prevent the user from having to remember invocations like the above, we inclde a simple wrapper script that simplifies this process. After installing RapMap, there should be a script called `RunRapMap.sh` in the `bin` directory of whereever you have chosen to install RapMap. You can issue a command equivalent to the above using this scrpt as follows:
5751

5852
```
59-
> RunRapMap.sh quasimap -i ref_index -1 <(gunzip -c r1.fq.gz) -2 <(gunzip -c r2.fq.gz) -t 8 --bamOut mapped_reads.sam --bamThreads 4
53+
> RunRapMap.sh quasimap -i ref_index -1 r1.fq.gz -2 r2.fq.gz -s -t 8 --bamOut mapped_reads.sam --bamThreads 4
6054
```
6155

6256
This will run RapMap with a command equivalent to the one mentioned above. If you leave out the `--bamThreads` argument, then a single thread will be used for compression. The `RunRapMap.sh` script can be used even if you don't wish to write the output to `BAM` format; in that case it is simply equivalent to running whichever command you pass with the `rapmap` executable itself.
6357

6458
# Can I use RapMap for genomic alignment?
6559

66-
No, at least not right now. The index and mapping strategy employed by RapMap are highly geared toward mapping to transcriptomes. It may be the case that some of these ideas can be successfully applied to genomic alignment, but
67-
this functionality is not currently suppored (and is not a high priority right now).
68-
69-
# How fast is RapMap?
70-
71-
Speed is relative, but we think it's very fast: On a synthetic test dataset comprised of 75 million 76bp paired-end reads, mapping to a human transcriptome with ~213,000 transcripts, RapMap takes ~ 10 minutes to align all of the reads *on a single core* (on an Intel Xeon E5-2690 @ 3.00 GHz) --- if you actually want to write out the alignments --- it depends on you disk speed, but for us it's ~15 minutes. Again, these mapping times are *on a single core* --- but RapMap is trivially parallelizable and can be run with multiple threads. Additionally, there are other optimizations we are currently exploring.
72-
73-
# OK, that's fast, but is it accurate?
74-
75-
Yes; quasi-mapping seems to provide accurate mapping results. In the above mentioned synthetic dataset (generated *with* sequencing errors), the true location of origin of the read appears in the hits returned by RapMap > 97% of the time. For more details, please refer to [the paper](http://bioinformatics.oxfordjournals.org/content/32/12/i192.full.pdf).
60+
The index and mapping strategy employed by RapMap are highly geared toward mapping to transcriptomes. This means that RapMap will likely use _a lot_ of memory when indexing and mapping to mammalian-sized genomes, though it's possible. We have succesfully applied RapMap to map reads to collections of baterial and viral genomes, however.
7661

7762
# Caveats
7863

7964
RapMap is experimental, and the code, at this point, is subject to me testing out new ideas (see the description above about the master vs. develop branch). This also means that limited effort has been put into size or speed optimizaiton. There are numerous ways that the code can be sped up and the memory footprint reduced, but that hasn't been the focus yet --- it will be eventualy. All of this being said --- RapMap is open to the community because I'd like feedback / help / thoughts. A contribution policy is forthcoming. So, if you're not scared off by any of the above, please *dig in*!
8065

81-
# External dependencies
82-
83-
[tclap](http://tclap.sourceforge.net/)
84-
85-
[cereal](https://github.com/USCiLab/cereal)
86-
87-
[jellyfish](https://github.com/gmarcais/Jellyfish)
88-
8966
# License
9067

91-
Since RapMap uses Jellyfish, it must be released under the GPL. However, this is currently the only GPL dependency. If it can be replaced, I'd like to re-license RapMap under the BSD license. I'd be happy to accept pull-requests that replace the Jellyfish components with a library released under a more liberal license (BSD-compatible), but note that I will *not* accept such pull requests if they reduce the speed or increase the memory consumption over the Jellyfish-based version.
68+
Since RapMap uses Vinga's [rank implementation](https://github.com/COMBINE-lab/RapMap/blob/master/include/rank9b.h), it must be released under the GPL. However, this is currently the only GPL dependency. If it can be replaced, I'd like to re-license RapMap under a BSD license. I'd be happy to accept pull-requests that replace this rank implementation with a library released under a more liberal license (BSD-compatible), but note that I will *not* accept such pull requests if they reduce the speed or increase the memory consumption over the current version.

src/RapMapSAMapper.cpp

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -908,6 +908,15 @@ bool mapReads(RapMapIndexT& rmi,
908908
bool validateOpts(MappingOpts& mopts, spdlog::logger* log) {
909909
bool valid{true};
910910

911+
if (!mopts.selAln) {
912+
log->warn("\n\n"
913+
"NOTE: It appears you are running rapmap without the selective-alignment (`--selAln`) option.\n"
914+
"Selective alignment can generally improve both the sensitivity and specificity of mapping,\n"
915+
"with only a moderate increase in use of computational resources. \n"
916+
"Unless there is a specific reason to do this (e.g. testing on clean simulated data),\n"
917+
"`--selAln` is generally recommended.\n");
918+
}
919+
911920
if (mopts.maxMMPExtension < 1) {
912921
log->error("--maxMMPExtension must be at least 1, but {} was provided.", mopts.maxMMPExtension);
913922
valid = false;
@@ -1009,38 +1018,36 @@ int rapMapSAMap(int argc, char* argv[]) {
10091018
TCLAP::SwitchArg mimicStrictBT2("", "mimicStrictBT2", "[only with selAln]: mimic strict Bowtie2-like default params (e.g. like those recommended for running RSEM)", false);
10101019
TCLAP::ValueArg<int32_t> maxMMPExtension("", "maxMMPExtension", "[only with selAln or with chaining]: maximum allowable MMP extension", false, 7, "positive integer > 1");
10111020

1012-
1013-
cmd.add(index);
10141021
cmd.add(noout);
1015-
1016-
cmd.add(read1);
1017-
cmd.add(read2);
1018-
cmd.add(unmatedReads);
1019-
cmd.add(outname);
1020-
cmd.add(numThreads);
10211022
cmd.add(maxNumHits);
10221023
cmd.add(quasiCov);
10231024
cmd.add(nosensitive);
10241025
cmd.add(noStrict);
10251026
cmd.add(fuzzy);
10261027
cmd.add(chain);
1027-
cmd.add(compressedOutput);
10281028
cmd.add(quiet);
1029+
cmd.add(hardFilter);
10291030
cmd.add(recoverOrphans);
10301031
cmd.add(noDovetail);
10311032
cmd.add(noOrphans);
1032-
cmd.add(selAln);
1033-
cmd.add(gapOpenPen);
1033+
cmd.add(dpBandwidth);
10341034
cmd.add(gapExtendPen);
1035+
cmd.add(gapOpenPen);
10351036
cmd.add(mismatchPen);
10361037
cmd.add(matchScore);
1037-
cmd.add(dpBandwidth);
1038-
cmd.add(minScoreFrac);
10391038
cmd.add(consensusSlack);
1040-
cmd.add(hardFilter);
1041-
cmd.add(mimicBT2);
1042-
cmd.add(mimicStrictBT2);
10431039
cmd.add(maxMMPExtension);
1040+
cmd.add(minScoreFrac);
1041+
cmd.add(mimicStrictBT2);
1042+
cmd.add(mimicBT2);
1043+
cmd.add(selAln);
1044+
cmd.add(numThreads);
1045+
cmd.add(compressedOutput);
1046+
cmd.add(outname);
1047+
cmd.add(unmatedReads);
1048+
cmd.add(read2);
1049+
cmd.add(read1);
1050+
cmd.add(index);
10441051

10451052
auto consoleSink = std::make_shared<spdlog::sinks::ansicolor_stderr_sink_mt>();
10461053
auto consoleLog = spdlog::create("stderrLog", {consoleSink});

0 commit comments

Comments
 (0)