Better support for read groups in distmap integration

@robmaz would like to integrate the capabilities of ReadTools for any kind of supported format (FASTQ/SAM/BAM/CRAM by now) into the distmap pipeline in a better way. The current pipeline is the following (all called within the distmap software):

* `ReadTools ReadsToDistmap`: upload (with optional trimming) the reads to HDFS into the compact and splitable distmap-format. The current implementation only keeps the barcodes in the read name, but if barcode de-multiplexing has been already performed, keeping the read groups (`@RG`) is desirable to mark the reads properly. One suggestion is to dump the header with the `@RG` to use later on download (see below and #510), but this will bring problems if multiple read groups are present as reads cannot be re-assigned without the full de-multiplexing run.
* Then, distmap maps the reads. Internally, it converts the distmap-format into FASTQ, runs the mapper and outputs part-files in the SAM/BAM format. As each mapper has its own features, we cannot do any assumption about how the header will look like (including `@RG` header lines) - this is one of the limiting factors out of our control.
* `ReadTools DownloadDistmapResult`: downloads from HDFS and merge the part files (SAM/BAM) into a combined file on the local path. It will be nice to provide a SAM header with read groups (or a master SAM header with more information) to be merged with the ones downloaded from the distmap run (requested in #511), but it is not trivial as it should have specific rules and requires to re-assign read groups each read (as in the first step).

To make posible to roundtrip reads->distmap->reads and keep the read group information from the original reads, there are several propositions under discussion:

1. Only allow one read group on download (suggested here: https://github.com/magicDGS/ReadTools/issues/511#issuecomment-415712396) and fail otherwise. This can be weird, because we allow to upload/transform reads from multiple `@RG` but not download them if we want to retrieve the information. This is the option that requires the minimal efford, as it will just fail for multiple `@RG` and assign the single one otherwise. Still, it will need to set some rules to merge the rest of header fields (unless the `@RG` is the only header lines allowed, appart of the version one).
2. Integrate a new distmap-format which supports adding barcodes to the read name if no `@RG` is present (`@{{read_name}}#{{barcode_seq}}`) or read-group id/index (`@{{read_name}}#{{rg_id}}` or `@{{read_name}}#{{rg_idx}}`), which can be parsed afterwards. Some complications might arrise from this: 1) always required to use the same version of ReadTools for upload/download; 2) unsupported `@RG` handling for legacy distmap format; 3) requirement for header while downloading if ID/idx was used; 4) lost of raw-barcode information if only-RG is handled. Nevertheless, this was just a first draft and can be modified to address this issues and discussed with @robmaz

I think that a quick implementation for option 1 is good to have this support to some extend, with a warning on upload and an error on download for more than 1 RG in the header file (saying that this limitation might be removed in the future) and then evolve the new format for distmap (#404) to contain information for the read group and maybe some arbitrary information. Another option is to change distmap to use the map-reduce code from Hadoop-BAM to split the input file, and remove completely the need of the distmap custom format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for read groups in distmap integration #518

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Better support for read groups in distmap integration #518

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions