Skip to content

Commit d54d02d

Browse files
committed
fixed typos and added acknowledgments
1 parent c6df79a commit d54d02d

1 file changed

Lines changed: 5 additions & 2 deletions

File tree

paper/paper.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ bibliography: paper.bib
1717

1818
# Summary
1919

20-
Next generation sequencing data is ubiquitous in medical and biological sciences. It has also become the primary tool in archaeogenetics, where ancient DNA is extracted from archaeological organic (often human skeletal) material, processed into DNA sequencing libraries and then sequenced [@Orlando2021]. As a testimony to the rapid and accellerating growth of the field, we today have close to ten thousand published ancient human genomes available in the public record [@Schmid2024; @Mallick2024], and many smaller datasets of other organisms. A key step in processing raw sequencing data is the estimation of genotypes at specific variable positions along the genome. Such positions are often pre-selected because they are informative about ancestry or of particular biological relevance [@Haak2015; @Mathieson2015; @Rohland2022]. While established tools exist for this task for high-quality modern sequencing data [@samtools; @gatk], these are often not appropriate for ancient DNA, which has often too low sequencing-coverage and a higher error rate due to post-mortem DNA damage. PileupCaller is a command-line tool written in Haskell, which randomly samples genotypes from raw alignment data at predefined bi-allelic positions. Several modes can be selected, geared towards specific input data features and research questions.
20+
Next generation sequencing data is ubiquitous in medical and biological sciences. It has also become the primary tool in archaeogenetics, where ancient DNA is extracted from archaeological organic (often human skeletal) material, processed into DNA sequencing libraries and then sequenced [@Orlando2021]. As a testimony to the rapid and accelerating growth of the field, we today have close to ten thousand published ancient human genomes available in the public record [@Schmid2024; @Mallick2024], and many smaller datasets of other organisms. A key step in processing raw sequencing data is the estimation of genotypes at specific variable positions along the genome. Such positions are often pre-selected because they are informative about ancestry or of particular biological relevance [@Haak2015; @Mathieson2015; @Rohland2022]. While established tools exist for this task for high-quality modern sequencing data [@samtools; @gatk], these are often not appropriate for ancient DNA, which has often too low sequencing-coverage and a higher error rate due to post-mortem DNA damage. PileupCaller is a command-line tool written in Haskell, which randomly samples genotypes from raw alignment data at predefined bi-allelic positions. Several modes can be selected, geared towards specific input data features and research questions.
2121

2222
# Statement of need
2323

24-
Present-day DNA, for example from medical studies results in raw sequencing data with relatively low per-base error rates and sequencing-coverages of at least several multiples of 1 (for example [@1000_Genomes_Project_Consortium2015]) but in fact up to 20-30x coverage. Dedicated tools to process such data include samtools/bcftools [@samtools] and GATK [@gatk] among many other tools. Ancient DNA seuqencing data often comes with substantially lower coverage and substantially higher error rates. In terms of coverage, most ancient genomes have genome-wide coverage often below 1x and in fact very often even below 0.1x. Such low coverage means that any given genomic site is more likely not covered by a sequencing read than covered. At the same time, the low fraction of sites that is actually covered has higher error rates than modern DNA, due to ancient-DNA damage. These two factors violate the assumptions behind statistical genotype callers like `bcftools call` or `HaplotypeCaller` from GATK.
24+
Present-day DNA, for example from medical studies results in raw sequencing data with relatively low per-base error rates and sequencing-coverages of at least several multiples of 1 (for example [@1000_Genomes_Project_Consortium2015]) but in fact up to 20-30x coverage. Dedicated tools to process such data include samtools/bcftools [@samtools] and GATK [@gatk] among many other tools. Ancient DNA sequencing data often comes with substantially lower coverage and substantially higher error rates. In terms of coverage, most ancient genomes have genome-wide coverage often below 1x and in fact very often even below 0.1x. Such low coverage means that any given genomic site is more likely not covered by a sequencing read than covered. At the same time, the low fraction of sites that is actually covered has higher error rates than modern DNA, due to ancient-DNA damage. These two factors violate the assumptions behind statistical genotype callers like `bcftools call` or `HaplotypeCaller` from GATK.
2525

2626
As is widely used practice in the field, very low-coverage ancient DNA data is often "called", simply by randomly selecting reads at a given position of interest. PileupCaller is a command-line tool that does exactly that, by reading in a list of SNP positions and a stream of sequencing data, some optional filtering options, and then performs random samples at every position of interest for multiple individuals. Even before this paper, `pileupCaller` has been widely used since its creation in 2017, mostly because of its simple use and low-memory footprint thanks to streaming.
2727

@@ -43,4 +43,7 @@ In terms of output formats, pileupCaller currently supports Eigenstrat, Plink (h
4343

4444
PileupCaller is part of the "sequenceTools" package, which contains multiple other minor scripts and command-line tools, with pileupCaller being the central and most popular tool. The sequenceTools package makes key use of the "sequence-formats" Haskell library [@sequence-formats], which contains parsers for the Pileup-, the Plink-, the Eigenstrat and the VCF-Format.
4545

46+
# Acknowledgments
47+
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 851511). The author acknowledges core funding by the Max Planck Society.
48+
4649
# References

0 commit comments

Comments
 (0)