You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+5-2Lines changed: 5 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,11 +17,11 @@ bibliography: paper.bib
17
17
18
18
# Summary
19
19
20
-
Next generation sequencing data is ubiquitous in medical and biological sciences. It has also become the primary tool in archaeogenetics, where ancient DNA is extracted from archaeological organic (often human skeletal) material, processed into DNA sequencing libraries and then sequenced [@Orlando2021]. As a testimony to the rapid and accellerating growth of the field, we today have close to ten thousand published ancient human genomes available in the public record [@Schmid2024; @Mallick2024], and many smaller datasets of other organisms. A key step in processing raw sequencing data is the estimation of genotypes at specific variable positions along the genome. Such positions are often pre-selected because they are informative about ancestry or of particular biological relevance [@Haak2015; @Mathieson2015; @Rohland2022]. While established tools exist for this task for high-quality modern sequencing data [@samtools; @gatk], these are often not appropriate for ancient DNA, which has often too low sequencing-coverage and a higher error rate due to post-mortem DNA damage. PileupCaller is a command-line tool written in Haskell, which randomly samples genotypes from raw alignment data at predefined bi-allelic positions. Several modes can be selected, geared towards specific input data features and research questions.
20
+
Next generation sequencing data is ubiquitous in medical and biological sciences. It has also become the primary tool in archaeogenetics, where ancient DNA is extracted from archaeological organic (often human skeletal) material, processed into DNA sequencing libraries and then sequenced [@Orlando2021]. As a testimony to the rapid and accelerating growth of the field, we today have close to ten thousand published ancient human genomes available in the public record [@Schmid2024; @Mallick2024], and many smaller datasets of other organisms. A key step in processing raw sequencing data is the estimation of genotypes at specific variable positions along the genome. Such positions are often pre-selected because they are informative about ancestry or of particular biological relevance [@Haak2015; @Mathieson2015; @Rohland2022]. While established tools exist for this task for high-quality modern sequencing data [@samtools; @gatk], these are often not appropriate for ancient DNA, which has often too low sequencing-coverage and a higher error rate due to post-mortem DNA damage. PileupCaller is a command-line tool written in Haskell, which randomly samples genotypes from raw alignment data at predefined bi-allelic positions. Several modes can be selected, geared towards specific input data features and research questions.
21
21
22
22
# Statement of need
23
23
24
-
Present-day DNA, for example from medical studies results in raw sequencing data with relatively low per-base error rates and sequencing-coverages of at least several multiples of 1 (for example [@1000_Genomes_Project_Consortium2015]) but in fact up to 20-30x coverage. Dedicated tools to process such data include samtools/bcftools [@samtools] and GATK [@gatk] among many other tools. Ancient DNA seuqencing data often comes with substantially lower coverage and substantially higher error rates. In terms of coverage, most ancient genomes have genome-wide coverage often below 1x and in fact very often even below 0.1x. Such low coverage means that any given genomic site is more likely not covered by a sequencing read than covered. At the same time, the low fraction of sites that is actually covered has higher error rates than modern DNA, due to ancient-DNA damage. These two factors violate the assumptions behind statistical genotype callers like `bcftools call` or `HaplotypeCaller` from GATK.
24
+
Present-day DNA, for example from medical studies results in raw sequencing data with relatively low per-base error rates and sequencing-coverages of at least several multiples of 1 (for example [@1000_Genomes_Project_Consortium2015]) but in fact up to 20-30x coverage. Dedicated tools to process such data include samtools/bcftools [@samtools] and GATK [@gatk] among many other tools. Ancient DNA sequencing data often comes with substantially lower coverage and substantially higher error rates. In terms of coverage, most ancient genomes have genome-wide coverage often below 1x and in fact very often even below 0.1x. Such low coverage means that any given genomic site is more likely not covered by a sequencing read than covered. At the same time, the low fraction of sites that is actually covered has higher error rates than modern DNA, due to ancient-DNA damage. These two factors violate the assumptions behind statistical genotype callers like `bcftools call` or `HaplotypeCaller` from GATK.
25
25
26
26
As is widely used practice in the field, very low-coverage ancient DNA data is often "called", simply by randomly selecting reads at a given position of interest. PileupCaller is a command-line tool that does exactly that, by reading in a list of SNP positions and a stream of sequencing data, some optional filtering options, and then performs random samples at every position of interest for multiple individuals. Even before this paper, `pileupCaller` has been widely used since its creation in 2017, mostly because of its simple use and low-memory footprint thanks to streaming.
27
27
@@ -43,4 +43,7 @@ In terms of output formats, pileupCaller currently supports Eigenstrat, Plink (h
43
43
44
44
PileupCaller is part of the "sequenceTools" package, which contains multiple other minor scripts and command-line tools, with pileupCaller being the central and most popular tool. The sequenceTools package makes key use of the "sequence-formats" Haskell library [@sequence-formats], which contains parsers for the Pileup-, the Plink-, the Eigenstrat and the VCF-Format.
45
45
46
+
# Acknowledgments
47
+
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 851511). The author acknowledges core funding by the Max Planck Society.
0 commit comments