word reduction, should be just under 1000

bioinfwithjudith · bioinfwithjudith · commit 8d1c109e4764 · 2026-03-02T10:45:34.000-05:00
diff --git a/joss/paper.md b/joss/paper.md
@@ -43,13 +43,13 @@ authors:
     corresponding: true 
     affiliation: "1, 2, 3"
 affiliations:
-  - name: School of Electrical Engineering and Computer Science, Pennsylvania State University, USA
+  - name: School of Electrical Engineering and Computer Science, Pennsylvania State University, United States of America
     index: 1
     ror: 04p491231
-  - name: Huck Institutes of the Life Sciences, Pennsylvania State University, USA
+  - name: Huck Institutes of the Life Sciences, Pennsylvania State University, United States of America
     index: 2
     ror: 04p491231
-  - name: Department of Biology, Pennsylvania State University, USA
+  - name: Department of Biology, Pennsylvania State University, United States of America
     index: 3
     ror: 04p491231
 date: 10 February 2025
@@ -58,25 +58,23 @@ bibliography: paper.bib
 
 # Summary
 
-In metagenomics, identifying genomes present in a sample is an important initial task, but is complicated by taxonomic profiling tools lacking uncertainty quantification and using incomplete reference databases missing exact genome matches. YACHT (**Y**es/No **A**nswers to **C**ommunity membership via **H**ypothesis **T**esting) [@koslicki2024yacht] introduces a statistical framework of taxonomic profiling that uses binomial hypothesis testing on exclusive k-mers to determine genome presence/absence in a metagenomic sample confidently. This paper describes the software implementation of YACHT, a command-line tool that translates the methodology into a practical, accessible tool for metagenomic analysis. YACHT assists in discovering rare microbiomes by identifying low-abundant species missed in other taxonomic profiling approaches while also controlling the false-negative rate. Its statistical model overcomes challenges in sequencing coverage and incomplete genomes, making it ideal for diverse metagenomic applications, including functional profiling, metatranscriptomics, and clinical microbiome analysis.
-
-YACHT presents a robust, $k$-mer sketching-based statistical framework for accurately detecting genetic similarity between the reference database and the metagenomic sample by incorporating evolutionary sequence divergence through the average nucleotide identity (ANI) and sequencing coverage to enable efficient detection of sampled genomes. The workflow for YACHT includes the following commands. To begin, `yacht sketch` creates reduced representation "sketches" of the reference and sample datasets enabling swift comparisons. Then, `yacht train` is used to find a representative of closely related reference genomes using ANI. Lastly, `yacht run` uses the YACHT algorithm to perform hypothesis testing and identify the presence or absence of species. YACHT is developed with C++ and Python and depends on `sourmash` [@irber2024sourmash], a program for extracting and managing $k$-mers.
+Identifying genomes in metagenomics samples can be complicated by taxonomic profiling tools that lack uncertainty quantification and rely on incomplete reference databases. YACHT (**Y**es/No **A**nswers to **C**ommunity membership via **H**ypothesis **T**esting) [@koslicki2024yacht] introduces a $k$-mer sketching based statistical framework that incorporates average nucleotide identity (ANI) and sequencing coverage to detect genetic similarity between reference and sample genomes using binomial hypothesis testing on exclusive $k$-mers to confidently determine genome presence/absence. This paper describes the software implementation of this methodology as a command-line tool that detects low-abundant species while controlling the false-negative rate, making it applicable to functional profiling, metatranscriptomics, and clinical microbiome analysis despite incomplete genomes and variable sequencing coverage. YACHT is developed with C++ and Python and depends on `sourmash` [@irber2024sourmash] for $k$-mer extraction and management.
 
 # Statement of need
 
-Accurately identifying and characterizing microbial communities with low relative abundance is a significant challenge in metagenomics. The current profiling-based practice involves setting arbitrary filter thresholds or discarding low-abundance data without robust justification, which can compromise profiling accuracy and lead to misinterpretations [@schloss2020removal; @jia2022sequencing]. Even with such filtering, the results remain inherently arbitrary because they are influenced by biological complexities such as sequencing errors and evolutionary processes. The lack of a systematic approach to establishing credibility in these results diminishes researchers' confidence in biologically informed methods for identifying rare microorganisms, thereby undermining metagenomic studies. Moreover, these difficulties are exacerbated by the incompleteness of reference databases and the variability in sequencing coverage depth, underscoring the need for statistically credible approaches.
+Accurately identifying low-abundance microbial communities remains a significant challenge in metagenomics. Current methods rely on arbitrary filter thresholds that, even when applied, produce results skewed by sequencing errors and evolutionary processes, compromising profiling accuracy and leading to misinterpretations [@schloss2020removal; @jia2022sequencing]. The lack of a systematic credibility framework can undermine researcher confidence, a problem compounded by incomplete reference databases and variable sequencing coverage depth.
 
-Metagenomic methods rely on existing genome references to detect and classify microbial organisms. However, these reference databases are often incomplete, and conventional metrics may not always align with traditional taxonomic frameworks that account for genomic changes. Consequently, microbes that carry mutations or have diverged evolutionarily can remain undetected, causing inaccuracies in microbial community profiling and misinterpretation of data [@kunin2008bioinformatician; @schlaberg2017validation; @loeffler2020improving; @marcelino2020ccmetagen]. Hence, analytical frameworks need to incorporate genome similarity metrics to capture the full breadth of microbial diversity and to provide accurate, interpretable microbiome dynamics. However, incomplete databases alone do not account for all metagenomic challenges; sequence coverage depth also contributes to the resolution and reliability of microbial detection and characterization.
+Metagenomic methods depend on reference databases that are often incomplete and misaligned with taxonomic frameworks, leaving evolutionarily diverged microbes undetected and causing profiling inaccuracies [@kunin2008bioinformatician; @schlaberg2017validation; @loeffler2020improving; @marcelino2020ccmetagen]. Addressing this equires  analytical frameworks that incorporate genome similarity metrics, though sequencing coverage depth presents an additional challenge to reliable microbial detection.
 
-Sequence coverage depth, defined as the portion of a microbe’s genome detected in a sample, is crucial for detecting low-abundance microbes. However, sequencing processes often fail to achieve complete coverage of all genomes in a sample due to limited sequencing depth. As a result, rare or low-abundance taxa may exhibit low sequence coverage, leading to their misinterpretation as noise rather than genuine observations [@mande2012classification; @shakya2013comparative; @sczyrba2017critical; @meyer2022critical]. Furthermore, the lack of guidelines for establishing a biologically meaningful coverage depth threshold introduces subjectivity and inconsistency in the metagenomic analyses. Therefore, implementing dynamic coverage depth thresholds tailored to varying abundance levels is essential for delivering accurate metagenomic studies. Yet, even if we address coverage depth and incomplete genome reference problems, ensuring proper control over statistical errors remains another major challenge.
+Sequence coverage depth—the portion of a microbe’s genome detected in a sample—is crucial for detecting low-abundance microbes, which are often misinterpreted as noise due to limited sequencing depth [@mande2012classification; @shakya2013comparative; @sczyrba2017critical; @meyer2022critical]. The lack of guidelines for biologically meaningful coverage depth thresholds introduces subjectivity, making dynamic coverage depth thresholds essential. Yet even with adequate coverage and reliable genome references, controlling statistical errors remains a major challenge.
 
-Existing metagenomic methods lack the statistical rigor to control false positives and false negatives effectively. High false positive rates misrepresent microbial composition and lead to biased conclusions, undermining research reliability. Conversely, false negative rates cause researchers to overlook important taxa, especially those in low abundance that often carry significant biological importance [@jousset2017less]. Incomplete reference databases, sequencing errors, and evolutionary divergence between reference and sample genomes further complicate these challenges. Therefore, maintaining appropriate control over these statistical error rates is critical to ensure more confident, reliable biological inferences and minimize the risk of misinterpretation. While limitations in reference database, sequence coverage depth and balance of statistical error pose significant challenges, the complexity of metagenomic analysis demands a multifaceted approach to capture microbial profiling accurately.
+Existing metagenomic methods lack the statistical rigor to control false positives and false negatives effectively, where high false positive rates misrepresent microbial composition and false negative rates cause researchers to overlook biologically important taxa [@jousset2017less]. Incomplete reference databases, sequencing errors, and evolutionary divergence between reference and sample genomes further complicate statistical error rates, making a multifaceted statistical approach essential to capture microbial profiling accurately.
 
-To address these challenges, YACHT offers a statistical framework that can accurately determine the presence or absence of microbial genome in a sample through hypothesis testing. The algorithm’s mathematical model accounts for evolutionary sequence divergence and incomplete sequencing depth by utilizing genome similarity and minimum sequencing depth parameters. It employs the FracMinHash sketching technique [@irber2020decentralizing;  @Irber2022FracMinHash], an alignment-free $k$-mer approach, facilitating fast and accurate genome detection that can efficiently process large datasets. YACHT ensures precise detection of low abundance taxa with a user-defined false negative rate, minimizing the risk of misinterpretation of the result. Our approach can be used for other metagenomic applications such as functional profiling, metatranscriptomic studies [@marcelino2019metatranscriptomics], metabolic potential analyses [@ward2018metapoap; @pereira2024metatranscriptomics], and the characterization of low abundant clinical metagenomic samples such as skin [@godlewska2020metagenomic]. YACHT enhances metagenomic analysis by offering reduced reliance on arbitrary thresholds, improving the interpretability of the result without compromising biological relevance, and allowing researchers to differentiate between genuine artifacts from “noise” with statistical confidence.
+YACHT addresses these challenges through hypothesis testing that accounts for evolutionary sequence divergence and incomplete sequencing depth utilizing genome similarity and minimum sequencing depth parameters. It employs the FracMinHash sketching technique [@irber2020decentralizing; @Irber2022FracMinHash], an alignment-free $k$-mer approach, facilitating fast and accurate detection of low abundance taxa with a user-defined false negative rate. YACHT is applicable to functional profiling, metatranscriptomic studies [@marcelino2019metatranscriptomics], metabolic potential analyses [@ward2018metapoap; @pereira2024metatranscriptomics], and the characterization of low abundant clinical metagenomic samples such as skin [@godlewska2020metagenomic], reducing reliance on arbitrary thresholds and distinguishing genuine artifacts from “noise” with statistical confidence.
 
 # Workflow
 
-The YACHT workflow involves four primary steps. First, `yacht sketch` samples compact representations of reference genomes using `sourmash`. Second, `yacht train` preprocesses the reference genomes, merging those with high average nucleotide identity (ANI) into a single representative. Third, `yacht run` executes the core YACHT algorithm to perform hypothesis testing and determine the presence or absence of organisms. Finally, `yacht convert` transforms the results into popular output formats like CAMI, BIOM, and GraphPhlAn.
+The YACHT workflow involves four primary steps. First, `yacht sketch` samples compact representations of reference genomes. Second, `yacht train` preprocesses the reference genomes, merging those with high ANI into a single representative. Third, `yacht run` executes the core YACHT algorithm to perform hypothesis testing and determine the membership of organisms. Finally, `yacht convert` transforms the results into popular output formats like CAMI, BIOM, and GraphPhlAn.
 
 ![The YACHT workflow illustrated with the four primary stages: sketching, training, running, and converting. \label{fig:workflow}](workflow.png)
 
@@ -98,7 +96,7 @@ Natronobacterium    & TRUE  & 700  & 638 & 0.053534755 \\
 Echinicola          & FALSE & 244  & 978 & 0.052885411 \\
 \bottomrule
 \end{tabular}
-\caption{YACHT results for Sediminispirochaeta, Natronobacterium, and Echinicola are reported. For each species, the following are shown as a subset of the output: whether the organism passed the presence threshold (Presence), the number of exclusive $k$-mer matches (num\_matches), the expected minimum number of matches (acceptance\_threshold), and an alternative confidence estimate for the mutation rate (alt\_confidence\_mut\_rate) are shown. Note that Echinicola is not reported as present, while Sediminispirochaeta and Natronobacterium are present meeting the acceptance threshold. Results were generated using the MBARC-26 dataset (SRA: SRR6394747 by @Singer2016MockCommunity) with YACHT parameters: $k$-size of 31, minimum coverage of 0.05, and ANI threshold of 0.95. Please refer to Use Case Examples for more information.}
+\caption{YACHT results for Sediminispirochaeta, Natronobacterium, and Echinicola showing a subset of output columns: whether the organism passed the presence threshold (Presence), the number of exclusive $k$-mer matches (num\_matches), the expected minimum number of matches (acceptance\_threshold), and an alternative confidence estimate for the mutation rate (alt\_confidence\_mut\_rate). Note that Echinicola is not reported as present, while Sediminispirochaeta and Natronobacterium are present meeting the acceptance threshold. Results were generated using the MBARC-26 dataset (SRA: SRR6394747 by @Singer2016MockCommunity) with YACHT parameters: $k$-size of 31, minimum coverage of 0.05, and ANI threshold of 0.95.}
 \end{table}
 
 
@@ -111,11 +109,11 @@ Echinicola          & FALSE & 244  & 978 & 0.052885411 \\
 
 We present the three use case examples to demonstrate the application of YACHT for identifying taxonomy in microbiome studies: (i) analyzing low-abundance metagenomic samples that are common in clinical settings, (ii) performing MAG fishing to detect specific metagenomic-assembled genomes, and (iii) evaluating synthetic microbial communities to identify the presence of specific organisms.
 
-**Low abundance samples:** YACHT can analyze metagenomic samples with low microbial DNA concentrations, which are common in clinical and environmental studies. In this use case example, we adjust the ANI threshold and $k$-size to balance sensitivity and specificy, with higher values increasing stringency and refining species resolution. Using a human skin metagenomic sample, we show that these parameters markedly influence species reporting highlighting the need for careful threshold selection. For more information, refer to [Low abundance samples](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/low_abundance_samples).
+**Low abundance samples:** YACHT can analyze metagenomic samples with low microbial DNA concentrations common in clinical and environmental studies. Using a human skin metagenomics samples, we show that ANI threshold and k-size markedly influence species specificity. See [Low abundance samples](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/low_abundance_samples).
 
-**Metagenomic-assembled genome (MAG) fishing:** YACHT can be employed to search for specific MAGs of interest within a sample by using a single MAG as the training reference database. Applying this approach to two skin metagenomic samples shows that detection strength varies with sequencing depths and coverage. This use case example illustrates how MAG fishing with YACHT is sensitive to coverage and parameter choice, emphasizing the importance of sequencing depth when assessing MAG presence. Find further detail in [MAG fishing](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/MAG_fishing).
+**Metagenomic-assembled genome (MAG) fishing:** Using a single MAG as a training reference database, YACHT searches for specific MAGs within a sample. Applied to two skin metagenomic samples, results shows detection is sensitive to sequencing depth, coverage, and parameter choice. See [MAG fishing](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/MAG_fishing).
 
-**Synthetic metagenomes:** YACHT can assess the construction of mock or synthetic microbial communities to verify that the designed microbes are present. Using a synthetic community from the literature, we show that ANI thresholds can influence accuracy where higher ANI thresholds recover most expected genomes, while lower ones can introduce false positives further highlighitng how parameter choice—particularly ANI and minimum coverage—affect sensitivity and specificity when validating synthetic community composition. For additional information, refer to [Synthetic metagenomes](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/synthetic_metagenome)
+**Synthetic metagenomes:** YACHT verifies the presence of designed microbes in mock microbial communities. Higher ANI thresholds recover expected genomes while lower thresholds introduce false positives, demonstrating how ANI and minimum coverage parameters affect sensitivy and specificity. See [Synthetic metagenomes](https://github.com/KoslickiLab/YACHT/tree/main/use_case_examples/synthetic_metagenome)
 
 # Acknowledgements
 We thank the contributors and collaborators who supported the development of YACHT. This work was supported in part by the National Institutes of Health (NIH) under grant number 5R01GM146462-03.