Skip to content

Commit c3a4a12

Browse files
committed
combine gtdb and user data
1 parent 35e1020 commit c3a4a12

1 file changed

Lines changed: 1 addition & 4 deletions

File tree

generate_customdatabase.md

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,6 @@
22

33
Genome dereplication is not always perfect due to inherent limitations of hierarchical clustering algorithms used in dereplication tools (dRep and Galah). Alternatively, taxonomic classification using GTDBtk followed by grouping genomes by taxonomy assignment is another option for dereplication, but it has limitations too: 1) ANI radius of under-represented species may be inaccurate, causing wrong taxonomy labeling; 2) novel species cannot be assigned. Combining dereplication and taxonomic classification can enhance the discovery of novel species with improved accuracy.
44

5-
---
6-
7-
### Overview
85

96
The `magmax customdb` subcommand builds a species-level non-redundant genome database by combining two complementary strategies:
107

@@ -209,7 +206,7 @@ Remaining bins are clustered by pairwise ANI (default 95%, aligned fraction ≥
209206
5. **The `unclassified_clusterrepresentatives_gtdbtkspecies_ani_connections.tsv` file is a diagnostic resource.** It lists novel-cluster representatives whose ANI to a known GTDB-Tk species representative meets or exceeds that species' ANI radius. This happens when unclassified cluster representatives have lower ANI to the GTDB reference species than the representatives selected from the user's input dataset.
210207

211208

212-
## Tutorial: Building a unified species-level database: integrating MAGmax dereplication results with GTDB reference genomes
209+
## Building a unified species-level database: integrating MAGmax dereplication results with GTDB reference genomes
213210

214211
The `unifygtdb.sh` script combines magmax customdb output and GTDB reference genomes. This is useful when users wants to create a complete species-level genome reference database including all known species and unknown species covered in the input data.
215212

0 commit comments

Comments
 (0)