-
Notifications
You must be signed in to change notification settings - Fork 5
4. basic usage examples
prepTG formats and parses information in provided GenBank files or can run prodigal (for bacteria only!) for gene-calling if provided FASTA files and subsequently create GenBank files.
prepTG -i Folder_with_Target_Genomes/ -o prepTG_DB/For additoinal details on prepTG (e.g. how to download genomes from NCBI), please check out the 1. more info on prepTG wiki page.
fai uses either (or combination) of a simple "gene-clumping" or "HMM-based" approach to identify homologous instances of a gene cluster or known set of homologous gene-clusters:
- Provide GenBank(s) of known instance(s) of gene cluster
fai -i Known_GeneCluster.gbk -tg prepTG_Database/ -o fai_Results/- Provide gene cluster coordinates along a FASTA reference genome
fai -r Reference.fasta -rc scaffold01 -rs 40201 -re 45043 -tg prepTG_Database/ -o fai_Results/- Provide proteins gene cluster using set of proteins that should be co-clustered (similar to cblaster)
fai -pq Gene-Cluster_Query_Proteins.faa -tg prepTG_Database/ -o fai_Results/- Provide a single query protein and use to extract surrounding +/-20kb of homologs in target genomes (inspired by CORASON; implementation still experimental)
# note, this option is still experimental. The concept of looking at variability
# in the context of a focal gene stems from CORASON but we don't use RBH and
# only an adjustable E-value threshold to identify homologs in target genomes.
# Unlike, the other 3 ways to run fai to identify gene clusters - where syntenic support
# can be used to better infer orthology - here we are more limited and can only infer
# homology. We might pair the -sq argument with another to provide a reference genome for
# the single query protein eventually.
fai -sq Single_Query_Protein.faa -tg prepTG_Database/ -o fai_Results/ -f 20000For additional details on fai (e.g. how it relates to cblaster and lsaBGC-Expansion, plots it can create to assess homologous gene-clusters detected), please check out the 2. more info on fai wiki page.
zol -i Genbanks_Directory/ -o zol_Results/if running after fai, then the input directory would be the Homologous_GenBanks_Directory/ subdirectory.
zol produces an XLSX spreadsheet report (within the sub-directory Final_Results/) where rows correspond to each individual ortholog group/homolog-group and columns provide basic stats, consensus order, annotation information using multiple databases, and evolutionary/selection-inference statistics. Coloring is automatically applied on select quantitative field for users to more easily assess trends. I strongly recommend providing a custom-annotation database as a FASTA file of protein sequences with headers corresponding to unique identifiers via the -cd argument because this will allow you to more easily link the ortholog groups to known genes from a well studied instance of the gene cluster if that exists!
Annotation databases include: KEGG, NCBI's PGAP, PaperBLAST, VOGs (phage related genes), MIBiG (genes from characterized BGCs), VFDB (virulence factors), CARD (antibiotic resistance), ISfinder (transposons/insertion-sequences).
For details on the stats/annotations zol infers, please refer to the zol wiki page.

Another application of zol is to use it for preliminary dereplication for visualization with clinker, CORASON, etc.
zol uses skani to perform dereplication with adjustable options (see zol --help).
Note, skani estimates for ANI and AF become less reliable when working with contigs <10kb, so zol-based dereplication should only be used for gene clusters 10 kb or larger.
# Run zol with dereplication requested
zol -i GenBanks_Directory/ -o zol_Results/ -d
# Reference dereplicated representative GenBanks/gene clusters as input for clinker analysis
clinker zol_Results/Dereplicated_GenBanks/*.gbk -p clinker_visualization.html