Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
e09856e
example_1 generation
bioinfwithjudith Sep 27, 2023
f0d3392
example 1 update
bioinfwithjudith Oct 5, 2023
7c425a2
Merge branch 'main' of https://github.com/KoslickiLab/YACHT into use_…
bioinfwithjudith Oct 5, 2023
eed232f
more use case examples
bioinfwithjudith Oct 5, 2023
6e1cb4d
use-case-example.md
bioinfwithjudith Oct 12, 2023
d40b381
use-case-example.md
bioinfwithjudith Oct 12, 2023
7a343fd
use-case-example.md
bioinfwithjudith Oct 12, 2023
a811474
use-case-example.md
bioinfwithjudith Oct 12, 2023
64d9316
use-case-example.md
bioinfwithjudith Oct 12, 2023
7fc01b9
use-case-example.md
bioinfwithjudith Oct 12, 2023
fbff66d
use-case-example.md
bioinfwithjudith Oct 12, 2023
32bddd4
use-case-example.md
bioinfwithjudith Oct 12, 2023
8e3caca
use-case-example.md
bioinfwithjudith Oct 12, 2023
379493f
use-case-example.md
bioinfwithjudith Oct 12, 2023
0da7e10
pathogen detection literature update
bioinfwithjudith Oct 17, 2023
f044cc7
update
bioinfwithjudith Oct 19, 2023
961ce4c
update
bioinfwithjudith Oct 19, 2023
ea440de
update on pathogen detection description
bioinfwithjudith Oct 30, 2023
bc8afec
pathogen detection use case example prep
bioinfwithjudith Nov 1, 2023
d437966
create toy dataset for lung sample
bioinfwithjudith Nov 7, 2023
2f20f8b
pathogen_detection_example.md
bioinfwithjudith Nov 7, 2023
4cceca2
updating example
bioinfwithjudith Nov 8, 2023
7ef5e71
update
bioinfwithjudith Nov 8, 2023
bab2191
update
bioinfwithjudith Nov 8, 2023
f16ebab
update
bioinfwithjudith Nov 8, 2023
17b3eb7
reduced ani on k15 example
bioinfwithjudith Nov 8, 2023
7c9fdd1
update
bioinfwithjudith Nov 9, 2023
6240cff
adding example for cross contamination
bioinfwithjudith Nov 17, 2023
3936e2a
updated title of markdown
bioinfwithjudith Nov 17, 2023
b0286aa
updated title of markdown
bioinfwithjudith Nov 17, 2023
772c459
add figure for well tray
bioinfwithjudith Nov 17, 2023
c3cc89b
update
bioinfwithjudith Nov 28, 2023
956d661
update
bioinfwithjudith Nov 28, 2023
f356dbd
Merge branch 'main' into use_case_examples
mfl15 Dec 19, 2023
b0ac652
Merge branch 'main' into use_case_examples
dkoslicki Jan 25, 2024
ab6e32e
update commands
bioinfwithjudith Feb 14, 2024
711fb6d
update commands
bioinfwithjudith Feb 16, 2024
6cd7c8f
tutorial to get these MAG fishing
bioinfwithjudith Jun 26, 2024
2a799c3
README.md update description
bioinfwithjudith Jun 27, 2024
edb95cf
Merge branch 'main' into PR
bioinfwithjudith Jun 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions use_case_examples/MAG_fishing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# YACHT for MAG Fishing

## Description of use-case-example

Metagenomic Assembled-Genomes (MAG) fishing is the process of reporting the assembled genomes within a metagenomic sample.

Metagenomics has been an important field in exploring the microbial communities of specific environments, especially for environments that contain unculturable microbes. However, there is a persistent underrepresentation of genomes challenging the production of a high-resolution of taxonomic profile. Consequently, many microbial communities are still understudied. Efforts have been made to increase the knowledge of these environments, such as the study we highlight here. One of the goals in the study by Banchi and colleagues (link to paper) was to unveil a more resolved taxonomic composition of marine sediments in the Venice Lagoon for further functional analyses of these microbial communities. The dataset from this study (NCBI accession: PRJNA924243) has 58 MAGS and serves as a use case example of using YACHT to resolve taxonomic composition.

According to their study, we should expect YACHT to report species from the phylum Proteobacteria, under the classes Alphaproteobacteria, Gammaproteobacteria, and Deltaprotwobacteria.

Banchi, E., Corre, E., Del Negro, P., Celussi, M., & Malfatti, F. (2024). Genome-resolved metagenomics of Venice Lagoon surface sediment bacteria reveals high biosynthetic potential and metabolic plasticity as successful strategies in an impacted environment. Marine Life Science & Technology, 6(1), 126-142.

## Install the following programs

**datasets**

More information please go to: [datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/).

**YACHT**

More information please visit: [YACHT](https://github.com/KoslickiLab/YACHT).


## Download MAG samples

The following command was used to download the MAG sample of interest:

```
datasets download genome accession PRJNA924243
```

Downloading this MAG project will produce a directory of multiple pathways for each fasta file in this project but yacht wants one fasta file with running `yacht sketch sample` and yacht will let you know of this with the following error message:

```
ValueError: Please provide either one file for single-end reads or two files for paired-end reads.
```

A work around is running the following commands.

```
cd MAG_data
cp data/ncbi_dataset/data/GCA_02928*/*fna MAG_data/.
```

## yacht sketch MAG of interest

```
yacht sketch sample --infile MAG_sample.fna --kmer 31 --scaled 1000 --outfile sample.sig.zip
```

Be aware that `yacht sketch sample` will create a sketch sample with more than one signature, but `yacht run` wants a sample with one signature, so it will direct you to create merge signatures using `sourmash merge`. Please execute the following command:

```
sourmash sig merge sample.sig.zip -k 31 -o sample_merge.sig.zip
```

## Using yacht download pretrained_ref_db

Trying to download the pretrained data did not download anything

```
yacht download pretrained_ref_db --database gtdb --db_version rs214 --k 31 --ani_thresh 0.9995 --outfolder ./
```

Runnning the following command, gave me everything?

```
yacht download default_ref_db --database gtdb --db_version rs214 --gtdb_type reps --k 31 --outfolder ./
```

It seems that the yacht downloand pretrained_ref_db doesn't show? It never really completed but completed once I ran yacht download default_ref_db

## yacht run
```
yacht run --json 'gtdb-rs214-reps.k31_0.9995_pretrained/gtdb-rs214-reps.k31_0.9995_config.json' --sample_file 'sample_merge.sig.zip' --num_threads 32 --keep_raw --significance 0.99 --min_coverage_list 1 0.5 0.1 0.05 0.01 --out ./result.xlsx
```

Binary file added use_case_examples/MAG_fishing/result.xlsx
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Download SRR25626360 which represents WGS of Haemophilus influenzae
nohup fastq-dump --fasta 60 SRR25626360 2>&1 &

### Download SRR24210460 which represents WGS of mycoplasma pneumoniae from library MDY
nohup fastq-dump --fasta 60 SRR24210460 2>&1 &

### Download SRR7217470 which represents WGS of Chlamydia pneumoniae
nohup fastq-dump --fasta 60 SRR7217470 2>&1 &

### Download SRR5962942 which represents WGS of Streptococcus pneumoniae
nohup fastq-dump --fasta 60 SRR5962942 2>&1 &

### Download SRR26202532 which represents WGS of Bordetella pertussis
nohup fastq-dump --fasta 60 SRR26202532 2>&1 &

### Download SRR2830253, reads of a healthy human lung microbiome
nohup fastq-dump --fasta 60 SRR2830253 2>&1 &
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Create the example sample data for a patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms.

# Before moving on. Make sure reads needed to create sample dataset are available. Please reference create_reference_database.md

# Create samples that will be loaded to the 96-well tray

# Negative control, so just reads from a healthy lung
cat SRR2830253.fasta negative_control_well_11.fasta

# Positive control with H. influenzae
cat SRR25626360.fasta SRR2830253.fasta > positive_control_well_23.fasta

# Sample 1
cat SRR25626360.fasta SRR2830253.fasta SRR25626360.fasta > positive_control_well_64.fasta

# Sample 2
cat SRR24210460.fasta SRR2830253.fasta SRR25626360.fasta > sample_well_80.fasta

# I check one of my negative controls, which is a healthy lung example and we should not detect any bacteria here
# no contamination

# I check one of my positive controls for M. pneumonaie which should not have H. influenzae
# no contamination

# I check one of my positive controls for H. influenzae which should not have M. pneumonaie
# contamination

# I check one of my samples for H. influenzae which should not have M. pneumonaie
# contamination

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Contamination Detection Example
A research is being conducted on how microbial communities are being shaped among diffrent types of respiratory diseases. Samples were collected from two patience in wihch one patient is M. pneumoniae positive and the other in H. influenzae. To save time and many, only one 96-well tray will be used for both samples. Before downstream analysis can be performed, we want to know if cross contamination between samples occured during the loading of the 96-well tray and we randomly choose wells 11, 23, 64, and 80.

Make sure all bacterial reads needed to create your reference dataset also known as a training dataset are available.
```bash
bash 1_before_starting.sh
```
```bash
bash 2_before_starting.sh
```

### Sketch your training dataset and sample to your preference.

#### Using k=31
Note: training and sample datasets are required to have the same ksize. Please note that since we are sketching from a list of genomes. We can use the following sourmash sketch command:
```bash
sourmash sketch fromfile genome_list.csv -p dna,k=31,scaled=1000,abund -o training_database.k31.sig.zip
```

Sketch the negative control reads from well 11
```bash
yacht sketch sample --infile ./negative_control_well_11.fasta --kmer 31 --scaled 1000 --outfile negative_control_well_11.k31.sig.zip
```

Sketch the positive control from well 23
```bash
yacht sketch sample --infile ./positive_control_well_23.fasta --kmer 31 --scaled 1000 --outfile positive_control_well_23.k31.sig.zip
```

Sketch the positive control from well 64
```bash
yacht sketch sample --infile ./positive_control_well_64.fasta --kmer 31 --scaled 1000 --outfile positive_control_well_64.k31.sig.zip
```

Sketch the sample from well 80
```bash
yacht sketch sample --infile ./sample_well_80.fasta --kmer 31 --scaled 1000 --outfile sample_well_80.k31.sig.zip
```

### Make training data for k=31
```bash
yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31' --outdir ./ --force
```

### Identify whether the patient has a infection and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_config.json --sample_file negative_control_well_11.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./negative_control_well_11_k31_result.xlsx
```

```bash
yacht run --json training_database.k31_config.json --sample_file positive_control_well_23.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./positive_control_well_23_k31_result.xlsx
```

```bash
yacht run --json training_database.k31_config.json --sample_file positive_control_well_64.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./positive_control_well_64_k31_result.xlsx
```

```bash
yacht run --json training_database.k31_config.json --sample_file sample_well_80.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./sample_well_80.xlsx
```

### Results
Using a ksize of 31 at ANI 0.95, YACHT finds XYZ

## Let's decrease ANI to 0.50

### Make training data for k=31
```bash
yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31_ani0.50' --outdir ./ --force
```

### Pathogen Detection using YACHT
Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_ani0.50_config.json --sample_file negative_control_well_11.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_negative_control_well_11.xlsx
```

Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_ani0.50_config.json --sample_file positive_control_well_23.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_positive_control_well_23.xlsx
```

Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_ani0.50_config.json --sample_file positive_control_well_64.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_positive_control_well_64.xlsx
```

Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_ani0.50_config.json --sample_file sample_well_80.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_ani0.50_result_sample_well_80.xlsx
```


### Results
Decreasing ANI to 0.50 and using a ksize of 31, YACHT finds XYZ
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
0,name,genome_filename,protein_filename
1,SRR25626360,SRR25626360.fasta,
2,SRR24210460,SRR24210460.fasta,
3,SRR7217470,SRR7217470.fasta,
4,SRR5962942,SRR5962942.fasta,
5,SRR26202532,SRR26202532.fasta,
17 changes: 17 additions & 0 deletions use_case_examples/pathogen_detection_example/1_before_starting.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Download SRR25626360 which represents WGS of Haemophilus influenzae
nohup fastq-dump --fasta 60 SRR25626360 2>&1 &

### Download SRR24210460 which represents WGS of mycoplasma pneumoniae from library MDY
nohup fastq-dump --fasta 60 SRR24210460 2>&1 &

### Download SRR7217470 which represents WGS of Chlamydia pneumoniae
nohup fastq-dump --fasta 60 SRR7217470 2>&1 &

### Download SRR5962942 which represents WGS of Streptococcus pneumoniae
nohup fastq-dump --fasta 60 SRR5962942 2>&1 &

### Download SRR26202532 which represents WGS of Bordetella pertussis
nohup fastq-dump --fasta 60 SRR26202532 2>&1 &

### Download SRR2830253, reads of a healthy human lung microbiome
nohup fastq-dump --fasta 60 SRR2830253 2>&1 &
11 changes: 11 additions & 0 deletions use_case_examples/pathogen_detection_example/2_before_starting.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Create the example sample data for a patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms.

# Before moving on. Make sure reads needed to create sample dataset are available. Please reference create_reference_database.md

# Sketch sample to your preference. Note: training and sample datasets are required to have the same ksize.

## Using k=31
nohup sourmash sketch fromfile lung_list.csv -p dna,k=31,scaled=1000,abund -o lung_sample.k31.sig.zip > k31_sample.log 2>&1 &

## Using k=15
nohup sourmash sketch fromfile lung_list.csv -p dna,k=15,scaled=1000,abund -o lung_sample.k15.sig.zip > k15_sample.log 2>&1 &
6 changes: 6 additions & 0 deletions use_case_examples/pathogen_detection_example/genome_list.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
0,name,genome_filename,protein_filename
1,SRR25626360,SRR25626360.fasta,
2,SRR24210460,SRR24210460.fasta,
3,SRR7217470,SRR7217470.fasta,
4,SRR5962942,SRR5962942.fasta,
5,SRR26202532,SRR26202532.fasta,
3 changes: 3 additions & 0 deletions use_case_examples/pathogen_detection_example/lung_list.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
0,name,genome_filename,protein_filename
1,SRR24210460,SRR24210460.fasta,
2,SRR26202532,SRR26202532.fasta,
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# Pathogen Detection Example
A patient with respiratory symptoms seeks to find out the pathogen that is causing them these symptoms.

Make sure all bacterial reads needed to create your reference dataset also known as a training dataset are available.
```bash
bash 1_before_starting.sh
```
```bash
bash 2_before_starting.sh
```

### Sketch your training dataset and sample to your preference.

#### Using k=31
Note: training and sample datasets are required to have the same ksize. Please note that since we are sketching from a list of genomes. We can use the following sourmash sketch command:
```bash
sourmash sketch fromfile genome_list.csv -p dna,k=31,scaled=1000,abund -o training_database.k31.sig.zip
```

Sketch your sample fasta file
```bash
yacht sketch sample --infile ./lung_sample.fasta --kmer 31 --scaled 1000 --outfile lung_sample.k31.sig.zip
```

### Make training data for k=31
```bash
yacht train --ref_file training_database.k31.sig.zip --ksize 31 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k31' --outdir ./ --force
```

### Identify whether the patient has a infection and what pathogen is causing the disease.
```bash
yacht run --json training_database.k31_config.json --sample_file lung_sample.k31.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k31_result.xlsx
```

### Results
Using a ksize of 31, YACHT finds that M. pneumoniae is present in the lung sample.

## What if we decrease ksize to 15?
If we use small ksizes like 15, we would expect to not find that the patient is infected by M. pneumoniae. Let's set up the experiment. Note that a ksize below 7 may not produce results and is not recommend.

### Sketch Lung Sample using a k=15
```bash
sourmash sketch fromfile genome_list.csv -p dna,k=15,scaled=1000,abund -o training_database.k15.sig.zip
```

Sketch your sample fasta file
```bash
yacht sketch sample --infile ./lung_sample.fasta --kmer 15 --scaled 1000 --outfile lung_sample.k15.sig.zip
```

### Make training data for k=15
```bash
yacht train --ref_file training_database.k15.sig.zip --ksize 15 --num_threads 64 --ani_thresh 0.95 --prefix 'training_database.k15' --outdir ./ --force
```

### Pathogen Detection using YACHT
Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k15_config.json --sample_file lung_sample.k15.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k15_result.xlsx
```
### Results
Using a ksize of 15, YACHT finds/does not fine that M. pneumoniae

## Let's decrease ANI to 0.85

### Make training data for k=15
```bash
yacht train --ref_file training_database.k15.sig.zip --ksize 15 --num_threads 64 --ani_thresh 0.85 --prefix 'training_database.k15_ani0.85' --outdir ./ --force
```

### Pathogen Detection using YACHT
Identify whether the patient has a infectin and what pathogen is causing the disease.
```bash
yacht run --json training_database.k15_ani0.85_config.json --sample_file lung_sample.k15.sig.zip --significance 0.99 --num_threads 64 --min_coverage_list 1 0.6 0.2 0.1 --out ./k15_ani0.85_result.xlsx
```
### Results
Using a ksize of 15, YACHT finds/does not fine that M. pneumoniae
Loading