GCP-Centeny/description/README.md at main · GiuntaLab/GCP-Centeny

GCP - tutorial

Install

To install and run GCP_Centeny, please clone the repository and then create and activate the GCP conda environment as follows:

git clone https://github.com/GiuntaLab/GCP-Centeny.git
cd GCP-Centeny/

#Creates GCP environment
conda env create -f environment.yml
conda activate GCP

All the scripts to run the main program (GCP.R) are in the folder named R/.

cd R/

To print all the available GCP commands, please run:

Rscript GCP.R  --help

usage: GCP.R [-h] {fuzznuc,centeny,model1,model2,model3,reads_retrieval} ...

This tool handles the annotation of DNA motif (CENP-B box by default)
occurrences along the genome to characterize the chromosome-level organization

positional arguments:
  {fuzznuc,centeny,model1,model2,model3,reads_retrieval}
                        Available commands
    fuzznuc             Fuzznuc searches for a specific DNA motif within a
                        FASTA file. This step produces the input for all the
                        other commands
    centeny             Centeny generates a BED file highlighting the strand
                        of DNA motif occurrences
    model1              Model1 generates a BED file highlighting the
                        conservation of bp-distances with respect to CHM13
    model2              Model2 generates a BED file highlighting the
                        conservation of the motif organization with respect to
                        alpha-satellite monomers
    model3              Model3 handles the annotation of motif occurrences
                        along the genome by annotating their organization
                        using k-pattern
    reads_retrieval     Reads_retrieval isolates chromosome-specific
                        centromeric reads from a FASTA/FASTQ file by using
                        chromosome-specific kpatterns from model3

optional arguments:
  -h, --help            show this help message and exit

Fuzznuc

This is the first step of the pipeline, which creates the input file for all the other commands.

Rscript GCP.R fuzznuc --help 

usage: GCP.R fuzznuc [-h] -fa FILE [-seq SEQUENCE] [-o DIR_NAME]

optional arguments:
  -h, --help            show this help message and exit
  -fa FILE, --fasta_file FILE
                        FASTA file where to search for the DNA motif
  -seq SEQUENCE, --sequence SEQUENCE
                        DNA motif to search within the FASTA. By default it
                        searches for the CENP-B box [default
                        NTTCGNNNNANNCGGGN]
  -o DIR_NAME, --output_dir DIR_NAME
                        Path of the output directory [default .]

The fuzznuc module searches for a specific DNA motif within a nucleotide sequence (FASTA format required) and returns the coordinates of its occurrences. Depending on your goal, input sequences can be either a set of a sequencing reads (ONT or PacBio) or a FASTA file of a genome assembly.

Rscript GCP.R \
    fuzznuc \
    -fa ../test/chr17.t2t-chm13-v2.0.fa.gz \
    -seq NTTCGNNNNANNCGGGN \
    -o ../output

This script outputs a BED file (0-based [start], 1-based [end]) in the specified folder that contains the annotation of the DNA motif occurrences. It also outputs a tsv file with name and length of each chromosome/contigs/reads useful for the centeny plot.

Centeny

This step uses the annotation file from fuzznuc to produce a BED file with the orientation of the motifs highlighted in colour (blue to forward and red to complement reverse) and viewable on a genome browser such as IGV. Moreover, the code uses the R package karyoploteR to plot the length of the chromosome/contig/read and the relative position and orientation of DNA motif occurrences, define as centeny map when looking at CENP-B boxes. The centeny map is useful to evaluate the conservation of the position and orientation of CENP-B boxes either in the centromeres and chromosome arms.

Rscript GCP.R centeny --help

usage: GCP.R centeny [-h] -i FILE --length FILE [--n_box INT] [-o FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        DNA motif annotation file, generated by fuzznuc module
  --length FILE         Input file with the length of each chr/contigs from
                        fuzznuc module
  --n_box INT           Filter out sequences with number of boxes lower than
                        n_box [default 50]
  -o FILE, --output_file FILE
                        Name of the output file [default centeny.bed]

The inputs for this step are the two files produced by fuzznuc module, which contain the annotation of motif occurrences and the length of each sequence of the FASTA file. The code allows users to filter out chromosomes/contigs/reads with less than 50 (by default) DNA motif occurrences.

Rscript GCP.R \
    centeny \
    --input ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
    --length ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN_length.tsv \
    --n_box 50 \
    --output_file ../output/centeny.bed

The code generates a BED file and a plot of the centeny map for each retained chromosome/contig/read (centeny.tiff) in the same folder as the specified output.

Model1

Model1 generates a BED file highlighting the conservation of 42 distance values found in the CHM13 genome with a percentage of occurrence > 1% in at least one chromosome. The code generates a BED file, viewable on IGV, where only the distances found in CHM13 are highlighted with a specific color, while the other distance values are in black. It is useful to evaluate the conservation of those values within the user-provided genome.

Rscript GCP.R model1 --help

usage: GCP.R model1 [-h] -i FILE -d FILE [-o FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        DNA motif annotation file, generated by fuzznuc module
  -d FILE, --distance FILE
                        Tab-separated file with distance values represented
                        more than 1 percent in at least one chromosome of the
                        CHM13 (distance column) and associated colors (color
                        column)
  -o FILE, --output_file FILE
                        Name of the output file [default model1.bed]

The inputs for this step are the BED file with the annotation of the DNA motif occurrences and a tab-separated file with distance values reported in the distance column and associated colors reported in the color column. The output is a BED file where the fourth column reports the distance in nucleotides (prefixed by "D") and the last column reports the associated color.

Rscript GCP.R \
    model1 \
    -i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
    -d ../file/chm13_model1.txt \
    -o ../output/model1.bed

Model2

Model2 considers the relationship between two consecutive DNA motif occurrences and the number of alpha-satellite monomers between them.

The code divides each bp-distance value by 171 and subtracts 1 to remove the monomer in which the motif falls. Then, it rounds each value to the nearest whole number ensuring the result has no decimal places. The resulting number reflects the number of monomers without motif between two consecutive motif occurrences. It is useful to evaluate the chromosome-specific conservation of the motif organization with respect to alpha-satellite monomers.

Rscript GCP.R model2 --help

usage: GCP.R model2 [-h] -i FILE -d FILE [-o FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        DNA motif annotation file, generated by fuzznuc module
  -d FILE, --distance FILE
                        Tab-separated file where the distance column contains
                        the number of monomers between two monomers with the
                        DNA motif and the color column reports the associated
                        color
  -o FILE, --output_file FILE
                        Name of the output file [default model2.bed]

This module takes as input the BED file with the annotation of the DNA motif occurrences and a tab-separated file that maps each distance value (expressed in monomer units) to a corresponding color. The output is a BED file where the fourth column reports the distance in monomers (prefixed by “M”) and the last column reports the associated color.

Rscript GCP.R \
    model2 \
    -i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
    -d ../file/model2.txt \
    -o ../output/model2.bed

Model3

This module retrieves and annotates the possible combinations of consecutive distance values, named k-pattern (where k is equal to the number of consecutive distance values considered).

Rscript GCP.R model3 --help

usage: GCP.R model3 [-h] -i FILE [-k INT] [-fq INT] [-o FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        DNA motif annotation file, generated by fuzznuc module
  -k INT, --klength INT
                        Number of distance values includeed in each pattern
                        [default 4]
  -fq INT, --frequency INT
                        Minimum frequency required to color a k-pattern
                        [default 5]
  -o FILE, --output_file FILE
                        Name of the output file [default model3.bed]

The input of model3 is the BED file generated by the fuzznuc module. In this step users must decide the number of boxes (default 4) of the patterns and which patterns to color (only those with frequency higher than 5 to avoid variants less represented that are colored in black). The module produces two output files. The first is a BED file where non-overlapping (step equal to k) blocks of consecutive k distance values are reported, along with the color associated to the k-pattern (k-patterns with a frequency lower than a user-specified threshold are colored in black). Visualization of this file using a genome browser allows to highlight chromosome-specific k-patterns. The second tab-separated file, created in the same folder as the output BED file (pattern_counts_sliding1_k[INT].txt, where INT is equal to the value of k), reports the frequency in each chromosome of all the k-patterns found with a sliding window step of 1. This last file is useful for the reads retrieval step.

Rscript GCP.R \
    model3 \
    -i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
    -k 4 \
    -fq 5 \
    -o ../output/model3.bed

Reads retrieval

This step allows to isolate chromosome-specific centromeric reads based on their match with k-patterns found in the genome.

Rscript GCP.R reads_retrieval --help 

usage: GCP.R reads_retrieval [-h] -i FILE -p FILE -c CHR_NAME --n_box INT
                             [--n_pattern INT] [--n_matches INT] [-t INT]
                             [-o FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i FILE, --input FILE
                        DNA motif annotation file, generated by applying the
                        fuzznuc module to long sequencing reads
  -p FILE, --pattern_table FILE
                        Input table from model3 which stores all the
                        k-patterns with a sliding step of 1
  -c CHR_NAME, --chr CHR_NAME
                        Chromosome of interest (names are stored in column 3
                        of the file in pattern_table)
  --n_box INT           Filter out reads with number of motif occurrences
                        lower than INT (this parameter must to be equal
                        to/higher than the number of distances in the
                        k-pattern)
  --n_pattern INT       Consider only k-patterns with frequency higher than
                        INT within the selected chromosome [default 10]
  --n_matches INT       Assign the read to the selected chromosome if it
                        contains at least INT considered k-patterns [default
                        1]
  -t INT, --n_cores INT
                        Number of threads [default 5]
  -o FILE, --output_file FILE
                        Name of the output file [default
                        chr_specific_reads.txt]

The inputs for this step are the BED file with the annotation of the DNA motif occurrences found within the reads, generated by fuzznuc module, and the list of chromosome-specific k-patterns (with sliding step of 1) from model3.

To convert FASTQ to FASTA in order to run fuzznuc we suggest to use seqtk:

seqtk seq -a <file.fastq> > <file.fasta>

The BED annotation file used in this tutorial was created by applying the fuzznuc module to ~33,000 ONT reads with a minimum length of 100kb.

Rscript GCP.R \
    reads_retrieval \
    -i ../test/ONT_100kb_2023_07_28_all_fasta_NTTCGNNNNANNCGGGN.bed \
    -p ../output/pattern_counts_sliding1_k4.txt \
    --chr chr17 \
    --n_box 4 \
    --n_pattern 10 \ 
    --n_matches 1 \
    --n_cores 5 \
    -o ../output/chr17_specific_reads.txt

The output of this step is a text file that contains IDs of the reads that were assigned to the chromosome of interest based on k-pattern matching. By using this file and seqtk it is possible to retrieve those reads from the FASTQ/FASTA:

seqtk subseq <file.fastq> chr17_specific_reads.txt > chr17_specific_reads.fastq

Author: Luca Corda
For private issues on the code: luca.corda@uniroma1.it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCP - tutorial

Install

Fuzznuc

Centeny

Model1

Model2

Model3

Reads retrieval

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

GCP - tutorial

Install

Fuzznuc

Centeny

Model1

Model2

Model3

Reads retrieval