To install and run GCP_Centeny, please clone the repository and then create and activate the GCP conda environment as follows:
git clone https://github.com/GiuntaLab/GCP-Centeny.git
cd GCP-Centeny/
#Creates GCP environment
conda env create -f environment.yml
conda activate GCP
All the scripts to run the main program (GCP.R) are in the folder named R/.
cd R/
To print all the available GCP commands, please run:
Rscript GCP.R --help
usage: GCP.R [-h] {fuzznuc,centeny,model1,model2,model3,reads_retrieval} ...
This tool handles the annotation of DNA motif (CENP-B box by default)
occurrences along the genome to characterize the chromosome-level organization
positional arguments:
{fuzznuc,centeny,model1,model2,model3,reads_retrieval}
Available commands
fuzznuc Fuzznuc searches for a specific DNA motif within a
FASTA file. This step produces the input for all the
other commands
centeny Centeny generates a BED file highlighting the strand
of DNA motif occurrences
model1 Model1 generates a BED file highlighting the
conservation of bp-distances with respect to CHM13
model2 Model2 generates a BED file highlighting the
conservation of the motif organization with respect to
alpha-satellite monomers
model3 Model3 handles the annotation of motif occurrences
along the genome by annotating their organization
using k-pattern
reads_retrieval Reads_retrieval isolates chromosome-specific
centromeric reads from a FASTA/FASTQ file by using
chromosome-specific kpatterns from model3
optional arguments:
-h, --help show this help message and exit
This is the first step of the pipeline, which creates the input file for all the other commands.
Rscript GCP.R fuzznuc --help
usage: GCP.R fuzznuc [-h] -fa FILE [-seq SEQUENCE] [-o DIR_NAME]
optional arguments:
-h, --help show this help message and exit
-fa FILE, --fasta_file FILE
FASTA file where to search for the DNA motif
-seq SEQUENCE, --sequence SEQUENCE
DNA motif to search within the FASTA. By default it
searches for the CENP-B box [default
NTTCGNNNNANNCGGGN]
-o DIR_NAME, --output_dir DIR_NAME
Path of the output directory [default .]
The fuzznuc module searches for a specific DNA motif within a nucleotide sequence (FASTA format required) and returns the coordinates of its occurrences. Depending on your goal, input sequences can be either a set of a sequencing reads (ONT or PacBio) or a FASTA file of a genome assembly.
Rscript GCP.R \
fuzznuc \
-fa ../test/chr17.t2t-chm13-v2.0.fa.gz \
-seq NTTCGNNNNANNCGGGN \
-o ../output
This script outputs a BED file (0-based [start], 1-based [end]) in the specified folder that contains the annotation of the DNA motif occurrences. It also outputs a tsv file with name and length of each chromosome/contigs/reads useful for the centeny plot.
This step uses the annotation file from fuzznuc to produce a BED file with the orientation of the motifs highlighted in colour (blue to forward and red to complement reverse) and viewable on a genome browser such as IGV. Moreover, the code uses the R package karyoploteR to plot the length of the chromosome/contig/read and the relative position and orientation of DNA motif occurrences, define as centeny map when looking at CENP-B boxes. The centeny map is useful to evaluate the conservation of the position and orientation of CENP-B boxes either in the centromeres and chromosome arms.
Rscript GCP.R centeny --help
usage: GCP.R centeny [-h] -i FILE --length FILE [--n_box INT] [-o FILE]
optional arguments:
-h, --help show this help message and exit
-i FILE, --input FILE
DNA motif annotation file, generated by fuzznuc module
--length FILE Input file with the length of each chr/contigs from
fuzznuc module
--n_box INT Filter out sequences with number of boxes lower than
n_box [default 50]
-o FILE, --output_file FILE
Name of the output file [default centeny.bed]
The inputs for this step are the two files produced by fuzznuc module, which contain the annotation of motif occurrences and the length of each sequence of the FASTA file. The code allows users to filter out chromosomes/contigs/reads with less than 50 (by default) DNA motif occurrences.
Rscript GCP.R \
centeny \
--input ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
--length ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN_length.tsv \
--n_box 50 \
--output_file ../output/centeny.bed
The code generates a BED file and a plot of the centeny map for each retained chromosome/contig/read (centeny.tiff) in the same folder as the specified output.
Model1 generates a BED file highlighting the conservation of 42 distance values found in the CHM13 genome with a percentage of occurrence > 1% in at least one chromosome. The code generates a BED file, viewable on IGV, where only the distances found in CHM13 are highlighted with a specific color, while the other distance values are in black. It is useful to evaluate the conservation of those values within the user-provided genome.
Rscript GCP.R model1 --help
usage: GCP.R model1 [-h] -i FILE -d FILE [-o FILE]
optional arguments:
-h, --help show this help message and exit
-i FILE, --input FILE
DNA motif annotation file, generated by fuzznuc module
-d FILE, --distance FILE
Tab-separated file with distance values represented
more than 1 percent in at least one chromosome of the
CHM13 (distance column) and associated colors (color
column)
-o FILE, --output_file FILE
Name of the output file [default model1.bed]
The inputs for this step are the BED file with the annotation of the DNA motif occurrences and a tab-separated file with distance values reported in the distance column and associated colors reported in the color column. The output is a BED file where the fourth column reports the distance in nucleotides (prefixed by "D") and the last column reports the associated color.
Rscript GCP.R \
model1 \
-i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
-d ../file/chm13_model1.txt \
-o ../output/model1.bed
Model2 considers the relationship between two consecutive DNA motif occurrences and the number of alpha-satellite monomers between them.
The code divides each bp-distance value by 171 and subtracts 1 to remove the monomer in which the motif falls. Then, it rounds each value to the nearest whole number ensuring the result has no decimal places. The resulting number reflects the number of monomers without motif between two consecutive motif occurrences. It is useful to evaluate the chromosome-specific conservation of the motif organization with respect to alpha-satellite monomers.
Rscript GCP.R model2 --help
usage: GCP.R model2 [-h] -i FILE -d FILE [-o FILE]
optional arguments:
-h, --help show this help message and exit
-i FILE, --input FILE
DNA motif annotation file, generated by fuzznuc module
-d FILE, --distance FILE
Tab-separated file where the distance column contains
the number of monomers between two monomers with the
DNA motif and the color column reports the associated
color
-o FILE, --output_file FILE
Name of the output file [default model2.bed]
This module takes as input the BED file with the annotation of the DNA motif occurrences and a tab-separated file that maps each distance value (expressed in monomer units) to a corresponding color. The output is a BED file where the fourth column reports the distance in monomers (prefixed by “M”) and the last column reports the associated color.
Rscript GCP.R \
model2 \
-i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
-d ../file/model2.txt \
-o ../output/model2.bed
This module retrieves and annotates the possible combinations of consecutive distance values, named k-pattern (where k is equal to the number of consecutive distance values considered).
Rscript GCP.R model3 --help
usage: GCP.R model3 [-h] -i FILE [-k INT] [-fq INT] [-o FILE]
optional arguments:
-h, --help show this help message and exit
-i FILE, --input FILE
DNA motif annotation file, generated by fuzznuc module
-k INT, --klength INT
Number of distance values includeed in each pattern
[default 4]
-fq INT, --frequency INT
Minimum frequency required to color a k-pattern
[default 5]
-o FILE, --output_file FILE
Name of the output file [default model3.bed]
The input of model3 is the BED file generated by the fuzznuc module. In this step users must decide the number of boxes (default 4) of the patterns and which patterns to color (only those with frequency higher than 5 to avoid variants less represented that are colored in black). The module produces two output files. The first is a BED file where non-overlapping (step equal to k) blocks of consecutive k distance values are reported, along with the color associated to the k-pattern (k-patterns with a frequency lower than a user-specified threshold are colored in black). Visualization of this file using a genome browser allows to highlight chromosome-specific k-patterns. The second tab-separated file, created in the same folder as the output BED file (pattern_counts_sliding1_k[INT].txt, where INT is equal to the value of k), reports the frequency in each chromosome of all the k-patterns found with a sliding window step of 1. This last file is useful for the reads retrieval step.
Rscript GCP.R \
model3 \
-i ../output/chr17.t2t-chm13-v2.0.fa_NTTCGNNNNANNCGGGN.bed \
-k 4 \
-fq 5 \
-o ../output/model3.bed
This step allows to isolate chromosome-specific centromeric reads based on their match with k-patterns found in the genome.
Rscript GCP.R reads_retrieval --help
usage: GCP.R reads_retrieval [-h] -i FILE -p FILE -c CHR_NAME --n_box INT
[--n_pattern INT] [--n_matches INT] [-t INT]
[-o FILE]
optional arguments:
-h, --help show this help message and exit
-i FILE, --input FILE
DNA motif annotation file, generated by applying the
fuzznuc module to long sequencing reads
-p FILE, --pattern_table FILE
Input table from model3 which stores all the
k-patterns with a sliding step of 1
-c CHR_NAME, --chr CHR_NAME
Chromosome of interest (names are stored in column 3
of the file in pattern_table)
--n_box INT Filter out reads with number of motif occurrences
lower than INT (this parameter must to be equal
to/higher than the number of distances in the
k-pattern)
--n_pattern INT Consider only k-patterns with frequency higher than
INT within the selected chromosome [default 10]
--n_matches INT Assign the read to the selected chromosome if it
contains at least INT considered k-patterns [default
1]
-t INT, --n_cores INT
Number of threads [default 5]
-o FILE, --output_file FILE
Name of the output file [default
chr_specific_reads.txt]
The inputs for this step are the BED file with the annotation of the DNA motif occurrences found within the reads, generated by fuzznuc module, and the list of chromosome-specific k-patterns (with sliding step of 1) from model3.
To convert FASTQ to FASTA in order to run fuzznuc we suggest to use seqtk:
seqtk seq -a <file.fastq> > <file.fasta>
The BED annotation file used in this tutorial was created by applying the fuzznuc module to ~33,000 ONT reads with a minimum length of 100kb.
Rscript GCP.R \
reads_retrieval \
-i ../test/ONT_100kb_2023_07_28_all_fasta_NTTCGNNNNANNCGGGN.bed \
-p ../output/pattern_counts_sliding1_k4.txt \
--chr chr17 \
--n_box 4 \
--n_pattern 10 \
--n_matches 1 \
--n_cores 5 \
-o ../output/chr17_specific_reads.txt
The output of this step is a text file that contains IDs of the reads that were assigned to the chromosome of interest based on k-pattern matching. By using this file and seqtk it is possible to retrieve those reads from the FASTQ/FASTA:
seqtk subseq <file.fastq> chr17_specific_reads.txt > chr17_specific_reads.fastq
Author: Luca Corda
For private issues on the code: luca.corda@uniroma1.it