Generating predictions on new data

Step 1: ATAC-seq Preprocessing

We need to do cross-cell-type normalization using the GM12878 cell line as the reference.

Ensure that GM12878 ATAC-seq bigwig is present in data/atac/raw.
Enter the directory with preprocessing scripts
```
cd preprocessing/atac
```

Run the normalization script

If you have bam files as input

python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam

If you have bigwig and peak files as input

python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bed

where <cell_line1> and <cell_line2> are the names of your cell lines/conditions.

This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.

While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.

Step 2: Extract Genomic Features from Stage 1

Create a new config file for your cell line or condition in Stage1/. See ./Stage1/ for more details.
Store the genomic inputs for each cell line <cell_line>
```
python ./Stage1/store_inputs.py --cell_line <cell_line>
```
This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml

Step 3: Generate Hi-C Predictions from Stage 2

Ensure that the atac_path (data/stage1_outputs/) in ./Stage2/configs/configs.yaml is correctly set. Then, for each cell line <cell_line>, run
```
python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line>
```
To select a subset of chromosomes for prediction, use
```
python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line> --chroms_predict 2 6 19
```
This generates ./results/<cell_line>/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz which stores the following information:
- chr (chromosome)
- pos1 (position of ATAC-seq peak 1)
- pos2 (position of ATAC-seq peak 2)
- predictions (log Hi-C between peaks 1 and 2)
- variance (aleatoric uncertainty associated with the prediction)
To obtain epistemic uncertainty, repeat Step 2 for each of the ten model checkpoints and take variance in predictions across the runs (as described in ./Stage2/plot_scores.ipynb).

Getting Started

Usage

Miscellaneous

Citing UniversalEPI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generating predictions on new data

Step 1: ATAC-seq Preprocessing

Step 2: Extract Genomic Features from Stage 1

Step 3: Generate Hi-C Predictions from Stage 2

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Getting Started

Usage

Miscellaneous

Clone this wiki locally