-
Notifications
You must be signed in to change notification settings - Fork 0
Generating predictions on new data
Aayush Grover edited this page Apr 1, 2025
·
4 revisions
We need to do cross-cell-type normalization using the GM12878 cell line as the reference.
- Ensure that GM12878 ATAC-seq bigwig is present in
data/atac/raw. - Enter the directory with preprocessing scripts
cd preprocessing/atac - Run the normalization script
- If you have bam files as input
python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam - If you have bigwig and peak files as input
python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bed
- If you have bam files as input
This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.
While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.
- Create a new config file for your cell line or condition in
Stage1/. See./Stage1/for more details. - Store the genomic inputs for each cell line <cell_line>
This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at
python ./Stage1/store_inputs.py --cell_line <cell_line>data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in./Stage1/configs/datamodule/validation/cross_cell.yaml
- Ensure that the atac_path (
data/stage1_outputs/) in./Stage2/configs/configs.yamlis correctly set. Then, for each cell line <cell_line>, runTo select a subset of chromosomes for prediction, usepython ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line>This generatespython ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line> --chroms_predict 2 6 19./results/<cell_line>/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npzwhich stores the following information:- chr (chromosome)
- pos1 (position of ATAC-seq peak 1)
- pos2 (position of ATAC-seq peak 2)
- predictions (log Hi-C between peaks 1 and 2)
- variance (aleatoric uncertainty associated with the prediction)
- To obtain epistemic uncertainty, repeat Step 2 for each of the ten model checkpoints and take variance in predictions across the runs (as described in
./Stage2/plot_scores.ipynb).