Skip to content

Generating predictions on new data

Aayush Grover edited this page Apr 1, 2025 · 4 revisions

Step 1: ATAC-seq Preprocessing

We need to do cross-cell-type normalization using the GM12878 cell line as the reference.

  1. Ensure that GM12878 ATAC-seq bigwig is present in data/atac/raw.
  2. Enter the directory with preprocessing scripts
    cd preprocessing/atac
    
  3. Run the normalization script
    • If you have bam files as input
      python normalize_atac.py -p ../../data/atac/raw/ --input_bam ../../data/atac/raw/<cell_line1>.bam ../../data/atac/raw/<cell_line2>.bam 
      
    • If you have bigwig and peak files as input
      python normalize_atac.py -p ../../data/atac/raw/ --input_bw ../../data/atac/raw/<cell_line1>.bigWig ../../data/atac/raw/<cell_line2>.bigWig --input_bed ../../data/atac/raw/<cell_line1>.bed ../../data/atac/raw/<cell_line2>.bed
      
    where <cell_line1> and <cell_line2> are the names of your cell lines/conditions.

This will create the normalized bigwig files data/atac/raw/<cell_line1>_normalized.bw, data/atac/raw/<cell_line2>_normalized.bw and deduplicated peak files data/atac/raw/<cell_line1>_dedup.bed, data/atac/raw/<cell_line2>_dedup.bed.

While the above example shows how to run the script when you have two cell lines, the script can be run for any number of cell lines.


Step 2: Extract Genomic Features from Stage 1

  1. Create a new config file for your cell line or condition in Stage1/. See ./Stage1/ for more details.
  2. Store the genomic inputs for each cell line <cell_line>
    python ./Stage1/store_inputs.py --cell_line <cell_line>
    
    This will store parquet files containing DNA-sequence, ATAC-seq, and mappability data at data/stage1_outputs/predict_<cell_line>/. By default, all chromosomes will be used. To use a subset of chromosomes, mention the chromosomes under "chromosome: predict:" in ./Stage1/configs/datamodule/validation/cross_cell.yaml

Step 3: Generate Hi-C Predictions from Stage 2

  1. Ensure that the atac_path (data/stage1_outputs/) in ./Stage2/configs/configs.yaml is correctly set. Then, for each cell line <cell_line>, run
    python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line>
    
    To select a subset of chromosomes for prediction, use
    python ./Stage2/predict.py --config_dir ./Stage2/configs/configs.yaml --cell_line_predict <cell_line> --chroms_predict 2 6 19
    
    This generates ./results/<cell_line>/paper-hg38-map-concat-stage1024-rf-lrelu-eval-stg-newsplit-newdata-atac-var-beta-neg-s1337/results.npz which stores the following information:
    • chr (chromosome)
    • pos1 (position of ATAC-seq peak 1)
    • pos2 (position of ATAC-seq peak 2)
    • predictions (log Hi-C between peaks 1 and 2)
    • variance (aleatoric uncertainty associated with the prediction)
  2. To obtain epistemic uncertainty, repeat Step 2 for each of the ten model checkpoints and take variance in predictions across the runs (as described in ./Stage2/plot_scores.ipynb).