Skip to content

Commit ca2fc5e

Browse files
Merge pull request #154 from ncsa/144-break-uprecombine-larger-genomes
Parallelization - 144 break uprecombine larger genomes
2 parents fb83403 + cd76ec3 commit ca2fc5e

12 files changed

Lines changed: 1093 additions & 15 deletions

File tree

README.md

Lines changed: 52 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Table of Contents
2121
* [neat-genreads](#neat-genreads)
2222
* [Table of Contents](#table-of-contents)
2323
* [Requirements](#requirements)
24-
* [Installation] (#installation)
24+
* [Installation](#installation)
2525
* [Usage](#usage)
2626
* [Functionality](#functionality)
2727
* [Examples](#examples)
@@ -32,18 +32,19 @@ Table of Contents
3232
* [Large single end reads](#large-single-end-reads)
3333
* [Parallelizing simulation](#parallelizing-simulation)
3434
* [Utilities](#utilities)
35+
* [Parallelization](#parallelization)
3536
* [model_fragment_lengths](#modelfraglen)
3637
* [gen_mut_model](#genmutmodel)
3738
* [model_sequencing_error](#modelseqerror)
38-
* [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
39-
39+
* [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
4040

4141
## Requirements (the most up-to-date requirements are found in the environment.yml file)
4242

4343
* Some version of Anaconda to set up the environment
4444
* Python == 3.10.*
4545
* poetry == 1.3.*
4646
* biopython == 1.79
47+
* samtools == 1.20
4748
* pkginfo
4849
* matplotlib
4950
* numpy
@@ -103,6 +104,8 @@ A config file is required. The config is a yml file specifying the input paramet
103104
description of the potential inputs in the config file. See NEAT/config_template/template_neat_config.yml for a
104105
template config file to copy and use for your runs.
105106

107+
To run the simulator in parallel with the same config file and significantly speed up runtime, please see the [Parallelization](#parallelization) section.
108+
106109
reference: full path to a fasta file to generate reads from
107110
read_len: The length of the reads for the fastq (if using). Integer value, default 101.
108111
coverage: desired coverage value. Float or int, default = 10
@@ -283,6 +286,51 @@ neat read-simulator \
283286
# Utilities
284287
Several scripts are distributed with gen_reads that are used to generate the models used for simulation.
285288

289+
## neat parallel
290+
291+
Runs NEAT’s read simulator across a split reference (by contig or by fixed chunk size), in parallel, and stitches the outputs into final FASTQ/BAM/VCF.
292+
293+
### Commands:
294+
295+
Minimal: all settings come from a single YAML config
296+
```
297+
neat parallel -c /path/to/config.yml
298+
```
299+
300+
Override or supplement a few options on the CLI
301+
```
302+
neat parallel -c /path/to/config.yml \
303+
--outdir run1 --by size --size 500000 --jobs 8
304+
```
305+
306+
neat parallel reads the same config you use for neat read-simulator and also looks for these parallelization keys at the top level:
307+
308+
```
309+
# required unless you pass --outdir on the CLI
310+
outdir: /absolute/or/relative/path/for/this_run
311+
312+
# stitched outputs live under outdir; relative values are resolved under outdir
313+
final_prefix: stitched/final # default if omitted: stitched/final
314+
315+
# how to split the reference (size recommended)
316+
by: contig # values: contig | size
317+
size: 1000000 # used only when by: size
318+
319+
# parallel execution
320+
jobs: 8 # default: CPU count
321+
322+
# how to invoke the simulator
323+
neat_cmd: neat read-simulator # default
324+
325+
# external tool for stitching BAMs
326+
samtools: samtools # default, must be on PATH
327+
328+
# organization
329+
cleanup_splits: false # delete outdir/splits after stitch
330+
reuse_splits: false # reuse existing splits if present
331+
```
332+
333+
286334
## neat model-fraglen
287335

288336
Computes empirical fragment length distribution from sample data.
@@ -344,17 +392,6 @@ neat model-seq-err \
344392

345393
Please note that -i2 can be used in place of -i to produce paired data.
346394

347-
## neat plot_mutation_model
348-
349-
Performs plotting and comparison of mutation models generated from genMutModel.py (Not yet implemented in NEAT 4.0).
350-
351-
```
352-
neat plot_mutation_model \
353-
-i model1.pickle.gz [model2.pickle.gz] [model3.pickle.gz]... \
354-
-l legend_label1 [legend_label2] [legend_label3]... \
355-
-o path/to/pdf_plot_prefix
356-
```
357-
358395
## neat vcf_compare
359396

360397
Tool for comparing VCF files (Not yet implemented in NEAT 4.0).
@@ -380,4 +417,4 @@ neat vcf_compare
380417
Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks
381418

382419
### Note on Sensitive Patient Data
383-
ICGC's "Access Controlled Data" documentation can be found at <a href = https://docs.icgc.org/portal/access/ target="_blank">https://docs.icgc.org/portal/access/</a>. To have access to controlled germline data, a DACO must be submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference allele. Controlled data includes unmasked germline alleles.
420+
ICGC's "Access Controlled Data" documentation can be found at <a href = https://docs.icgc.org/portal/access/ target="_blank">https://docs.icgc.org/portal/access/</a>. To have access to controlled germline data, a DACO must be submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference allele. Controlled data includes unmasked germline alleles.

config_template/simple_template.yml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,3 +26,12 @@ rng_seed: .
2626
min_mutations: .
2727
overwrite_output: .
2828

29+
outdir: .
30+
final_prefix: .
31+
by: .
32+
size: .
33+
jobs: .
34+
neat_cmd: .
35+
samtools: .
36+
cleanup_splits: .
37+
reuse_splits: .

config_template/template_neat_config.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,3 +140,43 @@ min_mutations: .
140140
# type: bool | required = no | default = false
141141
overwrite_output: .
142142

143+
# Top-level output directory for splits, per-chunk outputs, and stitched results.
144+
# Relative paths are interpreted against the CURRENT WORKING DIRECTORY.
145+
# If omitted (or set to .), it defaults to: <cwd>/<config_stem>_parallel
146+
# type = string | required: no
147+
outdir: .
148+
149+
# Location (prefix, no extension) for stitched outputs.
150+
# If relative, it is resolved under outdir (i.e., <outdir>/<final_prefix>*).
151+
# Default is "stitched/final".
152+
# type = string | required: no | default = stitched/final
153+
final_prefix: .
154+
155+
# How to split the input reference for parallelization
156+
# type = string | required: no | default = contig | values: contig, size
157+
by: .
158+
159+
# Target chunk size if by = size (overlap = read_len * 2).
160+
# Default is 500000 when by = size.
161+
# type = int | required: no | default = 500000 (when by=size)
162+
size: .
163+
164+
# Maximum number of concurrent NEAT jobs
165+
# type = int | required: no | default = (CPU count)
166+
jobs: .
167+
168+
# Command used to launch the simulator (CLI mode)
169+
# type = string | required: no | default = "neat read-simulator"
170+
neat_cmd: .
171+
172+
# Path to samtools (binary name if on PATH)
173+
# type = string | required: no | default = samtools
174+
samtools: .
175+
176+
# Delete the 'splits' directory after stitching completes
177+
# type = bool | required: no | default = false
178+
cleanup_splits: .
179+
180+
# Reuse existing files in 'splits' and skip the split step
181+
# type = bool | required: no | default = false
182+
reuse_splits: .

neat/cli/commands/parallel.py

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
"""
2+
Command line interface for parallelized wrapper of NEAT.
3+
"""
4+
5+
import argparse
6+
from pathlib import Path
7+
from typing import List
8+
9+
from .base import BaseCommand
10+
from ...parallel_read_simulator.parallelize import parallelize_main as pipeline_main
11+
12+
13+
class Command(BaseCommand):
14+
"""
15+
Split the reference, run read simulator, and stitch outputs together.
16+
"""
17+
name = "parallel"
18+
description = (
19+
"Split the reference, run read-simulator in parallel, and stitch outputs together."
20+
)
21+
22+
def add_arguments(self, parser: argparse.ArgumentParser) -> None:
23+
"""
24+
Register CLI arguments for the parallel read simulator.
25+
"""
26+
27+
parser.add_argument(
28+
"-c",
29+
"--config",
30+
type=Path,
31+
required=True,
32+
help="NEAT YAML/YML config containing the 'reference:' field",
33+
)
34+
35+
parser.add_argument(
36+
"--outdir",
37+
type=Path,
38+
required=False,
39+
default=None,
40+
help="Top-level directory for splits and stitched results (optional)",
41+
)
42+
43+
# Splitting options
44+
split = parser.add_argument_group("splitting options")
45+
split.add_argument(
46+
"--by",
47+
choices=["contig", "size"],
48+
default=None,
49+
help="Split mode",
50+
)
51+
split.add_argument(
52+
"--size",
53+
type=int,
54+
default=None,
55+
help="Target chunk size when --by size",
56+
)
57+
split.add_argument(
58+
"--cleanup-splits",
59+
action=argparse.BooleanOptionalAction,
60+
default=None,
61+
help="Delete the 'splits' directory after stitching completes",
62+
)
63+
split.add_argument(
64+
"--reuse-splits",
65+
action=argparse.BooleanOptionalAction,
66+
default=None,
67+
help="Skip splitting and reuse existing YAML/FASTA files in 'splits'",
68+
)
69+
70+
# Simulation options
71+
sim = parser.add_argument_group("simulation options")
72+
sim.add_argument(
73+
"--jobs",
74+
type=int,
75+
default=None,
76+
help="Maximum number of parallel NEAT jobs",
77+
)
78+
sim.add_argument(
79+
"--neat-cmd",
80+
default=None,
81+
help="Command used to launch the read simulator (e.g. 'neat read-simulator')",
82+
)
83+
84+
# Stitching options
85+
stitch = parser.add_argument_group("stitching options")
86+
stitch.add_argument(
87+
"--samtools",
88+
default=None,
89+
help="Path to samtools executable used by stitch_outputs.py",
90+
)
91+
stitch.add_argument(
92+
"--final-prefix",
93+
type=Path,
94+
default=None,
95+
help="Prefix (no extension) for stitched outputs",
96+
)
97+
98+
# Optional YAML/JSON describing parallel settings
99+
parser.add_argument(
100+
"--parallel-config",
101+
type=Path,
102+
help="Optional YAML/JSON file with parallelization settings (jobs, by, size, etc.)",
103+
)
104+
105+
def execute(self, arguments: argparse.Namespace) -> None:
106+
# Optionally overlay values from a parallel-config file
107+
if arguments.parallel_config and arguments.parallel_config.is_file():
108+
import json, yaml
109+
ext = arguments.parallel_config.suffix.lower()
110+
with open(arguments.parallel_config, "r") as fh:
111+
overrides = yaml.safe_load(fh) if ext in (".yml", ".yaml") else json.load(fh)
112+
for k, v in overrides.items():
113+
if hasattr(arguments, k):
114+
setattr(arguments, k, v)
115+
116+
argv: List[str] = [str(arguments.config)]
117+
118+
# Only forward flags the user actually set
119+
if arguments.outdir is not None:
120+
argv += ["--outdir", str(arguments.outdir)]
121+
if arguments.by is not None:
122+
argv += ["--by", arguments.by]
123+
if arguments.size is not None and arguments.by == "size":
124+
argv += ["--size", str(arguments.size)]
125+
126+
# Handle booleans
127+
if arguments.cleanup_splits is not None:
128+
argv += ["--cleanup-splits"] if arguments.cleanup_splits else ["--no-cleanup-splits"]
129+
if arguments.reuse_splits is not None:
130+
argv += ["--reuse-splits"] if arguments.reuse_splits else ["--no-reuse-splits"]
131+
132+
# Other parameters
133+
if arguments.jobs is not None:
134+
argv += ["--jobs", str(arguments.jobs)]
135+
if arguments.neat_cmd is not None:
136+
argv += ["--neat-cmd", arguments.neat_cmd]
137+
if arguments.samtools is not None:
138+
argv += ["--samtools", arguments.samtools]
139+
if arguments.final_prefix is not None:
140+
argv += ["--final-prefix", str(arguments.final_prefix)]
141+
142+
pipeline_main(argv)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""
2+
Load modules needed for other parts of the program
3+
"""
4+
from .parallelize import parallelize_main
5+
__all__ = ["parallelize_main"]

0 commit comments

Comments
 (0)