ncsa
diff --git a/‎README.md‎
Lines changed: 52 additions & 15 deletions b/‎README.md‎
Lines changed: 52 additions & 15 deletions
diff --git a/‎config_template/simple_template.yml‎
Lines changed: 9 additions & 0 deletions b/‎config_template/simple_template.yml‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎config_template/template_neat_config.yml‎
Lines changed: 40 additions & 0 deletions b/‎config_template/template_neat_config.yml‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎neat/cli/commands/parallel.py‎
Lines changed: 142 additions & 0 deletions b/‎neat/cli/commands/parallel.py‎
Lines changed: 142 additions & 0 deletions
diff --git a/‎neat/parallel_read_simulator/__init__.py‎
Lines changed: 5 additions & 0 deletions b/‎neat/parallel_read_simulator/__init__.py‎
Lines changed: 5 additions & 0 deletions
@@ -21,7 +21,7 @@ Table of Contents
   * [neat-genreads](#neat-genreads)
   * [Table of Contents](#table-of-contents)
     * [Requirements](#requirements)
-    * [Installation] (#installation)
+    * [Installation](#installation)
     * [Usage](#usage)
     * [Functionality](#functionality)
     * [Examples](#examples)
@@ -32,18 +32,19 @@ Table of Contents
       * [Large single end reads](#large-single-end-reads)
       * [Parallelizing simulation](#parallelizing-simulation)
   * [Utilities](#utilities)
+    * [Parallelization](#parallelization)
     * [model_fragment_lengths](#modelfraglen)
     * [gen_mut_model](#genmutmodel)
     * [model_sequencing_error](#modelseqerror)
-      * [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
-
+    * [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
 
 ## Requirements (the most up-to-date requirements are found in the environment.yml file)
 
 * Some version of Anaconda to set up the environment
 * Python == 3.10.*
 * poetry == 1.3.*
 * biopython == 1.79
+* samtools == 1.20
 * pkginfo
 * matplotlib
 * numpy
@@ -103,6 +104,8 @@ A config file is required. The config is a yml file specifying the input paramet
 description of the potential inputs in the config file. See NEAT/config_template/template_neat_config.yml for a
 template config file to copy and use for your runs.
 
+To run the simulator in parallel with the same config file and significantly speed up runtime, please see the [Parallelization](#parallelization) section.
+
 reference: full path to a fasta file to generate reads from
 read_len: The length of the reads for the fastq (if using). Integer value, default 101.
 coverage: desired coverage value. Float or int, default = 10
@@ -283,6 +286,51 @@ neat read-simulator                 \
 # Utilities	
 Several scripts are distributed with gen_reads that are used to generate the models used for simulation.
 
+## neat parallel
+
+Runs NEAT’s read simulator across a split reference (by contig or by fixed chunk size), in parallel, and stitches the outputs into final FASTQ/BAM/VCF.
+
+### Commands:
+
+Minimal: all settings come from a single YAML config
+```
+neat parallel -c /path/to/config.yml
+```
+
+Override or supplement a few options on the CLI
+```
+neat parallel -c /path/to/config.yml \
+  --outdir run1 --by size --size 500000 --jobs 8
+```
+
+neat parallel reads the same config you use for neat read-simulator and also looks for these parallelization keys at the top level:
+
+```
+# required unless you pass --outdir on the CLI
+outdir: /absolute/or/relative/path/for/this_run
+
+# stitched outputs live under outdir; relative values are resolved under outdir
+final_prefix: stitched/final         # default if omitted: stitched/final
+
+# how to split the reference (size recommended)
+by: contig                           # values: contig | size
+size: 1000000                        # used only when by: size
+
+# parallel execution
+jobs: 8                              # default: CPU count
+
+# how to invoke the simulator
+neat_cmd: neat read-simulator        # default
+
+# external tool for stitching BAMs
+samtools: samtools                   # default, must be on PATH
+
+# organization
+cleanup_splits: false                # delete outdir/splits after stitch
+reuse_splits: false                  # reuse existing splits if present
+```
+
+
 ## neat model-fraglen
 
 Computes empirical fragment length distribution from sample data.
@@ -344,17 +392,6 @@ neat model-seq-err                                    \
 
 Please note that -i2 can be used in place of -i to produce paired data.
 
-## neat plot_mutation_model
-
-Performs plotting and comparison of mutation models generated from genMutModel.py (Not yet implemented in NEAT 4.0).
-
-```
-neat plot_mutation_model                                                \
-        -i model1.pickle.gz [model2.pickle.gz] [model3.pickle.gz]...    \
-        -l legend_label1 [legend_label2] [legend_label3]...             \
-        -o path/to/pdf_plot_prefix
-```
-
 ## neat vcf_compare
 
 Tool for comparing VCF files (Not yet implemented in NEAT 4.0).
@@ -380,4 +417,4 @@ neat vcf_compare
 Mappability track examples: https://github.com/zstephens/neat-repeat/tree/master/example_mappabilityTracks
 
 ### Note on Sensitive Patient Data
-ICGC's "Access Controlled Data" documentation can be found at <a href = https://docs.icgc.org/portal/access/ target="_blank">https://docs.icgc.org/portal/access/</a>. To have access to controlled germline data, a DACO must be submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference allele. Controlled data includes unmasked germline alleles.
+ICGC's "Access Controlled Data" documentation can be found at <a href = https://docs.icgc.org/portal/access/ target="_blank">https://docs.icgc.org/portal/access/</a>. To have access to controlled germline data, a DACO must be submitted. Open tier data can be obtained without a DACO, but germline alleles that do not match the reference genome are masked and replaced with the reference allele. Controlled data includes unmasked germline alleles.
@@ -26,3 +26,12 @@ rng_seed: .
 min_mutations: .
 overwrite_output: .
 
+outdir: .
+final_prefix: .
+by: .
+size: .
+jobs: .
+neat_cmd: .
+samtools: .
+cleanup_splits: .
+reuse_splits: .
@@ -140,3 +140,43 @@ min_mutations: .
 # type: bool | required = no | default = false
 overwrite_output: .
 
+# Top-level output directory for splits, per-chunk outputs, and stitched results.
+# Relative paths are interpreted against the CURRENT WORKING DIRECTORY.
+# If omitted (or set to .), it defaults to: <cwd>/<config_stem>_parallel
+# type = string | required: no
+outdir: .
+
+# Location (prefix, no extension) for stitched outputs.
+# If relative, it is resolved under outdir (i.e., <outdir>/<final_prefix>*).
+# Default is "stitched/final".
+# type = string | required: no | default = stitched/final
+final_prefix: .
+
+# How to split the input reference for parallelization
+# type = string | required: no | default = contig | values: contig, size
+by: .
+
+# Target chunk size if by = size (overlap = read_len * 2).
+# Default is 500000 when by = size.
+# type = int | required: no | default = 500000 (when by=size)
+size: .
+
+# Maximum number of concurrent NEAT jobs
+# type = int | required: no | default = (CPU count)
+jobs: .
+
+# Command used to launch the simulator (CLI mode)
+# type = string | required: no | default = "neat read-simulator"
+neat_cmd: .
+
+# Path to samtools (binary name if on PATH)
+# type = string | required: no | default = samtools
+samtools: .
+
+# Delete the 'splits' directory after stitching completes
+# type = bool | required: no | default = false
+cleanup_splits: .
+
+# Reuse existing files in 'splits' and skip the split step
+# type = bool | required: no | default = false
+reuse_splits: .
@@ -0,0 +1,142 @@
+"""
+Command line interface for parallelized wrapper of NEAT.
+"""
+
+import argparse
+from pathlib import Path
+from typing import List
+
+from .base import BaseCommand
+from ...parallel_read_simulator.parallelize import parallelize_main as pipeline_main
+
+
+class Command(BaseCommand):
+    """
+    Split the reference, run read simulator, and stitch outputs together.
+    """
+    name = "parallel"
+    description = (
+        "Split the reference, run read-simulator in parallel, and stitch outputs together."
+    )
+
+    def add_arguments(self, parser: argparse.ArgumentParser) -> None:
+        """
+        Register CLI arguments for the parallel read simulator.
+        """
+
+        parser.add_argument(
+            "-c",
+            "--config",
+            type=Path,
+            required=True,
+            help="NEAT YAML/YML config containing the 'reference:' field",
+        )
+
+        parser.add_argument(
+            "--outdir",
+            type=Path,
+            required=False,
+            default=None,
+            help="Top-level directory for splits and stitched results (optional)",
+        )
+
+        # Splitting options
+        split = parser.add_argument_group("splitting options")
+        split.add_argument(
+            "--by",
+            choices=["contig", "size"],
+            default=None,
+            help="Split mode",
+        )
+        split.add_argument(
+            "--size",
+            type=int,
+            default=None,
+            help="Target chunk size when --by size",
+        )
+        split.add_argument(
+            "--cleanup-splits",
+            action=argparse.BooleanOptionalAction,
+            default=None,
+            help="Delete the 'splits' directory after stitching completes",
+        )
+        split.add_argument(
+            "--reuse-splits",
+            action=argparse.BooleanOptionalAction,
+            default=None,
+            help="Skip splitting and reuse existing YAML/FASTA files in 'splits'",
+        )
+
+        # Simulation options
+        sim = parser.add_argument_group("simulation options")
+        sim.add_argument(
+            "--jobs",
+            type=int,
+            default=None,
+            help="Maximum number of parallel NEAT jobs",
+        )
+        sim.add_argument(
+            "--neat-cmd",
+            default=None,
+            help="Command used to launch the read simulator (e.g. 'neat read-simulator')",
+        )
+
+        # Stitching options
+        stitch = parser.add_argument_group("stitching options")
+        stitch.add_argument(
+            "--samtools",
+            default=None,
+            help="Path to samtools executable used by stitch_outputs.py",
+        )
+        stitch.add_argument(
+            "--final-prefix",
+            type=Path,
+            default=None,
+            help="Prefix (no extension) for stitched outputs",
+        )
+
+        # Optional YAML/JSON describing parallel settings
+        parser.add_argument(
+            "--parallel-config",
+            type=Path,
+            help="Optional YAML/JSON file with parallelization settings (jobs, by, size, etc.)",
+        )
+
+    def execute(self, arguments: argparse.Namespace) -> None:
+        # Optionally overlay values from a parallel-config file
+        if arguments.parallel_config and arguments.parallel_config.is_file():
+            import json, yaml
+            ext = arguments.parallel_config.suffix.lower()
+            with open(arguments.parallel_config, "r") as fh:
+                overrides = yaml.safe_load(fh) if ext in (".yml", ".yaml") else json.load(fh)
+            for k, v in overrides.items():
+                if hasattr(arguments, k):
+                    setattr(arguments, k, v)
+
+        argv: List[str] = [str(arguments.config)]
+
+        # Only forward flags the user actually set
+        if arguments.outdir is not None:
+            argv += ["--outdir", str(arguments.outdir)]
+        if arguments.by is not None:
+            argv += ["--by", arguments.by]
+        if arguments.size is not None and arguments.by == "size":
+            argv += ["--size", str(arguments.size)]
+
+        # Handle booleans
+        if arguments.cleanup_splits is not None:
+            argv += ["--cleanup-splits"] if arguments.cleanup_splits else ["--no-cleanup-splits"]
+        if arguments.reuse_splits is not None:
+            argv += ["--reuse-splits"] if arguments.reuse_splits else ["--no-reuse-splits"]
+
+        # Other parameters
+        if arguments.jobs is not None:
+            argv += ["--jobs", str(arguments.jobs)]
+        if arguments.neat_cmd is not None:
+            argv += ["--neat-cmd", arguments.neat_cmd]
+        if arguments.samtools is not None:
+            argv += ["--samtools", arguments.samtools]
+        if arguments.final_prefix is not None:
+            argv += ["--final-prefix", str(arguments.final_prefix)]
+
+        pipeline_main(argv)
@@ -0,0 +1,5 @@
+"""
+Load modules needed for other parts of the program
+"""
+from .parallelize import parallelize_main
+__all__ = ["parallelize_main"]