ncsa · joshfactorial · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/README.md b/README.md
@@ -1,12 +1,12 @@
-# The NEAT Project v4.4
+# The NEAT Project v4.5
 
-Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.4. NEAT 4.4 is the official release of NEAT 4.0. It represents a lot of hard work from several contributors at NCSA and beyond. With the addition of parallel processing, we feel that the code is ready for production, and future releases will focus on compatibility, bug fixes, and testing. Future releases for the time being will be enumerations of 4.4.X.
+Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.5. See the [ChangeLog](ChangeLog.md) for the full version history.
 
-## NEAT v4.4
+## NEAT v4.5
 
-NEAT 4.4 is the current official release of NEAT 4.0, including parallel processing support and significant bug fixes to the sequencing error model. See the [ChangeLog](ChangeLog.md) for details. 
+NEAT 4.5 is the current release. It adds the `neat compare-vcfs` subcommand, removes the deprecated `cleanup_splits`/`reuse_splits` config keys, and ships a number of performance and correctness fixes accumulated across the 4.4.x line. See the [ChangeLog](ChangeLog.md) for details.
 
-We have completed major revisions on NEAT since 3.4 and consider NEAT 4.4 to be a stable release, in that we will continue to update and provide bug fixes and support. We will consider new features and pull requests. Please include justification for major changes. See [contribute](CONTRIBUTING.md) for more information. If you'd like to use some of our code in your own, no problem! Just review the [license](LICENSE.md), first.
+We consider NEAT 4.x to be a stable release line and will continue to provide bug fixes and support. We will consider new features and pull requests. Please include justification for major changes. See [contribute](CONTRIBUTING.md) for more information. If you'd like to use some of our code in your own, please review the [license](LICENSE.md) first.
 
 We've deprecated NEAT's command-line interface options for the most part, opting to simplify things with configuration files. If you require the CLI for legacy purposes, NEAT 3.4 was our last release to be fully supported via command-line interface. Please convert your CLI commands to the corresponding configuration file for future runs.
 
@@ -26,8 +26,8 @@ To cite this work, please use both of the following:
 
 ## Table of Contents
 
-* [The NEAT Project v4.4](#the-neat-project-v44)
-* [NEAT v4.4](#neat-v44)
+* [The NEAT Project v4.5](#the-neat-project-v45)
+* [NEAT v4.5](#neat-v45)
 * [Table of Contents](#table-of-contents)
   * [Prerequisites](#prerequisites)
   * [Installation](#installation)
@@ -48,6 +48,8 @@ To cite this work, please use both of the following:
     * [`neat model-qual-score`](#neat-model-qual-score)
     * [`neat model-gc-bias`](#neat-model-gc-bias)
     * [`neat compare-vcfs`](#neat-compare-vcfs)
+    * [`neat bacterial-wrapper`](#neat-bacterial-wrapper)
+  * [Validating NEAT Outputs](#validating-neat-outputs)
   * [Tests](#tests)
     * [Guide to run locally](#guide-to-run-locally)
     * [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
@@ -165,9 +167,9 @@ To run the simulator in multithreaded mode, set the `threads` value in the confi
 
 `reference`: Full path to a FASTA file to generate reads from.  
 
-`read_len`: The length of the reads for the FASTQ (if using). _Integer value, default 101._
+`read_len`: The length of the reads for the FASTQ (if using). _Integer value, default 151._
 
-`coverage`: Desired coverage value. _Float or integer, default = 10._
+`coverage`: Desired coverage depth. _Integer value, default = 10._
 
 `ploidy`: Desired value for ploidy (# of copies of each chromosome in the organism, where if ploidy > 2, "heterozygous"  mutates floor(ploidy / 2) chromosomes). _Default is 2._
 
@@ -193,16 +195,18 @@ More parameters are below:
 | `mutation_model`    | Full path to a mutation model generated by NEAT. Leave empty to use a default model (default model based on human data sequenced by Illumina).                                                                |
 | `fragment_model`    | Full path to fragment length model generated by NEAT. Leave empty to use default model (default model based on human data sequenced by Illumina).                                                             |
 | `gc_model`          | Full path to GC-bias model generated by NEAT. Leave empty for no GC bias.                                                                                                                                     |
-| `threads`           | The number of threads for NEAT to use. Increasing the number will speed up read generation.                                                                                                                   |
 | `avg_seq_error`     | Average sequencing error rate for the sequencing machine. Use to increase or decrease the rate of errors in the reads. Float between 0 and 0.3. Default is set by the error model.                            |
 | `rescale_qualities` | Rescale the quality scores to reflect the `avg_seq_error` rate above. Set `True` to activate if you notice issues with the sequencing error rates in your dataset.                                            |
+| `quality_offset`    | ASCII offset for quality score encoding. Default `33` (Sanger/Illumina 1.8+ Phred+33). Only change this if your data uses a non-standard encoding (valid range 33–64).                                       |
 | `include_vcf`       | Full path to list of variants in VCF format to include in the simulation. These will be inserted as they appear in the input VCF into the final VCF, and the corresponding FASTQ and BAM files, if requested. |
 | `target_bed`        | Full path to list of regions in BED format to target. All areas outside these regions will have coverage of 0.                                                                                                |
 | `discard_bed`       | Full path to a list of regions to discard, in BED format.                                                                                                                                                     |
 | `mutation_rate`     | Desired rate of mutation for the dataset. Float between 0.0 and 0.3 (default is determined by the mutation model).                                                                                            |
 | `mutation_bed`      | Full path to a list of regions with a column describing the mutation rate of that region, as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., `mut_rate`=0.00.  |
 | `rng_seed`          | Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer.                                                                                                           |
 | `min_mutations`     | Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig.   |
+| `overwrite_output`  | If `true`, existing output files with the same name will be overwritten. Default `false` (NEAT exits with an error if output files already exist).                                                            |
+| `no_coverage_bias`  | If `true`, disables statistical coverage models and forces uniform coverage across the genome. Intended for debugging. Default `false`.                                                                        |
 | `threads`           | Number of threads to use. More than 1 will use multi-threading to speed up processing. With `threads > 1`, NEAT splits each contig into chunks; with `threads == 1`, one chunk per contig is used.            |
 | `parallel_block_size` | Per-chunk size in bases when `threads > 1`. Default `0` (auto-tune from total genome length and thread count, targeting ~8 chunks per thread). Set to a positive integer to override. Ignored when `threads == 1`. |
 
@@ -664,6 +668,79 @@ then point `--happy-bin` at `/path/to/hap_py_env/bin/hap.py` (or put that
 directory on `$PATH`). Without `hap.py` available, `neat compare-vcfs` exits
 with a clear install hint.
 
+### `neat bacterial-wrapper`
+
+Runs NEAT's read simulator twice on the same reference — once with a standard
+mutation model ("Regular") and once with a higher mutation rate model
+("Wrapped") representing a bacterium under selection pressure — then stitches
+the two output sets together for downstream comparison.
+
+```bash
+neat bacterial-wrapper reference.fa bacteria_name \
+        -c config.yml                             \
+        -o /path/to/output/dir
+```
+
+Outputs are written into `<output_dir>/Regular/` and `<output_dir>/Wrapped/`
+subdirectories, then stitched into the parent output directory. The config file
+follows the same format as `neat read-simulator`.
+
+## Validating NEAT Outputs
+
+NEAT does not ship its own validation utilities. Use the standard bioinformatics
+tools below, which handle gzipped inputs, are actively maintained, and are
+typically already present in any conda/bioconda environment.
+
+### FASTQ validation
+
+**FastQC** — format validation plus quality metrics (GC content, adapter
+contamination, per-base quality scores):
+
+```bash
+fastq read1.fastq.gz            # single-end
+fastqc read1.fastq.gz read2.fastq.gz   # paired-end
+```
+
+**fastp** — lightweight format check without trimming or filtering:
+
+```bash
+fastp --in1 read1.fastq.gz --in2 read2.fastq.gz \
+      --disable_adapter_trimming --disable_quality_filtering \
+      --disable_length_filtering --thread 4
+```
+
+### BAM validation
+
+**samtools quickcheck** — fast EOF and header sanity check; exits non-zero on
+failure, making it CI-friendly:
+
+```bash
+samtools quickcheck output.bam
+```
+
+**samtools flagstat** — alignment statistics; a non-zero exit or obviously wrong
+counts (e.g., 0 mapped reads when coverage > 0) indicate a problem:
+
+```bash
+samtools flagstat output.bam
+```
+
+**Picard ValidateSamFile** — comprehensive structural validation (CIGAR
+consistency, mate-pair pairing, flag conflicts):
+
+```bash
+picard ValidateSamFile -I output.bam -MODE SUMMARY
+```
+
+### VCF validation
+
+**bcftools stats** — reports site counts, ts/tv ratio, and indel length
+distribution; useful for a sanity check against expected mutation rates:
+
+```bash
+bcftools stats output_golden.vcf.gz | grep "^SN"
+```
+
 ## Tests
 
 We provide unit tests (e.g., mutation and sequencing error models) and basic integration tests for the CLI.

diff --git a/config_template/template_neat_config.yml b/config_template/template_neat_config.yml
@@ -57,12 +57,6 @@ include_vcf: .
 # type = string | required = no
 target_bed: .
 
-# Scalar value for coverage in regions outside the targeted BED. Example: 0.5
-# would get you roughly half the coverage as the on target areas. Default is
-# 0 coverage in off-target regions. Number should be a float in decimal
-# type: float | required = no | default = 0.00
-off_target_scalar: .
-
 # Absolute path to BED file containing reference regions that the simulation should discard
 # type = string | required = no
 discard_bed: .