Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 87 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# The NEAT Project v4.4
# The NEAT Project v4.5

Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.4. NEAT 4.4 is the official release of NEAT 4.0. It represents a lot of hard work from several contributors at NCSA and beyond. With the addition of parallel processing, we feel that the code is ready for production, and future releases will focus on compatibility, bug fixes, and testing. Future releases for the time being will be enumerations of 4.4.X.
Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.5. See the [ChangeLog](ChangeLog.md) for the full version history.

## NEAT v4.4
## NEAT v4.5

NEAT 4.4 is the current official release of NEAT 4.0, including parallel processing support and significant bug fixes to the sequencing error model. See the [ChangeLog](ChangeLog.md) for details.
NEAT 4.5 is the current release. It adds the `neat compare-vcfs` subcommand, removes the deprecated `cleanup_splits`/`reuse_splits` config keys, and ships a number of performance and correctness fixes accumulated across the 4.4.x line. See the [ChangeLog](ChangeLog.md) for details.

We have completed major revisions on NEAT since 3.4 and consider NEAT 4.4 to be a stable release, in that we will continue to update and provide bug fixes and support. We will consider new features and pull requests. Please include justification for major changes. See [contribute](CONTRIBUTING.md) for more information. If you'd like to use some of our code in your own, no problem! Just review the [license](LICENSE.md), first.
We consider NEAT 4.x to be a stable release line and will continue to provide bug fixes and support. We will consider new features and pull requests. Please include justification for major changes. See [contribute](CONTRIBUTING.md) for more information. If you'd like to use some of our code in your own, please review the [license](LICENSE.md) first.

We've deprecated NEAT's command-line interface options for the most part, opting to simplify things with configuration files. If you require the CLI for legacy purposes, NEAT 3.4 was our last release to be fully supported via command-line interface. Please convert your CLI commands to the corresponding configuration file for future runs.

Expand All @@ -26,8 +26,8 @@ To cite this work, please use both of the following:

## Table of Contents

* [The NEAT Project v4.4](#the-neat-project-v44)
* [NEAT v4.4](#neat-v44)
* [The NEAT Project v4.5](#the-neat-project-v45)
* [NEAT v4.5](#neat-v45)
* [Table of Contents](#table-of-contents)
* [Prerequisites](#prerequisites)
* [Installation](#installation)
Expand All @@ -48,6 +48,8 @@ To cite this work, please use both of the following:
* [`neat model-qual-score`](#neat-model-qual-score)
* [`neat model-gc-bias`](#neat-model-gc-bias)
* [`neat compare-vcfs`](#neat-compare-vcfs)
* [`neat bacterial-wrapper`](#neat-bacterial-wrapper)
* [Validating NEAT Outputs](#validating-neat-outputs)
* [Tests](#tests)
* [Guide to run locally](#guide-to-run-locally)
* [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
Expand Down Expand Up @@ -165,9 +167,9 @@ To run the simulator in multithreaded mode, set the `threads` value in the confi

`reference`: Full path to a FASTA file to generate reads from.

`read_len`: The length of the reads for the FASTQ (if using). _Integer value, default 101._
`read_len`: The length of the reads for the FASTQ (if using). _Integer value, default 151._

`coverage`: Desired coverage value. _Float or integer, default = 10._
`coverage`: Desired coverage depth. _Integer value, default = 10._

`ploidy`: Desired value for ploidy (# of copies of each chromosome in the organism, where if ploidy > 2, "heterozygous" mutates floor(ploidy / 2) chromosomes). _Default is 2._

Expand All @@ -193,16 +195,18 @@ More parameters are below:
| `mutation_model` | Full path to a mutation model generated by NEAT. Leave empty to use a default model (default model based on human data sequenced by Illumina). |
| `fragment_model` | Full path to fragment length model generated by NEAT. Leave empty to use default model (default model based on human data sequenced by Illumina). |
| `gc_model` | Full path to GC-bias model generated by NEAT. Leave empty for no GC bias. |
| `threads` | The number of threads for NEAT to use. Increasing the number will speed up read generation. |
| `avg_seq_error` | Average sequencing error rate for the sequencing machine. Use to increase or decrease the rate of errors in the reads. Float between 0 and 0.3. Default is set by the error model. |
| `rescale_qualities` | Rescale the quality scores to reflect the `avg_seq_error` rate above. Set `True` to activate if you notice issues with the sequencing error rates in your dataset. |
| `quality_offset` | ASCII offset for quality score encoding. Default `33` (Sanger/Illumina 1.8+ Phred+33). Only change this if your data uses a non-standard encoding (valid range 33–64). |
| `include_vcf` | Full path to list of variants in VCF format to include in the simulation. These will be inserted as they appear in the input VCF into the final VCF, and the corresponding FASTQ and BAM files, if requested. |
| `target_bed` | Full path to list of regions in BED format to target. All areas outside these regions will have coverage of 0. |
| `discard_bed` | Full path to a list of regions to discard, in BED format. |
| `mutation_rate` | Desired rate of mutation for the dataset. Float between 0.0 and 0.3 (default is determined by the mutation model). |
| `mutation_bed` | Full path to a list of regions with a column describing the mutation rate of that region, as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., `mut_rate`=0.00. |
| `rng_seed` | Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer. |
| `min_mutations` | Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |
| `overwrite_output` | If `true`, existing output files with the same name will be overwritten. Default `false` (NEAT exits with an error if output files already exist). |
| `no_coverage_bias` | If `true`, disables statistical coverage models and forces uniform coverage across the genome. Intended for debugging. Default `false`. |
| `threads` | Number of threads to use. More than 1 will use multi-threading to speed up processing. With `threads > 1`, NEAT splits each contig into chunks; with `threads == 1`, one chunk per contig is used. |
| `parallel_block_size` | Per-chunk size in bases when `threads > 1`. Default `0` (auto-tune from total genome length and thread count, targeting ~8 chunks per thread). Set to a positive integer to override. Ignored when `threads == 1`. |

Expand Down Expand Up @@ -664,6 +668,79 @@ then point `--happy-bin` at `/path/to/hap_py_env/bin/hap.py` (or put that
directory on `$PATH`). Without `hap.py` available, `neat compare-vcfs` exits
with a clear install hint.

### `neat bacterial-wrapper`

Runs NEAT's read simulator twice on the same reference — once with a standard
mutation model ("Regular") and once with a higher mutation rate model
("Wrapped") representing a bacterium under selection pressure — then stitches
the two output sets together for downstream comparison.

```bash
neat bacterial-wrapper reference.fa bacteria_name \
-c config.yml \
-o /path/to/output/dir
```

Outputs are written into `<output_dir>/Regular/` and `<output_dir>/Wrapped/`
subdirectories, then stitched into the parent output directory. The config file
follows the same format as `neat read-simulator`.

## Validating NEAT Outputs

NEAT does not ship its own validation utilities. Use the standard bioinformatics
tools below, which handle gzipped inputs, are actively maintained, and are
typically already present in any conda/bioconda environment.

### FASTQ validation

**FastQC** — format validation plus quality metrics (GC content, adapter
contamination, per-base quality scores):

```bash
fastq read1.fastq.gz # single-end
fastqc read1.fastq.gz read2.fastq.gz # paired-end
```

**fastp** — lightweight format check without trimming or filtering:

```bash
fastp --in1 read1.fastq.gz --in2 read2.fastq.gz \
--disable_adapter_trimming --disable_quality_filtering \
--disable_length_filtering --thread 4
```

### BAM validation

**samtools quickcheck** — fast EOF and header sanity check; exits non-zero on
failure, making it CI-friendly:

```bash
samtools quickcheck output.bam
```

**samtools flagstat** — alignment statistics; a non-zero exit or obviously wrong
counts (e.g., 0 mapped reads when coverage > 0) indicate a problem:

```bash
samtools flagstat output.bam
```

**Picard ValidateSamFile** — comprehensive structural validation (CIGAR
consistency, mate-pair pairing, flag conflicts):

```bash
picard ValidateSamFile -I output.bam -MODE SUMMARY
```

### VCF validation

**bcftools stats** — reports site counts, ts/tv ratio, and indel length
distribution; useful for a sanity check against expected mutation rates:

```bash
bcftools stats output_golden.vcf.gz | grep "^SN"
```

## Tests

We provide unit tests (e.g., mutation and sequencing error models) and basic integration tests for the CLI.
Expand Down
6 changes: 0 additions & 6 deletions config_template/template_neat_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,12 +57,6 @@ include_vcf: .
# type = string | required = no
target_bed: .

# Scalar value for coverage in regions outside the targeted BED. Example: 0.5
# would get you roughly half the coverage as the on target areas. Default is
# 0 coverage in off-target regions. Number should be a float in decimal
# type: float | required = no | default = 0.00
off_target_scalar: .

# Absolute path to BED file containing reference regions that the simulation should discard
# type = string | required = no
discard_bed: .
Expand Down
Loading