Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions ChangeLog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,106 @@
# NEAT v4.4.4
Follow-up release on top of v4.4.3 bundling three lines of work: another perf
pass over the remaining single-thread hot paths (variant overlap checks,
trinucleotide bias mapping, BAM record encoding), a coverage-semantics fix
for the GC bias model added earlier in the 4.4 line, and removal of the
`parallel_mode` config key. The perf changes are behavior-preserving for a
given seed; the GC bias fix changes output bytes when a non-uniform `gc_model`
is configured (a typical model with mean weight ~0.7 previously delivered
~70 % of the requested coverage; v4.4.4 delivers the full coverage).

**Benchmark (ecoli 10× coverage, 4 threads, identical configs):**

| Metric | v4.4.3 | v4.4.4 | Improvement |
|------------------------------|----------|----------|-----------------|
| ecoli SE wall time | 1:35 | **1:07** | 1.42× faster |
| ecoli PE wall time | 1:34 | **1:06** | 1.43× faster |
| ecoli SE total CPU | 331 s | 244 s | 26% less |
| ecoli PE total CPU | 338 s | 248 s | 27% less |
| Worker CPU utilization (PE) | 359%/400 | **378%** | 94.5% of theory |
| Peak resident memory | 175 MB | 175 MB | unchanged |

Cumulative vs the v4.4.2 baseline at the start of this performance arc:
SE 14:55 → 1:07 (**13.3× faster**), PE 14:46 → 1:06 (**13.4× faster**).

**Hot-path fixes:**

- `ContigVariants.check_if_del` / `check_if_ins` — replace O(N) linear scans
of `all_dels` / `all_ins` with O(1) lookups into per-position bucket dicts.
Each indel is indexed at every reference position it covers when added;
total memory cost is O(sum of indel lengths). Single-thread profile on
14 k calls: `check_if_del` 3.62 s → 0.011 s (325×); `check_if_ins` 3.70 s →
0.007 s (530×). Removes O(N²) scaling that would have hurt larger genomes.

- `variant_models.map_local_trinuc_bias` — replace 64 regex scans per call
(one `re.finditer` per trinucleotide) with a single vectorized numpy
expression. The cache check is also tightened from `==` (O(N) byte
compare) to `is` (O(1) identity), since callers pass different slice
objects each time anyway. Single-thread profile on 14 k calls: 13.5 s →
0.81 s (16×). The entire `generate_variants` call dropped from 24.9 s
cumulative to 4.2 s.

- `OutputFileWriter.write_bam_record` — vectorize BAM record encoding:
- Sequence encoding replaces the per-byte Python loop (28 M
`Seq.__getitem__` calls in the old code) with a 256-entry numpy lookup
plus a single `((codes[0::2] << 4) | codes[1::2]).tobytes()`.
- Quality encoding replaces `"".join([chr(n) for n in quality_array])`
with one `np.asarray(...).astype(np.uint8).tobytes()` call.
- CIGAR ops are now packed in one `struct.pack(f'<{n}I', ...)` instead of
one pack-per-op + `bytearray.extend`. The CIGAR string is parsed in a
single linear pass (the previous code did `re.split` + `re.findall`).
- The fixed-size BAM record header is now one
`struct.pack('<iiiIIiiii', ...)` instead of nine separate pack calls
glued with `+`.
Single-thread profile: self time 12.0 s → 3.2 s, cumtime 58.1 s → 8.1 s
(7.2× cumulative). This was the single biggest hot path of the release.

The vectorized sequence and CIGAR encodings produce byte-identical BAM records
to the prior implementation (verified by `pysam.AlignmentFile` read-back and
sort-order checks on the 4-thread ecoli benchmark).

**Config cleanup:** `parallel_mode` has been removed from the YAML schema.
The splitting strategy is now derived from `threads` (single-chunk-per-
contig when `threads == 1`, size-based chunking when `threads > 1`).
Existing configs with `parallel_mode: ...` keep parsing — the key is
silently ignored with a one-line deprecation warning. The option never
actually let users change effective behavior under the old logic (it was
force-overridden to `'contig'` when `threads == 1`, and `'size'` was a
strict superset for `threads > 1`), so this is API surface cleanup rather
than a behavior change.

**GC bias coverage fix.** When a non-uniform `gc_model` was configured,
`cover_dataset` had been scaling per-chunk reads by `chunk_mean / max_weight`.
That made `coverage` the *peak* coverage at the GC bias maximum, undercounting
total reads by `mean_weight_genome / max_weight` (a typical model with
mean ~0.7, max 1.0 delivered ~70 % of the requested coverage). The user-facing
config (`template_neat_config.yml:18`) documents `coverage` as "average coverage
for the entire genome," so this was a documentation/behavior mismatch.

The fix:

- New `compute_genome_wide_gc_mean_weight(reference, gc_model)` helper
(`neat/model_gc_bias/utils.py`) — single-pass vectorized scan over every
length-`window_size` window in the reference, reusing the same prefix-sum
windowing as `cover_dataset`. Uniform models short-circuit without scanning.
- `read_simulator_runner` precomputes this once and stores it on `options` so
all worker chunks see the same denominator (no redundant per-chunk genome
scans).
- `cover_dataset` now scales by `chunk_mean / global_mean`, which averages to
1.0 across chunks and preserves `options.coverage` as the genome-wide average.
- When `cover_dataset` is called directly outside the runner (e.g., unit tests
with an in-memory `SeqRecord`), `gc_global_mean_weight` is unset and the
scaling falls back to chunk-local mean — i.e., no rescaling, which is the
correct single-chunk behavior.

The stale `TODO use gc bias to skew this number. Calculate at the runner level.`
at `generate_reads.py:72` is removed; that work is now complete.

All 603 unit/integration tests pass (up from 599 with three new
`test_compute_genome_wide_gc_mean_*` unit tests plus
`test_gc_bias_preserves_average_coverage`, which runs the full pipeline against
a half-AT / half-GC reference under a non-uniform model and asserts delivered
reads are within 20 % of `coverage × genome_length / read_len`).

# NEAT v4.4.3
Major performance and memory overhaul focused on making NEAT viable for large
genomes on supercomputing hardware. No user-visible API changes (other than
Expand Down
86 changes: 30 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,6 @@ To cite this work, please use both of the following:
* [`neat model-seq-err`](#neat-model-seq-err)
* [`neat model-qual-score`](#neat-model-qual-score)
* [`neat model-gc-bias`](#neat-model-gc-bias)
* [`neat vcf_compare`](#neat-vcf_compare)
* [Tests](#tests)
* [Guide to run locally](#guide-to-run-locally)
* [Note on Sensitive Patient Data](#note-on-sensitive-patient-data)
Expand All @@ -57,15 +56,15 @@ To cite this work, please use both of the following:
The most up-to-date requirements are found in `environment.yml`.

* Some version of Anaconda to set up the environment
* `python` == 3.10.*
* `poetry` == 1.3.*
* `python` >= 3.10
* `poetry` >= 2.0
* `biopython` == 1.85
* `pkginfo`
* `matplotlib`
* `numpy`
* `numpy` >= 2.2
* `pyyaml`
* `scipy`
* `pysam`
* `pysam` >= 0.23
* `frozendict`

NEAT assumes a Linux environment and was tested primarily on Ubuntu Linux. It should work on most Linux systems. If you
Expand Down Expand Up @@ -109,7 +108,7 @@ and run:

Assuming you have installed `conda`, run `source activate` or `conda activate`.

Please note that these installation instructions support MacOS, Windows, and Linux.
NEAT is tested on Linux. macOS may work but is not regularly tested; on Windows, use WSL.

Alternatively, if you wish to work with NEAT in the development-only environment, you can use `poetry install` within
the NEAT repo, after creating the `conda` environment:
Expand Down Expand Up @@ -203,9 +202,8 @@ More parameters are below:
| `mutation_bed` | Full path to a list of regions with a column describing the mutation rate of that region, as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., `mut_rate`=0.00. |
| `rng_seed` | Manually enter a seed for the random number generator. Used for repeating runs. Must be an integer. |
| `min_mutations` | Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |
| `threads` | Number of threads to use. More than 1 will use multi-threading to speed up processing. |
| `mode` | `size` or `contig` whether to divide the contigs into blocks or just by contig. By `contig` is the default, but division by `size` may speed up your run. |
| `size` | Default value of 500,000. |
| `threads` | Number of threads to use. More than 1 will use multi-threading to speed up processing. With `threads > 1`, NEAT splits each contig into chunks; with `threads == 1`, one chunk per contig is used. |
| `parallel_block_size` | Per-chunk size in bases when `threads > 1`. Default `0` (auto-tune from total genome length and thread count, targeting ~8 chunks per thread). Set to a positive integer to override. Ignored when `threads == 1`. |
| `cleanup_splits` | If running more than one simulation on the same input fasta, you can reuse splits files. By default, this will be set to `False`, and splits files will be deleted at the end of the run. |
| `reuse_splits` | If an existing splits file exists in the output folder, it will use those splits, if this value is set to `True`. |

Expand Down Expand Up @@ -268,7 +266,7 @@ seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz

### Estimated runtimes

> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.3 is roughly **9× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work landed in that release. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~5–10× for v4.4.3+. See ChangeLog v4.4.3 for a detailed before/after table on ecoli and a c_elegans scale-test.
> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.4 is roughly **13× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work in v4.4.3 and v4.4.4. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~10–15× for v4.4.4+. See `ChangeLog.md` (v4.4.3 and v4.4.4 entries) for a detailed before/after table on ecoli and a c_elegans scale-test.

To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.

Expand All @@ -280,8 +278,7 @@ The inputs in single-threaded, contig-based mode most closely replicate the beha

The configuration used:

- `mode: contig`
- `threads: 1`
- `threads: 1` (NEAT processes one contig per chunk in single-thread mode)
- `cleanup_splits: True`

| Organism | File size (bytes) | Avg. runtime (ms) | Avg. runtime (min) |
Expand All @@ -299,10 +296,9 @@ For small genomes, single-threaded performance is already fast (on the order of

#### NEAT Multi-Threaded (size mode)

Here we enabled NEAT’s parallelized mode (“small filtering”), which splits contigs into size-based blocks and processes them concurrently. The configuration used:
Here we enabled NEAT’s multi-threaded mode, which splits contigs into size-based blocks and processes them concurrently. The configuration used:

- `mode: size`
- `size: 500000`
- `parallel_block_size: 500000`
- `produce_bam: false`
- `threads: 7`
- `cleanup_splits: True`
Expand All @@ -329,16 +325,17 @@ The following commands are examples for common types of data to be generated. Th
### Whole genome simulation
Simulate whole genome dataset with random variants inserted according to the default model:

Contents of neat_config.yml:
```yml
[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
produce_vcf: True
paired_ended: True
fragment_mean: 300
fragment_st_dev: 30

```
```bash
neat read-simulator \
-c neat_config.yml \
-o /home/me/simulated_reads/
Expand All @@ -347,8 +344,8 @@ neat read-simulator \
### Targeted region simulation
Simulate a targeted region of a genome (e.g., an exome) with a targeted bed:

Contents of neat_config.yml:
```yml
[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
Expand All @@ -357,11 +354,11 @@ paired_ended: True
fragment_mean: 300
fragment_st_dev: 30
targed_bed: hg19_exome.bed

```
```bash
neat read-simulator \
-c neat_config \
-o /home/me/simulated_reads/

```
### Parallelizing simulation
Split the contig into blocks rather than reading by contig. Even in single-threaded mode this is likely to be faster. The chunk size auto-tunes from total genome length and thread count, targeting roughly 8 chunks per thread for good load balancing — on small bacterial genomes you get ~500 kb chunks (similar to NEAT's old hardcoded default), on human-scale genomes you get ~6 Mb chunks (a few hundred total instead of thousands). Specify `parallel_block_size` explicitly if you want to override.
Expand All @@ -377,7 +374,6 @@ paired_ended: True
fragment_mean: 350
fragment_st_dev: 50
threads: 12
parallel_mode: size
# parallel_block_size omitted: auto-tuned from genome length and thread count.
# Set explicitly (e.g. parallel_block_size: 1000000) only if you have a reason.
```
Expand Down Expand Up @@ -417,7 +413,6 @@ produce_bam: true
produce_fastq: true
produce_vcf: true
threads: ${SLURM_CPUS_PER_TASK}
parallel_mode: size
YAML

neat read-simulator -c "config_${CHROM}.yml" \
Expand All @@ -429,8 +424,7 @@ After the array completes, concatenate the per-chromosome outputs:

```bash
# BAMs are already coordinate-sorted within each chromosome; cat in chromosome order
# produces a globally-sorted whole-genome BAM with no further sort needed.
samtools cat -o all.bam out/chr1/chr1_golden.bam out/chr2/chr2_golden.bam ... out/chrY/chrY_golden.bam
# produces a globally-sorted whole-genome BAM with no further 2_golden.bam ... out/chrY/chrY_golden.bam
samtools index all.bam

# FASTQs: gzip-concat preserves a valid gzip stream
Expand All @@ -445,10 +439,10 @@ bcftools index all.vcf.gz
This gives you whole-genome simulation in roughly `single-chromosome-time / nodes` wall time. For a 30× human genome with 24 nodes (each 64 cores), a per-chromosome run takes ~10 min and the array finishes in ~10 min wall, plus a few minutes for the final concat.

### Insert specific variants
Simulate a whole genome dataset with only the variants in the provided VCF file using `-v` and setting mutation rate to 0 with `-M`.
Simulate a whole genome dataset with only the variants in the provided VCF file by setting `include_vcf` to the VCF path and `mutation_rate: 0`:

Contents of neat_config.yml:
```yml
[contents of neat_config.yml]
reference: hg19.fa
read_len: 126
produce_bam: True
Expand All @@ -458,7 +452,8 @@ fragment_mean: 300
fragment_st_dev: 30
include_vcf: NA12878.vcf
mutation_rate: 0

```
```bash
neat read-simulator \
-c neat_config.yml \
-o /home/me/simulated_reads/
Expand All @@ -467,29 +462,31 @@ neat read-simulator \
### Single-end reads
Simulate single-end reads by omitting paired-ended options:

Contents of neat_config.yml:
```yml
[contents of neat_config.yml]
reference: hg18.fa
read_len: 126
produce_bam: True
produce_vcf: True

```
```bash
neat read-simulator \
-c neat_config.yml \
-o /home/me/simulated_read/
-o /home/me/simulated_read/ \
-p 126_frags
```

### Large single-end reads
Simulate PacBio-like reads by providing an error model (note that this is not yet implemented in NEAT 4.0):
Simulate PacBio-like reads by providing an error model. Note: full long-read support is planned for a separate future release and is not yet fully validated in NEAT 4.4.

Contents of neat_config.yml:
```yml
[contents of neat-config.yml]
reference: hg19.fa
read_len: 5000
error_model: errorModel_pacbio.pickle.gz
avg_seq_error: 0.1

```
```bash
neat read-simulator \
-c neat_config.yml \
-o /home/me/simulated_reads
Expand Down Expand Up @@ -601,29 +598,6 @@ neat model-gc-bias \

and creates `gc_model.pickle.gz` model in the working directory.

### `neat vcf_compare`

Tool for comparing VCF files (Not yet implemented in NEAT 4.4).

```bash
neat vcf_compare
-r <ref.fa> * Reference Fasta \
-g <golden.vcf> * Golden VCF \
-w <workflow.vcf> * Workflow VCF \
-o <prefix> * Output Prefix \
-m <track.bed> Mappability Track \
-M <int> Maptrack Min Len \
-t <regions.bed> Targetted Regions \
-T <int> Min Region Len \
-c <int> Coverage Filter Threshold [15] \
-a <float> Allele Freq Filter Threshold [0.3] \
--vcf-out Output Match/FN/FP variants [False] \
--no-plot No plotting [False] \
--incl-homs Include homozygous ref calls [False] \
--incl-fail Include calls that failed filters [False] \
--fast No equivalent variant detection [False]
```

## Tests

We provide unit tests (e.g., mutation and sequencing error models) and basic integration tests for the CLI.
Expand Down
1 change: 0 additions & 1 deletion config_template/simple_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ rng_seed: .
min_mutations: .
overwrite_output: .

parallel_mode: .
parallel_block_size: .
threads: .
cleanup_splits: .
Expand Down
13 changes: 5 additions & 8 deletions config_template/template_neat_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,11 @@ min_mutations: .
# type: bool | required = no | default = false
overwrite_output: .

# How to split the input reference for parallelization
# Note: If threads == 1, this option has no effect.
# type = string | required: no | default = contig | values: contig, size
parallel_mode: .

# Target block size if by = size (overlap = read_len * 2).
# Default is 500000 when by = size. Not used for by = contig
# type = int | required: no | default = 500000 (when by=size)
# Per-chunk size in bases used when threads > 1.
# Default 0 = auto-tune from total genome length and thread count (targeting ~8
# chunks per thread). Set to a positive integer to override. When threads == 1
# one chunk per contig is used and this option has no effect.
# type = int | required: no | default = 0 (auto)
parallel_block_size: .

# Maximum number of concurrent NEAT jobs (threads or hyperthreads) to run
Expand Down
1 change: 0 additions & 1 deletion neat/common/ploid_functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@ def pick_ploids(ploidy: int,
how_many = 1
else:
# if it's polyploid, we'll consider it to be on roughly half the ploids
# TODO may need to improve the modeling for polyploid, maybe
how_many = ploidy//2

# wp is just the temporary genotype list, a hat tip to the old version of NEAT
Expand Down
Loading
Loading