Skip to content

Commit 799a67e

Browse files
Merge pull request #286 from ncsa/develop
Develop
2 parents 349f9ca + 34a2e7e commit 799a67e

13 files changed

Lines changed: 697 additions & 256 deletions

File tree

ChangeLog.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,100 @@
1+
# NEAT v4.4.3
2+
Major performance and memory overhaul focused on making NEAT viable for large
3+
genomes on supercomputing hardware. No user-visible API changes (other than
4+
chunk size now auto-tuning by default).
5+
6+
**Benchmark (ecoli 10× coverage, 4 threads, identical configs):**
7+
8+
| Metric | v4.4.2 | v4.4.3 | Improvement |
9+
|-------------------------|------------|------------|-----------------|
10+
| ecoli SE wall time | 14:55 | 1:35 | 9.4× faster |
11+
| ecoli PE wall time | 14:46 | 1:35 | 9.4× faster |
12+
| ecoli SE total CPU | 3,227 s | 331 s | 9.7× less |
13+
| ecoli PE total CPU | 3,168 s | 338 s | 9.4× less |
14+
| Peak resident memory | 549 MB | 175 MB | 3.1× less |
15+
| Peak heap (memray) | 1.27 GB | 0.32 GB | 4× less |
16+
| Per-worker memory | O(N×cov) | O(1) | bounded |
17+
| `pysam.sort` calls | 2 | 0 | gone |
18+
| BAM correctness | 0.06% dups | strict | fixed |
19+
20+
**Versus NEAT 2.1 (single-threaded baseline):**
21+
- SE: 12:28 → 1:35 (7.9× faster, 56% less CPU)
22+
- PE: 20:12 → 1:35 (12.8× faster, 72% less CPU)
23+
24+
**Scale-test (c_elegans 10× coverage, 4 threads, 100 Mb genome — ~7× the
25+
ecoli reference):**
26+
27+
| Metric | Value |
28+
|-------------------------|---------------------------------------------------------|
29+
| Wall time | 19:16 |
30+
| Total CPU | 4,085 s |
31+
| Peak resident memory | 304 MB |
32+
| BAM records | 6,685,764 |
33+
| BAM sort violations | 0 |
34+
| Stitch step (parallel) | 5.3 s |
35+
| Auto-tuned chunk size | 3.1 Mb (35 chunks) |
36+
37+
Scaling behavior vs ecoli is ~linear in genome size as expected. The stitch
38+
step is bounded by raw disk I/O via `pysam.cat`, so it stays at single-digit
39+
seconds even as the BAM grows. Per-worker peak RSS is 304 MB ÷ 4 ≈ 76 MB,
40+
which is the reference segment + models — independent of chunk size and
41+
coverage.
42+
43+
**What changed in the hot path:**
44+
- Vectorized error sampling in `get_sequencing_errors` — replaced a ~150-iteration
45+
per-read Python loop with batched numpy. Eliminated 28M `np.prod` calls per
46+
185k-read run.
47+
- Vectorized `get_quality_scores` — replaced per-base scalar `rng.normal` with
48+
one batched call.
49+
- Replaced per-read `PairwiseAligner.align()` in `make_cigar` with a direct
50+
walker that builds the CIGAR from known error/mutation positions in O(L).
51+
99% of reads now skip alignment entirely.
52+
- Rewrote `apply_errors` as a single ascending-position pass. The previous
53+
implementation did one `np.concatenate` and one `MutableSeq` slice/concat
54+
per error — quadratic in errors-per-read. The new pass is linear regardless
55+
of error count.
56+
- Removed redundant `deepcopy(self.reference_segment)` calls in
57+
`convert_masking` and `finalize_read_and_write`. Biopython `Seq` is
58+
immutable; the downstream operations make their own working copies.
59+
60+
**What changed in the I/O path:**
61+
- Removed both `pysam.sort` calls. Per-worker BAMs are emitted coordinate-sorted
62+
by construction; `pysam.merge` of sorted inputs already produces sorted output.
63+
The final sort allocated a 1 GB buffer that dominated peak memory.
64+
- Replaced `pysam.merge` with `pysam.cat` for the final stitch. cat does a raw
65+
BGZF concatenation (no decompression / re-encode), bounded by raw disk I/O
66+
instead of BGZF rate. At human-30× scale this is the difference between a
67+
multi-hour stitch and a multi-minute one.
68+
- Each chunk now owns a non-overlapping reference range for read1 placement
69+
(`responsibility_length`), enabling the cat-based stitch and eliminating
70+
~0.06% over-coverage in chunk-overlap regions.
71+
- Streamed FASTQ and BAM records directly to output during read generation.
72+
Workers no longer accumulate `reads_to_write` — per-worker memory is now
73+
bounded by reference segment + models, not by chunk size × coverage.
74+
- Stitch steps (FASTQ concat, VCF dedup, BAM cat) now run concurrently in
75+
threads. On a single-disk system the wall is bounded by the BAM cat alone;
76+
on parallel filesystems the overlap is more pronounced.
77+
- FASTQ stitch is now byte-level: per-chunk gzip streams are concatenated
78+
without decompression / re-encode (concatenated gzip streams form a valid
79+
gzip file per the spec).
80+
81+
**Defaults and ergonomics:**
82+
- `parallel_block_size` now auto-tunes from genome length and thread count
83+
(target: ~8 chunks per thread). For small bacterial genomes this matches the
84+
old hardcoded 500 kb; for human-scale genomes it produces ~6 Mb chunks
85+
instead of ~500 kb, dramatically reducing stitch overhead. Specify the option
86+
explicitly to override.
87+
- FASTQ output is no longer shuffled; reads come out in the natural sampling
88+
order. Pipe through `seqkit shuffle` if you need a uniform shuffle (documented
89+
in README).
90+
- Added a "Multi-node deployment on HPC clusters" section to the README
91+
showing a SLURM array-job pattern for whole-genome simulation across nodes.
92+
93+
**Caveats:**
94+
- Several of the vectorization fixes change how the PRNG stream is consumed.
95+
Same seed will produce statistically equivalent reads, but not bit-identical
96+
to v4.4.2. Re-baseline any regression tests that compared exact output.
97+
198
# NEAT v4.4.2
299
- Added GC bias modeling to generate reads and a function to create a GC bias model from real data.
3100
- Added improvements and efficiency upgrades to generate-reads.

README.md

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -251,8 +251,25 @@ Features:
251251
- Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
252252
- Create paired tumour/normal datasets using characteristics learned from real tumour data
253253

254+
### Output ordering
255+
256+
The BAM file NEAT produces is coordinate-sorted (by construction at write time — no separate sort pass is run, which used to allocate ~1 GB of sort buffer at stitch time).
257+
258+
The FASTQ files are written in the natural fragment-sampling order. This is *roughly* random — fragments are drawn from batched random positions — but is not a strict uniform shuffle. If your downstream tooling assumes a real-sequencer-style shuffle, pipe the FASTQ through [`seqkit shuffle`](https://bioinf.shenwei.me/seqkit/usage/#shuffle):
259+
260+
```sh
261+
# Single-end
262+
seqkit shuffle reads.fastq.gz -o reads.shuffled.fastq.gz
263+
264+
# Paired-end — use a shared seed so the two files stay mate-aligned
265+
seqkit shuffle -s 42 reads_r1.fastq.gz -o reads_r1.shuffled.fastq.gz
266+
seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz
267+
```
268+
254269
### Estimated runtimes
255270

271+
> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.3 is roughly **9× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work landed in that release. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~5–10× for v4.4.3+. See ChangeLog v4.4.3 for a detailed before/after table on ecoli and a c_elegans scale-test.
272+
256273
To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.
257274

258275
Benchmarks were run on a System76 Meerkat with a 13th Gen Intel Core i3-1315U (8 logical cores, up to 4.50 GHz) and 16 GiB RAM, using a 512 GB SSD and Ubuntu 24.04.3 LTS (Linux kernel 6.14). Actual runtimes will vary depending on your hardware.
@@ -347,7 +364,7 @@ neat read-simulator \
347364

348365
```
349366
### Parallelizing simulation
350-
In this case, you would want to split the contig into blocks, rather than reading by contig. Even in single-threaded mode, this is likely to be faster. The default block size of 500,000 yields results quickly on a variety of datasets and can be easily modified to meet your requirements.
367+
Split the contig into blocks rather than reading by contig. Even in single-threaded mode this is likely to be faster. The chunk size auto-tunes from total genome length and thread count, targeting roughly 8 chunks per thread for good load balancing — on small bacterial genomes you get ~500 kb chunks (similar to NEAT's old hardcoded default), on human-scale genomes you get ~6 Mb chunks (a few hundred total instead of thousands). Specify `parallel_block_size` explicitly if you want to override.
351368

352369
Also, we demonstrate the situation where you do not want any logs produced:
353370

@@ -361,7 +378,8 @@ fragment_mean: 350
361378
fragment_st_dev: 50
362379
threads: 12
363380
parallel_mode: size
364-
parallel_block_size: 500000
381+
# parallel_block_size omitted: auto-tuned from genome length and thread count.
382+
# Set explicitly (e.g. parallel_block_size: 1000000) only if you have a reason.
365383
```
366384
Then run with the command:
367385
```
@@ -371,6 +389,61 @@ neat read-simulator \
371389
-o /home/me/simulated_reads/
372390
```
373391

392+
### Multi-node deployment on HPC clusters
393+
394+
NEAT runs on a single node using Python's `multiprocessing`. To use multiple nodes on a supercomputer, dispatch one NEAT job per contig (or contig group) as a job-array element and concatenate the outputs afterwards. Each array task gets its own node and uses all available cores on it. NEAT itself doesn't need to know about the cluster.
395+
396+
SLURM example with one task per human chromosome (24 tasks):
397+
398+
```bash
399+
#!/bin/bash
400+
#SBATCH --job-name=neat-array
401+
#SBATCH --array=1-24
402+
#SBATCH --cpus-per-task=64
403+
#SBATCH --mem=64G
404+
#SBATCH --time=04:00:00
405+
406+
CHROM=$(awk "NR==$SLURM_ARRAY_TASK_ID" chroms.txt) # e.g. chr1, chr2, ...
407+
samtools faidx hg38.fa "$CHROM" > "ref_${CHROM}.fa" # one chromosome per task
408+
409+
cat > "config_${CHROM}.yml" <<YAML
410+
reference: ref_${CHROM}.fa
411+
read_len: 150
412+
coverage: 30
413+
paired_ended: true
414+
fragment_mean: 400
415+
fragment_st_dev: 50
416+
produce_bam: true
417+
produce_fastq: true
418+
produce_vcf: true
419+
threads: ${SLURM_CPUS_PER_TASK}
420+
parallel_mode: size
421+
YAML
422+
423+
neat read-simulator -c "config_${CHROM}.yml" \
424+
-o "out/${CHROM}/" \
425+
-p "${CHROM}"
426+
```
427+
428+
After the array completes, concatenate the per-chromosome outputs:
429+
430+
```bash
431+
# BAMs are already coordinate-sorted within each chromosome; cat in chromosome order
432+
# produces a globally-sorted whole-genome BAM with no further sort needed.
433+
samtools cat -o all.bam out/chr1/chr1_golden.bam out/chr2/chr2_golden.bam ... out/chrY/chrY_golden.bam
434+
samtools index all.bam
435+
436+
# FASTQs: gzip-concat preserves a valid gzip stream
437+
cat out/*/chr*_r1.fastq.gz > all_r1.fastq.gz
438+
cat out/*/chr*_r2.fastq.gz > all_r2.fastq.gz
439+
440+
# VCFs: use bcftools concat for proper merging
441+
bcftools concat -o all.vcf.gz -Oz out/*/chr*_golden.vcf.gz
442+
bcftools index all.vcf.gz
443+
```
444+
445+
This gives you whole-genome simulation in roughly `single-chromosome-time / nodes` wall time. For a 30× human genome with 24 nodes (each 64 cores), a per-chromosome run takes ~10 min and the array finishes in ~10 min wall, plus a few minutes for the final concat.
446+
374447
### Insert specific variants
375448
Simulate a whole genome dataset with only the variants in the provided VCF file using `-v` and setting mutation rate to 0 with `-M`.
376449

neat/models/error_models.py

Lines changed: 40 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -81,32 +81,27 @@ def get_quality_scores(
8181
:return: An array of quality scores.
8282
"""
8383
if self.uniform_quality_score:
84-
return np.array([self.uniform_quality_score] * length)
85-
else:
86-
if length == model_read_length:
87-
quality_index_map = np.arange(model_read_length)
88-
else:
89-
# This is basically a way to evenly spread the distribution across the number of bases in the read
90-
quality_index_map = np.array(
91-
[max([0, model_read_length * n // length]) for n in range(length)]
92-
)
93-
94-
temp_qual_array = []
95-
for i in quality_index_map:
96-
score = rng.normal(
97-
self.quality_score_probabilities[i][0],
98-
scale=self.quality_score_probabilities[i][1]
99-
)
100-
# make sure score is in range and an int
101-
score = round(score)
102-
if score > 42:
103-
score = 42
104-
if score < 1:
105-
score = 1
84+
return np.full(length, self.uniform_quality_score, dtype=int)
10685

107-
temp_qual_array.append(score)
86+
# Map each position in the read onto a row of the (model_read_length, 2) score
87+
# parameters table. When read length matches the model exactly this is just an
88+
# identity; otherwise we evenly spread the model distribution across the read.
89+
if length == model_read_length:
90+
quality_index_map = np.arange(model_read_length, dtype=np.int64)
91+
else:
92+
quality_index_map = np.maximum(
93+
0, (model_read_length * np.arange(length, dtype=np.int64)) // length
94+
)
10895

109-
return np.array(temp_qual_array)
96+
# Batched per-position normal draws — one rng.normal call returning `length` draws
97+
# with element-wise (mean, scale) parameters. Replaces a ~150-iteration Python
98+
# loop calling rng.normal scalar-per-base. Same statistical distribution but the
99+
# PRNG stream is consumed in a different order, so seeded outputs are not
100+
# bit-identical to the prior scalar-loop implementation.
101+
means = self.quality_score_probabilities[quality_index_map, 0]
102+
scales = self.quality_score_probabilities[quality_index_map, 1]
103+
scores = rng.normal(means, scales)
104+
return np.clip(np.rint(scores).astype(int), 1, 42)
110105

111106

112107
class MarkovQualityModel:
@@ -182,30 +177,34 @@ def get_sequencing_errors(
182177
:return: Modified sequence and associated quality scores
183178
"""
184179

185-
error_indexes = []
186180
introduced_errors = []
187-
# pre-compute the error rate for each quality score. This is the inverse of the phred score equation
188-
quality_score_error_rate: dict[int, float] = {x: 10. ** (-x / 10) for x in quality_scores}
189181

190182
# The use case here would be someone running a simulation where they want no sequencing errors.
191183
# No need to run any loops in this case.
192184
if self.average_error == 0:
193185
return introduced_errors
194-
else:
195-
i = len(quality_scores)
196-
while len(error_indexes) < num_errors and i > 0:
197-
index = rng.choice(list(range(len(quality_scores))))
198-
if rng.random() < quality_score_error_rate[quality_scores[index]]:
199-
error_indexes.append(index)
200-
i -= 1
201-
# Fallback: if quality scores are too high to naturally reach num_errors,
202-
# force errors at positions with at-or-below-median quality scores.
203-
# Using <= so that uniform quality arrays (all scores equal) always make progress.
186+
187+
n = len(quality_scores)
188+
# Batched rejection sampling: draw n candidate indices and n uniform deviates in two numpy
189+
# calls, then accept the first num_errors candidates where the deviate is below the
190+
# quality-derived error rate. Equivalent in distribution to the per-iteration scalar loop
191+
# but ~150x cheaper in Python overhead. Statistical caveat: this changes the order in
192+
# which the underlying PRNG stream is consumed, so seeded runs are not bit-identical to
193+
# the prior interleaved-draw implementation.
194+
candidate_indices = rng.integers(n, size=n)
195+
candidate_randoms = rng.random(size=n)
196+
rates_at_candidates = 10.0 ** (-quality_scores[candidate_indices].astype(float) / 10.0)
197+
accepted = candidate_indices[candidate_randoms < rates_at_candidates]
198+
error_indexes = accepted[:num_errors].tolist()
199+
200+
if len(error_indexes) < num_errors:
201+
# Fallback: if quality scores are too high to naturally reach num_errors, force errors
202+
# at positions with at-or-below-median quality scores. Using <= so that uniform
203+
# quality arrays (all scores equal) always make progress.
204204
median_score = median(quality_scores)
205-
while len(error_indexes) < num_errors:
206-
index = rng.integers(len(quality_scores))
207-
if quality_scores[index] <= median_score:
208-
error_indexes.append(index)
205+
eligible = np.flatnonzero(quality_scores <= median_score)
206+
needed = num_errors - len(error_indexes)
207+
error_indexes.extend(rng.choice(eligible, size=needed, replace=True).tolist())
209208

210209
total_indel_length = 0
211210
# To prevent deletion collisions

0 commit comments

Comments
 (0)