Develop by joshfactorial · Pull Request #296 · ncsa/NEAT

joshfactorial · 2026-05-19T17:29:27Z

Further refinements and enhancements related to processing time and memory. Release 4.4.4.

…erlap check Two hot-path improvements in the variant-generation phase, identified by a post-v4.4.3 cProfile run on ecoli 2x SE (single-thread baseline 158.6 s). 1. ContigVariants.check_if_del / check_if_ins — replace O(N) linear scans over `all_dels` / `all_ins` with O(1) lookups into per-position bucket dicts. Each indel is indexed at every reference position it covers when added; the total memory cost is O(sum of indel lengths), single-digit MB at human scale. Profile on 14,052 calls: check_if_del: 3.624 s -> 0.011 s (~325x) check_if_ins: 3.700 s -> 0.007 s (~530x) The previous representation was O(N**2) over a single variant-generation pass and would scale poorly on larger genomes. 2. variant_models.map_local_trinuc_bias — replace 64 regex scans per call (one re.finditer per trinucleotide string) with a single vectorized numpy expression that encodes each base 0-3, builds a trinuc-index array via `left*16 + mid*4 + right`, and looks up biases in one table read. The cache check is also tightened from `==` (O(N) byte compare) to `is` (O(1) object identity) since slices are different objects anyway. Profile on 14,100 calls: map_local_trinuc_bias: 13.491 s -> 0.811 s (~16x) Combined effect on the single-thread ecoli 2x SE cProfile: Total wall: 158.6 s -> 135.3 s (-15%, -23.3 s) generate_variants cumtime: 24.9 s -> 4.2 s (~5.9x) Variant generation is no longer in the top-5 hot path. Tests: all 599 pass. The `import re` in variant_models.py is no longer used and has been removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

write_bam_record was the single biggest hot path after the v4.4.3 variant work — 28 M Seq.__getitem__ calls per 185 k-read run, plus a per-base list comp for the quality string and 9 small struct.pack calls per record. Replace with bulk numpy / single struct.pack operations: - Sequence encoding: replace the per-byte Python loop for i in range(encoded_len): encoded_seq.extend(pack('<B', (SEQ_PACKED[alt_sequence[2*i]] << 4) + SEQ_PACKED[alt_sequence[2*i+1]])) with a vectorized lookup through a 256-entry table: codes = SEQ_PACKED_LOOKUP[np.frombuffer(seq_str.encode('ascii'), ...)] encoded_seq = ((codes[0::2] << 4) | codes[1::2]).tobytes() Eliminates the 28 M Biopython Seq.__getitem__ calls that dominated cumtime. - Quality encoding: replace encoded_qual = "".join([chr(n) for n in read.quality_array]).encode(...) with one `np.asarray(quality_array, dtype=np.uint8).tobytes()` call. - CIGAR ops: pack all in a single `struct.pack(f'<{n}I', ...)` instead of one pack per op + bytearray.extend. Also drop the regex CIGAR re-parse in favour of a single linear scan over the CIGAR string (no regex import needed; `re` removed). - BAM record header: one `struct.pack('<iiiIIiiii', ...)` for the nine fixed-size fields instead of nine separate pack calls glued with `+`. Effect (clean machine, ecoli 10x 4-thread untimed): SE wall: 1:35.3 (v4.4.3) -> 1:07.1 (1.42x faster, -30%) PE wall: 1:34.5 (v4.4.3) -> 1:05.9 (1.43x faster, -30%) SE CPU%: 348 -> 364 PE CPU%: 359 -> 378 (workers near-saturated) Single-thread cProfile of write_bam_record itself: self time: 12.0 s -> 3.2 s (3.75x) cumtime: 58.1 s -> 8.1 s (7.2x) Combined with the v4.4.4 variant-generation fixes (committed separately) and the v4.4.3 architectural work, NEAT is now 13.3-13.4x faster than v4.4.2 on the ecoli benchmark, with strictly coordinate-sorted output and bounded per-worker memory. All 599 tests pass. Bumps version to 4.4.4 for the release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Records the three hot-path fixes in this release with before/after numbers: - ContigVariants overlap-check O(1) via position-bucket dicts - map_local_trinuc_bias vectorization (16x) - write_bam_record vectorization (7.2x cumtime) Total release effect on ecoli 10x 4-thread: SE 1:35 -> 1:07 (1.42x) PE 1:34 -> 1:06 (1.43x) 13.3-13.4x faster than v4.4.2 cumulative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Low-risk audit items applied across the README: - Prerequisites: poetry pinned to >= 2.0 (was 1.3.*, no longer accurate against pyproject.toml's poetry-core>=2.0 requirement). Added version constraints for numpy (>=2.2) and pysam (>=0.23) to match pyproject. - Removed the OS-support contradiction. The Prerequisites section already notes "NEAT assumes a Linux environment ... install WSL on other OSes"; the installation section now agrees ("Linux tested; macOS may work; Windows via WSL") instead of contradicting it with the old "supports MacOS, Windows, and Linux" claim. - Usage table: corrected `mode`/`size` to the actual config keys `parallel_mode`/`parallel_block_size`. Updated descriptions for the new defaults (`parallel_mode: size` for threads>1, forced to `contig` for threads==1; `parallel_block_size: 0` = auto-tune). - Estimated-runtimes banner: updated to point at v4.4.4 numbers (~13× faster than v4.4.2 on ecoli) and to reference both v4.4.3 and v4.4.4 ChangeLog entries. - Estimated-runtimes config snippets: same `mode`/`size` → `parallel_mode`/ `parallel_block_size` correction. - "Insert specific variants" example: dropped the `-v`/`-M` CLI flag prose (those flags were removed when NEAT switched to config-file-only). - Three remaining Examples sections (Insert variants, Single-end, Large single-end): matched the fence formatting pattern (separate yml + bash blocks, "Contents of neat_config.yml:" label) that was already applied to the first two examples. - Long-read caveat: reworded "not yet implemented in NEAT 4.0" to "planned for a separate future release; not yet fully validated in NEAT 4.4". - Removed the `neat vcf_compare` section entirely. The subcommand does not exist (confirmed via `neat --help`) and the section was documenting an unimplemented interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…om threads `parallel_mode` was a YAML option with two values, `'size'` and `'contig'`, which never actually let the user change behavior: - When `threads == 1`, Options force-set it to `'contig'` regardless of user input. - When `threads > 1`, `'size'` was the default; `'contig'` would have yielded one worker per reference contig, but `chunk_record` in split_inputs already produces a single chunk for contigs ≤ chunk size, so `'size'` is a strict behavioral superset. Removing it cleans up dead config and eliminates a footgun (a user setting `parallel_mode: contig` with `threads: 8` thinking they'd get one-worker- per-contig would in fact just disable sub-contig splitting on large contigs and silently reduce parallelism). Changes: - options.py: drop `parallel_mode` from the Options __init__ signature, schema, and stored state. Replace the post-parse log block with a threads-keyed one. Add a `DEPRECATED_KEYS` table and a deprecation warning in `read_yaml` so old configs with `parallel_mode: ...` keep parsing (the key is silently ignored after a one-line warning). - split_inputs.py: branch on `options.threads == 1` instead of `options.parallel_mode == "contig"` for one-chunk-per-contig output. - runner.py: simplify the auto-tune gate; drop the `options.parallel_mode == "size"` predicate (always true when `threads > 1`). - config_template/template_neat_config.yml, config_template/simple_template.yml: drop the `parallel_mode: .` line and tighten the comment for `parallel_block_size`. - README.md: drop the `parallel_mode` table row; clean up references in the Estimated runtimes section and HPC examples; tighten the parallel_block_size description. - tests/test_models/test_split_inputs.py: the `mode=` parameter in the test helper now maps to `opts.threads` (1 for contig-style, 4 for size-style) — callsites unchanged. - tests/test_read_simulator/test_options.py: - drop `parallel_mode` lines from the embedded YAML samples - drop the `assert opts.parallel_mode == "contig"` default-values assertion - swap the `parallel_mode` example in the choice-validator tests for a synthetic key (`_test_choice_key`) since no production option uses the `choice` validator anymore - replace `test_log_configuration_threads_one_forces_contig` (which was checking a force-override that no longer exists) with `test_read_yaml_deprecated_parallel_mode_warns` (verifies the deprecation warning + silent-ignore path works end-to-end via `Options.from_cli`) End-to-end check: an old YAML with `parallel_mode: size` still runs to completion and emits a single warning line: WARNING: `parallel_mode` is deprecated and ignored; splitting strategy is now derived from `threads`. All 599 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The removal is a YAML-schema cleanup; existing configs keep working via a deprecation warning path. Behavior is unchanged — parallel_mode never actually let users change anything under the old logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-chunk read counts in cover_dataset were scaled by chunk_mean/max_weight when a non-uniform gc_model was configured, making `coverage` the peak coverage at the GC bias maximum rather than the documented genome-wide average. The fix precomputes the genome-wide mean weight once at the runner level and divides per-chunk reads by that, so the chunk-scaling ratio averages to 1.0 and total reads match coverage * genome_length / read_len. - New compute_genome_wide_gc_mean_weight helper (vectorized, reuses the same prefix-sum windowing as cover_dataset; uniform models short-circuit). - Runner stores the result on options.gc_global_mean_weight so workers don't redundantly scan the reference. - cover_dataset falls back to chunk-local mean (no rescaling) when the field is unset, preserving single-chunk unit-test behavior. Drive-by: drop dead MarkovQualityModel stub and stale TODO in test_contig_variants.py (the remove_variant fix landed separately). ChangeLog rewritten as a single v4.5.0 entry covering the perf work already shipped in v4.4.4, the parallel_mode removal, and this fix. All 603 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five comments were either factually wrong, idle speculation, or dead work-in-progress with no path back into the code: - model_sequencing_error/utils.py: drop "TODO implement plotting" — plt is in fact used heavily later in the file. Remove the commented-out shape_curves.append() + the 8-line shape_curves DataFrame scaffolding; both referenced pandas/seaborn that aren't imported and have sat dead for years. - model_fragment_lengths/utils.py: drop speculative docstring TODO about filter_lengths "maybe" being useful elsewhere. - common/ploid_functions.py: drop "TODO may need to improve the modeling for polyploid, maybe" — pure musing, not actionable as written. No behavior change; 603 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both lists were built up in gen_mut_model/runner.py but never read or returned anywhere in the codebase — scaffolding for two features that were never wired into the saved mutation model: - common_variants: 95th-percentile-frequency variants from the input VCF. - high_mut_regions: variant-density-clustered hotspots, top 3% by density, with overlapping regions merged. The design intent (pipe both into the saved mutation model so the simulator can bias generation accordingly) is now captured in #295. If the work is revived, the cluster_list and find_caf helpers are still available in gen_mut_model/utils.py. Collateral cleanup: drop the now-unused vcf_common dict and the cluster_list / find_caf imports from runner.py. No behavior change in the produced model; 603 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The release that adds the GC bias coverage fix, the follow-up perf pass, and the parallel_mode removal will be tagged 4.4.4 instead of 4.5.0. The GC bias fix does change output bytes when a non-uniform gc_model is configured, but the user-facing config has always documented `coverage` as the genome-wide average — the prior behavior was the bug. Shipping as a patch release is consistent with framing this as a fix rather than a deliberate behavior change. Updates: - pyproject.toml: 4.5.0 -> 4.4.4 - ChangeLog.md: rename v4.5.0 entry -> v4.4.4; benchmark column relabeled; drop the "reason for the minor version bump" sentence and replace it with concrete coverage-shortfall numbers (~70 % delivered pre-fix). - README v4.4.4 references are already correct; no edit needed. 603 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshfactorial and others added 11 commits May 18, 2026 22:13

Merge branch 'main' into develop

72b6c3b

joshfactorial self-assigned this May 19, 2026

joshfactorial added the enhancement New feature or request label May 19, 2026

joshfactorial linked an issue May 19, 2026 that may be closed by this pull request

GC bias #282

Closed

joshfactorial removed a link to an issue May 19, 2026

GC bias #282

Closed

joshfactorial linked an issue May 19, 2026 that may be closed by this pull request

Fix vcf compare #292

Closed

joshfactorial merged commit d44f314 into main May 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#296

Develop#296
joshfactorial merged 11 commits into
mainfrom
develop

joshfactorial commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshfactorial commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant