Skip to content

Develop#296

Merged
joshfactorial merged 11 commits into
mainfrom
develop
May 19, 2026
Merged

Develop#296
joshfactorial merged 11 commits into
mainfrom
develop

Conversation

@joshfactorial
Copy link
Copy Markdown
Collaborator

Further refinements and enhancements related to processing time and memory. Release 4.4.4.

joshfactorial and others added 11 commits May 18, 2026 22:13
…erlap check

Two hot-path improvements in the variant-generation phase, identified by a
post-v4.4.3 cProfile run on ecoli 2x SE (single-thread baseline 158.6 s).

1. ContigVariants.check_if_del / check_if_ins — replace O(N) linear scans
   over `all_dels` / `all_ins` with O(1) lookups into per-position bucket
   dicts. Each indel is indexed at every reference position it covers when
   added; the total memory cost is O(sum of indel lengths), single-digit
   MB at human scale. Profile on 14,052 calls:
     check_if_del: 3.624 s -> 0.011 s  (~325x)
     check_if_ins: 3.700 s -> 0.007 s  (~530x)
   The previous representation was O(N**2) over a single variant-generation
   pass and would scale poorly on larger genomes.

2. variant_models.map_local_trinuc_bias — replace 64 regex scans per call
   (one re.finditer per trinucleotide string) with a single vectorized
   numpy expression that encodes each base 0-3, builds a trinuc-index
   array via `left*16 + mid*4 + right`, and looks up biases in one table
   read. The cache check is also tightened from `==` (O(N) byte compare)
   to `is` (O(1) object identity) since slices are different objects
   anyway. Profile on 14,100 calls:
     map_local_trinuc_bias: 13.491 s -> 0.811 s  (~16x)

Combined effect on the single-thread ecoli 2x SE cProfile:
  Total wall: 158.6 s -> 135.3 s  (-15%, -23.3 s)
  generate_variants cumtime: 24.9 s -> 4.2 s  (~5.9x)
  Variant generation is no longer in the top-5 hot path.

Tests: all 599 pass. The `import re` in variant_models.py is no longer used
and has been removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
write_bam_record was the single biggest hot path after the v4.4.3 variant
work — 28 M Seq.__getitem__ calls per 185 k-read run, plus a per-base list
comp for the quality string and 9 small struct.pack calls per record.
Replace with bulk numpy / single struct.pack operations:

- Sequence encoding: replace the per-byte Python loop
    for i in range(encoded_len):
        encoded_seq.extend(pack('<B',
            (SEQ_PACKED[alt_sequence[2*i]] << 4) +
            SEQ_PACKED[alt_sequence[2*i+1]]))
  with a vectorized lookup through a 256-entry table:
    codes = SEQ_PACKED_LOOKUP[np.frombuffer(seq_str.encode('ascii'), ...)]
    encoded_seq = ((codes[0::2] << 4) | codes[1::2]).tobytes()
  Eliminates the 28 M Biopython Seq.__getitem__ calls that dominated cumtime.

- Quality encoding: replace
    encoded_qual = "".join([chr(n) for n in read.quality_array]).encode(...)
  with one `np.asarray(quality_array, dtype=np.uint8).tobytes()` call.

- CIGAR ops: pack all in a single `struct.pack(f'<{n}I', ...)` instead of
  one pack per op + bytearray.extend. Also drop the regex CIGAR re-parse in
  favour of a single linear scan over the CIGAR string (no regex import
  needed; `re` removed).

- BAM record header: one `struct.pack('<iiiIIiiii', ...)` for the nine
  fixed-size fields instead of nine separate pack calls glued with `+`.

Effect (clean machine, ecoli 10x 4-thread untimed):
  SE wall: 1:35.3 (v4.4.3) -> 1:07.1  (1.42x faster, -30%)
  PE wall: 1:34.5 (v4.4.3) -> 1:05.9  (1.43x faster, -30%)
  SE CPU%: 348 -> 364   PE CPU%: 359 -> 378   (workers near-saturated)

Single-thread cProfile of write_bam_record itself:
  self time:   12.0 s -> 3.2 s   (3.75x)
  cumtime:     58.1 s -> 8.1 s   (7.2x)

Combined with the v4.4.4 variant-generation fixes (committed separately) and
the v4.4.3 architectural work, NEAT is now 13.3-13.4x faster than v4.4.2 on
the ecoli benchmark, with strictly coordinate-sorted output and bounded
per-worker memory.

All 599 tests pass. Bumps version to 4.4.4 for the release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Records the three hot-path fixes in this release with before/after numbers:
  - ContigVariants overlap-check O(1) via position-bucket dicts
  - map_local_trinuc_bias vectorization (16x)
  - write_bam_record vectorization (7.2x cumtime)

Total release effect on ecoli 10x 4-thread:
  SE 1:35 -> 1:07 (1.42x)
  PE 1:34 -> 1:06 (1.43x)
13.3-13.4x faster than v4.4.2 cumulative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Low-risk audit items applied across the README:

- Prerequisites: poetry pinned to >= 2.0 (was 1.3.*, no longer accurate against
  pyproject.toml's poetry-core>=2.0 requirement). Added version constraints
  for numpy (>=2.2) and pysam (>=0.23) to match pyproject.
- Removed the OS-support contradiction. The Prerequisites section already
  notes "NEAT assumes a Linux environment ... install WSL on other OSes"; the
  installation section now agrees ("Linux tested; macOS may work; Windows via
  WSL") instead of contradicting it with the old "supports MacOS, Windows,
  and Linux" claim.
- Usage table: corrected `mode`/`size` to the actual config keys
  `parallel_mode`/`parallel_block_size`. Updated descriptions for the new
  defaults (`parallel_mode: size` for threads>1, forced to `contig` for
  threads==1; `parallel_block_size: 0` = auto-tune).
- Estimated-runtimes banner: updated to point at v4.4.4 numbers (~13× faster
  than v4.4.2 on ecoli) and to reference both v4.4.3 and v4.4.4 ChangeLog
  entries.
- Estimated-runtimes config snippets: same `mode`/`size` → `parallel_mode`/
  `parallel_block_size` correction.
- "Insert specific variants" example: dropped the `-v`/`-M` CLI flag prose
  (those flags were removed when NEAT switched to config-file-only).
- Three remaining Examples sections (Insert variants, Single-end, Large
  single-end): matched the fence formatting pattern (separate yml + bash
  blocks, "Contents of neat_config.yml:" label) that was already applied to
  the first two examples.
- Long-read caveat: reworded "not yet implemented in NEAT 4.0" to "planned
  for a separate future release; not yet fully validated in NEAT 4.4".
- Removed the `neat vcf_compare` section entirely. The subcommand does not
  exist (confirmed via `neat --help`) and the section was documenting an
  unimplemented interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…om threads

`parallel_mode` was a YAML option with two values, `'size'` and `'contig'`,
which never actually let the user change behavior:

  - When `threads == 1`, Options force-set it to `'contig'` regardless of
    user input.
  - When `threads > 1`, `'size'` was the default; `'contig'` would have
    yielded one worker per reference contig, but `chunk_record` in
    split_inputs already produces a single chunk for contigs ≤ chunk size,
    so `'size'` is a strict behavioral superset.

Removing it cleans up dead config and eliminates a footgun (a user setting
`parallel_mode: contig` with `threads: 8` thinking they'd get one-worker-
per-contig would in fact just disable sub-contig splitting on large
contigs and silently reduce parallelism).

Changes:

- options.py: drop `parallel_mode` from the Options __init__ signature,
  schema, and stored state. Replace the post-parse log block with a
  threads-keyed one. Add a `DEPRECATED_KEYS` table and a deprecation
  warning in `read_yaml` so old configs with `parallel_mode: ...` keep
  parsing (the key is silently ignored after a one-line warning).
- split_inputs.py: branch on `options.threads == 1` instead of
  `options.parallel_mode == "contig"` for one-chunk-per-contig output.
- runner.py: simplify the auto-tune gate; drop the
  `options.parallel_mode == "size"` predicate (always true when
  `threads > 1`).
- config_template/template_neat_config.yml,
  config_template/simple_template.yml: drop the `parallel_mode: .` line
  and tighten the comment for `parallel_block_size`.
- README.md: drop the `parallel_mode` table row; clean up references in
  the Estimated runtimes section and HPC examples; tighten the
  parallel_block_size description.
- tests/test_models/test_split_inputs.py: the `mode=` parameter in the
  test helper now maps to `opts.threads` (1 for contig-style, 4 for
  size-style) — callsites unchanged.
- tests/test_read_simulator/test_options.py:
  - drop `parallel_mode` lines from the embedded YAML samples
  - drop the `assert opts.parallel_mode == "contig"` default-values
    assertion
  - swap the `parallel_mode` example in the choice-validator tests for a
    synthetic key (`_test_choice_key`) since no production option uses
    the `choice` validator anymore
  - replace `test_log_configuration_threads_one_forces_contig` (which
    was checking a force-override that no longer exists) with
    `test_read_yaml_deprecated_parallel_mode_warns` (verifies the
    deprecation warning + silent-ignore path works end-to-end via
    `Options.from_cli`)

End-to-end check: an old YAML with `parallel_mode: size` still runs to
completion and emits a single warning line:
  WARNING: `parallel_mode` is deprecated and ignored; splitting strategy
  is now derived from `threads`.

All 599 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The removal is a YAML-schema cleanup; existing configs keep working via a
deprecation warning path. Behavior is unchanged — parallel_mode never
actually let users change anything under the old logic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-chunk read counts in cover_dataset were scaled by chunk_mean/max_weight
when a non-uniform gc_model was configured, making `coverage` the peak
coverage at the GC bias maximum rather than the documented genome-wide
average. The fix precomputes the genome-wide mean weight once at the
runner level and divides per-chunk reads by that, so the chunk-scaling
ratio averages to 1.0 and total reads match coverage * genome_length / read_len.

- New compute_genome_wide_gc_mean_weight helper (vectorized, reuses the
  same prefix-sum windowing as cover_dataset; uniform models short-circuit).
- Runner stores the result on options.gc_global_mean_weight so workers
  don't redundantly scan the reference.
- cover_dataset falls back to chunk-local mean (no rescaling) when the
  field is unset, preserving single-chunk unit-test behavior.

Drive-by: drop dead MarkovQualityModel stub and stale TODO in
test_contig_variants.py (the remove_variant fix landed separately).

ChangeLog rewritten as a single v4.5.0 entry covering the perf work
already shipped in v4.4.4, the parallel_mode removal, and this fix.
All 603 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five comments were either factually wrong, idle speculation, or dead
work-in-progress with no path back into the code:

- model_sequencing_error/utils.py: drop "TODO implement plotting" — plt
  is in fact used heavily later in the file. Remove the commented-out
  shape_curves.append() + the 8-line shape_curves DataFrame scaffolding;
  both referenced pandas/seaborn that aren't imported and have sat dead
  for years.
- model_fragment_lengths/utils.py: drop speculative docstring TODO about
  filter_lengths "maybe" being useful elsewhere.
- common/ploid_functions.py: drop "TODO may need to improve the modeling
  for polyploid, maybe" — pure musing, not actionable as written.

No behavior change; 603 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both lists were built up in gen_mut_model/runner.py but never read or
returned anywhere in the codebase — scaffolding for two features that
were never wired into the saved mutation model:

- common_variants: 95th-percentile-frequency variants from the input VCF.
- high_mut_regions: variant-density-clustered hotspots, top 3% by density,
  with overlapping regions merged.

The design intent (pipe both into the saved mutation model so the
simulator can bias generation accordingly) is now captured in #295. If
the work is revived, the cluster_list and find_caf helpers are still
available in gen_mut_model/utils.py.

Collateral cleanup: drop the now-unused vcf_common dict and the
cluster_list / find_caf imports from runner.py.

No behavior change in the produced model; 603 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The release that adds the GC bias coverage fix, the follow-up perf pass,
and the parallel_mode removal will be tagged 4.4.4 instead of 4.5.0.

The GC bias fix does change output bytes when a non-uniform gc_model is
configured, but the user-facing config has always documented `coverage`
as the genome-wide average — the prior behavior was the bug. Shipping
as a patch release is consistent with framing this as a fix rather than
a deliberate behavior change.

Updates:
- pyproject.toml: 4.5.0 -> 4.4.4
- ChangeLog.md: rename v4.5.0 entry -> v4.4.4; benchmark column relabeled;
  drop the "reason for the minor version bump" sentence and replace it
  with concrete coverage-shortfall numbers (~70 % delivered pre-fix).
- README v4.4.4 references are already correct; no edit needed.

603 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshfactorial joshfactorial self-assigned this May 19, 2026
@joshfactorial joshfactorial added the enhancement New feature or request label May 19, 2026
@joshfactorial joshfactorial linked an issue May 19, 2026 that may be closed by this pull request
@joshfactorial joshfactorial removed a link to an issue May 19, 2026
@joshfactorial joshfactorial linked an issue May 19, 2026 that may be closed by this pull request
@joshfactorial joshfactorial merged commit d44f314 into main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix vcf compare

1 participant