Skip to content

Commit 765b5ec

Browse files
Merge pull request #300 from ncsa/audit-splits-options
Drop cleanup_splits and reuse_splits; bump to v4.5.1
2 parents 8aeb765 + 0a7247b commit 765b5ec

11 files changed

Lines changed: 44 additions & 82 deletions

File tree

ChangeLog.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,18 @@
1+
# NEAT v4.5.1
2+
3+
Removed the `cleanup_splits` and `reuse_splits` config options.
4+
5+
`reuse_splits` was fully broken (it raised `FileNotFoundError` unconditionally
6+
regardless of whether the splits directory existed). `cleanup_splits` existed
7+
solely to support `reuse_splits`; without it the option has no value — the
8+
splits directory always lives in a `TemporaryDirectory` that is cleaned up
9+
automatically when the run exits.
10+
11+
Both keys are now in `DEPRECATED_KEYS`: configs that still include them
12+
receive a one-line deprecation warning and continue parsing cleanly, so no
13+
user config breaks. The README examples and all internal test fixtures have
14+
been updated to drop these keys.
15+
116
# NEAT v4.5.0
217

318
New `neat compare-vcfs` subcommand: compares a downstream variant caller's VCF

README.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,8 +205,6 @@ More parameters are below:
205205
| `min_mutations` | Set the minimum number of mutations that NEAT should add, per contig. Default is 0. We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig. |
206206
| `threads` | Number of threads to use. More than 1 will use multi-threading to speed up processing. With `threads > 1`, NEAT splits each contig into chunks; with `threads == 1`, one chunk per contig is used. |
207207
| `parallel_block_size` | Per-chunk size in bases when `threads > 1`. Default `0` (auto-tune from total genome length and thread count, targeting ~8 chunks per thread). Set to a positive integer to override. Ignored when `threads == 1`. |
208-
| `cleanup_splits` | If running more than one simulation on the same input fasta, you can reuse splits files. By default, this will be set to `False`, and splits files will be deleted at the end of the run. |
209-
| `reuse_splits` | If an existing splits file exists in the output folder, it will use those splits, if this value is set to `True`. |
210208

211209
The command line options for NEAT are as follows:
212210

@@ -280,7 +278,6 @@ The inputs in single-threaded, contig-based mode most closely replicate the beha
280278
The configuration used:
281279

282280
- `threads: 1` (NEAT processes one contig per chunk in single-thread mode)
283-
- `cleanup_splits: True`
284281

285282
| Organism | File size (bytes) | Avg. runtime (ms) | Avg. runtime (min) |
286283
|-----------------|-------------------|-------------------|--------------------|
@@ -302,7 +299,6 @@ Here we enabled NEAT’s multi-threaded mode, which splits contigs into size-bas
302299
- `parallel_block_size: 500000`
303300
- `produce_bam: false`
304301
- `threads: 7`
305-
- `cleanup_splits: True`
306302

307303
| Organism | File size (bytes) | Avg. runtime (ms) | Avg. runtime (min) |
308304
|-----------------|-------------------|-------------------|--------------------|

config_template/simple_template.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,5 +29,3 @@ overwrite_output: .
2929

3030
parallel_block_size: .
3131
threads: .
32-
cleanup_splits: .
33-
reuse_splits: .

config_template/template_neat_config.yml

Lines changed: 1 addition & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Template for NEAT's read-simulator (as of version 4.3.5, parallelization-friendly)
1+
## Template for NEAT's read-simulator (as of version 4.5.1, parallelization-friendly)
22
## Any parameter that is not required but has a default value will use the
33
## default value even if the variable is not included in the config. For
44
## required items, they must be included in the config and they must be given a value.
@@ -152,14 +152,3 @@ parallel_block_size: .
152152
# Maximum number of concurrent NEAT jobs (threads or hyperthreads) to run
153153
# type = int | required: no | default = all available
154154
threads: .
155-
156-
# Delete the 'splits' directory after stitching completes
157-
# Note: If threads == 1, this option has no effect.
158-
# type = bool | required: no | default = true
159-
cleanup_splits: .
160-
161-
# Reuse existing files in '<out_dir>/splits' and skip the split step.
162-
# The directory must contain NEAT-generated files and must be in the output directory within "splits"
163-
# Note: If threads == 1, this option has no effect.
164-
# type = bool | required: no | default = False
165-
reuse_splits: .

neat/read_simulator/utils/options.py

Lines changed: 4 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -82,9 +82,7 @@ def __init__(self,
8282
produce_fastq: bool = True,
8383
min_mutations: int = 0,
8484
parallel_block_size: int = 0,
85-
cleanup_splits: bool = True,
8685
splits_dir: Path | None = None,
87-
reuse_splits: bool = False,
8886
gc_model: Path | None = None,
8987
**kwargs: Any
9088
):
@@ -136,9 +134,7 @@ def __init__(self,
136134
chunks of this size). Default 0 auto-tunes from total genome length and thread count, targeting
137135
~8 chunks per thread. Specify a positive integer to override. Ignored when threads == 1, where
138136
one chunk per contig is used.
139-
:param cleanup_splits: Set to False in order to preserve splits after run
140-
:param reuse_splits: Attempts to reuse existing splits file
141-
"""
137+
"""
142138
super().__init__(**kwargs)
143139
self.reference: Path = reference
144140
self.output_dir: Path = output_dir
@@ -173,9 +169,7 @@ def __init__(self,
173169
# threads > 1, contigs are split into chunks of `parallel_block_size`; with
174170
# threads == 1, each contig is processed as a single chunk.
175171
self.parallel_block_size: int = parallel_block_size
176-
self.cleanup_splits: bool = cleanup_splits
177172
self.splits_dir: Path | None = splits_dir
178-
self.reuse_splits: bool = reuse_splits
179173
self.gc_model: Path | None = Path(gc_model) if gc_model else None
180174
# Genome-wide mean GC bias weight, computed once at the runner level when
181175
# gc_model is loaded. cover_dataset divides per-chunk reads by this rather
@@ -251,8 +245,6 @@ def from_cli(output_dir: Path,
251245
'overwrite_output': (bool, False, None, None),
252246
'parallel_block_size': (int, 0, None, None),
253247
'threads': (int, 1, 1, 1000),
254-
'cleanup_splits': (bool, True, None, None),
255-
'reuse_splits': (bool, False, None, None),
256248
'gc_model': (Path, None, 'exists', None)
257249
}
258250

@@ -312,6 +304,8 @@ def check_and_log_error(keyname, value_to_check, crit1, crit2):
312304
# Map: deprecated key -> short reason shown to the user.
313305
DEPRECATED_KEYS = {
314306
"parallel_mode": "splitting strategy is now derived from `threads`",
307+
"cleanup_splits": "splits are always written to a temporary directory and cleaned up automatically",
308+
"reuse_splits": "removed; splits are regenerated on each run",
315309
}
316310

317311
def read_yaml(self, config_yaml: Path, args: dict):
@@ -467,23 +461,7 @@ def log_configuration(self):
467461
else:
468462
_LOG.info('Single threading - 1 thread.')
469463
_LOG.info('Splitting input by contig.')
470-
if self.reuse_splits:
471-
splits_dir = Path(f'{self.output_dir}/splits/')
472-
_LOG.info(f'Reusing existing splits {splits_dir}.')
473-
if not splits_dir.is_dir():
474-
raise FileNotFoundError(f"reuse_splits=True but splits dir not found: {splits_dir}")
475-
else:
476-
if self.reuse_splits:
477-
raise FileNotFoundError(f'reuse_splits=True')
478-
else:
479-
_LOG.warning(f'Reused splits set to True, but splits dir not found: {splits_dir}. Creating new splits')
480-
_LOG.info(f'Preserving splits for next run in directory {self.splits_dir}.')
481-
elif not self.cleanup_splits:
482-
splits_dir = Path(f'{self.output_dir}/splits/')
483-
_LOG.info(f'Preserving splits for next run in directory {self.splits_dir}.')
484-
else:
485-
splits_dir = self.temp_dir_path / "splits"
486-
464+
splits_dir = self.temp_dir_path / "splits"
487465
validate_output_path(splits_dir, False)
488466
self.splits_dir = splits_dir
489467

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "neat-genreads"
3-
version = "4.5.0"
3+
version = "4.5.1"
44
description = "NGS Simulation toolkit"
55
readme = "README.md"
66
authors = ["Joshua Allen <jallen17@illinois.edu>", "Keshav Gandhi <krg3@illinois.edu>"]

tests/test_compare_vcfs/test_integration.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,6 @@ def _write_config(path: Path, ref_path: Path) -> Path:
7171
"rng_seed: 42\n"
7272
"mutation_rate: 0.01\n"
7373
"overwrite_output: true\n"
74-
"cleanup_splits: true\n"
7574
)
7675
path.write_text(cfg, encoding="utf-8")
7776
return path
@@ -250,8 +249,7 @@ def test_compare_vcfs_real_happy_with_chrom_mismatched_bed(tmp_path, happy_bin,
250249
"rng_seed: 42\n"
251250
"mutation_rate: 0.01\n"
252251
f"mutation_bed: {bed}\n"
253-
"overwrite_output: true\n"
254-
"cleanup_splits: true\n",
252+
"overwrite_output: true\n",
255253
encoding="utf-8",
256254
)
257255
read_simulator_runner(str(cfg_path), str(sim_out), "run")

tests/test_read_simulator/test_gc_bias_pipeline.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,6 @@ def _write_config(path: Path, ref_path: Path, **overrides) -> Path:
2727
"coverage": 2,
2828
"rng_seed": 42,
2929
"overwrite_output": "true",
30-
"cleanup_splits": "true",
3130
}
3231
defaults.update(overrides)
3332
lines = "\n".join(f"{k}: {v}" for k, v in defaults.items())

tests/test_read_simulator/test_models_pipeline.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,6 @@ def _write_config(path: Path, ref_path: Path, **overrides) -> Path:
2626
"coverage": 2,
2727
"rng_seed": 42,
2828
"overwrite_output": "true",
29-
"cleanup_splits": "true",
3029
}
3130
defaults.update(overrides)
3231
lines = "\n".join(f"{k}: {v}" for k, v in defaults.items())

tests/test_read_simulator/test_options.py

Lines changed: 22 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,6 @@ def test_from_cli_single_end_with_threads_and_splits(tmp_path: _PathAlias):
7575
7676
parallel_block_size: 500000
7777
threads: 2
78-
cleanup_splits: false
79-
reuse_splits: false
8078
"""
8179
).strip() + "\n"
8280

@@ -102,8 +100,8 @@ def test_from_cli_single_end_with_threads_and_splits(tmp_path: _PathAlias):
102100
assert opts.vcf is None
103101

104102
assert opts.threads == 2
105-
assert opts.splits_dir == outdir / "splits"
106103
assert opts.splits_dir.is_dir()
104+
assert opts.splits_dir.name == "splits"
107105

108106

109107
def test_from_cli_paired_end_fragments(tmp_path: _PathAlias):
@@ -125,8 +123,6 @@ def test_from_cli_paired_end_fragments(tmp_path: _PathAlias):
125123
overwrite_output: true
126124
127125
threads: 1
128-
cleanup_splits: true
129-
reuse_splits: false
130126
"""
131127
).strip() + "\n"
132128

@@ -160,8 +156,6 @@ def test_default_values():
160156
# auto-tune logic. The splitting strategy itself is no longer a user-facing option —
161157
# it's derived from `threads` at runtime.
162158
assert opts.parallel_block_size == 0
163-
assert opts.cleanup_splits is True
164-
assert opts.reuse_splits is False
165159
assert opts.overwrite_output is False
166160
assert opts.rescale_qualities is False
167161
assert opts.min_mutations == 0
@@ -286,6 +280,27 @@ def test_read_yaml_deprecated_parallel_mode_warns(tmp_path: _PathAlias, caplog):
286280
)
287281

288282

283+
def test_deprecated_cleanup_splits_warns(tmp_path: _PathAlias, caplog):
284+
"""cleanup_splits and reuse_splits must fire a deprecation warning and not crash."""
285+
ref = _project_root() / "data" / "H1N1.fa"
286+
cfg = _textwrap.dedent(
287+
f"""
288+
reference: {ref}
289+
cleanup_splits: true
290+
reuse_splits: false
291+
threads: 1
292+
"""
293+
).strip() + "\n"
294+
yml_path = tmp_path / "old_config.yml"
295+
yml_path.write_text(cfg, encoding="utf-8")
296+
import logging as _logging
297+
with caplog.at_level(_logging.WARNING):
298+
Options.from_cli(tmp_path, "out", yml_path)
299+
keys_warned = {rec.message.split("`")[1] for rec in caplog.records if "deprecated" in rec.message}
300+
assert "cleanup_splits" in keys_warned
301+
assert "reuse_splits" in keys_warned
302+
303+
289304
def test_log_configuration_fragment_mean_less_than_read_len_exits(tmp_path: _PathAlias):
290305
ref = _project_root() / "data" / "H1N1.fa"
291306
opts = Options(reference=ref, output_dir=tmp_path, output_prefix="out",
@@ -313,27 +328,3 @@ def test_log_configuration_paired_without_model_or_mean_exits(tmp_path: _PathAli
313328
opts.log_configuration()
314329

315330

316-
def test_from_cli_reuse_splits_missing_dir_raises(tmp_path: _PathAlias):
317-
cfg = _textwrap.dedent(
318-
f"""
319-
reference: {(_project_root() / 'data' / 'H1N1.fa').as_posix()}
320-
paired_ended: false
321-
produce_fastq: true
322-
produce_bam: false
323-
produce_vcf: false
324-
threads: 4
325-
parallel_block_size: 500000
326-
cleanup_splits: true
327-
reuse_splits: true
328-
overwrite_output: true
329-
"""
330-
).strip() + "\n"
331-
332-
yml_path = tmp_path / "neat_from_cli_reuse.yml"
333-
yml_path.write_text(cfg, encoding="utf-8")
334-
335-
outdir = tmp_path / "out"
336-
outdir.mkdir(parents=True, exist_ok=True)
337-
338-
with _pytest.raises(FileNotFoundError, match=r"reuse_splits=True"):
339-
Options.from_cli(outdir, "reuse", yml_path)

0 commit comments

Comments
 (0)