Skip to content

Commit 87e0a41

Browse files
Merge pull request #174 from ncsa/172-fix-vcf-headers
172 fix vcf headers
2 parents b97ea10 + cdb1abb commit 87e0a41

22 files changed

Lines changed: 2079 additions & 2085 deletions

ChangeLog.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# NEAT has a new home
22
NEAT is now a part of the NCSA github and active development will continue here. Please direct issues, comments, and requests to the NCSA issue tracker. Submit pull requests here insead of the old repo.
33

4+
# NEAT v4.3.2
5+
- Bug fixes for parallel processing, which was causing some of the headers to be printed incorrectly. To fix that, we had to rewrite a bunch of the code and integrate parallelism more directly into NEAT.
6+
7+
# NEAT v4.3.1
8+
- Bug fixes (see issue #160) having to do with output files.
9+
410
# NEAT v4.3.1
511
- Updated parallel module to integrate it into the code more fluidly. We also updated the options section to revise the process and allow for copying of options objects for parallelism run.
612

README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -138,15 +138,15 @@ The default is given:
138138
`mutation_bed`: full path to a list of regions with a column describing the mutation rate of that region, as a float with values between 0 and 0.3. The mutation rate must be in the third column as, e.g., mut_rate=0.00.
139139
`rng_seed`: Manually enter a seed for the random number generator. Used for repeating runs. _Must be an integer._
140140
`min_mutations`: Set the minimum number of mutations that NEAT should add, per contig. _Default is 0._ We recommend setting this to at least one for small chromosomes, so NEAT will produce at least one mutation per contig.
141-
'threads': Number of threads to use. More than 1 will activate parallel mode and perform part of the calclutations in parallel then recombine into the desired output files.
142-
'parallel_mode': 'size' or 'contig' whether to divide the contigs into blocks or just by contig. By contig is the default, try by size. Varying the parallel_block_size parameter may help if default values are not sufficient.
143-
'parallel_block_size': Default value of 500,000.
144-
'cleanup_splits': If running more than one simulation on the same input fasta, you can reuse splits files. By default, this will be set to False, and splits files will be deleted at the end of the run.
145-
'reuse_splits': If an existing splits file exists in the output folder, it will use those splits, if this value is set to True.
141+
`threads`: Number of threads to use. More than 1 will activate parallel mode and perform part of the calclutations in parallel then recombine into the desired output files.
142+
`parallel_mode`: 'size' or 'contig' whether to divide the contigs into blocks or just by contig. By contig is the default, try by size. Varying the parallel_block_size parameter may help if default values are not sufficient.
143+
`parallel_block_size`: Default value of 500,000.
144+
`cleanup_splits`: If running more than one simulation on the same input fasta, you can reuse splits files. By default, this will be set to False, and splits files will be deleted at the end of the run.
145+
`reuse_splits`: If an existing splits file exists in the output folder, it will use those splits, if this value is set to True.
146146

147147
The command line options for NEAT are as follows:
148148

149-
Universal options can be applied to any subfunction. The commands should come before the function name (e.g., neat --log-level DEBUG read-simulator ...), excetp -h or --help, which can appear anywhere in the command.
149+
Universal options can be applied to any subfunction. The commands should come before the function name (e.g., neat --log-level DEBUG read-simulator ...), except -h or --help, which can appear anywhere in the command.
150150
| Universal Options | Description |
151151
|---------------------|--------------------------------------|
152152
| -h, --help | Displays usage information |
@@ -161,7 +161,7 @@ read-simulator command line options
161161
|---------------------|-------------------------------------|
162162
| -c VALUE, --config VALUE | The VALUE should be the name of the config file to use for this run |
163163
| -o OUTPUT_DIR, --output_dir OUTPUT_DIR | The path to the directory to write the output files |
164-
| -p PREFIX, --prefix PREFIX | The prefix for file names |
164+
| -p PREFIX, --prefix String | The prefix for file names |
165165

166166
## Functionality
167167

@@ -188,7 +188,7 @@ Features:
188188

189189
## Examples
190190

191-
The following commands are examples for common types of data to be generated. The simulation uses a reference genome in fasta format to generate reads of 126 bases with default 10X coverage. Outputs paired fastq files, a BAM file and a VCF file. The random variants inserted into the sequence will be present in the VCF and all of the reads will show their proper alignment in the BAM. Unless specified, the simulator will also insert some "sequencing error" -- random variants in some reads that represents false positive results from sequencing.
191+
The following commands are examples for common types of data to be generated. The simulation uses a reference genome in fasta format to generate reads of 126 bases with default 10X coverage. Outputs paired fastq files, a BAM file and a VCF file. The random variants inserted into the sequence will be present in the VCF and the reads will show their proper alignment in the BAM. Unless specified, the simulator will also insert some "sequencing error" -- random variants in some reads that represents false positive results from sequencing.
192192

193193
### Whole genome simulation
194194
Simulate whole genome dataset with random variants inserted according to the default model.

environment.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ channels:
55

66
dependencies:
77
- python=3.10.*
8-
- biopython=1.79
8+
- biopython=1.85
99
- pkginfo
1010
- matplotlib
1111
- numpy

neat/common/io.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ def open_input(path: str | Path) -> Iterator[TextIO]:
6363
# - https://github.com/python/mypy/issues/12053
6464
open_: Callable[..., TextIO]
6565
if is_compressed(path):
66-
open_ = gzip.open
66+
open_ = bgzf.open
6767
else:
6868
open_ = open
6969
handle = open_(path, "rt", encoding="utf-8")

neat/models/error_models.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -269,8 +269,8 @@ def __init__(self,
269269
error_type: VariantTypes,
270270
location: int,
271271
length: int,
272-
ref: str or Seq,
273-
alt: str or Seq):
272+
ref: str | Seq,
273+
alt: str | Seq):
274274
self.error_type = error_type
275275
self.location = location
276276
self.length = length

neat/models/variant_models.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Classes for the variant models included in NEAT.
33
Every Variant type in variants > variant_types must have a corresponding model in order to be fully implemented.
44
"""
5-
5+
import pdb
66
import re
77
import logging
88
import abc
@@ -78,7 +78,7 @@ class DeletionModel(VariantModel):
7878
_type = Deletion
7979
_description = "A deletion of a random number of bases"
8080

81-
def __init__(self, deletion_len_model: dict[int: float, ...]):
81+
def __init__(self, deletion_len_model: dict[int, float, ...]):
8282
# Creating probabilities from the weights
8383
tot = sum(deletion_len_model.values())
8484
self.deletion_len_model = {key: val/tot for key, val in deletion_len_model.items()}
@@ -133,8 +133,8 @@ def __init__(
133133
self.trinuc_bias_map = None
134134

135135
# Some local variables for modeling
136-
self.local_trinuc_bias: np.array = None
137-
self.local_sequence: Seq or None = None
136+
self.local_trinuc_bias: np.ndarray | None = None
137+
self.local_sequence: Seq | None = None
138138

139139
def map_local_trinuc_bias(
140140
self,
@@ -163,7 +163,12 @@ def map_local_trinuc_bias(
163163
# Update the map bias at the central position for that trinuc
164164
for trinuc in ALL_TRINUCS:
165165
for match in re.finditer(trinuc, str(sequence)):
166+
# match.start() + 1 puts us at the center of the trinuc
167+
if match.start() + 1 > len(self.local_trinuc_bias):
168+
print("???")
166169
self.local_trinuc_bias[match.start() + 1] = self.trinuc_mutation_bias[TRINUC_IND[trinuc]]
170+
if len(self.local_trinuc_bias) != len(sequence):
171+
print("???")
167172

168173
# Now we normalize the bias
169174
self.local_trinuc_bias = self.local_trinuc_bias / sum(self.local_trinuc_bias)

neat/read_simulator/__init__.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
"""
22
Modules to generate reads
33
"""
4-
from .runner import *
5-
from .parallel_runner import main
4+
from .runner import *

neat/read_simulator/parallel_runner.py

Lines changed: 0 additions & 116 deletions
This file was deleted.

0 commit comments

Comments
 (0)