Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion .github/workflows/python-app.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,15 @@ jobs:
- name: Run unit tests
shell: bash -l {0}
run: |
micromamba run -n neat pytest -q tests
micromamba run -n neat pytest --cov=neat --cov-report=xml --cov-report=term tests

- name: Run CLI integration test
shell: bash -l {0}
run: |
micromamba run -n neat pytest -q tests/test_cli

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
file: ./coverage.xml
fail_ci_if_error: false
43 changes: 43 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<!-- gitnexus:start -->
# GitNexus — Code Intelligence

This project is indexed by GitNexus as **NEAT** (2553 symbols, 5138 relationships, 57 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.

> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.

## Always Do

- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.

## Never Do

- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.

## Resources

| Resource | Use for |
|----------|---------|
| `gitnexus://repo/NEAT/context` | Codebase overview, check index freshness |
| `gitnexus://repo/NEAT/clusters` | All functional areas |
| `gitnexus://repo/NEAT/processes` | All execution flows |
| `gitnexus://repo/NEAT/process/{name}` | Step-by-step execution trace |

## CLI

| Task | Read this skill file |
|------|---------------------|
| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |

<!-- gitnexus:end -->
43 changes: 43 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
<!-- gitnexus:start -->
# GitNexus — Code Intelligence

This project is indexed by GitNexus as **NEAT** (2553 symbols, 5138 relationships, 57 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.

> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.

## Always Do

- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.

## Never Do

- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.

## Resources

| Resource | Use for |
|----------|---------|
| `gitnexus://repo/NEAT/context` | Codebase overview, check index freshness |
| `gitnexus://repo/NEAT/clusters` | All functional areas |
| `gitnexus://repo/NEAT/processes` | All execution flows |
| `gitnexus://repo/NEAT/process/{name}` | Step-by-step execution trace |

## CLI

| Task | Read this skill file |
|------|---------------------|
| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |

<!-- gitnexus:end -->
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ To cite this work, please use both of the following:
* [`neat gen-mut-model`](#neat-gen-mut-model)
* [`neat model-seq-err`](#neat-model-seq-err)
* [`neat model-qual-score`](#neat-model-qual-score)
* [`neat model-gc-bias`](#neat-model-gc-bias)
* [`neat vcf_compare`](#neat-vcf_compare)
* [Tests](#tests)
* [Guide to run locally](#guide-to-run-locally)
Expand Down Expand Up @@ -191,6 +192,7 @@ More parameters are below:
| `error_model` | Full path to an error model or quality score model generated by NEAT. Leave empty to use default model (default model based on human, sequenced by Illumina). |
| `mutation_model` | Full path to a mutation model generated by NEAT. Leave empty to use a default model (default model based on human data sequenced by Illumina). |
| `fragment_model` | Full path to fragment length model generated by NEAT. Leave empty to use default model (default model based on human data sequenced by Illumina). |
| `gc_model` | Full path to GC-bias model generated by NEAT. Leave empty for no GC bias. |
| `threads` | The number of threads for NEAT to use. Increasing the number will speed up read generation. |
| `avg_seq_error` | Average sequencing error rate for the sequencing machine. Use to increase or decrease the rate of errors in the reads. Float between 0 and 0.3. Default is set by the error model. |
| `rescale_qualities` | Rescale the quality scores to reflect the `avg_seq_error` rate above. Set `True` to activate if you notice issues with the sequencing error rates in your dataset. |
Expand Down Expand Up @@ -243,6 +245,7 @@ Features:
- Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
- Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution
- Simulates quality scores using either the default model or empirically learned quality scores using `neat gen_mut_model`
- Introduces GC-bias using an empirically learned model from `neat model-gc-bias`
- Introduces sequencing substitution errors using either the default model or empirically learned in `utilities`
- Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
- Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
Expand Down Expand Up @@ -512,6 +515,19 @@ Similarly, use `-i2` to produce a model for paired-ended data. `-q` denotes the

Finally, `-o` is the output directory for the model file and `-p` is the prefix for the output model, such that the file will be written as `<prefix>.p.gz` inside the output folder.

### `neat model-gc-bias`

Computes GC-bias model from a BAM file and reference genome. It calculates the relative weight of fragments based on their GC content.

```bash
neat model-gc-bias \
-i input.bam \
-r reference.fa \
-o /path/to/prefix
```

and creates `gc_model.pickle.gz` model in the working directory.

### `neat vcf_compare`

Tool for comparing VCF files (Not yet implemented in NEAT 4.4).
Expand Down
1 change: 1 addition & 0 deletions config_template/simple_template.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ produce_vcf: .
produce_fastq: .

error_model: .
gc_model: .
mutation_model: .
fragment_model: .

Expand Down
5 changes: 5 additions & 0 deletions config_template/template_neat_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,11 @@ coverage: .
# type = string | required: no | default: <NEAT_DIR>/neat/models/defaults/default_error_model.pickle.gz
error_model: .

# Absolute path to file containing GC-bias model.
# GC-bias models can be produced by neat model-gc-bias (learned from BAM alignments and reference)
# type = string | required: no | default: None (no GC bias)
gc_model: .

# Average sequencing error rate for the sequencing machine
# type = float | required = no | must be between 0.0 and 0.3
avg_seq_error: .
Expand Down
62 changes: 62 additions & 0 deletions neat/cli/commands/model_gc_bias.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
"""
Command line interface for NEAT's compute GC bias function
"""

import argparse

from ...model_gc_bias import compute_gc_bias_runner
from .base import BaseCommand
from .options import output_group

class Command(BaseCommand):
"""
Class that generates a model of GC bias, derived from real data
"""
name = "model-gc-bias"
description = "Generate GC bias model from a BAM file and a reference FASTA."

def add_arguments(self, parser: argparse.ArgumentParser):
"""
Add the command's arguments to its parser
"""
parser.add_argument('-i',
type=str,
metavar="input_bam",
dest="input_bam",
required=True,
help="BAM input file.")

parser.add_argument('-r',
type=str,
metavar="reference",
dest="reference",
required=True,
help="Reference FASTA file.")

parser.add_argument('-w',
type=int,
metavar="window_size",
dest="window_size",
required=False,
default=100,
help="Window size for GC bias calculation. Default is 100.")

parser.add_argument('--overwrite',
action='store_true',
required=False,
default=False,
help="Set this flag to overwrite existing output.")
output_group.add_to_parser(parser)

def execute(self, arguments: argparse.Namespace):
"""
Execute the command
"""
compute_gc_bias_runner(
arguments.input_bam,
arguments.reference,
arguments.output_dir,
arguments.prefix,
arguments.window_size,
arguments.overwrite
)
8 changes: 7 additions & 1 deletion neat/common/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,13 @@ def open_input(path: str | Path) -> Iterator[TextIO]:
# - https://github.com/python/mypy/issues/12053
open_: Callable[..., TextIO]
if is_compressed(path):
handle = bgzf.BgzfReader(path, 'r')
try:
handle = bgzf.BgzfReader(path, 'r')
# Trigger reading a block to verify it IS actually a BGZF file
handle.read(0)
except ValueError:
# Fall back to regular gzip if it's not BGZF
handle = gzip.open(path, 'rt', encoding='utf-8')
else:
handle = open(path, 'rt', encoding='utf-8')
try:
Expand Down
1 change: 1 addition & 0 deletions neat/model_gc_bias/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .runner import compute_gc_bias_runner
61 changes: 61 additions & 0 deletions neat/model_gc_bias/runner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
Creates a model of GC bias in a dataset
"""

import gzip
import pickle
import numpy as np
import logging
import sys
import pysam
from pathlib import Path
from Bio import SeqIO

from .utils import calculate_gc_content, get_gc_bias_weights
from ..models import GCBiasModel
from ..common import validate_output_path

__all__ = [
"compute_gc_bias_runner"
]

_LOG = logging.getLogger(__name__)

def compute_gc_bias_runner(
bam_file: str | Path,
reference_file: str | Path,
output_dir: str | Path,
output_prefix: str,
window_size: int = 100,
overwrite: bool = False):
"""
Main function for computing GC bias model.
"""
_LOG.info("Generating GC bias model")

output_file = Path(output_dir) / (output_prefix + ".pickle.gz")
validate_output_path(output_file, True, overwrite)

_LOG.debug(f"BAM: {bam_file}")
_LOG.debug(f"Reference: {reference_file}")
_LOG.debug(f"Window size: {window_size}")
_LOG.debug(f"Output: {output_file}")

_LOG.info("Calculating GC bias")

# Simple estimation:
# 1. Calculate GC content for many windows across the reference.
# 2. Count number of reads mapping to those windows.
# 3. Normalize to get weights.

weights = get_gc_bias_weights(bam_file, reference_file, window_size)

model = GCBiasModel(weights, window_size)

_LOG.info(f"Saving model: {output_file}")
with gzip.open(output_file, 'wb') as outfile:
# Saving in NEAT 2.1 compatible format [GC_SCALE_COUNT, GC_SCALE_VAL]
gc_scale_count = list(range(1, 101)) + [window_size]
pickle.dump([gc_scale_count, model.weights.tolist()], outfile)

_LOG.info("Modeling complete.")
86 changes: 86 additions & 0 deletions neat/model_gc_bias/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
"""
Utilities for modeling GC bias
"""

import numpy as np
import pysam
from Bio import SeqIO
from pathlib import Path

def calculate_gc_content(sequence: str) -> float:
"""
Calculates GC content of a sequence.
"""
if not sequence:
return 0.0
sequence = sequence.upper()
gc_count = sequence.count('G') + sequence.count('C')
total_count = sum(1 for base in sequence if base in 'ATGC')
if total_count == 0:
return 0.0
return gc_count / total_count

def get_gc_bias_weights(bam_file: str | Path, reference_file: str | Path, window_size: int = 100) -> list[float]:
"""
Estimates GC bias weights from a BAM file and a reference.
"""
# 1. Sample windows across the genome
# 2. For each window, calculate GC content and count reads
# 3. Aggregate counts by GC percentage
# 4. Normalize by the number of windows at each GC percentage

gc_bins = [0] * 101
gc_counts = [0.0] * 101

with pysam.AlignmentFile(str(bam_file), "rb") as bam:
ref_index = SeqIO.index(str(reference_file), "fasta")

for contig in bam.references:
if contig not in ref_index:
continue

contig_seq = ref_index[contig].seq
contig_len = len(contig_seq)

# Sample windows (not all windows to be faster)
step = max(window_size, contig_len // 1000)
for start in range(0, contig_len - window_size, step):
end = start + window_size
window_seq = contig_seq[start:end]
gc_fraction = calculate_gc_content(str(window_seq))
gc_percent = int(round(gc_fraction * 100))

# Count reads in this window
# Note: count() returns number of alignments overlapping the region
read_count = bam.count(contig, start, end)

gc_bins[gc_percent] += 1
gc_counts[gc_percent] += read_count

# Normalize
weights = [0.0] * 101
for i in range(101):
if gc_bins[i] > 0:
weights[i] = gc_counts[i] / gc_bins[i]
else:
weights[i] = 0.0

# Rescale so max weight is 1.0
max_w = max(weights) if any(weights) else 1.0
if max_w > 0:
weights = [w / max_w for w in weights]
else:
weights = [1.0] * 101

# Smoothing might be good here, but keeping it simple for now
# Fill in zeros with neighbors or global average if needed
avg_w = sum(weights) / sum(1 for w in weights if w > 0) if any(w > 0 for w in weights) else 1.0
for i in range(101):
if weights[i] == 0:
weights[i] = avg_w

# Rescale again
max_w = max(weights)
weights = [w / max_w for w in weights]

return weights
Loading
Loading