ncsa · joshfactorial · May 13, 2026 · May 11, 2026 · May 11, 2026 · May 13, 2026
diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml
@@ -28,9 +28,15 @@ jobs:
       - name: Run unit tests
         shell: bash -l {0}
         run: |
-          micromamba run -n neat pytest -q tests
+          micromamba run -n neat pytest --cov=neat --cov-report=xml --cov-report=term tests
 
       - name: Run CLI integration test
         shell: bash -l {0}
         run: |
           micromamba run -n neat pytest -q tests/test_cli
+
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v4
+        with:
+          file: ./coverage.xml
+          fail_ci_if_error: false
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,43 @@
+<!-- gitnexus:start -->
+# GitNexus — Code Intelligence
+
+This project is indexed by GitNexus as **NEAT** (2553 symbols, 5138 relationships, 57 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
+
+> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.
+
+## Always Do
+
+- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
+- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
+- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
+- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
+- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.
+
+## Never Do
+
+- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
+- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
+- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
+- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.
+
+## Resources
+
+| Resource | Use for |
+|----------|---------|
+| `gitnexus://repo/NEAT/context` | Codebase overview, check index freshness |
+| `gitnexus://repo/NEAT/clusters` | All functional areas |
+| `gitnexus://repo/NEAT/processes` | All execution flows |
+| `gitnexus://repo/NEAT/process/{name}` | Step-by-step execution trace |
+
+## CLI
+
+| Task | Read this skill file |
+|------|---------------------|
+| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
+| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
+| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
+| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
+| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
+| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |
+
+<!-- gitnexus:end -->
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,43 @@
+<!-- gitnexus:start -->
+# GitNexus — Code Intelligence
+
+This project is indexed by GitNexus as **NEAT** (2553 symbols, 5138 relationships, 57 execution flows). Use the GitNexus MCP tools to understand code, assess impact, and navigate safely.
+
+> If any GitNexus tool warns the index is stale, run `npx gitnexus analyze` in terminal first.
+
+## Always Do
+
+- **MUST run impact analysis before editing any symbol.** Before modifying a function, class, or method, run `gitnexus_impact({target: "symbolName", direction: "upstream"})` and report the blast radius (direct callers, affected processes, risk level) to the user.
+- **MUST run `gitnexus_detect_changes()` before committing** to verify your changes only affect expected symbols and execution flows.
+- **MUST warn the user** if impact analysis returns HIGH or CRITICAL risk before proceeding with edits.
+- When exploring unfamiliar code, use `gitnexus_query({query: "concept"})` to find execution flows instead of grepping. It returns process-grouped results ranked by relevance.
+- When you need full context on a specific symbol — callers, callees, which execution flows it participates in — use `gitnexus_context({name: "symbolName"})`.
+
+## Never Do
+
+- NEVER edit a function, class, or method without first running `gitnexus_impact` on it.
+- NEVER ignore HIGH or CRITICAL risk warnings from impact analysis.
+- NEVER rename symbols with find-and-replace — use `gitnexus_rename` which understands the call graph.
+- NEVER commit changes without running `gitnexus_detect_changes()` to check affected scope.
+
+## Resources
+
+| Resource | Use for |
+|----------|---------|
+| `gitnexus://repo/NEAT/context` | Codebase overview, check index freshness |
+| `gitnexus://repo/NEAT/clusters` | All functional areas |
+| `gitnexus://repo/NEAT/processes` | All execution flows |
+| `gitnexus://repo/NEAT/process/{name}` | Step-by-step execution trace |
+
+## CLI
+
+| Task | Read this skill file |
+|------|---------------------|
+| Understand architecture / "How does X work?" | `.claude/skills/gitnexus/gitnexus-exploring/SKILL.md` |
+| Blast radius / "What breaks if I change X?" | `.claude/skills/gitnexus/gitnexus-impact-analysis/SKILL.md` |
+| Trace bugs / "Why is X failing?" | `.claude/skills/gitnexus/gitnexus-debugging/SKILL.md` |
+| Rename / extract / split / refactor | `.claude/skills/gitnexus/gitnexus-refactoring/SKILL.md` |
+| Tools, resources, schema reference | `.claude/skills/gitnexus/gitnexus-guide/SKILL.md` |
+| Index, status, clean, wiki CLI commands | `.claude/skills/gitnexus/gitnexus-cli/SKILL.md` |
+
+<!-- gitnexus:end -->
diff --git a/README.md b/README.md
@@ -46,6 +46,7 @@ To cite this work, please use both of the following:
     * [`neat gen-mut-model`](#neat-gen-mut-model)
     * [`neat model-seq-err`](#neat-model-seq-err)
     * [`neat model-qual-score`](#neat-model-qual-score)
+    * [`neat model-gc-bias`](#neat-model-gc-bias)
     * [`neat vcf_compare`](#neat-vcf_compare)
   * [Tests](#tests)
     * [Guide to run locally](#guide-to-run-locally)
@@ -191,6 +192,7 @@ More parameters are below:
 | `error_model`       | Full path to an error model or quality score model generated by NEAT. Leave empty to use default model (default model based on human, sequenced by Illumina).                                                 |
 | `mutation_model`    | Full path to a mutation model generated by NEAT. Leave empty to use a default model (default model based on human data sequenced by Illumina).                                                                |
 | `fragment_model`    | Full path to fragment length model generated by NEAT. Leave empty to use default model (default model based on human data sequenced by Illumina).                                                             |
+| `gc_model`          | Full path to GC-bias model generated by NEAT. Leave empty for no GC bias.                                                                                                                                     |
 | `threads`           | The number of threads for NEAT to use. Increasing the number will speed up read generation.                                                                                                                   |
 | `avg_seq_error`     | Average sequencing error rate for the sequencing machine. Use to increase or decrease the rate of errors in the reads. Float between 0 and 0.3. Default is set by the error model.                            |
 | `rescale_qualities` | Rescale the quality scores to reflect the `avg_seq_error` rate above. Set `True` to activate if you notice issues with the sequencing error rates in your dataset.                                            |
@@ -243,6 +245,7 @@ Features:
 - Can accurately simulate large, single-end reads with high indel error rates (PacBio-like) given a model
 - Specify simple fragment length model with mean and standard deviation or an empirically learned fragment distribution
 - Simulates quality scores using either the default model or empirically learned quality scores using `neat gen_mut_model`
+- Introduces GC-bias using an empirically learned model from `neat model-gc-bias`
 - Introduces sequencing substitution errors using either the default model or empirically learned in `utilities`
 - Output a VCF file with the 'golden' set of true positive variants. These can be compared to bioinformatics workflow output (includes coverage and allele balance information)
 - Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
@@ -512,6 +515,19 @@ Similarly, use `-i2` to produce a model for paired-ended data. `-q` denotes the
 
 Finally, `-o` is the output directory for the model file and `-p` is the prefix for the output model, such that the file will be written as `<prefix>.p.gz` inside the output folder.
 
+### `neat model-gc-bias`
+
+Computes GC-bias model from a BAM file and reference genome. It calculates the relative weight of fragments based on their GC content.
+
+```bash
+neat model-gc-bias    \
+    -i input.bam      \
+    -r reference.fa   \
+    -o /path/to/prefix
+```
+
+and creates `gc_model.pickle.gz` model in the working directory.
+
 ### `neat vcf_compare`
 
 Tool for comparing VCF files (Not yet implemented in NEAT 4.4).

diff --git a/config_template/simple_template.yml b/config_template/simple_template.yml
@@ -11,6 +11,7 @@ produce_vcf: .
 produce_fastq: .
 
 error_model: .
+gc_model: .
 mutation_model: .
 fragment_model: .
 

diff --git a/config_template/template_neat_config.yml b/config_template/template_neat_config.yml
@@ -25,6 +25,11 @@ coverage: .
 # type = string | required: no | default: <NEAT_DIR>/neat/models/defaults/default_error_model.pickle.gz
 error_model: .
 
+# Absolute path to file containing GC-bias model.
+# GC-bias models can be produced by neat model-gc-bias (learned from BAM alignments and reference)
+# type = string | required: no | default: None (no GC bias)
+gc_model: .
+
 # Average sequencing error rate for the sequencing machine
 # type = float | required = no | must be between 0.0 and 0.3
 avg_seq_error: .

diff --git a/neat/cli/commands/model_gc_bias.py b/neat/cli/commands/model_gc_bias.py
@@ -0,0 +1,62 @@
+"""
+Command line interface for NEAT's compute GC bias function
+"""
+
+import argparse
+
+from ...model_gc_bias import compute_gc_bias_runner
+from .base import BaseCommand
+from .options import output_group
+
+class Command(BaseCommand):
+    """
+    Class that generates a model of GC bias, derived from real data
+    """
+    name = "model-gc-bias"
+    description = "Generate GC bias model from a BAM file and a reference FASTA."
+
+    def add_arguments(self, parser: argparse.ArgumentParser):
+        """
+        Add the command's arguments to its parser
+        """
+        parser.add_argument('-i',
+                            type=str,
+                            metavar="input_bam",
+                            dest="input_bam",
+                            required=True,
+                            help="BAM input file.")
+
+        parser.add_argument('-r',
+                            type=str,
+                            metavar="reference",
+                            dest="reference",
+                            required=True,
+                            help="Reference FASTA file.")
+
+        parser.add_argument('-w',
+                            type=int,
+                            metavar="window_size",
+                            dest="window_size",
+                            required=False,
+                            default=100,
+                            help="Window size for GC bias calculation. Default is 100.")
+
+        parser.add_argument('--overwrite',
+                            action='store_true',
+                            required=False,
+                            default=False,
+                            help="Set this flag to overwrite existing output.")
+        output_group.add_to_parser(parser)
+
+    def execute(self, arguments: argparse.Namespace):
+        """
+        Execute the command
+        """
+        compute_gc_bias_runner(
+            arguments.input_bam,
+            arguments.reference,
+            arguments.output_dir,
+            arguments.prefix,
+            arguments.window_size,
+            arguments.overwrite
+        )
diff --git a/neat/common/io.py b/neat/common/io.py
@@ -63,7 +63,13 @@ def open_input(path: str | Path) -> Iterator[TextIO]:
     # - https://github.com/python/mypy/issues/12053
     open_: Callable[..., TextIO]
     if is_compressed(path):
-        handle = bgzf.BgzfReader(path, 'r')
+        try:
+            handle = bgzf.BgzfReader(path, 'r')
+            # Trigger reading a block to verify it IS actually a BGZF file
+            handle.read(0)
+        except ValueError:
+            # Fall back to regular gzip if it's not BGZF
+            handle = gzip.open(path, 'rt', encoding='utf-8')
     else:
         handle = open(path, 'rt', encoding='utf-8')
     try:

diff --git a/neat/model_gc_bias/__init__.py b/neat/model_gc_bias/__init__.py
@@ -0,0 +1 @@
+from .runner import compute_gc_bias_runner
diff --git a/neat/model_gc_bias/runner.py b/neat/model_gc_bias/runner.py
@@ -0,0 +1,61 @@
+"""
+Creates a model of GC bias in a dataset
+"""
+
+import gzip
+import pickle
+import numpy as np
+import logging
+import sys
+import pysam
+from pathlib import Path
+from Bio import SeqIO
+
+from .utils import calculate_gc_content, get_gc_bias_weights
+from ..models import GCBiasModel
+from ..common import validate_output_path
+
+__all__ = [
+    "compute_gc_bias_runner"
+]
+
+_LOG = logging.getLogger(__name__)
+
+def compute_gc_bias_runner(
+        bam_file: str | Path,
+        reference_file: str | Path,
+        output_dir: str | Path,
+        output_prefix: str,
+        window_size: int = 100,
+        overwrite: bool = False):
+    """
+    Main function for computing GC bias model.
+    """
+    _LOG.info("Generating GC bias model")
+
+    output_file = Path(output_dir) / (output_prefix + ".pickle.gz")
+    validate_output_path(output_file, True, overwrite)
+
+    _LOG.debug(f"BAM: {bam_file}")
+    _LOG.debug(f"Reference: {reference_file}")
+    _LOG.debug(f"Window size: {window_size}")
+    _LOG.debug(f"Output: {output_file}")
+
+    _LOG.info("Calculating GC bias")
+
+    # Simple estimation:
+    # 1. Calculate GC content for many windows across the reference.
+    # 2. Count number of reads mapping to those windows.
+    # 3. Normalize to get weights.
+
+    weights = get_gc_bias_weights(bam_file, reference_file, window_size)
+
+    model = GCBiasModel(weights, window_size)
+
+    _LOG.info(f"Saving model: {output_file}")
+    with gzip.open(output_file, 'wb') as outfile:
+        # Saving in NEAT 2.1 compatible format [GC_SCALE_COUNT, GC_SCALE_VAL]
+        gc_scale_count = list(range(1, 101)) + [window_size]
+        pickle.dump([gc_scale_count, model.weights.tolist()], outfile)
+
+    _LOG.info("Modeling complete.")
diff --git a/neat/model_gc_bias/utils.py b/neat/model_gc_bias/utils.py
@@ -0,0 +1,86 @@
+"""
+Utilities for modeling GC bias
+"""
+
+import numpy as np
+import pysam
+from Bio import SeqIO
+from pathlib import Path
+
+def calculate_gc_content(sequence: str) -> float:
+    """
+    Calculates GC content of a sequence.
+    """
+    if not sequence:
+        return 0.0
+    sequence = sequence.upper()
+    gc_count = sequence.count('G') + sequence.count('C')
+    total_count = sum(1 for base in sequence if base in 'ATGC')
+    if total_count == 0:
+        return 0.0
+    return gc_count / total_count
+
+def get_gc_bias_weights(bam_file: str | Path, reference_file: str | Path, window_size: int = 100) -> list[float]:
+    """
+    Estimates GC bias weights from a BAM file and a reference.
+    """
+    # 1. Sample windows across the genome
+    # 2. For each window, calculate GC content and count reads
+    # 3. Aggregate counts by GC percentage
+    # 4. Normalize by the number of windows at each GC percentage
+
+    gc_bins = [0] * 101
+    gc_counts = [0.0] * 101
+
+    with pysam.AlignmentFile(str(bam_file), "rb") as bam:
+        ref_index = SeqIO.index(str(reference_file), "fasta")
+
+        for contig in bam.references:
+            if contig not in ref_index:
+                continue
+
+            contig_seq = ref_index[contig].seq
+            contig_len = len(contig_seq)
+
+            # Sample windows (not all windows to be faster)
+            step = max(window_size, contig_len // 1000)
+            for start in range(0, contig_len - window_size, step):
+                end = start + window_size
+                window_seq = contig_seq[start:end]
+                gc_fraction = calculate_gc_content(str(window_seq))
+                gc_percent = int(round(gc_fraction * 100))
+
+                # Count reads in this window
+                # Note: count() returns number of alignments overlapping the region
+                read_count = bam.count(contig, start, end)
+
+                gc_bins[gc_percent] += 1
+                gc_counts[gc_percent] += read_count
+
+    # Normalize
+    weights = [0.0] * 101
+    for i in range(101):
+        if gc_bins[i] > 0:
+            weights[i] = gc_counts[i] / gc_bins[i]
+        else:
+            weights[i] = 0.0
+
+    # Rescale so max weight is 1.0
+    max_w = max(weights) if any(weights) else 1.0
+    if max_w > 0:
+        weights = [w / max_w for w in weights]
+    else:
+        weights = [1.0] * 101
+
+    # Smoothing might be good here, but keeping it simple for now
+    # Fill in zeros with neighbors or global average if needed
+    avg_w = sum(weights) / sum(1 for w in weights if w > 0) if any(w > 0 for w in weights) else 1.0
+    for i in range(101):
+        if weights[i] == 0:
+            weights[i] = avg_w
+
+    # Rescale again
+    max_w = max(weights)
+    weights = [w / max_w for w in weights]
+
+    return weights