docs + tooling: CHANGELOG.md (Unreleased) + §2.5 measurement helper

CodingBash · claude · CodingBash · commit 4efdc375edb0 · 2026-04-22T02:20:47.000Z
Phase 7 Q.2 + Q.3 wrap-up. No version bump.

CHANGELOG.md (new, repo root, under "Unreleased"):
- Comprehensive notes on the Phase 1-7 API additions (api.map_fastq/count/alleles/save/load, ParsingConfig, MatchTier, crispr_correct alias, CLI with map/count/save/load/alleles, CI-locked sim regression).
- Changed + Removed sections covering every kwarg rename, memory/perf fix, and deprecated-symbol deletion.
- Cumulative perf deltas: AVITI -43% wall / -33% RSS; chrX -36% wall / -57% RSS; scCRISPR -43%; default pickle -93%.
- Dependency additions: click, pyarrow.

Pointers: README migration banner + USAGE.md top-level note pointing at CHANGELOG.

§2.5 measurement helper (Q.3): new tests/benchmarks/measure_series_overhead.py that dumps per-Series byte cost from a result pickle. Sim result measures at 0.10 MB total across 48 Series — same as Phase 6 P.1 deferral signal. Users can run it on a real 11k-guide chrX or 45M-read AVITI pickle to decide whether the 54-Series -&gt; DataFrame dedup refactor is justified (threshold: ship if total &gt; 100 MB, close as not-worth if &lt; 10 MB).

Gate: 13 smoke + 87 fast sim + scCRISPR bit-identical. No package-source changes.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,69 @@
+# Changelog
+
+Entries here are not yet assigned a version — the user reviews accumulated changes and picks the next release number.
+
+## [Unreleased]
+
+### Added — public API
+
+- **`crispr_ambiguous_mapping.api`** with three stage-aligned entry points:
+  - `map_fastq(library, fastq_r1_fns, fastq_r2_fns, *, config=ParsingConfig(...), **overrides)` — thin wrapper over the legacy `get_whitelist_reporter_counts_from_fastq`.
+  - `count(result)` — returns the per-tier count-Series container.
+  - `alleles(result, tier, *, contains_guide_surrogate, contains_guide_barcode, contains_guide_umi)` — wraps `get_matchset_alleleseries`; raises a clear `ValueError` on slim results.
+- **`ParsingConfig`** dataclass — IDE-friendly bundle of the 50 parsing + threshold kwargs.
+- **`MatchTier`** `str` enum (`PM`, `PM_SM`, `PM_BM`, `PM_SM_BM`, `PM_MISMATCH_SM`, `PM_MISMATCH_SM_BM`) — backward-compatible via `str` base class.
+- **`save(result, directory)` / `load(directory)`** — parquet + JSON cross-language durable serialization (§7.4). Count series, QC summary, and `CountInput` round-trip; the per-observation inference dict stays pickle-only.
+- **`crispr_correct`** top-level package alias — `import crispr_correct as cc` forward-looking name.
+
+### Added — CLI
+
+Flat-arg command-line entry point `crispr-correct` (§4.5). Every `ParsingConfig` field maps to `--flag value` (field names with `_` → `-`); no YAML, no config file.
+
+- `crispr-correct map` — run mapping from FASTQs.
+- `crispr-correct count` — emit one tier's count Series as TSV.
+- `crispr-correct save` / `load` — round-trip a mapping result through a parquet/JSON directory.
+- `crispr-correct alleles` — post-processing allele extraction to parquet.
+
+### Added — testing / CI
+
+- GitHub Actions workflow `.github/workflows/ci.yml` runs smoke tests + 135-mode simulation regression (§7.7). Fast subset (~30 s) on push, full matrix on PR.
+- `tests/fixtures/` checked in (library + 8k-read simulated FASTQs + truth parquet) so CI runs without external data.
+- `tests/test_smoke.py` (13 tests) covers API surface, CLI, save/load round-trip.
+- `tests/test_simulation.py` (135 parametrized comparisons across 8 parse modes × 6 tiers × 9 strategies).
+
+### Changed
+
+- **`retain_inference_results: bool = False` is the default** on `get_whitelist_reporter_counts_from_fastq` (§2.2). Result pickle shrinks ~15× (default) / ~93% at sim scale. Post-processing functions raise `ValueError` with an actionable message when called on a slim result.
+- **`contains_surrogate` → `contains_guide_surrogate`** on `CountInput` and throughout post-processing kwargs (§4.3). Legacy `contains_surrogate` remains as a deprecated `@property` alias through the next release; removed after.
+- **`contains_barcode` / `contains_umi` → `contains_guide_barcode` / `contains_guide_umi`** on `get_matchset_alleleseries`, `get_mutation_profile`, `tally_linked_mutation_count_per_sequence` (§1.4). No compat shim.
+- Whitelist DataFrame columns `surrogate` / `barcode` are now accepted as `guide_surrogate` / `guide_barcode` (auto-renamed internally).
+- `print()` statements replaced with module-level `logging.Logger` instances (§4.7 / §7.6). `logging.basicConfig(level=logging.INFO)` enables the verbose trace.
+- Optional `tqdm` progress bar around the inference `pool.imap` (§4.8).
+- Per-observation `pd.Series` construction replaced with a plain dict (§3.5).
+- Redundant per-observation Hamming computations deduped — surrogate encoded once, full-whitelist Hamming computed once per read; subsets index-gather (§3.3).
+- LUT-based DNA encoding replaces `np.vectorize` (§3.1).
+- `pd.merge` in the per-observed-sequence loop replaced with `set.intersection` on tuple-of-guide-tuples (§3.4).
+- O(N²) `.apply(axis=1)` cross-product in counter-series build replaced with `pd.DataFrame.from_records(counterdict.items())` — the counter-series stage went from ~21 317 s to ~53 s (~400×) on the AVITI 100k-read profile.
+- `parse_fastq` collapsed from ~700 lines / 32 nested branches to ~150 lines / one generic component loop (§4.6).
+- Default `surrogate_hamming_threshold_strict` corrected from 2 to 10.
+- **Fixed** surrogate-length truncation bug (§1.1): observed surrogate is now clamped to the surrogate library length (32 bp) instead of the protospacer library length (20 bp). Output values change for any surrogate-involving tier — the scCRISPR golden pickle was re-baselined.
+
+### Removed
+
+- Deprecated `get_whitelist_reporter_counts_from_umitools_output` entry point and the two deprecated parsing modules (`reporter_umitools_fastq_parsing`, `guide_raw_fastq_parsing`) (§8). In-tree drivers must migrate to `get_whitelist_reporter_counts_from_fastq`.
+- `non_error_dict` field on `QualityControlResult` tier objects (§2.1) — duplicated ~78% of the result pickle.
+- `original_df` / `hamming_min_match_df` DataFrame fields on `HammingThresholdGuideCountError` subclasses (§2.3) — replaced with lightweight `n_whitelist_candidates` / `n_hamming_min_match` ints.
+- `store_intermediates` flag — unused (§4.9).
+- Legacy encoding helpers `encode_DNA_base_{whitelist,observed}{,_vectorized}`, `numpify_string{,_vectorized}` (§5.4) — superseded by the LUT-based `encode_DNA_sequence_*` / `encode_guide_series_*`.
+
+### Performance deltas (cumulative Phase 1-5)
+
+- **AVITI TCG** (100k reads, 1186-guide HBG library, full triplet + UMI): 272.3 s / 1461 MB → 155.3 s / 982 MB (**−43% wall, −33% peak RSS**).
+- **chrX TGC** (100k reads, 11035-guide library, no UMI): 784.4 s / 2579 MB → 505.7 s / 1103 MB (**−36% wall, −57% peak RSS**).
+- **scCRISPR pytest**: 75 s → 43 s (**−43% wall**).
+- **Default result pickle**: 2.40 MB → 0.16 MB (**−93%**) at simulation scale.
+
+### Dependencies
+
+- Added `click ^8.1` (CLI).
+- Added `pyarrow >=11,<22` (parquet save/load).
diff --git a/README.md b/README.md
@@ -12,6 +12,8 @@ CRISPR-Correct also handles guide-RNA **sensor / surrogate constructs**, **UMIs*
 
 If you are mapping many large samples that would take too long on a personal computer, CRISPR-Correct can also run on the [Broad Institute's Terra Platform](https://terra.bio/). The workflow file is at the [Terra Firecloud repository](https://portal.firecloud.org/?return=terra#methods/pinellolab/CrisprSelfEditMappingOrchestratorWorkflowSampleEntity/2).
 
+> **Migrating from 0.0.x?** See `../USAGE.md §6` for the CountInput rename + post-processing kwarg rename table, and `../CHANGELOG.md` for the accumulated 0.0.236 → `Unreleased` changes (new API surface, CLI, parquet save/load, memory + performance deltas).
+
 ## Installation
 
 ```bash
@@ -319,6 +321,40 @@ crispr-correct count \
 
 Repeat `--r1` / `--r2` for multi-file input. Boolean flags use `--flag/--no-flag` convention. Run `crispr-correct map --help` for the full flag list.
 
+### Save/load — parquet + JSON (cross-language durable)
+
+```bash
+# Convert a pickle into a parquet directory (portable across Python/R/Julia)
+crispr-correct save --in result.pickle --out-dir result/
+# Inspect in pandas:
+#   pd.read_parquet("result/counts_protospacer_match_surrogate_match_barcode_match.parquet")
+
+# Reconstruct a pickle from the directory
+crispr-correct load --in-dir result/ --out result.pickle
+```
+
+Save/load via Python:
+
+```python
+import crispr_ambiguous_mapping as cam
+cam.save(result, "result/")      # writes parquet + manifest.json + qc.json + count_input.json
+result2 = cam.load("result/")    # reconstructs the result (without the inference dict)
+```
+
+The per-observation inference dict (`observed_guide_reporter_umi_counts_inferred`) is not round-tripped through parquet — pickle it if you need it.
+
+### Post-processing — `alleles` subcommand
+
+```bash
+crispr-correct alleles \
+    --in result.pickle \
+    --tier protospacer_match_surrogate_match_barcode_match \
+    --ambiguity accepted --umi-strategy noncollapsed \
+    --out alleles.parquet
+```
+
+Requires the source pickle to have been produced with `--retain-inference-results`; otherwise emits a clear error pointing at the flag.
+
 ---
 
 ## Testing
diff --git a/crispr-ambiguous-mapping/tests/benchmarks/__init__.py b/crispr-ambiguous-mapping/tests/benchmarks/__init__.py
diff --git a/crispr-ambiguous-mapping/tests/benchmarks/measure_series_overhead.py b/crispr-ambiguous-mapping/tests/benchmarks/measure_series_overhead.py
@@ -0,0 +1,89 @@
+"""§2.5 measurement helper — profile per-Series byte footprint on a real
+mapping-result pickle.
+
+Usage:
+
+    python measure_series_overhead.py /path/to/result.pickle
+
+Prints a table of (tier, strategy) → bytes (values + index), the total over
+all non-None Series, and the pickle size. Point this at the 11 k-guide chrX
+result (or the 45 M-read AVITI result, if available) to decide whether the
+54-Series → DataFrame dedup proposed in IMPROVEMENTS.md §2.5 is worth the
+shape-changing refactor.
+
+Decision threshold suggested by the Phase 6 measurement (~100 KB for a
+45-guide sim):
+- If total < 10 MB on real-scale data: close §2.5 as not worth the refactor.
+- If total > 100 MB: schedule the refactor in Phase 8.
+- In between: judgment call based on per-user RSS headroom.
+"""
+from __future__ import annotations
+
+import argparse
+import pickle
+import sys
+from pathlib import Path
+
+
+def _series_bytes(s) -> int:
+    """Approximate in-memory byte cost of a pandas Series (values + index)."""
+    total = int(getattr(s, "values", b"").nbytes or 0)
+    idx = getattr(s, "index", None)
+    if idx is None:
+        return total
+    if hasattr(idx, "levels"):
+        for lv in idx.levels:
+            total += int(lv.values.nbytes)
+    else:
+        total += int(idx.values.nbytes)
+    return total
+
+
+def main():
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("pickle_path", type=Path, help="Path to a WhitelistReporterCountsResult pickle.")
+    args = ap.parse_args()
+
+    path = args.pickle_path
+    with path.open("rb") as fh:
+        result = pickle.load(fh)
+
+    allw = result.all_match_set_whitelist_reporter_counter_series_results
+    tiers = [
+        "protospacer_match",
+        "protospacer_match_surrogate_match",
+        "protospacer_match_barcode_match",
+        "protospacer_match_surrogate_match_barcode_match",
+        "protospacer_mismatch_surrogate_match",
+        "protospacer_mismatch_surrogate_match_barcode_match",
+    ]
+
+    print(f"{'tier':50s} {'strategy':45s} {'rows':>8s} {'bytes':>12s}")
+    print("-" * 120)
+    total = 0
+    n_series = 0
+    for tn in tiers:
+        t = getattr(allw, tn, None)
+        if t is None:
+            continue
+        for attr in sorted(a for a in dir(t) if a.endswith("counterseries") and not a.startswith("_")):
+            s = getattr(t, attr, None)
+            if s is None:
+                continue
+            b = _series_bytes(s)
+            total += b
+            n_series += 1
+            print(f"{tn:50s} {attr:45s} {len(s):>8d} {b:>12,d}")
+
+    print("-" * 120)
+    print(f"{n_series} non-None Series; total = {total/1024/1024:.2f} MB")
+
+    try:
+        pkl_size = path.stat().st_size
+        print(f"Source pickle size on disk: {pkl_size/1024/1024:.2f} MB")
+    except OSError:
+        pass
+
+
+if __name__ == "__main__":
+    sys.exit(main() or 0)