Skip to content

Commit 4efdc37

Browse files
CodingBashclaude
andcommitted
docs + tooling: CHANGELOG.md (Unreleased) + §2.5 measurement helper
Phase 7 Q.2 + Q.3 wrap-up. No version bump. CHANGELOG.md (new, repo root, under "Unreleased"): - Comprehensive notes on the Phase 1-7 API additions (api.map_fastq/count/alleles/save/load, ParsingConfig, MatchTier, crispr_correct alias, CLI with map/count/save/load/alleles, CI-locked sim regression). - Changed + Removed sections covering every kwarg rename, memory/perf fix, and deprecated-symbol deletion. - Cumulative perf deltas: AVITI -43% wall / -33% RSS; chrX -36% wall / -57% RSS; scCRISPR -43%; default pickle -93%. - Dependency additions: click, pyarrow. Pointers: README migration banner + USAGE.md top-level note pointing at CHANGELOG. §2.5 measurement helper (Q.3): new tests/benchmarks/measure_series_overhead.py that dumps per-Series byte cost from a result pickle. Sim result measures at 0.10 MB total across 48 Series — same as Phase 6 P.1 deferral signal. Users can run it on a real 11k-guide chrX or 45M-read AVITI pickle to decide whether the 54-Series -> DataFrame dedup refactor is justified (threshold: ship if total > 100 MB, close as not-worth if < 10 MB). Gate: 13 smoke + 87 fast sim + scCRISPR bit-identical. No package-source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1940438 commit 4efdc37

4 files changed

Lines changed: 194 additions & 0 deletions

File tree

CHANGELOG.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Changelog
2+
3+
Entries here are not yet assigned a version — the user reviews accumulated changes and picks the next release number.
4+
5+
## [Unreleased]
6+
7+
### Added — public API
8+
9+
- **`crispr_ambiguous_mapping.api`** with three stage-aligned entry points:
10+
- `map_fastq(library, fastq_r1_fns, fastq_r2_fns, *, config=ParsingConfig(...), **overrides)` — thin wrapper over the legacy `get_whitelist_reporter_counts_from_fastq`.
11+
- `count(result)` — returns the per-tier count-Series container.
12+
- `alleles(result, tier, *, contains_guide_surrogate, contains_guide_barcode, contains_guide_umi)` — wraps `get_matchset_alleleseries`; raises a clear `ValueError` on slim results.
13+
- **`ParsingConfig`** dataclass — IDE-friendly bundle of the 50 parsing + threshold kwargs.
14+
- **`MatchTier`** `str` enum (`PM`, `PM_SM`, `PM_BM`, `PM_SM_BM`, `PM_MISMATCH_SM`, `PM_MISMATCH_SM_BM`) — backward-compatible via `str` base class.
15+
- **`save(result, directory)` / `load(directory)`** — parquet + JSON cross-language durable serialization (§7.4). Count series, QC summary, and `CountInput` round-trip; the per-observation inference dict stays pickle-only.
16+
- **`crispr_correct`** top-level package alias — `import crispr_correct as cc` forward-looking name.
17+
18+
### Added — CLI
19+
20+
Flat-arg command-line entry point `crispr-correct` (§4.5). Every `ParsingConfig` field maps to `--flag value` (field names with `_``-`); no YAML, no config file.
21+
22+
- `crispr-correct map` — run mapping from FASTQs.
23+
- `crispr-correct count` — emit one tier's count Series as TSV.
24+
- `crispr-correct save` / `load` — round-trip a mapping result through a parquet/JSON directory.
25+
- `crispr-correct alleles` — post-processing allele extraction to parquet.
26+
27+
### Added — testing / CI
28+
29+
- GitHub Actions workflow `.github/workflows/ci.yml` runs smoke tests + 135-mode simulation regression (§7.7). Fast subset (~30 s) on push, full matrix on PR.
30+
- `tests/fixtures/` checked in (library + 8k-read simulated FASTQs + truth parquet) so CI runs without external data.
31+
- `tests/test_smoke.py` (13 tests) covers API surface, CLI, save/load round-trip.
32+
- `tests/test_simulation.py` (135 parametrized comparisons across 8 parse modes × 6 tiers × 9 strategies).
33+
34+
### Changed
35+
36+
- **`retain_inference_results: bool = False` is the default** on `get_whitelist_reporter_counts_from_fastq` (§2.2). Result pickle shrinks ~15× (default) / ~93% at sim scale. Post-processing functions raise `ValueError` with an actionable message when called on a slim result.
37+
- **`contains_surrogate``contains_guide_surrogate`** on `CountInput` and throughout post-processing kwargs (§4.3). Legacy `contains_surrogate` remains as a deprecated `@property` alias through the next release; removed after.
38+
- **`contains_barcode` / `contains_umi``contains_guide_barcode` / `contains_guide_umi`** on `get_matchset_alleleseries`, `get_mutation_profile`, `tally_linked_mutation_count_per_sequence` (§1.4). No compat shim.
39+
- Whitelist DataFrame columns `surrogate` / `barcode` are now accepted as `guide_surrogate` / `guide_barcode` (auto-renamed internally).
40+
- `print()` statements replaced with module-level `logging.Logger` instances (§4.7 / §7.6). `logging.basicConfig(level=logging.INFO)` enables the verbose trace.
41+
- Optional `tqdm` progress bar around the inference `pool.imap` (§4.8).
42+
- Per-observation `pd.Series` construction replaced with a plain dict (§3.5).
43+
- Redundant per-observation Hamming computations deduped — surrogate encoded once, full-whitelist Hamming computed once per read; subsets index-gather (§3.3).
44+
- LUT-based DNA encoding replaces `np.vectorize` (§3.1).
45+
- `pd.merge` in the per-observed-sequence loop replaced with `set.intersection` on tuple-of-guide-tuples (§3.4).
46+
- O(N²) `.apply(axis=1)` cross-product in counter-series build replaced with `pd.DataFrame.from_records(counterdict.items())` — the counter-series stage went from ~21 317 s to ~53 s (~400×) on the AVITI 100k-read profile.
47+
- `parse_fastq` collapsed from ~700 lines / 32 nested branches to ~150 lines / one generic component loop (§4.6).
48+
- Default `surrogate_hamming_threshold_strict` corrected from 2 to 10.
49+
- **Fixed** surrogate-length truncation bug (§1.1): observed surrogate is now clamped to the surrogate library length (32 bp) instead of the protospacer library length (20 bp). Output values change for any surrogate-involving tier — the scCRISPR golden pickle was re-baselined.
50+
51+
### Removed
52+
53+
- Deprecated `get_whitelist_reporter_counts_from_umitools_output` entry point and the two deprecated parsing modules (`reporter_umitools_fastq_parsing`, `guide_raw_fastq_parsing`) (§8). In-tree drivers must migrate to `get_whitelist_reporter_counts_from_fastq`.
54+
- `non_error_dict` field on `QualityControlResult` tier objects (§2.1) — duplicated ~78% of the result pickle.
55+
- `original_df` / `hamming_min_match_df` DataFrame fields on `HammingThresholdGuideCountError` subclasses (§2.3) — replaced with lightweight `n_whitelist_candidates` / `n_hamming_min_match` ints.
56+
- `store_intermediates` flag — unused (§4.9).
57+
- Legacy encoding helpers `encode_DNA_base_{whitelist,observed}{,_vectorized}`, `numpify_string{,_vectorized}` (§5.4) — superseded by the LUT-based `encode_DNA_sequence_*` / `encode_guide_series_*`.
58+
59+
### Performance deltas (cumulative Phase 1-5)
60+
61+
- **AVITI TCG** (100k reads, 1186-guide HBG library, full triplet + UMI): 272.3 s / 1461 MB → 155.3 s / 982 MB (**−43% wall, −33% peak RSS**).
62+
- **chrX TGC** (100k reads, 11035-guide library, no UMI): 784.4 s / 2579 MB → 505.7 s / 1103 MB (**−36% wall, −57% peak RSS**).
63+
- **scCRISPR pytest**: 75 s → 43 s (**−43% wall**).
64+
- **Default result pickle**: 2.40 MB → 0.16 MB (**−93%**) at simulation scale.
65+
66+
### Dependencies
67+
68+
- Added `click ^8.1` (CLI).
69+
- Added `pyarrow >=11,<22` (parquet save/load).

README.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ CRISPR-Correct also handles guide-RNA **sensor / surrogate constructs**, **UMIs*
1212

1313
If you are mapping many large samples that would take too long on a personal computer, CRISPR-Correct can also run on the [Broad Institute's Terra Platform](https://terra.bio/). The workflow file is at the [Terra Firecloud repository](https://portal.firecloud.org/?return=terra#methods/pinellolab/CrisprSelfEditMappingOrchestratorWorkflowSampleEntity/2).
1414

15+
> **Migrating from 0.0.x?** See `../USAGE.md §6` for the CountInput rename + post-processing kwarg rename table, and `../CHANGELOG.md` for the accumulated 0.0.236 → `Unreleased` changes (new API surface, CLI, parquet save/load, memory + performance deltas).
16+
1517
## Installation
1618

1719
```bash
@@ -319,6 +321,40 @@ crispr-correct count \
319321

320322
Repeat `--r1` / `--r2` for multi-file input. Boolean flags use `--flag/--no-flag` convention. Run `crispr-correct map --help` for the full flag list.
321323

324+
### Save/load — parquet + JSON (cross-language durable)
325+
326+
```bash
327+
# Convert a pickle into a parquet directory (portable across Python/R/Julia)
328+
crispr-correct save --in result.pickle --out-dir result/
329+
# Inspect in pandas:
330+
# pd.read_parquet("result/counts_protospacer_match_surrogate_match_barcode_match.parquet")
331+
332+
# Reconstruct a pickle from the directory
333+
crispr-correct load --in-dir result/ --out result.pickle
334+
```
335+
336+
Save/load via Python:
337+
338+
```python
339+
import crispr_ambiguous_mapping as cam
340+
cam.save(result, "result/") # writes parquet + manifest.json + qc.json + count_input.json
341+
result2 = cam.load("result/") # reconstructs the result (without the inference dict)
342+
```
343+
344+
The per-observation inference dict (`observed_guide_reporter_umi_counts_inferred`) is not round-tripped through parquet — pickle it if you need it.
345+
346+
### Post-processing — `alleles` subcommand
347+
348+
```bash
349+
crispr-correct alleles \
350+
--in result.pickle \
351+
--tier protospacer_match_surrogate_match_barcode_match \
352+
--ambiguity accepted --umi-strategy noncollapsed \
353+
--out alleles.parquet
354+
```
355+
356+
Requires the source pickle to have been produced with `--retain-inference-results`; otherwise emits a clear error pointing at the flag.
357+
322358
---
323359

324360
## Testing

crispr-ambiguous-mapping/tests/benchmarks/__init__.py

Whitespace-only changes.
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
"""§2.5 measurement helper — profile per-Series byte footprint on a real
2+
mapping-result pickle.
3+
4+
Usage:
5+
6+
python measure_series_overhead.py /path/to/result.pickle
7+
8+
Prints a table of (tier, strategy) → bytes (values + index), the total over
9+
all non-None Series, and the pickle size. Point this at the 11 k-guide chrX
10+
result (or the 45 M-read AVITI result, if available) to decide whether the
11+
54-Series → DataFrame dedup proposed in IMPROVEMENTS.md §2.5 is worth the
12+
shape-changing refactor.
13+
14+
Decision threshold suggested by the Phase 6 measurement (~100 KB for a
15+
45-guide sim):
16+
- If total < 10 MB on real-scale data: close §2.5 as not worth the refactor.
17+
- If total > 100 MB: schedule the refactor in Phase 8.
18+
- In between: judgment call based on per-user RSS headroom.
19+
"""
20+
from __future__ import annotations
21+
22+
import argparse
23+
import pickle
24+
import sys
25+
from pathlib import Path
26+
27+
28+
def _series_bytes(s) -> int:
29+
"""Approximate in-memory byte cost of a pandas Series (values + index)."""
30+
total = int(getattr(s, "values", b"").nbytes or 0)
31+
idx = getattr(s, "index", None)
32+
if idx is None:
33+
return total
34+
if hasattr(idx, "levels"):
35+
for lv in idx.levels:
36+
total += int(lv.values.nbytes)
37+
else:
38+
total += int(idx.values.nbytes)
39+
return total
40+
41+
42+
def main():
43+
ap = argparse.ArgumentParser(description=__doc__)
44+
ap.add_argument("pickle_path", type=Path, help="Path to a WhitelistReporterCountsResult pickle.")
45+
args = ap.parse_args()
46+
47+
path = args.pickle_path
48+
with path.open("rb") as fh:
49+
result = pickle.load(fh)
50+
51+
allw = result.all_match_set_whitelist_reporter_counter_series_results
52+
tiers = [
53+
"protospacer_match",
54+
"protospacer_match_surrogate_match",
55+
"protospacer_match_barcode_match",
56+
"protospacer_match_surrogate_match_barcode_match",
57+
"protospacer_mismatch_surrogate_match",
58+
"protospacer_mismatch_surrogate_match_barcode_match",
59+
]
60+
61+
print(f"{'tier':50s} {'strategy':45s} {'rows':>8s} {'bytes':>12s}")
62+
print("-" * 120)
63+
total = 0
64+
n_series = 0
65+
for tn in tiers:
66+
t = getattr(allw, tn, None)
67+
if t is None:
68+
continue
69+
for attr in sorted(a for a in dir(t) if a.endswith("counterseries") and not a.startswith("_")):
70+
s = getattr(t, attr, None)
71+
if s is None:
72+
continue
73+
b = _series_bytes(s)
74+
total += b
75+
n_series += 1
76+
print(f"{tn:50s} {attr:45s} {len(s):>8d} {b:>12,d}")
77+
78+
print("-" * 120)
79+
print(f"{n_series} non-None Series; total = {total/1024/1024:.2f} MB")
80+
81+
try:
82+
pkl_size = path.stat().st_size
83+
print(f"Source pickle size on disk: {pkl_size/1024/1024:.2f} MB")
84+
except OSError:
85+
pass
86+
87+
88+
if __name__ == "__main__":
89+
sys.exit(main() or 0)

0 commit comments

Comments
 (0)