|
| 1 | +# Changelog |
| 2 | + |
| 3 | +Entries here are not yet assigned a version — the user reviews accumulated changes and picks the next release number. |
| 4 | + |
| 5 | +## [Unreleased] |
| 6 | + |
| 7 | +### Added — public API |
| 8 | + |
| 9 | +- **`crispr_ambiguous_mapping.api`** with three stage-aligned entry points: |
| 10 | + - `map_fastq(library, fastq_r1_fns, fastq_r2_fns, *, config=ParsingConfig(...), **overrides)` — thin wrapper over the legacy `get_whitelist_reporter_counts_from_fastq`. |
| 11 | + - `count(result)` — returns the per-tier count-Series container. |
| 12 | + - `alleles(result, tier, *, contains_guide_surrogate, contains_guide_barcode, contains_guide_umi)` — wraps `get_matchset_alleleseries`; raises a clear `ValueError` on slim results. |
| 13 | +- **`ParsingConfig`** dataclass — IDE-friendly bundle of the 50 parsing + threshold kwargs. |
| 14 | +- **`MatchTier`** `str` enum (`PM`, `PM_SM`, `PM_BM`, `PM_SM_BM`, `PM_MISMATCH_SM`, `PM_MISMATCH_SM_BM`) — backward-compatible via `str` base class. |
| 15 | +- **`save(result, directory)` / `load(directory)`** — parquet + JSON cross-language durable serialization (§7.4). Count series, QC summary, and `CountInput` round-trip; the per-observation inference dict stays pickle-only. |
| 16 | +- **`crispr_correct`** top-level package alias — `import crispr_correct as cc` forward-looking name. |
| 17 | + |
| 18 | +### Added — CLI |
| 19 | + |
| 20 | +Flat-arg command-line entry point `crispr-correct` (§4.5). Every `ParsingConfig` field maps to `--flag value` (field names with `_` → `-`); no YAML, no config file. |
| 21 | + |
| 22 | +- `crispr-correct map` — run mapping from FASTQs. |
| 23 | +- `crispr-correct count` — emit one tier's count Series as TSV. |
| 24 | +- `crispr-correct save` / `load` — round-trip a mapping result through a parquet/JSON directory. |
| 25 | +- `crispr-correct alleles` — post-processing allele extraction to parquet. |
| 26 | + |
| 27 | +### Added — testing / CI |
| 28 | + |
| 29 | +- GitHub Actions workflow `.github/workflows/ci.yml` runs smoke tests + 135-mode simulation regression (§7.7). Fast subset (~30 s) on push, full matrix on PR. |
| 30 | +- `tests/fixtures/` checked in (library + 8k-read simulated FASTQs + truth parquet) so CI runs without external data. |
| 31 | +- `tests/test_smoke.py` (13 tests) covers API surface, CLI, save/load round-trip. |
| 32 | +- `tests/test_simulation.py` (135 parametrized comparisons across 8 parse modes × 6 tiers × 9 strategies). |
| 33 | + |
| 34 | +### Changed |
| 35 | + |
| 36 | +- **`retain_inference_results: bool = False` is the default** on `get_whitelist_reporter_counts_from_fastq` (§2.2). Result pickle shrinks ~15× (default) / ~93% at sim scale. Post-processing functions raise `ValueError` with an actionable message when called on a slim result. |
| 37 | +- **`contains_surrogate` → `contains_guide_surrogate`** on `CountInput` and throughout post-processing kwargs (§4.3). Legacy `contains_surrogate` remains as a deprecated `@property` alias through the next release; removed after. |
| 38 | +- **`contains_barcode` / `contains_umi` → `contains_guide_barcode` / `contains_guide_umi`** on `get_matchset_alleleseries`, `get_mutation_profile`, `tally_linked_mutation_count_per_sequence` (§1.4). No compat shim. |
| 39 | +- Whitelist DataFrame columns `surrogate` / `barcode` are now accepted as `guide_surrogate` / `guide_barcode` (auto-renamed internally). |
| 40 | +- `print()` statements replaced with module-level `logging.Logger` instances (§4.7 / §7.6). `logging.basicConfig(level=logging.INFO)` enables the verbose trace. |
| 41 | +- Optional `tqdm` progress bar around the inference `pool.imap` (§4.8). |
| 42 | +- Per-observation `pd.Series` construction replaced with a plain dict (§3.5). |
| 43 | +- Redundant per-observation Hamming computations deduped — surrogate encoded once, full-whitelist Hamming computed once per read; subsets index-gather (§3.3). |
| 44 | +- LUT-based DNA encoding replaces `np.vectorize` (§3.1). |
| 45 | +- `pd.merge` in the per-observed-sequence loop replaced with `set.intersection` on tuple-of-guide-tuples (§3.4). |
| 46 | +- O(N²) `.apply(axis=1)` cross-product in counter-series build replaced with `pd.DataFrame.from_records(counterdict.items())` — the counter-series stage went from ~21 317 s to ~53 s (~400×) on the AVITI 100k-read profile. |
| 47 | +- `parse_fastq` collapsed from ~700 lines / 32 nested branches to ~150 lines / one generic component loop (§4.6). |
| 48 | +- Default `surrogate_hamming_threshold_strict` corrected from 2 to 10. |
| 49 | +- **Fixed** surrogate-length truncation bug (§1.1): observed surrogate is now clamped to the surrogate library length (32 bp) instead of the protospacer library length (20 bp). Output values change for any surrogate-involving tier — the scCRISPR golden pickle was re-baselined. |
| 50 | + |
| 51 | +### Removed |
| 52 | + |
| 53 | +- Deprecated `get_whitelist_reporter_counts_from_umitools_output` entry point and the two deprecated parsing modules (`reporter_umitools_fastq_parsing`, `guide_raw_fastq_parsing`) (§8). In-tree drivers must migrate to `get_whitelist_reporter_counts_from_fastq`. |
| 54 | +- `non_error_dict` field on `QualityControlResult` tier objects (§2.1) — duplicated ~78% of the result pickle. |
| 55 | +- `original_df` / `hamming_min_match_df` DataFrame fields on `HammingThresholdGuideCountError` subclasses (§2.3) — replaced with lightweight `n_whitelist_candidates` / `n_hamming_min_match` ints. |
| 56 | +- `store_intermediates` flag — unused (§4.9). |
| 57 | +- Legacy encoding helpers `encode_DNA_base_{whitelist,observed}{,_vectorized}`, `numpify_string{,_vectorized}` (§5.4) — superseded by the LUT-based `encode_DNA_sequence_*` / `encode_guide_series_*`. |
| 58 | + |
| 59 | +### Performance deltas (cumulative Phase 1-5) |
| 60 | + |
| 61 | +- **AVITI TCG** (100k reads, 1186-guide HBG library, full triplet + UMI): 272.3 s / 1461 MB → 155.3 s / 982 MB (**−43% wall, −33% peak RSS**). |
| 62 | +- **chrX TGC** (100k reads, 11035-guide library, no UMI): 784.4 s / 2579 MB → 505.7 s / 1103 MB (**−36% wall, −57% peak RSS**). |
| 63 | +- **scCRISPR pytest**: 75 s → 43 s (**−43% wall**). |
| 64 | +- **Default result pickle**: 2.40 MB → 0.16 MB (**−93%**) at simulation scale. |
| 65 | + |
| 66 | +### Dependencies |
| 67 | + |
| 68 | +- Added `click ^8.1` (CLI). |
| 69 | +- Added `pyarrow >=11,<22` (parquet save/load). |
0 commit comments