|
| 1 | +# Changelog |
| 2 | + |
| 3 | +Entries here are not yet assigned a version — the user reviews accumulated changes and picks the next release number. |
| 4 | + |
| 5 | +## [Unreleased] |
| 6 | + |
| 7 | +### Added — public API |
| 8 | + |
| 9 | +- **`crispr_ambiguous_mapping.api`** with three stage-aligned entry points: |
| 10 | + - `map_fastq(library, fastq_r1_fns, fastq_r2_fns, *, config=ParsingConfig(...), **overrides)` — thin wrapper over the legacy `get_whitelist_reporter_counts_from_fastq`. |
| 11 | + - `count(result)` — returns the per-tier count-Series container. |
| 12 | + - `alleles(result, tier, *, contains_guide_surrogate, contains_guide_barcode, contains_guide_umi)` — wraps `get_matchset_alleleseries`; raises a clear `ValueError` on slim results. |
| 13 | +- **`ParsingConfig`** dataclass — IDE-friendly bundle of the 50 parsing + threshold kwargs. |
| 14 | +- **`MatchTier`** `str` enum (`PM`, `PM_SM`, `PM_BM`, `PM_SM_BM`, `PM_MISMATCH_SM`, `PM_MISMATCH_SM_BM`) — backward-compatible via `str` base class. |
| 15 | +- **`save(result, directory)` / `load(directory)`** — parquet + JSON cross-language durable serialization (§7.4). Count series, QC summary, and `CountInput` round-trip; the per-observation inference dict stays pickle-only. |
| 16 | +- **`crispr_correct`** top-level package alias — `import crispr_correct as cc` forward-looking name. |
| 17 | + |
| 18 | +### Added — CLI |
| 19 | + |
| 20 | +Flat-arg command-line entry point `crispr-correct` (§4.5). Every `ParsingConfig` field maps to `--flag value` (field names with `_` → `-`); no YAML, no config file. |
| 21 | + |
| 22 | +- `crispr-correct map` — run mapping from FASTQs. |
| 23 | +- `crispr-correct count` — emit one tier's count Series as TSV. |
| 24 | +- `crispr-correct save` / `load` — round-trip a mapping result through a parquet/JSON directory. |
| 25 | +- `crispr-correct alleles` — post-processing allele extraction to parquet. |
| 26 | + |
| 27 | +### Added — validation + demo |
| 28 | + |
| 29 | +- `tests/test_simulation_e2e.py` — 11-stage end-to-end sweep exercising every public surface (Python `map_fastq` / `count` / `alleles` / `save` / `load` + the five CLI subcommands) on the simulation fixtures with bit-equality and ground-truth assertions at every stage. |
| 30 | +- `tests/simulation/phase8_validation_and_demo.ipynb` (wrapper folder) — 35-cell validation report + feature demo + guided tour of Phase 1-7 improvements with `git show --stat` snippets. Runs in ~90 s on the sim fixture. Outputs committed so it renders as a report on GitHub without running locally. |
| 31 | +- **Fixed**: `MatchTier.__str__` was returning the Enum repr (`"MatchTier.PM_SM_BM"`) instead of the underlying string value, breaking `getattr(result, str(tier))`. Added `__str__` override returning `self.value`. Equality with plain strings via the `str` subclass is unaffected. |
| 32 | +- **Fixed**: `crispr-correct alleles` built the strategy attribute name without the `ambiguous_` prefix, causing subprocess failures. |
| 33 | + |
| 34 | +### Added — testing / CI |
| 35 | + |
| 36 | +- GitHub Actions workflow `.github/workflows/ci.yml` runs smoke tests + 135-mode simulation regression (§7.7). Fast subset (~30 s) on push, full matrix on PR. |
| 37 | +- `tests/fixtures/` checked in (library + 8k-read simulated FASTQs + truth parquet) so CI runs without external data. |
| 38 | +- `tests/test_smoke.py` (13 tests) covers API surface, CLI, save/load round-trip. |
| 39 | +- `tests/test_simulation.py` (135 parametrized comparisons across 8 parse modes × 6 tiers × 9 strategies). |
| 40 | + |
| 41 | +### Changed |
| 42 | + |
| 43 | +- **`MatchSetSingleInferenceMatchResultValue.matches`** is now `Optional[Tuple[Tuple[Any, ...], ...]]` (§2.4). Was `Optional[pd.DataFrame]` holding a per-observation iloc slice of the whitelist; now stores a plain tuple of positional row tuples. Per-match footprint drops ~0.8 KB → ~100 B; only matters when `retain_inference_results=True`. Column order matches whitelist registration (`protospacer` [, `surrogate`] [, `barcode`]). No compat shim — readers of `result.<tier>.matches` as a DataFrame must adapt. |
| 44 | +- Padded whitelist DataFrame is released immediately after encoding (§2.7); was kept alive through the whole inference loop, duplicating the whitelist for the multiprocessing pool. |
| 45 | +- **`retain_inference_results: bool = False` is the default** on `get_whitelist_reporter_counts_from_fastq` (§2.2). Result pickle shrinks ~15× (default) / ~93% at sim scale. Post-processing functions raise `ValueError` with an actionable message when called on a slim result. |
| 46 | +- **`contains_surrogate` → `contains_guide_surrogate`** on `CountInput` and throughout post-processing kwargs (§4.3). Legacy `contains_surrogate` remains as a deprecated `@property` alias through the next release; removed after. |
| 47 | +- **`contains_barcode` / `contains_umi` → `contains_guide_barcode` / `contains_guide_umi`** on `get_matchset_alleleseries`, `get_mutation_profile`, `tally_linked_mutation_count_per_sequence` (§1.4). No compat shim. |
| 48 | +- Whitelist DataFrame columns `surrogate` / `barcode` are now accepted as `guide_surrogate` / `guide_barcode` (auto-renamed internally). |
| 49 | +- `print()` statements replaced with module-level `logging.Logger` instances (§4.7 / §7.6). `logging.basicConfig(level=logging.INFO)` enables the verbose trace. |
| 50 | +- Optional `tqdm` progress bar around the inference `pool.imap` (§4.8). |
| 51 | +- Per-observation `pd.Series` construction replaced with a plain dict (§3.5). |
| 52 | +- Redundant per-observation Hamming computations deduped — surrogate encoded once, full-whitelist Hamming computed once per read; subsets index-gather (§3.3). |
| 53 | +- LUT-based DNA encoding replaces `np.vectorize` (§3.1). |
| 54 | +- `pd.merge` in the per-observed-sequence loop replaced with `set.intersection` on tuple-of-guide-tuples (§3.4). |
| 55 | +- O(N²) `.apply(axis=1)` cross-product in counter-series build replaced with `pd.DataFrame.from_records(counterdict.items())` — the counter-series stage went from ~21 317 s to ~53 s (~400×) on the AVITI 100k-read profile. |
| 56 | +- `parse_fastq` collapsed from ~700 lines / 32 nested branches to ~150 lines / one generic component loop (§4.6). |
| 57 | +- Default `surrogate_hamming_threshold_strict` corrected from 2 to 10. |
| 58 | +- **Fixed** surrogate-length truncation bug (§1.1): observed surrogate is now clamped to the surrogate library length (32 bp) instead of the protospacer library length (20 bp). Output values change for any surrogate-involving tier — the scCRISPR golden pickle was re-baselined. |
| 59 | + |
| 60 | +### Removed |
| 61 | + |
| 62 | +- Deprecated `get_whitelist_reporter_counts_from_umitools_output` entry point and the two deprecated parsing modules (`reporter_umitools_fastq_parsing`, `guide_raw_fastq_parsing`) (§8). In-tree drivers must migrate to `get_whitelist_reporter_counts_from_fastq`. |
| 63 | +- `non_error_dict` field on `QualityControlResult` tier objects (§2.1) — duplicated ~78% of the result pickle. |
| 64 | +- `original_df` / `hamming_min_match_df` DataFrame fields on `HammingThresholdGuideCountError` subclasses (§2.3) — replaced with lightweight `n_whitelist_candidates` / `n_hamming_min_match` ints. |
| 65 | +- `store_intermediates` flag — unused (§4.9). |
| 66 | +- Legacy encoding helpers `encode_DNA_base_{whitelist,observed}{,_vectorized}`, `numpify_string{,_vectorized}` (§5.4) — superseded by the LUT-based `encode_DNA_sequence_*` / `encode_guide_series_*`. |
| 67 | + |
| 68 | +### Performance deltas (cumulative Phase 1-5) |
| 69 | + |
| 70 | +- **AVITI TCG** (100k reads, 1186-guide HBG library, full triplet + UMI): 272.3 s / 1461 MB → 155.3 s / 982 MB (**−43% wall, −33% peak RSS**). |
| 71 | +- **chrX TGC** (100k reads, 11035-guide library, no UMI): 784.4 s / 2579 MB → 505.7 s / 1103 MB (**−36% wall, −57% peak RSS**). |
| 72 | +- **scCRISPR pytest**: 75 s → 43 s (**−43% wall**). |
| 73 | +- **Default result pickle**: 2.40 MB → 0.16 MB (**−93%**) at simulation scale. |
| 74 | + |
| 75 | +### Dependencies |
| 76 | + |
| 77 | +- Added `click ^8.1` (CLI). |
| 78 | +- Added `pyarrow >=11,<22` (parquet save/load). |
0 commit comments