Skip to content

Commit ce8fc76

Browse files
CodingBashclaude
andcommitted
Merge perf/all-phases-review: Phase 1-9 optimization roadmap
Integrates 36 commits across 9 stacked branches (perf/roadmap-phase1 → perf/roadmap-phase9) covering the IMPROVEMENTS.md tactical roadmap. Bit-identical output to the v0.0.236 baseline except for the §1.1 surrogate-truncation correctness fix. Cumulative wins vs v0.0.236 baseline (downsampled real data, 4 cores): - AVITI TCG: 272.3s/1461MB -> 161.7s/916MB (-40.6% wall, -37.3% peak RSS) - chrX TGC: 784.4s/2579MB -> 567.0s/933MB (-27.7% wall, -63.8% peak RSS) - scCRISPR pytest: ~75s -> ~40s (-47%) - Default result pickle: 2.40MB -> 0.16MB (-93.3%) Added public surfaces: api.map_fastq/count/alleles/save/load + ParsingConfig, MatchTier enum, crispr-correct CLI, parquet save/load, crispr_correct alias package, GitHub Actions CI, 159 in-repo tests, Phase 8 validation notebook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 parents df3a3df + 3d4ca50 commit ce8fc76

39 files changed

Lines changed: 72431 additions & 1242 deletions

.github/workflows/ci.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: CI
2+
3+
on:
4+
push:
5+
branches: [master, multi-sample-support, "perf/**"]
6+
pull_request:
7+
branches: [master, multi-sample-support]
8+
9+
jobs:
10+
test:
11+
runs-on: ubuntu-latest
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
python-version: ["3.10", "3.11"]
16+
defaults:
17+
run:
18+
working-directory: crispr-ambiguous-mapping
19+
steps:
20+
- uses: actions/checkout@v4
21+
22+
- name: Set up Python ${{ matrix.python-version }}
23+
uses: actions/setup-python@v5
24+
with:
25+
python-version: ${{ matrix.python-version }}
26+
cache: "pip"
27+
28+
- name: Install package + test deps
29+
run: |
30+
python -m pip install --upgrade pip
31+
pip install -e .
32+
pip install pytest
33+
34+
- name: Run smoke + CI-fast simulation regression (push)
35+
if: github.event_name == 'push'
36+
run: pytest tests/ -v -m "not slow"
37+
38+
- name: Run smoke + full 135-mode simulation regression (PR)
39+
if: github.event_name == 'pull_request'
40+
run: pytest tests/ -v

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# §5.5: stop committing generated wheels — 26 version-bumped artifacts were
2+
# being carried in `dist/` on multi-sample-support. Let a release pipeline
3+
# publish them to PyPI; they don't belong in git.
4+
crispr-ambiguous-mapping/dist/
5+
**/__pycache__/
6+
*.py[cod]
7+
*.egg-info/
8+
.pytest_cache/
9+
.mypy_cache/
10+
.ipynb_checkpoints/
11+
tests/simulation/test_local.run.ipynb
12+
tests/simulation/bench_simulation.run.ipynb
13+
tests/benchmarks/benchmark_demos.run.ipynb

CHANGELOG.md

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Changelog
2+
3+
Entries here are not yet assigned a version — the user reviews accumulated changes and picks the next release number.
4+
5+
## [Unreleased]
6+
7+
### Added — public API
8+
9+
- **`crispr_ambiguous_mapping.api`** with three stage-aligned entry points:
10+
- `map_fastq(library, fastq_r1_fns, fastq_r2_fns, *, config=ParsingConfig(...), **overrides)` — thin wrapper over the legacy `get_whitelist_reporter_counts_from_fastq`.
11+
- `count(result)` — returns the per-tier count-Series container.
12+
- `alleles(result, tier, *, contains_guide_surrogate, contains_guide_barcode, contains_guide_umi)` — wraps `get_matchset_alleleseries`; raises a clear `ValueError` on slim results.
13+
- **`ParsingConfig`** dataclass — IDE-friendly bundle of the 50 parsing + threshold kwargs.
14+
- **`MatchTier`** `str` enum (`PM`, `PM_SM`, `PM_BM`, `PM_SM_BM`, `PM_MISMATCH_SM`, `PM_MISMATCH_SM_BM`) — backward-compatible via `str` base class.
15+
- **`save(result, directory)` / `load(directory)`** — parquet + JSON cross-language durable serialization (§7.4). Count series, QC summary, and `CountInput` round-trip; the per-observation inference dict stays pickle-only.
16+
- **`crispr_correct`** top-level package alias — `import crispr_correct as cc` forward-looking name.
17+
18+
### Added — CLI
19+
20+
Flat-arg command-line entry point `crispr-correct` (§4.5). Every `ParsingConfig` field maps to `--flag value` (field names with `_``-`); no YAML, no config file.
21+
22+
- `crispr-correct map` — run mapping from FASTQs.
23+
- `crispr-correct count` — emit one tier's count Series as TSV.
24+
- `crispr-correct save` / `load` — round-trip a mapping result through a parquet/JSON directory.
25+
- `crispr-correct alleles` — post-processing allele extraction to parquet.
26+
27+
### Added — validation + demo
28+
29+
- `tests/test_simulation_e2e.py` — 11-stage end-to-end sweep exercising every public surface (Python `map_fastq` / `count` / `alleles` / `save` / `load` + the five CLI subcommands) on the simulation fixtures with bit-equality and ground-truth assertions at every stage.
30+
- `tests/simulation/phase8_validation_and_demo.ipynb` (wrapper folder) — 35-cell validation report + feature demo + guided tour of Phase 1-7 improvements with `git show --stat` snippets. Runs in ~90 s on the sim fixture. Outputs committed so it renders as a report on GitHub without running locally.
31+
- **Fixed**: `MatchTier.__str__` was returning the Enum repr (`"MatchTier.PM_SM_BM"`) instead of the underlying string value, breaking `getattr(result, str(tier))`. Added `__str__` override returning `self.value`. Equality with plain strings via the `str` subclass is unaffected.
32+
- **Fixed**: `crispr-correct alleles` built the strategy attribute name without the `ambiguous_` prefix, causing subprocess failures.
33+
34+
### Added — testing / CI
35+
36+
- GitHub Actions workflow `.github/workflows/ci.yml` runs smoke tests + 135-mode simulation regression (§7.7). Fast subset (~30 s) on push, full matrix on PR.
37+
- `tests/fixtures/` checked in (library + 8k-read simulated FASTQs + truth parquet) so CI runs without external data.
38+
- `tests/test_smoke.py` (13 tests) covers API surface, CLI, save/load round-trip.
39+
- `tests/test_simulation.py` (135 parametrized comparisons across 8 parse modes × 6 tiers × 9 strategies).
40+
41+
### Changed
42+
43+
- **`MatchSetSingleInferenceMatchResultValue.matches`** is now `Optional[Tuple[Tuple[Any, ...], ...]]` (§2.4). Was `Optional[pd.DataFrame]` holding a per-observation iloc slice of the whitelist; now stores a plain tuple of positional row tuples. Per-match footprint drops ~0.8 KB → ~100 B; only matters when `retain_inference_results=True`. Column order matches whitelist registration (`protospacer` [, `surrogate`] [, `barcode`]). No compat shim — readers of `result.<tier>.matches` as a DataFrame must adapt.
44+
- Padded whitelist DataFrame is released immediately after encoding (§2.7); was kept alive through the whole inference loop, duplicating the whitelist for the multiprocessing pool.
45+
- **`retain_inference_results: bool = False` is the default** on `get_whitelist_reporter_counts_from_fastq` (§2.2). Result pickle shrinks ~15× (default) / ~93% at sim scale. Post-processing functions raise `ValueError` with an actionable message when called on a slim result.
46+
- **`contains_surrogate``contains_guide_surrogate`** on `CountInput` and throughout post-processing kwargs (§4.3). Legacy `contains_surrogate` remains as a deprecated `@property` alias through the next release; removed after.
47+
- **`contains_barcode` / `contains_umi``contains_guide_barcode` / `contains_guide_umi`** on `get_matchset_alleleseries`, `get_mutation_profile`, `tally_linked_mutation_count_per_sequence` (§1.4). No compat shim.
48+
- Whitelist DataFrame columns `surrogate` / `barcode` are now accepted as `guide_surrogate` / `guide_barcode` (auto-renamed internally).
49+
- `print()` statements replaced with module-level `logging.Logger` instances (§4.7 / §7.6). `logging.basicConfig(level=logging.INFO)` enables the verbose trace.
50+
- Optional `tqdm` progress bar around the inference `pool.imap` (§4.8).
51+
- Per-observation `pd.Series` construction replaced with a plain dict (§3.5).
52+
- Redundant per-observation Hamming computations deduped — surrogate encoded once, full-whitelist Hamming computed once per read; subsets index-gather (§3.3).
53+
- LUT-based DNA encoding replaces `np.vectorize` (§3.1).
54+
- `pd.merge` in the per-observed-sequence loop replaced with `set.intersection` on tuple-of-guide-tuples (§3.4).
55+
- O(N²) `.apply(axis=1)` cross-product in counter-series build replaced with `pd.DataFrame.from_records(counterdict.items())` — the counter-series stage went from ~21 317 s to ~53 s (~400×) on the AVITI 100k-read profile.
56+
- `parse_fastq` collapsed from ~700 lines / 32 nested branches to ~150 lines / one generic component loop (§4.6).
57+
- Default `surrogate_hamming_threshold_strict` corrected from 2 to 10.
58+
- **Fixed** surrogate-length truncation bug (§1.1): observed surrogate is now clamped to the surrogate library length (32 bp) instead of the protospacer library length (20 bp). Output values change for any surrogate-involving tier — the scCRISPR golden pickle was re-baselined.
59+
60+
### Removed
61+
62+
- Deprecated `get_whitelist_reporter_counts_from_umitools_output` entry point and the two deprecated parsing modules (`reporter_umitools_fastq_parsing`, `guide_raw_fastq_parsing`) (§8). In-tree drivers must migrate to `get_whitelist_reporter_counts_from_fastq`.
63+
- `non_error_dict` field on `QualityControlResult` tier objects (§2.1) — duplicated ~78% of the result pickle.
64+
- `original_df` / `hamming_min_match_df` DataFrame fields on `HammingThresholdGuideCountError` subclasses (§2.3) — replaced with lightweight `n_whitelist_candidates` / `n_hamming_min_match` ints.
65+
- `store_intermediates` flag — unused (§4.9).
66+
- Legacy encoding helpers `encode_DNA_base_{whitelist,observed}{,_vectorized}`, `numpify_string{,_vectorized}` (§5.4) — superseded by the LUT-based `encode_DNA_sequence_*` / `encode_guide_series_*`.
67+
68+
### Performance deltas (cumulative Phase 1-5)
69+
70+
- **AVITI TCG** (100k reads, 1186-guide HBG library, full triplet + UMI): 272.3 s / 1461 MB → 155.3 s / 982 MB (**−43% wall, −33% peak RSS**).
71+
- **chrX TGC** (100k reads, 11035-guide library, no UMI): 784.4 s / 2579 MB → 505.7 s / 1103 MB (**−36% wall, −57% peak RSS**).
72+
- **scCRISPR pytest**: 75 s → 43 s (**−43% wall**).
73+
- **Default result pickle**: 2.40 MB → 0.16 MB (**−93%**) at simulation scale.
74+
75+
### Dependencies
76+
77+
- Added `click ^8.1` (CLI).
78+
- Added `pyarrow >=11,<22` (parquet save/load).

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ ENV PATH="${VENV}/bin:$PATH"
1515

1616
# Install from PyPI
1717
RUN pip install --upgrade pip
18-
RUN pip install crispr-ambiguous-mapping==0.0.156
18+
RUN pip install crispr-ambiguous-mapping==0.0.236

0 commit comments

Comments
 (0)