Skip to content

Commit ade0b93

Browse files
CodingBashclaude
andcommitted
docs §5.7 + §5.6 + §3.14: public docstrings + Dockerfile version bump
§5.7: NumPy-style docstrings on the 3 main public entry points: - mapping.get_whitelist_reporter_counts_from_fastq (params, return fields, raises, see-also) - processing.get_matchset_alleleseries - processing.get_mutation_profile §3.14: documents that *_hamming_threshold_strict is strict-less-than (value 7 => dist <= 6) in the main entry docstring, resolving the long-standing ambiguity. §5.6: Dockerfile pin bumped 0.0.156 -> 0.0.236 (current multi-sample-support). Gate: scCRISPR + smoke (7 tests) pass (44s); simulation 135/135. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f6c17e7 commit ade0b93

3 files changed

Lines changed: 129 additions & 3 deletions

File tree

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@ ENV PATH="${VENV}/bin:$PATH"
1515

1616
# Install from PyPI
1717
RUN pip install --upgrade pip
18-
RUN pip install crispr-ambiguous-mapping==0.0.156
18+
RUN pip install crispr-ambiguous-mapping==0.0.236

crispr-ambiguous-mapping/crispr_ambiguous_mapping/mapping/main_mapping.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,74 @@ def get_whitelist_reporter_counts_from_fastq(whitelist_guide_reporter_df: Option
107107

108108
retain_inference_results: bool = False,
109109
cores: int=1) -> WhitelistReporterCountsResult:
110+
"""Map observed CRISPR reads from FASTQs to a whitelist guide library via per-base Hamming distance.
111+
112+
This is the canonical entry point for the multi-sample-support branch. It
113+
parses the configured components (protospacer, optional surrogate, guide
114+
barcode, guide UMI, sample/cell barcode) from R1/R2/header, runs Hamming
115+
inference in parallel, builds the per-tier count series, and returns a
116+
`WhitelistReporterCountsResult` dataclass.
117+
118+
Parameters
119+
----------
120+
whitelist_guide_reporter_df
121+
DataFrame with one row per guide. Required column: ``protospacer``. If
122+
``contains_guide_surrogate`` is inferred from the parsing kwargs, add a
123+
``surrogate`` column; if ``contains_guide_barcode`` is inferred, add a
124+
``barcode`` column.
125+
fastq_r1_fns
126+
List of R1 FASTQ paths (gzipped accepted). Single-end calls still pass
127+
a single-element list.
128+
fastq_r2_fns
129+
List of R2 FASTQ paths or ``None`` for single-end.
130+
protospacer_* / surrogate_* / guide_barcode_* / guide_umi_* / sample_barcode_*
131+
Per-component extraction knobs. Provide one of:
132+
``*_pattern_regex`` (capture-group-1 parsed from sequence or header),
133+
``*_left_flank`` / ``*_right_flank`` (flank-based extraction), or
134+
``*_start_position`` + ``*_length`` / ``*_end_position`` (fixed offset).
135+
``is_*_r1`` / ``is_*_header`` selects source; ``revcomp_*`` reverse-
136+
complements the extracted fragment.
137+
protospacer_hamming_threshold_strict, surrogate_hamming_threshold_strict, guide_barcode_hamming_threshold_strict
138+
Strict-less-than thresholds. A value of 7 means distances ``<= 6`` are
139+
matches — the ``_strict`` suffix is deliberate. Typical: 7 for 20bp
140+
protospacers, 10 for 32bp surrogates, 2 for 4bp barcodes. Pass ``None``
141+
to auto-determine from the library (5th percentile of pairwise Hamming
142+
distances, sample=100).
143+
retain_inference_results
144+
Default ``False`` — the slim result drops the per-observation
145+
inference dict (15x smaller pickle, ~45% smaller peak RSS). Set to
146+
``True`` if you plan to call ``get_matchset_alleleseries`` /
147+
``get_mutation_profile`` / ``tally_linked_mutation_count_per_sequence``
148+
downstream (they raise ``ValueError`` on a slim result with a clear
149+
remediation message).
150+
cores
151+
Number of worker processes for inference. FASTQ parsing is single-
152+
threaded (§3.12 streaming is a future upgrade).
153+
154+
Returns
155+
-------
156+
WhitelistReporterCountsResult
157+
Fields of note:
158+
- ``all_match_set_whitelist_reporter_counter_series_results`` — six tiers
159+
(protospacer_match, PM+SM, PM+BM, PM+SM+BM, PM_mismatch_SM, PM_mismatch_SM_BM),
160+
each with 9 Series (3 ambiguity strategies x 3 UMI strategies).
161+
- ``quality_control_result`` — per-tier error counts (``num_total_*``, ``num_non_error_*``).
162+
- ``count_input`` — echo of parsing flags (``contains_guide_surrogate``, etc.).
163+
- ``observed_guide_reporter_umi_counts_inferred`` — raw per-observation
164+
inference dict, present only when ``retain_inference_results=True``.
165+
166+
Raises
167+
------
168+
ValueError
169+
If ``whitelist_guide_reporter_df`` is missing required columns for
170+
the configured components.
171+
172+
See Also
173+
--------
174+
crispr_ambiguous_mapping.processing.get_matchset_alleleseries
175+
crispr_ambiguous_mapping.processing.get_mutation_profile
176+
crispr_ambiguous_mapping.models.MatchTier
177+
"""
110178
# Input parameter validation checks
111179

112180
protospacer_pattern_regex = None if ((protospacer_pattern_regex is not None) and (protospacer_pattern_regex.strip() == "")) else protospacer_pattern_regex

crispr-ambiguous-mapping/crispr_ambiguous_mapping/processing/crispr_editing_processing.py

Lines changed: 60 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,36 @@ def _require_inference_dict(observed_guide_reporter_umi_counts_inferred, caller:
5454

5555

5656
def get_matchset_alleleseries(observed_guide_reporter_umi_counts_inferred: GeneralMappingInferenceDict, attribute_name: str, contains_surrogate: bool, contains_guide_barcode: bool, contains_guide_umi: bool):
57+
"""Build per-tier observed-allele count Series from a full (retained) mapping result.
58+
59+
For each whitelist guide, aggregates the observed protospacer/surrogate/barcode
60+
alleles that mapped to it under the chosen tier, across nine (ambiguity x UMI)
61+
counting strategies. Required input for downstream mutation profiling.
62+
63+
Parameters
64+
----------
65+
observed_guide_reporter_umi_counts_inferred
66+
The per-observation inference dict — i.e. ``result.observed_guide_reporter_umi_counts_inferred``
67+
from a mapping call made with ``retain_inference_results=True``. A slim
68+
result (default) passes ``None`` here and raises ``ValueError``.
69+
attribute_name
70+
Match tier to extract. Pass a ``MatchTier`` enum member (or its string
71+
value). Typical: ``MatchTier.PM_SM_BM`` for full-triplet screens.
72+
contains_surrogate, contains_guide_barcode, contains_guide_umi
73+
Must match what was configured during mapping (these drive the output
74+
DataFrame column shape).
75+
76+
Returns
77+
-------
78+
MatchSetWhitelistReporterObservedSequenceCounterSeriesResults
79+
Dataclass with 9 alleledict + 9 alleleseries_dict + 9 allele_df fields,
80+
one per (ambiguity, UMI) strategy.
81+
82+
Raises
83+
------
84+
ValueError
85+
If called on a slim mapping result (re-run with ``retain_inference_results=True``).
86+
"""
5787
_require_inference_dict(observed_guide_reporter_umi_counts_inferred, "get_matchset_alleleseries")
5888
#
5989
# DEFINE THE DEFAULTDICTS FOR COUNTING
@@ -264,8 +294,36 @@ def determine_mutations_in_sequence(true_sequence, observed_sequence):
264294
return observed_sequence_mutation_df
265295

266296

267-
def get_mutation_profile(match_set_whitelist_reporter_observed_sequence_counter_series_results: MatchSetWhitelistReporterObservedSequenceCounterSeriesResults, whitelist_reporter_df: pd.DataFrame, contains_surrogate: bool, contains_guide_barcode: bool) -> MatchSetWhitelistReporterObservedSequenceMutationProfiles:
268-
297+
def get_mutation_profile(match_set_whitelist_reporter_observed_sequence_counter_series_results: MatchSetWhitelistReporterObservedSequenceCounterSeriesResults, whitelist_reporter_df: pd.DataFrame, contains_surrogate: bool, contains_guide_barcode: bool) -> MatchSetWhitelistReporterObservedSequenceMutationProfiles:
298+
"""Compute per-position mutation profiles from allele count series.
299+
300+
Given the allele Series built by ``get_matchset_alleleseries``, this walks
301+
each (whitelist, observed_allele) pair and records per-base mutations
302+
against the whitelist reference, producing both linked (allele-level) and
303+
unlinked (position-level) mutation tables for all nine counting strategies.
304+
305+
Parameters
306+
----------
307+
match_set_whitelist_reporter_observed_sequence_counter_series_results
308+
Return value of ``get_matchset_alleleseries``.
309+
whitelist_reporter_df
310+
The same DataFrame that was passed into the mapping call. Used as the
311+
reference sequence for computing mutations.
312+
contains_surrogate, contains_guide_barcode
313+
Must match the mapping configuration.
314+
315+
Returns
316+
-------
317+
MatchSetWhitelistReporterObservedSequenceMutationProfiles
318+
Mutation tables per strategy. Consume via
319+
``tally_linked_mutation_count_per_sequence`` for aggregate counters, or
320+
drive ``visualization.plot_mutation_count_histogram`` /
321+
``plot_trinucleotide_mutational_signature`` directly.
322+
323+
See Also
324+
--------
325+
tally_linked_mutation_count_per_sequence
326+
"""
269327
# Function to generate unlinked mutations for particular count type
270328
def generate_mutations_results(alleleseries: Optional[GeneralAlleleCountSeriesDict], whitelist_reporter_df: pd.DataFrame, contains_surrogate: bool, contains_guide_barcode: bool) -> Optional[ObservedSequenceMutationProfile]:
271329
if alleleseries is not None:

0 commit comments

Comments
 (0)