Skip to content

Commit 718ea1e

Browse files
authored
Merge pull request #114 from MannLabs/developer_readme
Developer readme
2 parents e76f5db + b576247 commit 718ea1e

8 files changed

Lines changed: 386 additions & 111 deletions

File tree

DEVELOPERS.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# For Developers: Modifying AlphaQuant
2+
3+
AlphaQuant is designed with modularity in mind to allow practitioners to introduce alternative numerical methods for each module. The codebase follows clear interfaces that make it straightforward to extend or replace statistical methods at different levels of the analysis pipeline.
4+
5+
## ⚠️ Important: Benchmarking and Validation
6+
7+
**Any changes to statistical methods should be thoroughly benchmarked and fine-tuned before use in production analyses.** The default methods in AlphaQuant have been extensively tested and validated on diverse proteomics datasets. When implementing alternative approaches, ensure you carry out appropriate benchmarking using ground truth datasets (e.g., spike-in experiments, mixed-species samples) and evaluate key performance metrics (sensitivity, specificity, false discovery rates, reproducibility).
8+
9+
10+
## 1. Ion-Level Statistical Testing
11+
12+
**Where to modify:** `alphaquant/diffquant/diff_analysis.py`
13+
14+
**How it works:** Each ion (fragment, peptide, etc.) is tested independently for differential expression. The test produces three key outputs: `p_val` (p-value), `fc` (log2 fold change), and `z_val` (z-score for aggregation).
15+
16+
**Main class:**
17+
- **`DifferentialIon`** - The default method that uses intensity-dependent empirical background distributions to compute p-values and z-scores. It accounts for technical variation by comparing observed fold changes against distributions derived from similarly abundant ions in the dataset. The core statistical logic is in the `_calc_diffreg_peptide()` method.
18+
19+
**How to extend:** We've included `DifferentialIonTTest` in the same file as example code demonstrating how to implement alternative tests. This variant uses Welch's t-test with robust variance estimation. Note that this example has not been extensively benchmarked and is included for educational purposes to demonstrate the interface.
20+
21+
1. Create a new class (e.g., `DifferentialIonMyMethod`) with the same interface:
22+
- `__init__()` should accept `(noNanvals_from, noNanvals_to, ...)` and any method-specific parameters
23+
- Set attributes: `name`, `p_val`, `fc`, `z_val`, `usable`
24+
2. Implement your statistical test in a method (e.g., `_calc_mymethod()`)
25+
3. Modify `alphaquant/diffquant/condpair_analysis.py` (lines 67-70) to instantiate your class
26+
4. Optionally, add a parameter to `run_pipeline()` to select between methods
27+
28+
The key requirement is that your class must output `p_val`, `fc`, and `z_val` for each ion—these are used by the tree aggregation framework.
29+
30+
## 2. Tree-Based Ion Propagation
31+
32+
**Where to modify:** `alphaquant/cluster/cluster_utils.py` and `alphaquant/cluster/cluster_ions.py`
33+
34+
**How it works:** Statistics from child nodes (e.g., fragments) are aggregated to parent nodes (e.g., peptides → proteins) in a hierarchical tree. Z-values are combined using Stouffer's method, and fold changes are summarized using medians.
35+
36+
**Key functions:**
37+
- **`aggregate_node_properties()`** - The core function that propagates statistics up the tree. It combines z-values, fold changes, and quality metrics from children to parents.
38+
- **`sum_and_re_scale_zvalues()`** - Implements Stouffer's Z-score method: sums z-values and divides by sqrt(n), then rescales to maintain standard normal distribution.
39+
- **`transform_znormed_to_pval()`** - Converts aggregated z-scores back to two-sided p-values.
40+
41+
**How to extend:** If you want to use different aggregation methods:
42+
1. Modify `sum_and_re_scale_zvalues()` to implement your preferred meta-analysis method (e.g., Fisher's method, weighted Z-scores, etc.)
43+
2. If your method changes the distribution, update `transform_znormed_to_pval()` accordingly
44+
3. For fold-change aggregation, modify line 67 in `aggregate_node_properties()` where `node.fc = np.median(fcs)` is set
45+
46+
The tree traversal itself is in `cluster_ions.py`:
47+
- **`cluster_along_specified_levels()`** - Iterates through tree levels bottom-to-top
48+
- **`get_scored_clusterselected_ions()`** - Entry point for the hierarchical workflow
49+
50+
## 3. Multiple Testing Correction
51+
52+
**Where to modify:** `alphaquant/tables/diffquant_table.py` and `alphaquant/tables/proteoformtable.py`
53+
54+
**How it works:** FDR correction is applied separately to different result tables during output generation. The method outputs p-values in all tables, so you can always recalculate q-values from the output files.
55+
56+
**Key functions:**
57+
- **Protein results** (`alphaquant/tables/diffquant_table.py`):
58+
- `_add_fdr_fc_based_set()` - Applies Benjamini-Hochberg to intensity-based proteins
59+
- `_add_fdr_counting_based_set()` - Applies adjusted Benjamini-Hochberg to proteins detected only via missing values
60+
61+
- **Proteoform results** (`alphaquant/tables/proteoformtable.py`):
62+
- `_annotate_fdr_column()` - Applies Benjamini-Hochberg to test if alternative proteoforms differ from the reference
63+
64+
**How to extend:**
65+
1. Modify the relevant function to use a different method (e.g., Bonferroni, Storey's q-value, etc.)
66+
2. Replace the `mt.multipletests(..., method='fdr_bh', ...)` call with your preferred correction
67+
3. Alternatively, use the p-values from output tables and apply your own correction externally
68+
69+
## 4. Outlier Robustness
70+
71+
**Where to modify:** `alphaquant/diffquant/diff_analysis.py` and `alphaquant/cluster/cluster_utils.py`
72+
73+
**How it works:** AlphaQuant applies outlier correction at two levels to make results robust to technical variation and biological heterogeneity.
74+
75+
**Key functions:**
76+
- **`calc_outlier_scaling_factor()`** (in `diff_analysis.py`) - Compares between-replicate variance to expected technical variance and inflates estimates when replicates show unusual variability
77+
- **`remove_outlier_fragion_childs()`** (in `cluster_utils.py`) - Filters extreme fragments before aggregating to peptides (keeps the 5 most central fragments when >4 are available)
78+
79+
**How to extend:**
80+
1. Modify the scaling logic in `calc_outlier_scaling_factor()` to use different robust estimators
81+
2. Adjust `remove_outlier_fragion_childs()` to change how many fragments are retained or which criteria are used for selection
82+
3. Set `outlier_correction=False` in `run_pipeline()` to disable this feature entirely
83+
84+
## 5. Main Workflow Orchestration
85+
86+
**Where to modify:** `alphaquant/diffquant/condpair_analysis.py`
87+
88+
**How it works:** The `analyze_condpair()` function coordinates the complete pipeline for comparing two conditions.
89+
90+
**Pipeline steps:**
91+
1. Load and filter data for the two conditions
92+
2. Perform normalization (within and between conditions)
93+
3. Create empirical background distributions
94+
4. Compute ion-level differential statistics (`DifferentialIon` or `DifferentialIonTTest`)
95+
5. Build hierarchical trees and perform clustering to identify proteoforms
96+
6. Apply machine learning quality scoring (if enabled)
97+
7. Filter outlier peptides (if enabled)
98+
8. Generate output tables with FDR correction
99+
9. Create visualization plots
100+
101+
**How to extend:** This file shows how all components connect. To add custom preprocessing, normalization, or post-processing steps, modify this function or create a wrapper that calls it with modified data.
102+
103+
---
104+
105+
## Additional Resources
106+
107+
For general contribution guidelines, code style, and how to submit pull requests, please see [CONTRIBUTING.md](CONTRIBUTING.md).
108+
109+
For questions or discussions about extending AlphaQuant, please use the [GitHub Discussions](https://github.com/MannLabs/alphaquant/discussions) forum.
110+

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,20 @@ A manuscript has been submitted to bioRxiv:
293293
> Constantin Ammar, Marvin Thielert, Caroline A M Weiss, Edwin H Rodriguez, Maximilian T Strauss, Florian A Rosenberger, Wen-Feng Zeng, Matthias Mann
294294
> bioRxiv 2025.03.06.641844; doi: https://doi.org/10.1101/2025.03.06.641844
295295
296+
---
297+
## For Developers: Modifying AlphaQuant
298+
299+
AlphaQuant is designed with modularity in mind. If you want to implement alternative statistical methods, modify the tree-based propagation, or adjust multiple testing correction approaches, we provide clear interfaces at each level of the analysis pipeline.
300+
301+
For detailed documentation on how to extend or replace:
302+
- Ion-level statistical testing methods
303+
- Tree-based aggregation and z-value propagation
304+
- Multiple testing correction procedures
305+
- Outlier robustness filtering
306+
- Main workflow orchestration
307+
308+
Please see **[DEVELOPERS.md](DEVELOPERS.md)** for comprehensive guidance with code examples.
309+
296310
---
297311
## How to contribute
298312

alphaquant/cluster/cluster_ions.py

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,29 @@
2929

3030

3131
def get_scored_clusterselected_ions(gene_name, diffions, normed_c1, normed_c2, ion2diffDist, p2z, deedpair2doublediffdist, pval_threshold_basis, fcfc_threshold, take_median_ion, fcdiff_cutoff_clustermerge, fragment_outlier_filtering=True):
32+
"""Main entry point for hierarchical clustering and tree-based quantification of a protein.
33+
34+
This function creates a hierarchical tree structure from fragment ions up to the protein level
35+
(fragments → peptides → modified peptides → unmodified peptides → protein), performs statistical
36+
clustering at each level to identify proteoforms, and computes aggregated statistics.
37+
38+
Args:
39+
gene_name: Protein/gene identifier
40+
diffions: List of DifferentialIon objects for all ions belonging to this protein
41+
normed_c1: ConditionBackgrounds object for condition 1
42+
normed_c2: ConditionBackgrounds object for condition 2
43+
ion2diffDist: Dictionary mapping ion pairs to differential background distributions
44+
p2z: Cache dictionary for p-value to z-value conversions
45+
deedpair2doublediffdist: Cache for double-differential distributions used in clustering
46+
pval_threshold_basis: P-value threshold for determining if ions differ significantly
47+
fcfc_threshold: Fold-change difference threshold for clustering
48+
take_median_ion: If True, use median-centered ions for clustering
49+
fcdiff_cutoff_clustermerge: Fold-change threshold for merging similar clusters
50+
fragment_outlier_filtering: Whether to filter outlier fragments when aggregating to peptides
51+
52+
Returns:
53+
anytree.Node: Root node of the hierarchical tree containing all statistics and clustering results
54+
"""
3255
#typefilter = TypeFilter('successive')
3356

3457
global FCDIFF_CUTOFF_CLUSTERMERGE
@@ -92,6 +115,30 @@ def add_reduced_names_to_root(node):
92115

93116
import pandas as pd
94117
def cluster_along_specified_levels(root_node, ionname2diffion, normed_c1, normed_c2, ion2diffDist, p2z, deedpair2doublediffdist, pval_threshold_basis, fcfc_threshold, take_median_ion, fragment_outlier_filtering=True):#~60% of overall runtime
118+
"""Performs hierarchical clustering at each level of the tree from bottom to top.
119+
120+
Starting from base ions (fragments/MS1), this function iterates through each level
121+
of the tree hierarchy and performs statistical clustering to identify groups of ions
122+
with similar quantitative behavior (proteoforms). At each level, ions are tested
123+
pairwise for consistent fold-change differences, clustered hierarchically, and
124+
statistics are aggregated to parent nodes.
125+
126+
Args:
127+
root_node: Root of the hierarchical tree (protein level)
128+
ionname2diffion: Dictionary mapping ion names to DifferentialIon objects
129+
normed_c1: ConditionBackgrounds for condition 1
130+
normed_c2: ConditionBackgrounds for condition 2
131+
ion2diffDist: Dictionary of differential background distributions
132+
p2z: Cache for p-value to z-value conversions
133+
deedpair2doublediffdist: Cache for double-differential distributions
134+
pval_threshold_basis: P-value threshold for clustering decisions
135+
fcfc_threshold: Fold-change threshold for clustering
136+
take_median_ion: Whether to use median-centered ions
137+
fragment_outlier_filtering: Whether to filter fragment outliers
138+
139+
Returns:
140+
anytree.Node: The root node with all clustering annotations and aggregated statistics
141+
"""
95142
#typefilter object specifies filtering and clustering of the nodes
96143
aqcluster_utils.assign_properties_to_base_ions(root_node, ionname2diffion, normed_c1, normed_c2)
97144

alphaquant/cluster/cluster_utils.py

Lines changed: 52 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,26 @@
2020

2121

2222
def aggregate_node_properties(node, only_use_mainclust, peptide_outlier_filtering=False, fragment_outlier_filtering=True):
23-
"""Goes through the children and summarizes their properties to the node
23+
"""Aggregates statistical properties from child nodes to a parent node in the tree.
24+
25+
This is the core function for propagating statistics up the hierarchical tree structure.
26+
It combines z-values, fold changes, and quality metrics from child nodes (e.g., peptides)
27+
into parent node (e.g., protein) statistics. The aggregation can optionally exclude
28+
proteoforms (non-main clusters) and filter outlier children.
2429
2530
Args:
26-
node ([type]): [description]
27-
only_use_mainclust (bool, optional): [description]. Defaults to True.
28-
peptide_outlier_filtering (bool, optional): Whether to filter outlier peptides. Defaults to False.
29-
fragment_outlier_filtering (bool, optional): Whether to filter outlier fragments. Defaults to True.
31+
node: The parent node whose properties will be computed from its children
32+
only_use_mainclust: If True, only use children in the main cluster (cluster==0),
33+
excluding proteoform variants
34+
peptide_outlier_filtering: If True and node is a protein, exclude peptides
35+
identified as statistical outliers (default: False)
36+
fragment_outlier_filtering: If True and node is a peptide, exclude extreme
37+
fragment ions before aggregation (default: True)
38+
39+
Side effects:
40+
Sets node.z_val, node.p_val, node.fc, node.cv, node.min_intensity,
41+
node.total_intensity, node.min_reps, node.fraction_consistent, and
42+
optionally node.ml_score based on aggregated child values.
3043
"""
3144
if only_use_mainclust:
3245
childs = [x for x in node.children if x.is_included & (x.cluster ==0)]
@@ -61,11 +74,8 @@ def aggregate_node_properties(node, only_use_mainclust, peptide_outlier_filterin
6174
node.z_val = z_normed
6275
node.p_val = p_val
6376

64-
# if node.type == "frgion":
65-
# node.fc = calc_weighted_fold_change_from_included_leaves_fcs(node)
66-
# else:
77+
6778
node.fc = np.median(fcs)
68-
#calc_fold_change_from_included_leaves_fcs(node) ## #np.median(fcs)#
6979
node.fraction_consistent = fraction_consistent
7080
node.cv = min(cvs)
7181
node.min_intensity = min_intensity
@@ -210,6 +220,18 @@ def get_median_peptides(pepnode2zval2numleaves): #least significant peptides are
210220
return [x[0] for x in pepnode2zval2numleaves[:median_idx+1]]
211221

212222
def remove_outlier_fragion_childs(childs):
223+
"""Filters extreme fragment ions before aggregating to peptide level.
224+
225+
When a peptide has many fragment ions, this function selects a subset to avoid
226+
bias from extreme outliers. For >4 fragments, it keeps the 5 most central fragments
227+
(ranked by z-value). For ≤4 fragments, all are retained.
228+
229+
Args:
230+
childs: List of fragment ion nodes (children of a peptide node)
231+
232+
Returns:
233+
list: Filtered subset of fragment ion nodes to use for aggregation
234+
"""
213235
zvals = get_feature_numpy_array_from_nodes(nodes=childs, feature_name="z_val")
214236
if aqvariables.PTM_FRAGMENT_SELECTION:
215237
sorted_idxs_zvals = np.argsort(np.abs(zvals))
@@ -235,6 +257,19 @@ def remove_outlier_fragion_childs(childs):
235257

236258

237259
def sum_and_re_scale_zvalues(zvals):
260+
"""Combines multiple z-values into a single aggregated z-value using Stouffer's method.
261+
262+
This implements Stouffer's Z-score method for meta-analysis: z-values are summed
263+
and divided by sqrt(n) to account for the number of tests. The result is then
264+
rescaled back to a standard normal distribution. This allows combining evidence
265+
from multiple ions/peptides while maintaining proper statistical interpretation.
266+
267+
Args:
268+
zvals: Array or list of z-values to combine
269+
270+
Returns:
271+
float: Combined z-value following a standard normal distribution under the null
272+
"""
238273
if len(zvals) == 1:
239274
return zvals[0] # No aggregation needed for single values - avoids floating-point precision errors
240275

@@ -245,6 +280,14 @@ def sum_and_re_scale_zvalues(zvals):
245280
return z_normed
246281

247282
def transform_znormed_to_pval(z_normed):
283+
"""Converts a z-score to a two-sided p-value.
284+
285+
Args:
286+
z_normed: Z-score from a standard normal distribution
287+
288+
Returns:
289+
float: Two-sided p-value. For z=0 returns 1.0, for large |z| returns small p-value.
290+
"""
248291
return 2.0 * (1.0 - NormalDist().cdf(abs(z_normed))) #we take the abs of the z_normed (normed means it belongs to a ND(0,1)), which means the cdf will return values between 0.5 and 1, and closer to 1 with increasing z_normed.
249292

250293

alphaquant/diffquant/condpair_analysis.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,26 @@
2525
LOGGER = logging.getLogger(__name__)
2626

2727
def analyze_condpair(*,runconfig, condpair):
28+
"""Main workflow orchestration for differential analysis of a condition pair.
29+
30+
This function coordinates the complete analysis pipeline for comparing two conditions:
31+
1. Loads and filters data for the two conditions
32+
2. Performs normalization (within and between conditions)
33+
3. Creates empirical background distributions
34+
4. Computes ion-level differential statistics
35+
5. Builds hierarchical trees and performs clustering to identify proteoforms
36+
6. Applies machine learning quality scoring (if enabled)
37+
7. Filters outlier peptides (if enabled)
38+
8. Generates output tables with FDR correction
39+
9. Creates visualization plots
40+
41+
Args:
42+
runconfig: Configuration object containing all analysis parameters (see run_pipeline docstring)
43+
condpair: Tuple of (condition1_name, condition2_name) to compare
44+
45+
Returns:
46+
tuple: (results_df, peptide_df) - DataFrames with protein and peptide-level results
47+
"""
2848
LOGGER.info(f"start processeing condpair {condpair}")
2949
prot2diffions = defaultdict(list) #per default maps any key to empty list
3050
prot2missingval_diffions = defaultdict(list)

0 commit comments

Comments
 (0)