MannLabs
diff --git a/‎DEVELOPERS.md‎
Lines changed: 110 additions & 0 deletions b/‎DEVELOPERS.md‎
Lines changed: 110 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 0 deletions b/‎README.md‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎alphaquant/cluster/cluster_ions.py‎
Lines changed: 47 additions & 0 deletions b/‎alphaquant/cluster/cluster_ions.py‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎alphaquant/cluster/cluster_utils.py‎
Lines changed: 52 additions & 9 deletions b/‎alphaquant/cluster/cluster_utils.py‎
Lines changed: 52 additions & 9 deletions
diff --git a/‎alphaquant/diffquant/condpair_analysis.py‎
Lines changed: 20 additions & 0 deletions b/‎alphaquant/diffquant/condpair_analysis.py‎
Lines changed: 20 additions & 0 deletions
@@ -0,0 +1,110 @@
+# For Developers: Modifying AlphaQuant
+
+AlphaQuant is designed with modularity in mind to allow practitioners to introduce alternative numerical methods for each module. The codebase follows clear interfaces that make it straightforward to extend or replace statistical methods at different levels of the analysis pipeline.
+
+## ⚠️ Important: Benchmarking and Validation
+
+**Any changes to statistical methods should be thoroughly benchmarked and fine-tuned before use in production analyses.** The default methods in AlphaQuant have been extensively tested and validated on diverse proteomics datasets. When implementing alternative approaches, ensure you carry out appropriate benchmarking using ground truth datasets (e.g., spike-in experiments, mixed-species samples) and evaluate key performance metrics (sensitivity, specificity, false discovery rates, reproducibility).
+
+
+## 1. Ion-Level Statistical Testing
+
+**Where to modify:** `alphaquant/diffquant/diff_analysis.py`
+
+**How it works:** Each ion (fragment, peptide, etc.) is tested independently for differential expression. The test produces three key outputs: `p_val` (p-value), `fc` (log2 fold change), and `z_val` (z-score for aggregation).
+
+**Main class:**
+- **`DifferentialIon`** - The default method that uses intensity-dependent empirical background distributions to compute p-values and z-scores. It accounts for technical variation by comparing observed fold changes against distributions derived from similarly abundant ions in the dataset. The core statistical logic is in the `_calc_diffreg_peptide()` method.
+
+**How to extend:** We've included `DifferentialIonTTest` in the same file as example code demonstrating how to implement alternative tests. This variant uses Welch's t-test with robust variance estimation. Note that this example has not been extensively benchmarked and is included for educational purposes to demonstrate the interface.
+
+1. Create a new class (e.g., `DifferentialIonMyMethod`) with the same interface:
+   - `__init__()` should accept `(noNanvals_from, noNanvals_to, ...)` and any method-specific parameters
+   - Set attributes: `name`, `p_val`, `fc`, `z_val`, `usable`
+2. Implement your statistical test in a method (e.g., `_calc_mymethod()`)
+3. Modify `alphaquant/diffquant/condpair_analysis.py` (lines 67-70) to instantiate your class
+4. Optionally, add a parameter to `run_pipeline()` to select between methods
+
+The key requirement is that your class must output `p_val`, `fc`, and `z_val` for each ion—these are used by the tree aggregation framework.
+
+## 2. Tree-Based Ion Propagation
+
+**Where to modify:** `alphaquant/cluster/cluster_utils.py` and `alphaquant/cluster/cluster_ions.py`
+
+**How it works:** Statistics from child nodes (e.g., fragments) are aggregated to parent nodes (e.g., peptides → proteins) in a hierarchical tree. Z-values are combined using Stouffer's method, and fold changes are summarized using medians.
+
+**Key functions:**
+- **`aggregate_node_properties()`** - The core function that propagates statistics up the tree. It combines z-values, fold changes, and quality metrics from children to parents.
+- **`sum_and_re_scale_zvalues()`** - Implements Stouffer's Z-score method: sums z-values and divides by sqrt(n), then rescales to maintain standard normal distribution.
+- **`transform_znormed_to_pval()`** - Converts aggregated z-scores back to two-sided p-values.
+
+**How to extend:** If you want to use different aggregation methods:
+1. Modify `sum_and_re_scale_zvalues()` to implement your preferred meta-analysis method (e.g., Fisher's method, weighted Z-scores, etc.)
+2. If your method changes the distribution, update `transform_znormed_to_pval()` accordingly
+3. For fold-change aggregation, modify line 67 in `aggregate_node_properties()` where `node.fc = np.median(fcs)` is set
+
+The tree traversal itself is in `cluster_ions.py`:
+- **`cluster_along_specified_levels()`** - Iterates through tree levels bottom-to-top
+- **`get_scored_clusterselected_ions()`** - Entry point for the hierarchical workflow
+
+## 3. Multiple Testing Correction
+
+**Where to modify:** `alphaquant/tables/diffquant_table.py` and `alphaquant/tables/proteoformtable.py`
+
+**How it works:** FDR correction is applied separately to different result tables during output generation. The method outputs p-values in all tables, so you can always recalculate q-values from the output files.
+
+**Key functions:**
+- **Protein results** (`alphaquant/tables/diffquant_table.py`):
+  - `_add_fdr_fc_based_set()` - Applies Benjamini-Hochberg to intensity-based proteins
+  - `_add_fdr_counting_based_set()` - Applies adjusted Benjamini-Hochberg to proteins detected only via missing values
+
+- **Proteoform results** (`alphaquant/tables/proteoformtable.py`):
+  - `_annotate_fdr_column()` - Applies Benjamini-Hochberg to test if alternative proteoforms differ from the reference
+
+**How to extend:**
+1. Modify the relevant function to use a different method (e.g., Bonferroni, Storey's q-value, etc.)
+2. Replace the `mt.multipletests(..., method='fdr_bh', ...)` call with your preferred correction
+3. Alternatively, use the p-values from output tables and apply your own correction externally
+
+## 4. Outlier Robustness
+
+**Where to modify:** `alphaquant/diffquant/diff_analysis.py` and `alphaquant/cluster/cluster_utils.py`
+
+**How it works:** AlphaQuant applies outlier correction at two levels to make results robust to technical variation and biological heterogeneity.
+
+**Key functions:**
+- **`calc_outlier_scaling_factor()`** (in `diff_analysis.py`) - Compares between-replicate variance to expected technical variance and inflates estimates when replicates show unusual variability
+- **`remove_outlier_fragion_childs()`** (in `cluster_utils.py`) - Filters extreme fragments before aggregating to peptides (keeps the 5 most central fragments when >4 are available)
+
+**How to extend:**
+1. Modify the scaling logic in `calc_outlier_scaling_factor()` to use different robust estimators
+2. Adjust `remove_outlier_fragion_childs()` to change how many fragments are retained or which criteria are used for selection
+3. Set `outlier_correction=False` in `run_pipeline()` to disable this feature entirely
+
+## 5. Main Workflow Orchestration
+
+**Where to modify:** `alphaquant/diffquant/condpair_analysis.py`
+
+**How it works:** The `analyze_condpair()` function coordinates the complete pipeline for comparing two conditions.
+
+**Pipeline steps:**
+1. Load and filter data for the two conditions
+2. Perform normalization (within and between conditions)
+3. Create empirical background distributions
+4. Compute ion-level differential statistics (`DifferentialIon` or `DifferentialIonTTest`)
+5. Build hierarchical trees and perform clustering to identify proteoforms
+6. Apply machine learning quality scoring (if enabled)
+7. Filter outlier peptides (if enabled)
+8. Generate output tables with FDR correction
+9. Create visualization plots
+
+**How to extend:** This file shows how all components connect. To add custom preprocessing, normalization, or post-processing steps, modify this function or create a wrapper that calls it with modified data.
+
+---
+
+## Additional Resources
+
+For general contribution guidelines, code style, and how to submit pull requests, please see [CONTRIBUTING.md](CONTRIBUTING.md).
+
+For questions or discussions about extending AlphaQuant, please use the [GitHub Discussions](https://github.com/MannLabs/alphaquant/discussions) forum.
+
@@ -293,6 +293,20 @@ A manuscript has been submitted to bioRxiv:
 > Constantin Ammar, Marvin Thielert, Caroline A M Weiss, Edwin H Rodriguez, Maximilian T Strauss, Florian A Rosenberger, Wen-Feng Zeng, Matthias Mann
 > bioRxiv 2025.03.06.641844; doi: https://doi.org/10.1101/2025.03.06.641844
 
+---
+## For Developers: Modifying AlphaQuant
+
+AlphaQuant is designed with modularity in mind. If you want to implement alternative statistical methods, modify the tree-based propagation, or adjust multiple testing correction approaches, we provide clear interfaces at each level of the analysis pipeline.
+
+For detailed documentation on how to extend or replace:
+- Ion-level statistical testing methods
+- Tree-based aggregation and z-value propagation
+- Multiple testing correction procedures
+- Outlier robustness filtering
+- Main workflow orchestration
+
+Please see **[DEVELOPERS.md](DEVELOPERS.md)** for comprehensive guidance with code examples.
+
 ---
 ## How to contribute
 
 
@@ -29,6 +29,29 @@
 
 
 def get_scored_clusterselected_ions(gene_name, diffions, normed_c1, normed_c2, ion2diffDist, p2z, deedpair2doublediffdist, pval_threshold_basis, fcfc_threshold, take_median_ion, fcdiff_cutoff_clustermerge, fragment_outlier_filtering=True):
+    """Main entry point for hierarchical clustering and tree-based quantification of a protein.
+
+    This function creates a hierarchical tree structure from fragment ions up to the protein level
+    (fragments → peptides → modified peptides → unmodified peptides → protein), performs statistical
+    clustering at each level to identify proteoforms, and computes aggregated statistics.
+
+    Args:
+        gene_name: Protein/gene identifier
+        diffions: List of DifferentialIon objects for all ions belonging to this protein
+        normed_c1: ConditionBackgrounds object for condition 1
+        normed_c2: ConditionBackgrounds object for condition 2
+        ion2diffDist: Dictionary mapping ion pairs to differential background distributions
+        p2z: Cache dictionary for p-value to z-value conversions
+        deedpair2doublediffdist: Cache for double-differential distributions used in clustering
+        pval_threshold_basis: P-value threshold for determining if ions differ significantly
+        fcfc_threshold: Fold-change difference threshold for clustering
+        take_median_ion: If True, use median-centered ions for clustering
+        fcdiff_cutoff_clustermerge: Fold-change threshold for merging similar clusters
+        fragment_outlier_filtering: Whether to filter outlier fragments when aggregating to peptides
+
+    Returns:
+        anytree.Node: Root node of the hierarchical tree containing all statistics and clustering results
+    """
     #typefilter = TypeFilter('successive')
 
     global FCDIFF_CUTOFF_CLUSTERMERGE
@@ -92,6 +115,30 @@ def add_reduced_names_to_root(node):
 
 import pandas as pd
 def cluster_along_specified_levels(root_node, ionname2diffion, normed_c1, normed_c2, ion2diffDist, p2z, deedpair2doublediffdist, pval_threshold_basis, fcfc_threshold, take_median_ion, fragment_outlier_filtering=True):#~60% of overall runtime
+    """Performs hierarchical clustering at each level of the tree from bottom to top.
+
+    Starting from base ions (fragments/MS1), this function iterates through each level
+    of the tree hierarchy and performs statistical clustering to identify groups of ions
+    with similar quantitative behavior (proteoforms). At each level, ions are tested
+    pairwise for consistent fold-change differences, clustered hierarchically, and
+    statistics are aggregated to parent nodes.
+
+    Args:
+        root_node: Root of the hierarchical tree (protein level)
+        ionname2diffion: Dictionary mapping ion names to DifferentialIon objects
+        normed_c1: ConditionBackgrounds for condition 1
+        normed_c2: ConditionBackgrounds for condition 2
+        ion2diffDist: Dictionary of differential background distributions
+        p2z: Cache for p-value to z-value conversions
+        deedpair2doublediffdist: Cache for double-differential distributions
+        pval_threshold_basis: P-value threshold for clustering decisions
+        fcfc_threshold: Fold-change threshold for clustering
+        take_median_ion: Whether to use median-centered ions
+        fragment_outlier_filtering: Whether to filter fragment outliers
+
+    Returns:
+        anytree.Node: The root node with all clustering annotations and aggregated statistics
+    """
     #typefilter object specifies filtering and clustering of the nodes
     aqcluster_utils.assign_properties_to_base_ions(root_node, ionname2diffion, normed_c1, normed_c2)
 
 
@@ -20,13 +20,26 @@
 
 
 def aggregate_node_properties(node, only_use_mainclust, peptide_outlier_filtering=False, fragment_outlier_filtering=True):
-    """Goes through the children and summarizes their properties to the node
+    """Aggregates statistical properties from child nodes to a parent node in the tree.
+
+    This is the core function for propagating statistics up the hierarchical tree structure.
+    It combines z-values, fold changes, and quality metrics from child nodes (e.g., peptides)
+    into parent node (e.g., protein) statistics. The aggregation can optionally exclude
+    proteoforms (non-main clusters) and filter outlier children.
 
     Args:
-        node ([type]): [description]
-        only_use_mainclust (bool, optional): [description]. Defaults to True.
-        peptide_outlier_filtering (bool, optional): Whether to filter outlier peptides. Defaults to False.
-        fragment_outlier_filtering (bool, optional): Whether to filter outlier fragments. Defaults to True.
+        node: The parent node whose properties will be computed from its children
+        only_use_mainclust: If True, only use children in the main cluster (cluster==0),
+                          excluding proteoform variants
+        peptide_outlier_filtering: If True and node is a protein, exclude peptides
+                                  identified as statistical outliers (default: False)
+        fragment_outlier_filtering: If True and node is a peptide, exclude extreme
+                                   fragment ions before aggregation (default: True)
+
+    Side effects:
+        Sets node.z_val, node.p_val, node.fc, node.cv, node.min_intensity,
+        node.total_intensity, node.min_reps, node.fraction_consistent, and
+        optionally node.ml_score based on aggregated child values.
     """
     if only_use_mainclust:
         childs = [x for x in node.children if x.is_included & (x.cluster ==0)]
@@ -61,11 +74,8 @@ def aggregate_node_properties(node, only_use_mainclust, peptide_outlier_filterin
     node.z_val = z_normed
     node.p_val = p_val
 
-    # if node.type == "frgion":
-    #     node.fc = calc_weighted_fold_change_from_included_leaves_fcs(node)
-    # else:
+
     node.fc = np.median(fcs)
-    #calc_fold_change_from_included_leaves_fcs(node) ##  #np.median(fcs)#
     node.fraction_consistent = fraction_consistent
     node.cv = min(cvs)
     node.min_intensity = min_intensity
@@ -210,6 +220,18 @@ def get_median_peptides(pepnode2zval2numleaves): #least significant peptides are
         return [x[0] for x in pepnode2zval2numleaves[:median_idx+1]]
 
 def remove_outlier_fragion_childs(childs):
+    """Filters extreme fragment ions before aggregating to peptide level.
+
+    When a peptide has many fragment ions, this function selects a subset to avoid
+    bias from extreme outliers. For >4 fragments, it keeps the 5 most central fragments
+    (ranked by z-value). For ≤4 fragments, all are retained.
+
+    Args:
+        childs: List of fragment ion nodes (children of a peptide node)
+
+    Returns:
+        list: Filtered subset of fragment ion nodes to use for aggregation
+    """
     zvals = get_feature_numpy_array_from_nodes(nodes=childs, feature_name="z_val")
     if aqvariables.PTM_FRAGMENT_SELECTION:
         sorted_idxs_zvals = np.argsort(np.abs(zvals))
@@ -235,6 +257,19 @@ def remove_outlier_fragion_childs(childs):
 
 
 def sum_and_re_scale_zvalues(zvals):
+    """Combines multiple z-values into a single aggregated z-value using Stouffer's method.
+
+    This implements Stouffer's Z-score method for meta-analysis: z-values are summed
+    and divided by sqrt(n) to account for the number of tests. The result is then
+    rescaled back to a standard normal distribution. This allows combining evidence
+    from multiple ions/peptides while maintaining proper statistical interpretation.
+
+    Args:
+        zvals: Array or list of z-values to combine
+
+    Returns:
+        float: Combined z-value following a standard normal distribution under the null
+    """
     if len(zvals) == 1:
         return zvals[0]  # No aggregation needed for single values - avoids floating-point precision errors
 
@@ -245,6 +280,14 @@ def sum_and_re_scale_zvalues(zvals):
     return z_normed
 
 def transform_znormed_to_pval(z_normed):
+    """Converts a z-score to a two-sided p-value.
+
+    Args:
+        z_normed: Z-score from a standard normal distribution
+
+    Returns:
+        float: Two-sided p-value. For z=0 returns 1.0, for large |z| returns small p-value.
+    """
     return 2.0 * (1.0 - NormalDist().cdf(abs(z_normed))) #we take the abs of the z_normed (normed means it belongs to a ND(0,1)), which means the cdf will return values between 0.5 and 1, and closer to 1 with increasing z_normed.
 
 
 
@@ -25,6 +25,26 @@
 LOGGER = logging.getLogger(__name__)
 
 def analyze_condpair(*,runconfig, condpair):
+    """Main workflow orchestration for differential analysis of a condition pair.
+
+    This function coordinates the complete analysis pipeline for comparing two conditions:
+    1. Loads and filters data for the two conditions
+    2. Performs normalization (within and between conditions)
+    3. Creates empirical background distributions
+    4. Computes ion-level differential statistics
+    5. Builds hierarchical trees and performs clustering to identify proteoforms
+    6. Applies machine learning quality scoring (if enabled)
+    7. Filters outlier peptides (if enabled)
+    8. Generates output tables with FDR correction
+    9. Creates visualization plots
+
+    Args:
+        runconfig: Configuration object containing all analysis parameters (see run_pipeline docstring)
+        condpair: Tuple of (condition1_name, condition2_name) to compare
+
+    Returns:
+        tuple: (results_df, peptide_df) - DataFrames with protein and peptide-level results
+    """
     LOGGER.info(f"start processeing condpair {condpair}")
     prot2diffions = defaultdict(list) #per default maps any key to empty list
     prot2missingval_diffions = defaultdict(list)