CLUES.xMSannotator is a network-based annotation pipeline for LC-MS metabolomics data. Starting from a peak table (m/z, RT, intensities) and a compound database, it assigns putative identities to detected features through mass matching, co-abundance network analysis, isotope pattern verification, chemical scoring, pathway enrichment, confidence assignment, and redundancy filtering.
This pipeline is a customized fork of RECETOX/recetox-xMSannotator, which is itself a refactored version of the original xMSannotator R package by Karan Uppal (Emory University). Key modifications in this fork include:
- Evidence ceiling (Approach C) for post-hoc confidence adjustment
- WGCNA module and RT coherence filtering
- Support for
boosted_compounds(user-verified standards → Confidence 4) - Multi-evidence chemical scoring with isotope, adduct, and correlation weighting
- Intermediate stage output files (
Stage1_*.txtthroughStage5_*.txt) for inspection
The entry point is advanced_annotation() in R/advanced_annotation.R.
Citation: Uppal K, Walker DI, Jones DP. xMSannotator: an R package for network-based annotation of high-resolution metabolomics data. Analytical Chemistry. 2017;89(2):1063-1067.
A minimal annotation run requires a peak table (m/z, RT, and sample intensity columns) and a compound database (compound_id, name, molecular_formula, monoisotopic_mass). The following example shows a positive-mode run with the most important parameters:
library(CLUES.xMSannotator)
# Load your data
peak_table <- data.table::fread("feature_table.txt", data.table = FALSE)
compound_table <- readxl::read_xlsx("compound_database.xlsx")
# Filter adduct table to positive-mode adducts
adduct_table <- CLUES.xMSannotator:::sample_adduct_table
adduct_table <- adduct_table[adduct_table$adduct %in%
c("M+H", "M+2H", "M+NH4", "M+Na", "M+ACN+H", "2M+H", "M+H-H2O"), ]
# Build adduct weights (all selected adducts, equal weight)
adduct_weights <- data.frame(adduct = adduct_table$adduct, weight = rep(1, nrow(adduct_table)))
annotation <- advanced_annotation(
peak_table = peak_table, # features × samples intensity matrix
compound_table = compound_table, # database of candidate compounds
adduct_table = adduct_table, # adduct definitions for positive mode
adduct_weights = adduct_weights, # which adducts to prioritize in scoring
mass_tolerance = 5e-6, # 5 ppm (fractional form)
time_tolerance = 10, # RT grouping window (seconds)
filter_by = "M+H", # expected primary adduct
pathway_mode = "skip", # "HMDB", "custom", or "skip"
outloc = "xMSannotator_output" # output directory
)After a successful run, the output directory will contain tab-delimited text files for each pipeline stage: Stage1_mass_matched.txt, Stage1_peak_clusters.txt, Stage2_isotope_detection.txt, Stage3_chemical_scores.txt, a Stage 3 pathway file, Stage4a_confidence_levels.txt, Stage4b_confidence_levels.txt, and Stage5_curated_results.txt. The final curated result in Stage5_curated_results.txt is the primary output — each row is a putative annotation with a confidence level (0–3), a chemical score, and a MatchCategory indicating whether the feature matched one or multiple compounds.
The pipeline runs seven processing steps in sequence. Each step consumes upstream results, adds new evidence, and writes an intermediate output file:
-
Stage 1 — Mass Matching. Finds all compound–adduct combinations whose expected m/z falls within tolerance of each detected feature. Consumes the peak table, compound database, and adduct table. Produces
Stage1_mass_matched.txtandStage1_peak_clusters.txt. -
Stage 1.5 — Peak Correlation & Module Detection. Groups features into co-abundance modules using WGCNA, then sub-clusters each module by retention time. Consumes the peak intensity matrix. Module and RT cluster assignments are written to
Stage1_peak_clusters.txt. -
Stage 2 — Isotope Detection. Searches for isotopic peaks (M+1, M+2, ...) matching each annotation's theoretical isotope envelope. Consumes Stage 1 annotations joined with module/RT assignments. Produces
Stage2_isotope_detection.txt. -
Stage 3 — Chemical Scoring. Computes a multi-evidence score for each annotation based on adduct count, correlation strength, isotope confirmation, and RT coherence. Consumes Stage 2 output and the global correlation matrix. Produces
Stage3_chemical_scores.txt. -
Stage 3b — Pathway Enrichment. Tests whether annotated compounds are enriched in known biological pathways using Fisher's exact test, and boosts scores for pathway-supported annotations. Consumes Stage 3 scores and a pathway database. Produces
Stage3_HMDB_pathways.txt(orStage3_custom_pathways.txtorStage3_pathway_skipped.txt). -
Stage 4 — Confidence Assignment. Translates continuous scores into discrete confidence levels (0–3) based on evidence thresholds, then filters to coherent annotations. Consumes Stage 3b output. Produces
Stage4a_confidence_levels.txt(all rows) andStage4b_confidence_levels.txt(coherent rows only). -
Stage 5 — Redundancy Filtering. Resolves features matched to multiple compounds by selecting the best-supported annotation at each m/z. Consumes Stage 4b output. Produces
Stage5_curated_results.txt.
Each stage adds an independent layer of evidence to the annotation. Stage 1 provides the initial mass-match candidates. Stage 1.5 adds co-abundance structure (which features behave similarly across samples). Stage 2 adds isotope-pattern confirmation. Stage 3 synthesizes these signals into a single score. Stage 3b adds biological context from pathway databases. Stages 4 and 5 then use the accumulated evidence to assign confidence and resolve ambiguity. A compound that survives all stages with high confidence has been validated by mass accuracy, co-elution, isotope pattern, correlation network membership, and (optionally) pathway co-occurrence — multiple independent lines of evidence converging on the same identification.
Files: R/simple_annotation.R, src/match_by_mass.cpp
Key questions this stage answers: Which compounds in the database could produce this observed m/z? Under what adduct form, charge state, and multimer configuration?
Compares every detected m/z against a compound database to find putative adduct matches within a mass tolerance window. This is a brute-force search accelerated by a C++ inner loop.
The C++ engine (match_by_mass.cpp, line 39) computes the expected m/z for each compound–adduct combination:
expected_mz = (factor × M + adduct_mass) / charge
- M = compound monoisotopic mass
- factor = number of molecules in the adduct species (1 for
M+H, 2 for2M+H, 3 for3M+H) - adduct_mass = mass contribution of the adduct ion (e.g., +1.007276 for protonation)
- charge = charge state (1, 2, or 3)
Applied as relative (ppm) tolerance via almost_equal() (match_by_mass.cpp):
|observed_mz − expected_mz| ≤ max(|observed_mz|, |expected_mz|) × tolerance
Default: mass_tolerance = 5e-6 (5 ppm in fractional form).
Before output, every match is screened by check_golden_rules() (lines 13–38 of simple_annotation.R). These are elemental ratio heuristics that reject chemically implausible molecular formulas:
Elemental ratio checks (all must pass, requires C > 0):
| Ratio | Allowed range |
|---|---|
| H/C | 0.1–6 |
| F/C | ≤ 6 |
| N/C | ≤ 4 |
| O/C | ≤ 3 |
| P/C | ≤ 2 |
| S/C | ≤ 3 |
Combined element restrictions (upper bounds on element counts when multiple heteroatoms co-occur):
| Elements | Must satisfy (if all > threshold) |
|---|---|
| N,O,P,S | N<10 AND O<20 AND P<4 AND S<3 |
| N,O,P | N<11 AND O<22 AND P<6 |
| O,P,S | O<14 AND P<3 AND S<3 |
| P,S,N | N<4 AND P<3 AND S<3 |
| N,O,S | N<19 AND O<14 AND S<8 |
Water-loss adduct rule (forms_valid_adduct_pair): Adducts involving water loss (M+H−H₂O, M+H−2H₂O, M−H₂O−H) are only allowed if the molecular formula contains oxygen.
The default sample_adduct_table contains 110 adducts spanning both positive and negative ionization modes, including:
- Single-charge primary ions: M+H, M−H, M+Na, M+K, M+NH₄, M+Cl, M+Br
- Multi-charge ions: M+2H (z=2), M+3H (z=3), M−2H (z=2), M−3H (z=3)
- Multimers: 2M+H, 3M+H, 2M−H, 3M−H
- Adducts with common mobile-phase modifiers: formic acid, acetic acid, acetonitrile, TFA, DMSO, isopropanol
Stage1_mass_matched.txt — all candidate annotations with columns: peak, mz, rt, compound, adduct, expected_mass, molecular_formula, name, plus inherited compound metadata.
Files: R/compute_peak_modules.R, R/compute_rt_modules.R
Key questions this stage answers: Which features co-vary in intensity across samples and therefore likely originate from the same compound or biological process? Within each co-abundance module, which features co-elute?
Metabolites entering the mass spectrometer generate multiple related ions: adducts, in-source fragments, and isotopes. These co-eluting, co-abundant species will show correlated intensity patterns across samples. By building a co-abundance network from the peak intensity matrix, the pipeline groups peaks into modules — clusters of features likely originating from the same or related metabolites. This biological signal is later used to boost annotation confidence.
Function: compute_peak_correlations(peak_intensity_matrix, correlation_method = "p")
Computes a pairwise Pearson correlation matrix across all peaks using WGCNA::cor(). Input is a samples × peaks intensity matrix. Output is a symmetric peaks × peaks correlation matrix.
Function: compute_peak_modules()
- Soft-threshold power selection —
WGCNA::pickSoftThreshold()finds the power that best approximates a scale-free network topology. Fallback: power = 6 if estimation fails. - Hierarchical clustering —
flashClust::flashClust()clusters peaks by correlation distance (1 − correlation). - Dynamic tree cutting —
dynamicTreeCut::cutreeDynamic()identifies modules with adaptive branch cutting (parameters:deep_split = 2,min_cluster_size = 10). - Blockwise module detection —
WGCNA::blockwiseModules()refines modules using the soft-threshold power, then merges modules with correlation above the threshold (mergeCutHeight = 1 − correlation_threshold, i.e., 0.3 at default). - Fallback — if
blockwiseModules()fails, the pipeline computes a Topological Overlap Matrix (TOM) and usesmergeCloseModules().
Key defaults:
| Parameter | Default | Description |
|---|---|---|
correlation_threshold |
0.7 | Module merge threshold (merge if r > 0.7) |
deep_split |
2 | Tree cutting depth (0=conservative, 4=aggressive) |
min_cluster_size |
10 | Minimum peaks per module |
network_type |
"unsigned" | Considers both positive and negative correlations |
Function: compute_rt_modules(peak_table, peak_width = 1)
Within each module, peaks are further sub-grouped by retention time using kernel density estimation:
- KDE — Gaussian kernel density with bandwidth =
peak_width(default: 1 second), minimum 2048 evaluation points. - Peak detection —
pastecs::turnpoints()finds local maxima in the density curve (inflection points above machine epsilon). - Nearest-neighbor assignment —
RANN::nn2()assigns each peak to its nearest RT cluster center.
Output: Stage1_peak_clusters.txt — each peak tagged with module, rt_cluster, and the combined identifier Module_RTclust (e.g., "110_3").
File: R/compute_isotopes.R
Key questions this stage answers: Does the observed isotopic envelope (M+1, M+2, ...) match the theoretical pattern predicted by this compound's molecular formula? Are the isotope peaks present at the expected masses and intensities?
Each molecular formula produces a characteristic isotopic envelope due to the natural abundance of heavier isotopes (¹³C, ²H, ¹⁸O, etc.). For example, a compound with 10 carbons will show an M+1 peak at ~11% of the monoisotopic intensity (from ¹³C). By matching the observed isotopic pattern against the theoretical pattern, the pipeline gains additional evidence for or against an annotation.
For each annotated peak (the putative monoisotopic ion):
-
Compute theoretical pattern —
enviPat::isopattern()generates the expected isotope envelope from the molecular formula. Peaks below 0.1% relative abundance (minAbund = 0.001) are discarded. -
Filter candidates — Nearby peaks are screened by:
- Same
rt_cluster(must be co-eluting) - RT within
rt_tolerance(default: 1 second incompute_isotopes, but receivestime_tolerance= 10 fromadvanced_annotation) - Same
mass_defectbin (tolerance: 0, exact match by default) - Not the same peak
- Same
-
Mass matching — Each candidate's m/z is compared to the expected isotope m/z:
expected_isotope_mz = monoisotopic_expected_mass + exact_mass_diff mass_error_ppm = |observed_mz − expected_isotope_mz| / expected_isotope_mz × 10⁶Filtered by
isotope_mass_tolerance_ppm(defaults tomass_tolerance × 10⁶, i.e., 5 ppm). -
Intensity matching — The observed intensity ratio (isotope/monoisotopic) is compared to the theoretical relative abundance:
|observed_ratio − theoretical_abundance| ≤ observed_ratio × intensity_deviation_toleranceDefault tolerance: 0.1 (10% relative deviation).
| Parameter | Default | Description |
|---|---|---|
intensity_deviation_tolerance |
0.1 | 10% allowed deviation in intensity ratio |
mass_defect_tolerance |
0.1 | Mass defect window (Da) |
rt_tolerance |
receives time_tolerance |
RT window (seconds) |
isotope_mass_tolerance_ppm |
mass_tolerance × 10⁶ |
PPM filter for isotope mass matching |
minAbund (enviPat) |
0.001 | 0.1% relative abundance cutoff |
Stage2_isotope_detection.txt — the annotation table expanded with isotope rows. A new column mass_number_difference indicates: 0 = monoisotopic, 1/2/3... = M+1/M+2/M+3 isotope.
Files: R/get_chemscore.R, R/chemscore_helpers.R
Key questions this stage answers: How much independent evidence supports this annotation? Is it a lone mass match, or do multiple adducts, isotope patterns, and intensity correlations converge on the same compound?
Not all mass matches are equally credible. A compound with multiple corroborating adducts in the same RT window, strong intensity correlations, and matching isotope patterns is far more likely to be a true identification than a lone hit. Chemical scoring quantifies this multi-evidence support into a single numeric score.
Function: compute_hub_mz() (line 346, chemscore_helpers.R)
The "hub" is the most informative feature for a given compound — the anchor around which other evidence is evaluated. Selection follows a hierarchical criterion:
- Primary: Features with adducts in the
filter.bylist AND positive correlation with other features - Secondary: Features with weighted adducts (in
adduct_weights) AND no isotope annotations - Tertiary: Any feature with correlation > 0 and no isotope annotations
Among candidates, the hub is the feature with the highest mean intensity. Ties are broken by the feature with the most time neighbors within max_diff_rt.
Function: calc_base_score() (line 363, chemscore_helpers.R)
chemical_score = n_unique_adducts × n_weighted_adducts × max_correlation × 10^max_adduct_weight
- n_unique_adducts — count of distinct adduct types (e.g., M+H, M+Na = 2)
- n_weighted_adducts — count of adducts matching the
adduct_weightstable - max_correlation — highest Pearson correlation between the hub and any other feature for this compound (from the upper triangle of the correlation submatrix)
- 10^max_adduct_weight — exponential encoding of the highest adduct weight (default weights typically 0–5; e.g., weight=2 → ×100)
After monoisotopic scoring, if isotopic peaks are detected for the compound, the score is multiplied by 100:
if (has_isotopes) chemical_score <- chemical_score * 100This 100× multiplier strongly rewards annotations with isotope confirmation.
Function: apply_rt_scaling() (line 785, chemscore_helpers.R)
The score is penalized based on the RT spread of the compound's features:
chemical_score = chemical_score × 1 / ((diff_rt × 0.1) + 1)^k
diff_rt= max(RT) − min(RT) across all features for this compoundk = 1if diff_rt ≤max_diff_rt(gentle penalty)k = 10if diff_rt >max_diff_rt(severe penalty — effectively eliminates scattered annotations)
Example at max_diff_rt = 10s:
- diff_rt = 5s → scaling ≈ 0.67 (mild penalty)
- diff_rt = 15s → scaling ≈ 0.00003 (effectively zero)
After all other adjustments, scores above the minimum threshold receive an exponential boost based on the preliminary confidence level:
chemical_score = chemical_score × conf_level^conf_level
| Confidence level | Boost factor |
|---|---|
| 0 (None) | 0 (zeroed) |
| 1 (Low) | 1× (no change) |
| 2 (Medium) | 4× |
| 3 (High) | 27× |
Function: get_min_chemscore() (line 781, chemscore_helpers.R)
min_score = 100 × 2 × corthresh × (1 / ((max_diff_rt × 0.1) + 1)^3)
At defaults (corthresh=0.7, max_diff_rt=10): min_score ≈ 17.5.
Consider a hypothetical compound detected with the following evidence:
- 2 unique adducts: M+H and M+Na
- 1 of those adducts is weighted (max weight = 2)
- Highest Pearson correlation between hub and another feature: 0.85
- Isotope peaks detected (M+1 confirmed)
- RT spread across features: diff_rt = 5 seconds (max_diff_rt = 10)
- Preliminary confidence level: 2 (Medium)
The score is computed in five steps:
| Step | Calculation | Result |
|---|---|---|
| 1. Base score | 2 × 1 × 0.85 × 10^2 |
170 |
| 2. Isotope boost | 170 × 100 |
17,000 |
| 3. RT scaling (k=1, since 5 ≤ 10) | 17,000 × 1/((5 × 0.1) + 1)^1 = 17,000 × 1/1.5 |
11,333 |
| 4. Min score check | 100 × 2 × 0.7 × 1/((10 × 0.1) + 1)^3 = 140 × 1/8 = 17.5 |
11,333 > 17.5: passes |
| 5. Confidence boost | 11,333 × 2^2 |
45,333 |
The final chemical score of ~45,333 places this annotation firmly in the "strong multi-evidence" range (>10,000). Without isotope confirmation (skip step 2), the score would be ~453 — still meaningful but two orders of magnitude lower, reflecting the reduced certainty.
Stage3_chemical_scores.txt — annotations with cur_chem_score column representing the multi-evidence score.
Files: R/multilevelannotationstep3.R, R/compute_pathways.R
Key questions this stage answers: Are the annotated compounds over-represented in known biological pathways? Does the co-location of pathway members within the same WGCNA module suggest a real biological signal rather than random matches?
If multiple annotated metabolites map to the same biological pathway, this co-occurrence is unlikely to be coincidental and supports the annotations' validity. Pathway enrichment tests whether the overlap between annotated compounds and known pathway members exceeds what would be expected by chance.
Level 1: Global pathway enrichment (line 72–77)
For each pathway, constructs a 2×2 contingency table:
| In pathway | Not in pathway | |
|---|---|---|
| Significant hit | a | b |
| Not significant | c | d |
"Significant" = annotation has score ≥ scorethresh (default: 0.1) AND adduct is in adduct_weights.
Fisher's exact test is applied. Threshold: p ≤ 0.05.
Level 2: Module-pathway overlap (line 115–124)
For pathways passing Level 1, tests whether pathway compounds are co-located within the same WGCNA/RT module more than expected by chance. This is a second 2×2 contingency table comparing the dominant module against all other modules.
Threshold: p ≤ 0.2 (more permissive, since modules are smaller subsets).
All of the following must be met for a compound to receive a pathway boost:
- Pathway passes Level 1 Fisher's test (p ≤ 0.05)
- Module passes Level 2 Fisher's test (p ≤ 0.2)
- ≥ 3 pathway compounds in the module
- Compound's chemical score ≥ 0.1
- Compound has a valid adduct (in
adduct_weights)
Boost amount: score += n_pathway_compounds (additive, per qualifying pathway).
| Mode | Source | Selection |
|---|---|---|
"HMDB" (default) |
Internal hmdbCompMZ dataset (HMDB v3.5) |
SMPDB pathway IDs parsed from column 14 |
"custom" |
User-provided pathway_data data frame |
Must have compound and pathway columns |
"skip" |
None | Pathway step is bypassed |
Stage3_HMDB_pathways.txt(HMDB mode)Stage3_custom_pathways.txt(custom mode)Stage3_pathway_skipped.txt(skip mode)
The cur_chem_score column is renamed to score at this stage.
File: R/multilevelannotationstep4.R
Key questions this stage answers: How confident should we be in this annotation? Is the evidence strong enough for the annotation to be reported, or is it a speculative mass match that should be flagged as low-confidence?
Chemical scores are continuous values that resist intuitive interpretation. Confidence levels translate multi-evidence support into a discrete 0–3 scale that communicates the strength of annotation evidence at a glance. Higher confidence requires convergent evidence from multiple independent sources (adducts, isotopes, module co-membership, RT coherence).
| Level | Label | Interpretation |
|---|---|---|
| 0 | None | Insufficient evidence; mass match only |
| 1 | Low | Single primary adduct match (M+H or M−H) |
| 2 | Medium | Multiple distinct adducts or isotope evidence with moderate coherence |
| 3 | High | Multiple adducts + isotope evidence + module coherence + RT coherence |
| 4 | Confirmed | User-verified compound (via boosted_compounds parameter) |
Function: get_confidence_stage4() (lines 442–542)
The algorithm evaluates each compound's annotation rows through a decision tree. The following pseudocode traces the major decision points. To determine the confidence for any annotation, start at the top and follow the first matching branch:
IF score < 0.1
→ CONFIDENCE 0 (None)
IF adduct not recognized in adduct_table
→ CONFIDENCE 0 (None)
IF only one unique adduct type:
IF score < 10 AND no filter_by match
→ CONFIDENCE 0 (None)
IF score < 10 AND has filter_by match
→ CONFIDENCE 1 (Low)
IF nrow == 2 (single adduct but two rows, suggesting isotope)
→ CONFIDENCE 2 (Medium)
IF nrow > 2 AND no filter_by match
→ CONFIDENCE 0 (None)
IF RT spread > max.rt.diff:
Re-cluster rows by RT → keep the best cluster
(continue evaluation with filtered rows)
IF features span multiple WGCNA modules:
IF score < 10
→ CONFIDENCE 0 (None)
IF score > 10 AND has filter_by match
→ CONFIDENCE 2 (Medium)
IF only one row remains:
IF adduct not in adduct_weights
→ CONFIDENCE 1 (Low)
IF adduct in adduct_weights AND score > 10 AND has filter_by match
→ CONFIDENCE 2 (Medium)
ELSE
→ CONFIDENCE 0 (None)
IF multimer (2M, 3M) is more abundant than monomer
→ downgrade confidence
IF charge > 1
→ CONFIDENCE 1 (Low)
IF carbon count < 1 in molecular formula
→ CONFIDENCE 0 (None)
Otherwise: assign based on adduct count and score thresholds
Two score thresholds drive most decisions: 0.1 (minimum to be considered at all) and 10 (threshold for medium-confidence promotion). The filter_by parameter acts as a gatekeeper — annotations matching the expected primary adduct (e.g., M+H in positive mode) can reach Confidence 1 even with low scores, while non-filter adducts require stronger evidence.
Function: upgrade_confidence_with_evidence() (lines 869–932)
Re-evaluates compounds below Confidence 3 using isotope and adduct evidence. Can only upgrade, never downgrade.
| Evidence pattern | With filter_by active | With filter_by = NULL |
|---|---|---|
| Isotope rows + ≥1 base row + ≥2 base adducts | → Conf 2 | → Conf 3 |
| Isotope rows + ≥1 base row (single adduct) | → Conf 1 | → Conf 2 |
| ≥2 base adducts (no isotopes) | → Conf 1 | → Conf 2 |
| No evidence | No upgrade | No upgrade |
Requirements: actual base rows must exist (not just isotope inference), and module + RT coherence must pass.
Function: cap_confidence_with_evidence() (lines 959–1007)
Enforces hard evidence ceilings. Can both raise (rescue Level 0) and lower (prevent over-confidence).
| Max justified confidence | Required evidence |
|---|---|
| 3 (High) | Isotope rows + ≥1 base row + ≥1 adduct + module + RT coherence |
| 2 (Medium) | ≥2 base adducts + module + RT coherence |
| 1 (Low) | Single primary adduct (M+H or M−H) from an actual base row |
| 0 (None) | Insufficient evidence |
Exceptions: Confidence 4 (user-confirmed) is never modified.
Function: enforce_compound_coherence() (lines 114–122)
Two requirements for a compound's rows to be "coherent":
- Module coherence — all rows must share the same WGCNA module
- RT coherence — RT range (max − min) must be ≤
max.rt.diffseconds
For compounds split across modules, only the rows from the largest module group are kept. This prevents minority stray rows from degrading confidence.
| Stage 4a | Stage 4b | |
|---|---|---|
| Content | All rows with confidence assigned | Coherent rows only (largest module group per compound) |
| Purpose | Audit trail — inspect all annotations including dropped rows | Production data — feeds Stage 5 |
| File | Stage4a_confidence_levels.txt |
Stage4b_confidence_levels.txt |
Function: boost_confidence_of_IDs() (lines 664–703)
If boosted_compounds is provided, matching annotations are elevated to Confidence 4 with a 100× score multiplier. Matching modes:
- ID-only: exact compound ID match
- Proximity-based: compound ID + m/z tolerance (
boost_mass_tolerance) + RT tolerance (boost_time_tolerance)
File: R/multilevelannotationstep5.R
Key questions this stage answers: When a single m/z feature matches multiple compounds, which annotation is best supported? After resolving conflicts, is each remaining annotation a unique match or a winner among competitors?
A single m/z feature may match multiple compounds in the database. Stage 5 resolves these conflicts by selecting the best-supported annotation for each feature, using adduct priority and chemical score as the arbitration criteria.
Function: multilevelannotationstep5() (lines 53–85)
- Identify m/z values matching multiple compounds (
get_features(mz, "multiple")) - For each multi-match m/z:
- Boost weighted adducts: annotations with adducts in
adduct_weightsreceive a 100× score multiplier - Select winner: annotation with the highest (boosted) score
- Remove losers at this m/z: other compounds' rows at this specific m/z are dropped
- Recalculate loser scores:
score = (n_competing − 1) / n_competing × winning_score
- Boost weighted adducts: annotations with adducts in
Function: increase_multimatches_score() (lines 14–20)
Annotations with adducts matching adduct_weights[, 1] receive a 100× score multiplier during conflict resolution. Default weighted adducts: M+H, M−H (when adduct_weights = NULL, all adducts get weight 5).
This strongly favors primary adducts (protonation/deprotonation) over exotic adduct assignments when multiple compounds compete for the same feature.
Each feature in the final output receives a MatchCategory label:
| Category | Meaning |
|---|---|
| Unique | This m/z matched exactly one compound |
| Multiple | This m/z initially matched multiple compounds (even if only the winner remains after filtering) |
MatchCategory is assigned after redundancy removal, so it reflects the initial ambiguity, not the final state.
Stage5_curated_results.txt — the final curated annotation table. Contains the same columns as Stage 4b plus MatchCategory.
| Stage | File | Description |
|---|---|---|
| 1 | Stage1_mass_matched.txt |
Raw mass matching results (all putative adduct annotations) |
| 1 | Stage1_peak_clusters.txt |
Peak module and RT cluster assignments |
| 2 | Stage2_isotope_detection.txt |
Annotations expanded with detected isotope rows |
| 3 | Stage3_chemical_scores.txt |
Chemical scores from multi-evidence evaluation |
| 3b | Stage3_HMDB_pathways.txt |
Pathway-enriched scores (HMDB mode) |
| 3b | Stage3_custom_pathways.txt |
Pathway-enriched scores (custom mode) |
| 3b | Stage3_pathway_skipped.txt |
Scores without pathway enrichment (skip mode) |
| 4a | Stage4a_confidence_levels.txt |
All rows with confidence levels (audit trail) |
| 4b | Stage4b_confidence_levels.txt |
Coherent rows only (production output) |
| 5 | Stage5_curated_results.txt |
Final curated results after redundancy filtering |
All files are tab-delimited text. Columns vary by stage but always include: mz, time/rt, compound_id, Adduct, score (from Stage 3 onward), Confidence and Confidence_Level (from Stage 4 onward).
Before diving into the full parameter reference table, use the sections below to identify which parameters to adjust for your specific situation.
mass_tolerance— Decrease to 2–3 ppm for Orbitrap and other high-resolution instruments where mass accuracy is tight. Increase to 10–15 ppm for QTOF or lower-resolution instruments where wider windows are needed to capture true matches.adduct_table— Filter the default 110-adduct table down to adducts relevant to your ionization mode and chromatographic conditions. Running positive-mode C18 data against all 110 adducts (including negative-mode and exotic species) generates unnecessary false matches.
n_workers— Set to the number of available CPU cores minus one; more cores speed up the WGCNA module detection step, which is the most computationally expensive part of the pipeline.pathway_mode— Use"HMDB"for human metabolomics studies where compound IDs are HMDB identifiers. Use"custom"with your own pathway table for non-HMDB databases (e.g., pesticide libraries, plant metabolite databases). Use"skip"for exposomics or targeted panels where pathway enrichment is not meaningful.boosted_compounds— Provide a data frame of verified standards (compound IDs with known m/z and RT) to anchor confidence calibration; these are elevated to Confidence 4 and serve as positive controls in your results.
correlation_threshold— Decrease to 0.5–0.6 if WGCNA produces too few modules (most features land in a single module). Increase to 0.8–0.9 if modules are too large and mix unrelated features.min_cluster_size— Decrease below 10 if small datasets produce too few modules. Increase above 10 to reduce noise modules in large studies with thousands of features.time_tolerance— Decrease below 10 seconds for narrow chromatographic peaks (e.g., UHPLC C18). Increase above 10 seconds for broad peaks or HILIC separations where co-eluting adducts span a wider RT range.intensity_deviation_tolerance— Increase to 0.2–0.3 for noisy or low-abundance data where isotope intensity ratios are imprecise. Decrease to 0.05 for clean data from standards or high-abundance metabolites.filter_by— Set to the expected primary adduct for your ionization mode ("M+H"for positive,"M-H"for negative) to gate confidence levels. Set toNULLfor unbiased scoring where no single adduct is expected to dominate.
| Parameter | Default | Stage | Description |
|---|---|---|---|
mass_tolerance |
5e-6 (5 ppm) | 1 | Relative mass tolerance for adduct matching |
adduct_table |
110 built-in adducts | 1 | Adduct definitions (charge, factor, mass) |
feature_id_column |
NULL | all | Column name for custom feature IDs to preserve in all stage outputs |
correlation_threshold |
0.7 | 1.5, 3 | Module merge threshold; minimum correlation for scoring evidence |
deep_split |
2 | 1.5 | WGCNA tree cutting depth (0=conservative, 4=aggressive) |
min_cluster_size |
10 | 1.5 | Minimum peaks per WGCNA module |
network_type |
"unsigned" | 1.5 | WGCNA network topology type |
peak_rt_width |
1 | 1.5 | Gaussian KDE bandwidth for RT clustering (seconds) |
intensity_deviation_tolerance |
0.1 | 2 | Allowed deviation in isotope intensity ratio (10%) |
mass_defect_tolerance |
0.1 | 2 | Mass defect window (Da) |
mass_defect_precision |
0.01 | 2 | Bin width for mass defect calculation |
isotope_mass_tolerance |
NULL (→ mass_tolerance) |
2 | PPM tolerance for isotope mass matching |
maximum_isotopes |
10 | 4 | Maximum isotope peaks per compound |
time_tolerance |
10 | 2, 3, 3b, 4 | RT window for grouping features (seconds) |
MplusH_abundance_ratio_check |
TRUE | 3 | Require M+H/M−H to be most abundant adduct |
multimer_abundance_check |
TRUE | 4 | Penalize multimers (2M, 3M) more abundant than monomer |
adduct_weights |
NULL (all adducts, weight=5) | 3, 3b, 4, 5 | Adduct priority weights for scoring and conflict resolution |
filter_by |
c("M−H", "M+H") | 3, 4 | Expected primary adducts (gates certain confidence levels) |
min_ions_per_chemical |
2 | 4 | Minimum annotation rows per compound |
level1_primary_adducts |
c("M+H", "M−H") | 4 | Adducts that qualify for Confidence Level 1 |
pathway_mode |
"HMDB" | 3b | Pathway analysis mode: "HMDB", "custom", or "skip" |
pathway_data |
NULL | 3b | Custom pathway table (required if pathway_mode = "custom") |
excluded_pathways |
NULL | 3b | Pathway IDs to exclude from enrichment analysis |
excluded_pathway_compounds |
NULL | 3b | Compound IDs to exclude from pathway analysis |
redundancy_filtering |
TRUE | 5 | Enable/disable Stage 5 multi-match resolution |
boosted_compounds |
NULL | 4 | User-verified compound IDs (elevated to Confidence 4) |
boost_match_by |
c("mz", "rt") | 4 | How to match boosted compounds: by mz, rt, or both |
boost_mass_tolerance |
NULL (→ mass_tolerance) |
4 | Mass tolerance for boost matching |
boost_time_tolerance |
NULL (→ time_tolerance) |
4 | RT tolerance for boost matching |
identify_isotopologues_flag |
TRUE | 4 | Add isotopologue identification labels |
outloc |
tempdir() |
all | Output directory for stage files |
n_workers |
parallel::detectCores() |
1.5 | Number of parallel workers for WGCNA |
| Score range | Typical evidence |
|---|---|
| 0 | No corroborating evidence; mass match only |
| 1–10 | Single adduct, low correlation |
| 10–100 | Multiple adducts or weighted adduct with good correlation |
| 100–1,000 | Strong multi-adduct evidence or isotope confirmation |
| 1,000–10,000 | Multiple weighted adducts + isotopes + high correlation |
| >10,000 | Full evidence: weighted adducts + isotopes + high confidence level boost |
Symptom: Most annotations have Confidence 0 (None), even for features you expect to annotate well.
Likely cause: The filter_by parameter does not include adducts present in your adduct_table, so the Stage 4 decision tree hits the "no filter match" branch and assigns Confidence 0. Alternatively, mass_tolerance may be too tight, producing too few mass matches for multi-evidence scoring to work.
Fix: Verify that filter_by contains an adduct that is also in your adduct_table (e.g., filter_by = "M+H" for positive mode). If mass matches are sparse, widen mass_tolerance (e.g., from 5e-6 to 10e-6).
Symptom: All or most features are assigned to a single WGCNA module, and Stage1_peak_clusters.txt shows only one or two module numbers.
Likely cause: correlation_threshold is too low (modules that should be separate get merged) or min_cluster_size is too large (small modules are absorbed into larger ones).
Fix: Increase correlation_threshold to 0.8 or 0.9 so only highly correlated features merge. Decrease min_cluster_size below 10 if your dataset has fewer features. Also check that the peak table has enough samples (columns) for meaningful correlation — WGCNA needs at least ~15–20 samples to produce stable modules.
Symptom: No pathway enrichment is occurring — scores in Stage3_HMDB_pathways.txt are identical to Stage3_chemical_scores.txt.
Likely cause: pathway_mode is set to "skip", or compound IDs in your database do not match the HMDB identifier format (e.g., "HMDB0000001") used by the internal pathway table.
Fix: Set pathway_mode = "HMDB" and ensure your compound table's compound_id column uses HMDB-format IDs. If using a non-HMDB database, set pathway_mode = "custom" and provide a pathway_data data frame with compound and pathway columns mapping your compound IDs to pathway identifiers.
Symptom: Chemical scores are uniformly low (all below 10), resulting in most annotations stuck at Confidence 0 or 1.
Likely cause: Few adducts are being detected per compound (most compounds match only one adduct), correlations are weak, or time_tolerance is too narrow for features to be grouped together. When compounds have only a single adduct and no isotope confirmation, the base score formula produces small values.
Fix: Widen time_tolerance (e.g., from 10 to 15–20 seconds) to allow more features to be grouped. Check that your adduct_table includes the adducts expected for your chromatographic conditions. If correlations are weak, verify that your peak table has sufficient biological replicates (not just technical replicates or QC samples).
Symptom: Stage5_curated_results.txt has far fewer rows than Stage4b_confidence_levels.txt.
Likely cause: This is expected behavior when many features match multiple compounds. Stage 5 resolves each multi-match feature to a single winner, dropping the competing annotations. The MatchCategory column in the Stage 5 output distinguishes features that were unique matches from those that had competition.
Fix: No fix needed — this is the pipeline working as intended. To see what was dropped, compare Stage4b_confidence_levels.txt against Stage5_curated_results.txt. Features marked MatchCategory = "Multiple" in Stage 5 had competing annotations that were resolved in favor of the retained compound.