CLUES.xMSannotator Workflow Reference

Overview

CLUES.xMSannotator is a network-based annotation pipeline for LC-MS metabolomics data. Starting from a peak table (m/z, RT, intensities) and a compound database, it assigns putative identities to detected features through mass matching, co-abundance network analysis, isotope pattern verification, chemical scoring, pathway enrichment, confidence assignment, and redundancy filtering.

This pipeline is a customized fork of RECETOX/recetox-xMSannotator, which is itself a refactored version of the original xMSannotator R package by Karan Uppal (Emory University). Key modifications in this fork include:

Evidence ceiling (Approach C) for post-hoc confidence adjustment
WGCNA module and RT coherence filtering
Support for boosted_compounds (user-verified standards → Confidence 4)
Multi-evidence chemical scoring with isotope, adduct, and correlation weighting
Intermediate stage output files (Stage1_*.txt through Stage5_*.txt) for inspection

The entry point is advanced_annotation() in R/advanced_annotation.R.

Citation: Uppal K, Walker DI, Jones DP. xMSannotator: an R package for network-based annotation of high-resolution metabolomics data. Analytical Chemistry. 2017;89(2):1063-1067.

Quick Start

A minimal annotation run requires a peak table (m/z, RT, and sample intensity columns) and a compound database (compound_id, name, molecular_formula, monoisotopic_mass). The following example shows a positive-mode run with the most important parameters:

library(CLUES.xMSannotator)

# Load your data
peak_table     <- data.table::fread("feature_table.txt", data.table = FALSE)
compound_table <- readxl::read_xlsx("compound_database.xlsx")

# Filter adduct table to positive-mode adducts
adduct_table <- CLUES.xMSannotator:::sample_adduct_table
adduct_table <- adduct_table[adduct_table$adduct %in%
  c("M+H", "M+2H", "M+NH4", "M+Na", "M+ACN+H", "2M+H", "M+H-H2O"), ]

# Build adduct weights (all selected adducts, equal weight)
adduct_weights <- data.frame(adduct = adduct_table$adduct, weight = rep(1, nrow(adduct_table)))

annotation <- advanced_annotation(
  peak_table        = peak_table,            # features × samples intensity matrix
  compound_table    = compound_table,        # database of candidate compounds
  adduct_table      = adduct_table,          # adduct definitions for positive mode
  adduct_weights    = adduct_weights,        # which adducts to prioritize in scoring
  mass_tolerance    = 5e-6,                  # 5 ppm (fractional form)
  time_tolerance    = 10,                    # RT grouping window (seconds)
  filter_by         = "M+H",                # expected primary adduct
  pathway_mode      = "skip",               # "HMDB", "custom", or "skip"
  outloc            = "xMSannotator_output"  # output directory
)

After a successful run, the output directory will contain tab-delimited text files for each pipeline stage: Stage1_mass_matched.txt, Stage1_peak_clusters.txt, Stage2_isotope_detection.txt, Stage3_chemical_scores.txt, a Stage 3 pathway file, Stage4a_confidence_levels.txt, Stage4b_confidence_levels.txt, and Stage5_curated_results.txt. The final curated result in Stage5_curated_results.txt is the primary output — each row is a putative annotation with a confidence level (0–3), a chemical score, and a MatchCategory indicating whether the feature matched one or multiple compounds.

Pipeline Data Flow

The pipeline runs seven processing steps in sequence. Each step consumes upstream results, adds new evidence, and writes an intermediate output file:

Stage 1 — Mass Matching. Finds all compound–adduct combinations whose expected m/z falls within tolerance of each detected feature. Consumes the peak table, compound database, and adduct table. Produces Stage1_mass_matched.txt and Stage1_peak_clusters.txt.
Stage 1.5 — Peak Correlation & Module Detection. Groups features into co-abundance modules using WGCNA, then sub-clusters each module by retention time. Consumes the peak intensity matrix. Module and RT cluster assignments are written to Stage1_peak_clusters.txt.
Stage 2 — Isotope Detection. Searches for isotopic peaks (M+1, M+2, ...) matching each annotation's theoretical isotope envelope. Consumes Stage 1 annotations joined with module/RT assignments. Produces Stage2_isotope_detection.txt.
Stage 3 — Chemical Scoring. Computes a multi-evidence score for each annotation based on adduct count, correlation strength, isotope confirmation, and RT coherence. Consumes Stage 2 output and the global correlation matrix. Produces Stage3_chemical_scores.txt.
Stage 3b — Pathway Enrichment. Tests whether annotated compounds are enriched in known biological pathways using Fisher's exact test, and boosts scores for pathway-supported annotations. Consumes Stage 3 scores and a pathway database. Produces Stage3_HMDB_pathways.txt (or Stage3_custom_pathways.txt or Stage3_pathway_skipped.txt).
Stage 4 — Confidence Assignment. Translates continuous scores into discrete confidence levels (0–3) based on evidence thresholds, then filters to coherent annotations. Consumes Stage 3b output. Produces Stage4a_confidence_levels.txt (all rows) and Stage4b_confidence_levels.txt (coherent rows only).
Stage 5 — Redundancy Filtering. Resolves features matched to multiple compounds by selecting the best-supported annotation at each m/z. Consumes Stage 4b output. Produces Stage5_curated_results.txt.

Each stage adds an independent layer of evidence to the annotation. Stage 1 provides the initial mass-match candidates. Stage 1.5 adds co-abundance structure (which features behave similarly across samples). Stage 2 adds isotope-pattern confirmation. Stage 3 synthesizes these signals into a single score. Stage 3b adds biological context from pathway databases. Stages 4 and 5 then use the accumulated evidence to assign confidence and resolve ambiguity. A compound that survives all stages with high confidence has been validated by mass accuracy, co-elution, isotope pattern, correlation network membership, and (optionally) pathway co-occurrence — multiple independent lines of evidence converging on the same identification.

Stage 1: Mass Matching

Files: R/simple_annotation.R, src/match_by_mass.cpp

Key questions this stage answers: Which compounds in the database could produce this observed m/z? Under what adduct form, charge state, and multimer configuration?

What it does

Compares every detected m/z against a compound database to find putative adduct matches within a mass tolerance window. This is a brute-force search accelerated by a C++ inner loop.

Adduct equation

The C++ engine (match_by_mass.cpp, line 39) computes the expected m/z for each compound–adduct combination:

expected_mz = (factor × M + adduct_mass) / charge

M = compound monoisotopic mass
factor = number of molecules in the adduct species (1 for M+H, 2 for 2M+H, 3 for 3M+H)
adduct_mass = mass contribution of the adduct ion (e.g., +1.007276 for protonation)
charge = charge state (1, 2, or 3)

Mass tolerance

Applied as relative (ppm) tolerance via almost_equal() (match_by_mass.cpp):

|observed_mz − expected_mz| ≤ max(|observed_mz|, |expected_mz|) × tolerance

Default: mass_tolerance = 5e-6 (5 ppm in fractional form).

Golden Rules filters

Before output, every match is screened by check_golden_rules() (lines 13–38 of simple_annotation.R). These are elemental ratio heuristics that reject chemically implausible molecular formulas:

Elemental ratio checks (all must pass, requires C > 0):

Ratio	Allowed range
H/C	0.1–6
F/C	≤ 6
N/C	≤ 4
O/C	≤ 3
P/C	≤ 2
S/C	≤ 3

Combined element restrictions (upper bounds on element counts when multiple heteroatoms co-occur):

Elements	Must satisfy (if all > threshold)
N,O,P,S	N<10 AND O<20 AND P<4 AND S<3
N,O,P	N<11 AND O<22 AND P<6
O,P,S	O<14 AND P<3 AND S<3
P,S,N	N<4 AND P<3 AND S<3
N,O,S	N<19 AND O<14 AND S<8

Water-loss adduct rule (forms_valid_adduct_pair): Adducts involving water loss (M+H−H₂O, M+H−2H₂O, M−H₂O−H) are only allowed if the molecular formula contains oxygen.

Adduct table

The default sample_adduct_table contains 110 adducts spanning both positive and negative ionization modes, including:

Single-charge primary ions: M+H, M−H, M+Na, M+K, M+NH₄, M+Cl, M+Br
Multi-charge ions: M+2H (z=2), M+3H (z=3), M−2H (z=2), M−3H (z=3)
Multimers: 2M+H, 3M+H, 2M−H, 3M−H
Adducts with common mobile-phase modifiers: formic acid, acetic acid, acetonitrile, TFA, DMSO, isopropanol

Output

Stage1_mass_matched.txt — all candidate annotations with columns: peak, mz, rt, compound, adduct, expected_mass, molecular_formula, name, plus inherited compound metadata.

Stage 1.5: Peak Correlation & Module Detection

Files: R/compute_peak_modules.R, R/compute_rt_modules.R

Key questions this stage answers: Which features co-vary in intensity across samples and therefore likely originate from the same compound or biological process? Within each co-abundance module, which features co-elute?

Theory

Metabolites entering the mass spectrometer generate multiple related ions: adducts, in-source fragments, and isotopes. These co-eluting, co-abundant species will show correlated intensity patterns across samples. By building a co-abundance network from the peak intensity matrix, the pipeline groups peaks into modules — clusters of features likely originating from the same or related metabolites. This biological signal is later used to boost annotation confidence.

Step A: Pearson correlation matrix

Function: compute_peak_correlations(peak_intensity_matrix, correlation_method = "p")

Computes a pairwise Pearson correlation matrix across all peaks using WGCNA::cor(). Input is a samples × peaks intensity matrix. Output is a symmetric peaks × peaks correlation matrix.

Step B: WGCNA module detection

Function: compute_peak_modules()

Soft-threshold power selection — WGCNA::pickSoftThreshold() finds the power that best approximates a scale-free network topology. Fallback: power = 6 if estimation fails.
Hierarchical clustering — flashClust::flashClust() clusters peaks by correlation distance (1 − correlation).
Dynamic tree cutting — dynamicTreeCut::cutreeDynamic() identifies modules with adaptive branch cutting (parameters: deep_split = 2, min_cluster_size = 10).
Blockwise module detection — WGCNA::blockwiseModules() refines modules using the soft-threshold power, then merges modules with correlation above the threshold (mergeCutHeight = 1 − correlation_threshold, i.e., 0.3 at default).
Fallback — if blockwiseModules() fails, the pipeline computes a Topological Overlap Matrix (TOM) and uses mergeCloseModules().

Key defaults:

Parameter	Default	Description
`correlation_threshold`	0.7	Module merge threshold (merge if r > 0.7)
`deep_split`	2	Tree cutting depth (0=conservative, 4=aggressive)
`min_cluster_size`	10	Minimum peaks per module
`network_type`	"unsigned"	Considers both positive and negative correlations

Step C: RT clustering within modules

Function: compute_rt_modules(peak_table, peak_width = 1)

Within each module, peaks are further sub-grouped by retention time using kernel density estimation:

KDE — Gaussian kernel density with bandwidth = peak_width (default: 1 second), minimum 2048 evaluation points.
Peak detection — pastecs::turnpoints() finds local maxima in the density curve (inflection points above machine epsilon).
Nearest-neighbor assignment — RANN::nn2() assigns each peak to its nearest RT cluster center.

Output: Stage1_peak_clusters.txt — each peak tagged with module, rt_cluster, and the combined identifier Module_RTclust (e.g., "110_3").

Stage 2: Isotope Detection

File: R/compute_isotopes.R

Key questions this stage answers: Does the observed isotopic envelope (M+1, M+2, ...) match the theoretical pattern predicted by this compound's molecular formula? Are the isotope peaks present at the expected masses and intensities?

Theory

Each molecular formula produces a characteristic isotopic envelope due to the natural abundance of heavier isotopes (¹³C, ²H, ¹⁸O, etc.). For example, a compound with 10 carbons will show an M+1 peak at ~11% of the monoisotopic intensity (from ¹³C). By matching the observed isotopic pattern against the theoretical pattern, the pipeline gains additional evidence for or against an annotation.

Algorithm

For each annotated peak (the putative monoisotopic ion):

Compute theoretical pattern — enviPat::isopattern() generates the expected isotope envelope from the molecular formula. Peaks below 0.1% relative abundance (minAbund = 0.001) are discarded.
Filter candidates — Nearby peaks are screened by:
- Same rt_cluster (must be co-eluting)
- RT within rt_tolerance (default: 1 second in compute_isotopes, but receives time_tolerance = 10 from advanced_annotation)
- Same mass_defect bin (tolerance: 0, exact match by default)
- Not the same peak
Mass matching — Each candidate's m/z is compared to the expected isotope m/z:
```
expected_isotope_mz = monoisotopic_expected_mass + exact_mass_diff
mass_error_ppm = |observed_mz − expected_isotope_mz| / expected_isotope_mz × 10⁶
```
Filtered by isotope_mass_tolerance_ppm (defaults to mass_tolerance × 10⁶, i.e., 5 ppm).
Intensity matching — The observed intensity ratio (isotope/monoisotopic) is compared to the theoretical relative abundance:
```
|observed_ratio − theoretical_abundance| ≤ observed_ratio × intensity_deviation_tolerance
```
Default tolerance: 0.1 (10% relative deviation).

Key defaults

Parameter	Default	Description
`intensity_deviation_tolerance`	0.1	10% allowed deviation in intensity ratio
`mass_defect_tolerance`	0.1	Mass defect window (Da)
`rt_tolerance`	receives `time_tolerance`	RT window (seconds)
`isotope_mass_tolerance_ppm`	`mass_tolerance × 10⁶`	PPM filter for isotope mass matching
`minAbund` (enviPat)	0.001	0.1% relative abundance cutoff

Output

Stage2_isotope_detection.txt — the annotation table expanded with isotope rows. A new column mass_number_difference indicates: 0 = monoisotopic, 1/2/3... = M+1/M+2/M+3 isotope.

Stage 3: Chemical Scoring

Files: R/get_chemscore.R, R/chemscore_helpers.R

Key questions this stage answers: How much independent evidence supports this annotation? Is it a lone mass match, or do multiple adducts, isotope patterns, and intensity correlations converge on the same compound?

Theory

Not all mass matches are equally credible. A compound with multiple corroborating adducts in the same RT window, strong intensity correlations, and matching isotope patterns is far more likely to be a true identification than a lone hit. Chemical scoring quantifies this multi-evidence support into a single numeric score.

Hub m/z identification

Function: compute_hub_mz() (line 346, chemscore_helpers.R)

The "hub" is the most informative feature for a given compound — the anchor around which other evidence is evaluated. Selection follows a hierarchical criterion:

Primary: Features with adducts in the filter.by list AND positive correlation with other features
Secondary: Features with weighted adducts (in adduct_weights) AND no isotope annotations
Tertiary: Any feature with correlation > 0 and no isotope annotations

Among candidates, the hub is the feature with the highest mean intensity. Ties are broken by the feature with the most time neighbors within max_diff_rt.

Scoring formula

Function: calc_base_score() (line 363, chemscore_helpers.R)

chemical_score = n_unique_adducts × n_weighted_adducts × max_correlation × 10^max_adduct_weight

n_unique_adducts — count of distinct adduct types (e.g., M+H, M+Na = 2)
n_weighted_adducts — count of adducts matching the adduct_weights table
max_correlation — highest Pearson correlation between the hub and any other feature for this compound (from the upper triangle of the correlation submatrix)
10^max_adduct_weight — exponential encoding of the highest adduct weight (default weights typically 0–5; e.g., weight=2 → ×100)

Isotope evidence boost

After monoisotopic scoring, if isotopic peaks are detected for the compound, the score is multiplied by 100:

if (has_isotopes) chemical_score <- chemical_score * 100

This 100× multiplier strongly rewards annotations with isotope confirmation.

RT scaling

Function: apply_rt_scaling() (line 785, chemscore_helpers.R)

The score is penalized based on the RT spread of the compound's features:

chemical_score = chemical_score × 1 / ((diff_rt × 0.1) + 1)^k

diff_rt = max(RT) − min(RT) across all features for this compound
k = 1 if diff_rt ≤ max_diff_rt (gentle penalty)
k = 10 if diff_rt > max_diff_rt (severe penalty — effectively eliminates scattered annotations)

Example at max_diff_rt = 10s:

diff_rt = 5s → scaling ≈ 0.67 (mild penalty)
diff_rt = 15s → scaling ≈ 0.00003 (effectively zero)

Confidence-level boost

After all other adjustments, scores above the minimum threshold receive an exponential boost based on the preliminary confidence level:

chemical_score = chemical_score × conf_level^conf_level

Confidence level	Boost factor
0 (None)	0 (zeroed)
1 (Low)	1× (no change)
2 (Medium)	4×
3 (High)	27×

Minimum score threshold

Function: get_min_chemscore() (line 781, chemscore_helpers.R)

min_score = 100 × 2 × corthresh × (1 / ((max_diff_rt × 0.1) + 1)^3)

At defaults (corthresh=0.7, max_diff_rt=10): min_score ≈ 17.5.

Worked example

Consider a hypothetical compound detected with the following evidence:

2 unique adducts: M+H and M+Na
1 of those adducts is weighted (max weight = 2)
Highest Pearson correlation between hub and another feature: 0.85
Isotope peaks detected (M+1 confirmed)
RT spread across features: diff_rt = 5 seconds (max_diff_rt = 10)
Preliminary confidence level: 2 (Medium)

The score is computed in five steps:

Step	Calculation	Result
1. Base score	`2 × 1 × 0.85 × 10^2`	170
2. Isotope boost	`170 × 100`	17,000
3. RT scaling (k=1, since 5 ≤ 10)	`17,000 × 1/((5 × 0.1) + 1)^1` = `17,000 × 1/1.5`	11,333
4. Min score check	`100 × 2 × 0.7 × 1/((10 × 0.1) + 1)^3` = `140 × 1/8` = 17.5	11,333 > 17.5: passes
5. Confidence boost	`11,333 × 2^2`	45,333

The final chemical score of ~45,333 places this annotation firmly in the "strong multi-evidence" range (>10,000). Without isotope confirmation (skip step 2), the score would be ~453 — still meaningful but two orders of magnitude lower, reflecting the reduced certainty.

Output

Stage3_chemical_scores.txt — annotations with cur_chem_score column representing the multi-evidence score.

Stage 3b: Pathway Enrichment

Files: R/multilevelannotationstep3.R, R/compute_pathways.R

Key questions this stage answers: Are the annotated compounds over-represented in known biological pathways? Does the co-location of pathway members within the same WGCNA module suggest a real biological signal rather than random matches?

Theory

If multiple annotated metabolites map to the same biological pathway, this co-occurrence is unlikely to be coincidental and supports the annotations' validity. Pathway enrichment tests whether the overlap between annotated compounds and known pathway members exceeds what would be expected by chance.

Fisher's exact test — two levels

Level 1: Global pathway enrichment (line 72–77)

For each pathway, constructs a 2×2 contingency table:

	In pathway	Not in pathway
Significant hit	a	b
Not significant	c	d

"Significant" = annotation has score ≥ scorethresh (default: 0.1) AND adduct is in adduct_weights.

Fisher's exact test is applied. Threshold: p ≤ 0.05.

Level 2: Module-pathway overlap (line 115–124)

For pathways passing Level 1, tests whether pathway compounds are co-located within the same WGCNA/RT module more than expected by chance. This is a second 2×2 contingency table comparing the dominant module against all other modules.

Threshold: p ≤ 0.2 (more permissive, since modules are smaller subsets).

Score boost conditions

All of the following must be met for a compound to receive a pathway boost:

Pathway passes Level 1 Fisher's test (p ≤ 0.05)
Module passes Level 2 Fisher's test (p ≤ 0.2)
≥ 3 pathway compounds in the module
Compound's chemical score ≥ 0.1
Compound has a valid adduct (in adduct_weights)

Boost amount: score += n_pathway_compounds (additive, per qualifying pathway).

Pathway databases

Mode	Source	Selection
`"HMDB"` (default)	Internal `hmdbCompMZ` dataset (HMDB v3.5)	SMPDB pathway IDs parsed from column 14
`"custom"`	User-provided `pathway_data` data frame	Must have `compound` and `pathway` columns
`"skip"`	None	Pathway step is bypassed

Output

Stage3_HMDB_pathways.txt (HMDB mode)
Stage3_custom_pathways.txt (custom mode)
Stage3_pathway_skipped.txt (skip mode)

The cur_chem_score column is renamed to score at this stage.

Stage 4: Confidence Assignment

File: R/multilevelannotationstep4.R

Key questions this stage answers: How confident should we be in this annotation? Is the evidence strong enough for the annotation to be reported, or is it a speculative mass match that should be flagged as low-confidence?

Theory

Chemical scores are continuous values that resist intuitive interpretation. Confidence levels translate multi-evidence support into a discrete 0–3 scale that communicates the strength of annotation evidence at a glance. Higher confidence requires convergent evidence from multiple independent sources (adducts, isotopes, module co-membership, RT coherence).

Confidence levels

Level	Label	Interpretation
0	None	Insufficient evidence; mass match only
1	Low	Single primary adduct match (M+H or M−H)
2	Medium	Multiple distinct adducts or isotope evidence with moderate coherence
3	High	Multiple adducts + isotope evidence + module coherence + RT coherence
4	Confirmed	User-verified compound (via `boosted_compounds` parameter)

Assignment algorithm

Function: get_confidence_stage4() (lines 442–542)

The algorithm evaluates each compound's annotation rows through a decision tree. The following pseudocode traces the major decision points. To determine the confidence for any annotation, start at the top and follow the first matching branch:

IF score < 0.1
  → CONFIDENCE 0 (None)

IF adduct not recognized in adduct_table
  → CONFIDENCE 0 (None)

IF only one unique adduct type:
  IF score < 10 AND no filter_by match
    → CONFIDENCE 0 (None)
  IF score < 10 AND has filter_by match
    → CONFIDENCE 1 (Low)
  IF nrow == 2 (single adduct but two rows, suggesting isotope)
    → CONFIDENCE 2 (Medium)
  IF nrow > 2 AND no filter_by match
    → CONFIDENCE 0 (None)

IF RT spread > max.rt.diff:
  Re-cluster rows by RT → keep the best cluster
  (continue evaluation with filtered rows)

IF features span multiple WGCNA modules:
  IF score < 10
    → CONFIDENCE 0 (None)
  IF score > 10 AND has filter_by match
    → CONFIDENCE 2 (Medium)

IF only one row remains:
  IF adduct not in adduct_weights
    → CONFIDENCE 1 (Low)
  IF adduct in adduct_weights AND score > 10 AND has filter_by match
    → CONFIDENCE 2 (Medium)
  ELSE
    → CONFIDENCE 0 (None)

IF multimer (2M, 3M) is more abundant than monomer
  → downgrade confidence

IF charge > 1
  → CONFIDENCE 1 (Low)

IF carbon count < 1 in molecular formula
  → CONFIDENCE 0 (None)

Otherwise: assign based on adduct count and score thresholds

Two score thresholds drive most decisions: 0.1 (minimum to be considered at all) and 10 (threshold for medium-confidence promotion). The filter_by parameter acts as a gatekeeper — annotations matching the expected primary adduct (e.g., M+H in positive mode) can reach Confidence 1 even with low scores, while non-filter adducts require stronger evidence.

Confidence upgrade (post-hoc)

Function: upgrade_confidence_with_evidence() (lines 869–932)

Re-evaluates compounds below Confidence 3 using isotope and adduct evidence. Can only upgrade, never downgrade.

Evidence pattern	With filter_by active	With filter_by = NULL
Isotope rows + ≥1 base row + ≥2 base adducts	→ Conf 2	→ Conf 3
Isotope rows + ≥1 base row (single adduct)	→ Conf 1	→ Conf 2
≥2 base adducts (no isotopes)	→ Conf 1	→ Conf 2
No evidence	No upgrade	No upgrade

Requirements: actual base rows must exist (not just isotope inference), and module + RT coherence must pass.

Confidence cap (post-hoc)

Function: cap_confidence_with_evidence() (lines 959–1007)

Enforces hard evidence ceilings. Can both raise (rescue Level 0) and lower (prevent over-confidence).

Max justified confidence	Required evidence
3 (High)	Isotope rows + ≥1 base row + ≥1 adduct + module + RT coherence
2 (Medium)	≥2 base adducts + module + RT coherence
1 (Low)	Single primary adduct (M+H or M−H) from an actual base row
0 (None)	Insufficient evidence

Exceptions: Confidence 4 (user-confirmed) is never modified.

Coherence filtering

Function: enforce_compound_coherence() (lines 114–122)

Two requirements for a compound's rows to be "coherent":

Module coherence — all rows must share the same WGCNA module
RT coherence — RT range (max − min) must be ≤ max.rt.diff seconds

For compounds split across modules, only the rows from the largest module group are kept. This prevents minority stray rows from degrading confidence.

Stage 4a vs 4b

	Stage 4a	Stage 4b
Content	All rows with confidence assigned	Coherent rows only (largest module group per compound)
Purpose	Audit trail — inspect all annotations including dropped rows	Production data — feeds Stage 5
File	`Stage4a_confidence_levels.txt`	`Stage4b_confidence_levels.txt`

Confidence boosting (user-confirmed compounds)

Function: boost_confidence_of_IDs() (lines 664–703)

If boosted_compounds is provided, matching annotations are elevated to Confidence 4 with a 100× score multiplier. Matching modes:

ID-only: exact compound ID match
Proximity-based: compound ID + m/z tolerance (boost_mass_tolerance) + RT tolerance (boost_time_tolerance)

Stage 5: Redundancy Filtering

File: R/multilevelannotationstep5.R

Key questions this stage answers: When a single m/z feature matches multiple compounds, which annotation is best supported? After resolving conflicts, is each remaining annotation a unique match or a winner among competitors?

Theory

A single m/z feature may match multiple compounds in the database. Stage 5 resolves these conflicts by selecting the best-supported annotation for each feature, using adduct priority and chemical score as the arbitration criteria.

Multi-match resolution algorithm

Function: multilevelannotationstep5() (lines 53–85)

Identify m/z values matching multiple compounds (get_features(mz, "multiple"))
For each multi-match m/z:
- Boost weighted adducts: annotations with adducts in adduct_weights receive a 100× score multiplier
- Select winner: annotation with the highest (boosted) score
- Remove losers at this m/z: other compounds' rows at this specific m/z are dropped
- Recalculate loser scores: score = (n_competing − 1) / n_competing × winning_score

Adduct weighting

Function: increase_multimatches_score() (lines 14–20)

Annotations with adducts matching adduct_weights[, 1] receive a 100× score multiplier during conflict resolution. Default weighted adducts: M+H, M−H (when adduct_weights = NULL, all adducts get weight 5).

This strongly favors primary adducts (protonation/deprotonation) over exotic adduct assignments when multiple compounds compete for the same feature.

MatchCategory

Each feature in the final output receives a MatchCategory label:

Category	Meaning
Unique	This m/z matched exactly one compound
Multiple	This m/z initially matched multiple compounds (even if only the winner remains after filtering)

MatchCategory is assigned after redundancy removal, so it reflects the initial ambiguity, not the final state.

Output

Stage5_curated_results.txt — the final curated annotation table. Contains the same columns as Stage 4b plus MatchCategory.

Output Files Summary

Stage	File	Description
1	`Stage1_mass_matched.txt`	Raw mass matching results (all putative adduct annotations)
1	`Stage1_peak_clusters.txt`	Peak module and RT cluster assignments
2	`Stage2_isotope_detection.txt`	Annotations expanded with detected isotope rows
3	`Stage3_chemical_scores.txt`	Chemical scores from multi-evidence evaluation
3b	`Stage3_HMDB_pathways.txt`	Pathway-enriched scores (HMDB mode)
3b	`Stage3_custom_pathways.txt`	Pathway-enriched scores (custom mode)
3b	`Stage3_pathway_skipped.txt`	Scores without pathway enrichment (skip mode)
4a	`Stage4a_confidence_levels.txt`	All rows with confidence levels (audit trail)
4b	`Stage4b_confidence_levels.txt`	Coherent rows only (production output)
5	`Stage5_curated_results.txt`	Final curated results after redundancy filtering

All files are tab-delimited text. Columns vary by stage but always include: mz, time/rt, compound_id, Adduct, score (from Stage 3 onward), Confidence and Confidence_Level (from Stage 4 onward).

Parameter Decision Guide

Before diving into the full parameter reference table, use the sections below to identify which parameters to adjust for your specific situation.

Adjust for your instrument

mass_tolerance — Decrease to 2–3 ppm for Orbitrap and other high-resolution instruments where mass accuracy is tight. Increase to 10–15 ppm for QTOF or lower-resolution instruments where wider windows are needed to capture true matches.
adduct_table — Filter the default 110-adduct table down to adducts relevant to your ionization mode and chromatographic conditions. Running positive-mode C18 data against all 110 adducts (including negative-mode and exotic species) generates unnecessary false matches.

Adjust for your study design

n_workers — Set to the number of available CPU cores minus one; more cores speed up the WGCNA module detection step, which is the most computationally expensive part of the pipeline.
pathway_mode — Use "HMDB" for human metabolomics studies where compound IDs are HMDB identifiers. Use "custom" with your own pathway table for non-HMDB databases (e.g., pesticide libraries, plant metabolite databases). Use "skip" for exposomics or targeted panels where pathway enrichment is not meaningful.
boosted_compounds — Provide a data frame of verified standards (compound IDs with known m/z and RT) to anchor confidence calibration; these are elevated to Confidence 4 and serve as positive controls in your results.

Tune if results look wrong

correlation_threshold — Decrease to 0.5–0.6 if WGCNA produces too few modules (most features land in a single module). Increase to 0.8–0.9 if modules are too large and mix unrelated features.
min_cluster_size — Decrease below 10 if small datasets produce too few modules. Increase above 10 to reduce noise modules in large studies with thousands of features.
time_tolerance — Decrease below 10 seconds for narrow chromatographic peaks (e.g., UHPLC C18). Increase above 10 seconds for broad peaks or HILIC separations where co-eluting adducts span a wider RT range.
intensity_deviation_tolerance — Increase to 0.2–0.3 for noisy or low-abundance data where isotope intensity ratios are imprecise. Decrease to 0.05 for clean data from standards or high-abundance metabolites.
filter_by — Set to the expected primary adduct for your ionization mode ("M+H" for positive, "M-H" for negative) to gate confidence levels. Set to NULL for unbiased scoring where no single adduct is expected to dominate.

Key Parameters Reference

`advanced_annotation()` parameters

Parameter	Default	Stage	Description
`mass_tolerance`	5e-6 (5 ppm)	1	Relative mass tolerance for adduct matching
`adduct_table`	110 built-in adducts	1	Adduct definitions (charge, factor, mass)
`feature_id_column`	NULL	all	Column name for custom feature IDs to preserve in all stage outputs
`correlation_threshold`	0.7	1.5, 3	Module merge threshold; minimum correlation for scoring evidence
`deep_split`	2	1.5	WGCNA tree cutting depth (0=conservative, 4=aggressive)
`min_cluster_size`	10	1.5	Minimum peaks per WGCNA module
`network_type`	"unsigned"	1.5	WGCNA network topology type
`peak_rt_width`	1	1.5	Gaussian KDE bandwidth for RT clustering (seconds)
`intensity_deviation_tolerance`	0.1	2	Allowed deviation in isotope intensity ratio (10%)
`mass_defect_tolerance`	0.1	2	Mass defect window (Da)
`mass_defect_precision`	0.01	2	Bin width for mass defect calculation
`isotope_mass_tolerance`	NULL (→ `mass_tolerance`)	2	PPM tolerance for isotope mass matching
`maximum_isotopes`	10	4	Maximum isotope peaks per compound
`time_tolerance`	10	2, 3, 3b, 4	RT window for grouping features (seconds)
`MplusH_abundance_ratio_check`	TRUE	3	Require M+H/M−H to be most abundant adduct
`multimer_abundance_check`	TRUE	4	Penalize multimers (2M, 3M) more abundant than monomer
`adduct_weights`	NULL (all adducts, weight=5)	3, 3b, 4, 5	Adduct priority weights for scoring and conflict resolution
`filter_by`	c("M−H", "M+H")	3, 4	Expected primary adducts (gates certain confidence levels)
`min_ions_per_chemical`	2	4	Minimum annotation rows per compound
`level1_primary_adducts`	c("M+H", "M−H")	4	Adducts that qualify for Confidence Level 1
`pathway_mode`	"HMDB"	3b	Pathway analysis mode: "HMDB", "custom", or "skip"
`pathway_data`	NULL	3b	Custom pathway table (required if `pathway_mode = "custom"`)
`excluded_pathways`	NULL	3b	Pathway IDs to exclude from enrichment analysis
`excluded_pathway_compounds`	NULL	3b	Compound IDs to exclude from pathway analysis
`redundancy_filtering`	TRUE	5	Enable/disable Stage 5 multi-match resolution
`boosted_compounds`	NULL	4	User-verified compound IDs (elevated to Confidence 4)
`boost_match_by`	c("mz", "rt")	4	How to match boosted compounds: by mz, rt, or both
`boost_mass_tolerance`	NULL (→ `mass_tolerance`)	4	Mass tolerance for boost matching
`boost_time_tolerance`	NULL (→ `time_tolerance`)	4	RT tolerance for boost matching
`identify_isotopologues_flag`	TRUE	4	Add isotopologue identification labels
`outloc`	`tempdir()`	all	Output directory for stage files
`n_workers`	`parallel::detectCores()`	1.5	Number of parallel workers for WGCNA

Score interpretation guide

Score range	Typical evidence
0	No corroborating evidence; mass match only
1–10	Single adduct, low correlation
10–100	Multiple adducts or weighted adduct with good correlation
100–1,000	Strong multi-adduct evidence or isotope confirmation
1,000–10,000	Multiple weighted adducts + isotopes + high correlation
>10,000	Full evidence: weighted adducts + isotopes + high confidence level boost

Common Issues

Symptom: Most annotations have Confidence 0 (None), even for features you expect to annotate well. Likely cause: The filter_by parameter does not include adducts present in your adduct_table, so the Stage 4 decision tree hits the "no filter match" branch and assigns Confidence 0. Alternatively, mass_tolerance may be too tight, producing too few mass matches for multi-evidence scoring to work. Fix: Verify that filter_by contains an adduct that is also in your adduct_table (e.g., filter_by = "M+H" for positive mode). If mass matches are sparse, widen mass_tolerance (e.g., from 5e-6 to 10e-6).

Symptom: All or most features are assigned to a single WGCNA module, and Stage1_peak_clusters.txt shows only one or two module numbers. Likely cause: correlation_threshold is too low (modules that should be separate get merged) or min_cluster_size is too large (small modules are absorbed into larger ones). Fix: Increase correlation_threshold to 0.8 or 0.9 so only highly correlated features merge. Decrease min_cluster_size below 10 if your dataset has fewer features. Also check that the peak table has enough samples (columns) for meaningful correlation — WGCNA needs at least ~15–20 samples to produce stable modules.

Symptom: No pathway enrichment is occurring — scores in Stage3_HMDB_pathways.txt are identical to Stage3_chemical_scores.txt. Likely cause: pathway_mode is set to "skip", or compound IDs in your database do not match the HMDB identifier format (e.g., "HMDB0000001") used by the internal pathway table. Fix: Set pathway_mode = "HMDB" and ensure your compound table's compound_id column uses HMDB-format IDs. If using a non-HMDB database, set pathway_mode = "custom" and provide a pathway_data data frame with compound and pathway columns mapping your compound IDs to pathway identifiers.

Symptom: Chemical scores are uniformly low (all below 10), resulting in most annotations stuck at Confidence 0 or 1. Likely cause: Few adducts are being detected per compound (most compounds match only one adduct), correlations are weak, or time_tolerance is too narrow for features to be grouped together. When compounds have only a single adduct and no isotope confirmation, the base score formula produces small values. Fix: Widen time_tolerance (e.g., from 10 to 15–20 seconds) to allow more features to be grouped. Check that your adduct_table includes the adducts expected for your chromatographic conditions. If correlations are weak, verify that your peak table has sufficient biological replicates (not just technical replicates or QC samples).

Symptom: Stage5_curated_results.txt has far fewer rows than Stage4b_confidence_levels.txt. Likely cause: This is expected behavior when many features match multiple compounds. Stage 5 resolves each multi-match feature to a single winner, dropping the competing annotations. The MatchCategory column in the Stage 5 output distinguishes features that were unique matches from those that had competition. Fix: No fix needed — this is the pipeline working as intended. To see what was dropped, compare Stage4b_confidence_levels.txt against Stage5_curated_results.txt. Features marked MatchCategory = "Multiple" in Stage 5 had competing annotations that were resolved in favor of the retained compound.

FilesExpand file tree

xMSannotator_Workflow.md

Latest commit

History

xMSannotator_Workflow.md

File metadata and controls

CLUES.xMSannotator Workflow Reference

Overview

Quick Start

Pipeline Data Flow

Stage 1: Mass Matching

What it does

Adduct equation

Mass tolerance

Golden Rules filters

Adduct table

Output

Stage 1.5: Peak Correlation & Module Detection

Theory

Step A: Pearson correlation matrix

Step B: WGCNA module detection

Step C: RT clustering within modules

Stage 2: Isotope Detection

Theory

Algorithm

Key defaults

Output

Stage 3: Chemical Scoring

Theory

Hub m/z identification

Scoring formula

Isotope evidence boost

RT scaling

Confidence-level boost

Minimum score threshold

Worked example

Output

Stage 3b: Pathway Enrichment

Theory

Fisher's exact test — two levels

Score boost conditions

Pathway databases

Output

Stage 4: Confidence Assignment

Theory

Confidence levels

Assignment algorithm

Confidence upgrade (post-hoc)

Confidence cap (post-hoc)

Coherence filtering

Stage 4a vs 4b

Confidence boosting (user-confirmed compounds)

Stage 5: Redundancy Filtering

Theory

Multi-match resolution algorithm

Adduct weighting

MatchCategory

Output

Output Files Summary

Parameter Decision Guide

Adjust for your instrument

Adjust for your study design

Tune if results look wrong

Key Parameters Reference

advanced_annotation() parameters

Score interpretation guide

Common Issues

`advanced_annotation()` parameters