CLUES.xMSannotator

An updated fork of RECETOX/recetox-xMSannotator for automated annotation of untargeted mass spectrometry data, maintained by CLUES-Emory.

Lineage: kuppal2/xMSannotator (original) → RECETOX/recetox-xMSannotator (refactored) → CLUES-Emory/CLUES-xMSannotator (this repo)

Documentation & Examples

Guides

Document	Description
Pipeline Workflow Reference	Complete 5-stage pipeline reference with scoring formulas, parameter guide, and troubleshooting
Input File Formats	API-level specification for all input tables, parameters, and working examples
Input Data Pre-Processing	XCMS feature table, sample mapfile, and compound database formats with pre-processing steps
Stage 5 Output Reference	Column definitions for the primary output file
Changelog	All notable changes, bug fixes, and new features

Example Scripts

Script	Description
Example Runscript	Complete single-database annotation workflow
Multi-DB Runscript	SLURM array job script for multi-database annotation
SLURM Wrapper	Bash/SLURM submission script for dual-polarity batch jobs

recetox-xMSannotator Developer & Legacy Docs

Document	Description
Developer Documentation	Setup, code style, and testing framework
Testing	testthat/patrick testing and code coverage
Modifications	Changes vs. original xMSannotator
Refactoring Patterns	Design patterns used during refactoring
Possible Issues	Known issues from the original codebase
Research Reproducibility	Online data dependencies affecting reproducibility

Pipeline Overview

Figure 1. Overview of the xMSannotator annotation workflow. The pipeline takes a peak table, compound database, and adduct table as inputs and proceeds through six stages: (1) brute-force mass matching within ppm tolerance, (1.5) WGCNA co-abundance network analysis with RT sub-clustering, (2) theoretical isotope envelope matching via enviPat, (3) multi-evidence chemical scoring combining adduct count, correlation, isotope confirmation, and RT coherence, (3b) pathway enrichment using Fisher's exact test, (4) confidence assignment (0–3) via decision-tree classification with module and RT coherence filtering, and (5) redundancy filtering to resolve multi-compound matches per feature. User-verified compounds supplied via the boosted_compounds parameter receive Confidence 4. The final output is a curated annotation table with confidence levels, chemical scores, and match categories.

The annotation pipeline runs through five stages via advanced_annotation(). See the Pipeline Workflow Reference for full algorithmic details, scoring formulas, and parameter guidance.

Stage 1 - Mass Matching: Matches observed m/z values to compound databases across all specified adducts. Assigns peaks to correlation-based modules and RT clusters.
Stage 2 - Isotope Detection: Identifies isotopic peaks (M+1, M+2, etc.) based on mass differences, intensity ratios, and RT agreement.
Stage 3 - Chemical Scoring: Scores each compound annotation using adduct correlation evidence, module membership, and isotope support. Optionally integrates pathway enrichment (HMDB or custom).
Stage 4 - Confidence Assignment: Assigns confidence levels (0-4) based on adduct evidence, isotope detection, RT coherence, and module coherence. Enforces hard evidence requirements via a post-hoc cap (see Confidence Levels). Identifies isotopologues (e.g., 13C, 15N substitutions). Adds Confidence_Level text labels to output. Outputs Stage4a (all rows) and Stage4b (coherent rows only); Stage 5 uses Stage4b.
Stage 5 - Redundancy Filtering: Curates annotations by removing redundant entries, keeping the highest-confidence annotation per feature.

All intermediate results are saved as tab-delimited text files (Stage1_*.txt through Stage5_*.txt) for inspection.

Confidence Levels

Figure 2. Overview of confidence level requirements. Annotations are first classified into Confidence 0–3 via an initial decision tree (A), optionally upgraded based on corroborating isotope and adduct evidence (B), then subject to a post-hoc evidence ceiling (C) that can raise or lower the final assignment. Note: Level 4 assignments correspond to Schymanski Level 1 confidence levels. Levels 3 and 2 can be considered Schymanski Level 4. Level 1 and lower are consistent with Schymanski Level 5.

Each annotation receives a numeric Confidence level (0-4) and a human-readable Confidence_Level label in all output files. The final confidence is determined by hard evidence requirements enforced after all internal scoring steps.

Level	Label	Evidence Required
4	Confirmed	User-confirmed compound (via `boosted_compounds` parameter)
3	High	Isotope rows (M+1, M+2, etc.) + 1 or more base adducts + module/RT coherent
2	Medium	2+ distinct base adducts + module/RT coherent (no isotopes required)
1	Low	Single primary adduct match (default: M+H or M-H)
0	None	Single non-primary adduct, incoherent multi-row evidence, or no match

Module coherence: When a compound has annotations in multiple peak modules, only the rows from the most-represented module are used for evidence evaluation. This prevents stray matches in unrelated modules from inflating confidence.

RT coherence: All adduct and isotope rows for a compound must fall within time_tolerance of each other to qualify for Confidence 2 or 3.

Primary adducts: The level1_primary_adducts parameter (default: c("M+H", "M-H")) controls which adducts qualify for Confidence 1 as a single match. This is independent of filter_by, allowing filter_by = NULL for equal scoring while still requiring primary ion evidence for Level 1.

For the complete confidence decision tree with worked examples, see the Pipeline Workflow Reference.

Summary of CLUES-Emory Changes

Bug fixes, new features, and code cleanup were completed using Claude Code with Claude Opus.

New Features

Custom pathways: unified pathway scoring supports both HMDB and user-provided pathway databases via pathway_mode parameter
Compound ID support: string compound IDs (e.g., "HMDB0000001") flow through the entire pipeline and appear in all output files
Feature ID passthrough: feature_id_column parameter preserves custom feature identifiers (e.g., "C0001") through all stages
Confirmed compounds: boosted_compounds parameter boosts confidence of confirmed annotations to level 4 (Confirmed), with flexible mz/rt proximity matching
Isotope mass tolerance: separate isotope_mass_tolerance parameter for ppm-based filtering of isotope matches
Isotopologue identification: post-confidence step uses enviPat to identify which specific isotope substitution each isotope peak corresponds to (e.g., 13C:1 vs 15N:1 for M+1 peaks), adding isotopologue and isotopologue_quality columns to output
Evidence-based confidence: hard evidence requirements enforced via post-hoc cap — Confidence 3 requires isotope evidence, Confidence 2 requires multiple adducts, Confidence 1 requires a primary ion match (see Confidence Levels)
Confidence labels: Confidence_Level text column (None/Low/Medium/High/Confirmed) added to all output files
Primary adduct parameter: level1_primary_adducts controls which single-adduct matches qualify for Confidence 1 (default: M+H, M-H), independent of filter_by
Module coherence filtering: compounds spanning multiple peak modules are filtered to the largest module group before confidence evaluation, preventing stray matches from inflating confidence. Stage 4 outputs split into Stage4a (all rows) and Stage4b (coherent only); Stage 5 uses Stage4b
Stage outputs: all intermediate results saved as tab-delimited text files (Stage1 through Stage5) for inspection
Adduct/isotope summaries: console output summarizing adduct detection and isotope detection after each step
Abundance checks: configurable multimer_abundance_check and MplusH_abundance_ratio_check parameters

Bug Fixes

Replaced rcdk (Java/rJava dependency) with enviPat for isotope pattern calculations, eliminating the rJava dependency entirely (no JDK, JAVA_HOME, or R CMD javareconf required)
Removed false Confidence 2 assignments for compounds without corroborating evidence (unconditional score boost and score-proxy isotope detection)
Fixed confidence assignment bugs: dead code in non-filter adduct path (always-FALSE condition), score zeroing preventing confidence boosts, forward-iterating row deletion in apply_multimer_rules(), NULL adduct weights, unreachable guards, fragile column deletion by position, hardcoded column indices, cbind() type coercion, inconsistent early-return confidence types
Fixed chemical scoring: hardcoded RT tolerance now uses parameter, NA rows from empty results, duplicate rows from per-row instead of per-compound processing, isotope rows incorrectly removed by na.omit()
Fixed isotope handling: isotopes now preserved through chemical scoring with 100x score boost, Stage 2 output column headers and file creation
Fixed crash on charged molecular formulas (e.g., C12H14N2+2 for Paraquat) after rcdk to enviPat migration
Fixed feature_id_column validation error with non-numeric IDs, duplicate feature_id columns in stage outputs, Stage 5 output creation, rm() warnings in get_confidence_stage4()

Code Cleanup

Removed ~1500 lines of dead code not used by advanced_annotation() workflow (get_confidence_stage2.R, multilevelannotationstep2.R, get_chemscorev1.6.71.R, group_by_rt_histv2.R, compute_confidence_levels.R)
Removed experimental permutation-based p-value testing (~650 lines, never production-ready)
Removed all setwd() calls from pipeline functions, replaced with file.path() absolute paths
Removed unused ISgroup column, redundant time.y column, redundant forms_valid_adduct_pair filter, and remove_tmp_files() auto-cleanup
Added roxygen2 documentation to 13 exported functions that lacked man pages

Installation

The package can be installed from GitHub:

devtools::install_github("CLUES-Emory/CLUES-xMSannotator")

See the Example Runscript for a complete working example, and the Input File Formats guide for all parameter details.

Reference

When using this tool, please cite the original work:

Uppal, Karan, et al. "XMSannotator: An R Package for Network-Based Annotation of High-Resolution Metabolomics Data." Analytical Chemistry, vol. 89, no. 2, Jan. 2017, pp. 1063-67, doi:10.1021/acs.analchem.6b01214.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLUES.xMSannotator

Documentation & Examples

Guides

Example Scripts

recetox-xMSannotator Developer & Legacy Docs

Pipeline Overview

Confidence Levels

Summary of CLUES-Emory Changes

New Features

Bug Fixes

Code Cleanup

Installation

Reference

FilesExpand file tree

readme.md

Latest commit

History

readme.md

File metadata and controls

CLUES.xMSannotator

Documentation & Examples

Guides

Example Scripts

recetox-xMSannotator Developer & Legacy Docs

Pipeline Overview

Confidence Levels

Summary of CLUES-Emory Changes

New Features

Bug Fixes

Code Cleanup

Installation

Reference