An updated fork of RECETOX/recetox-xMSannotator for automated annotation of untargeted mass spectrometry data, maintained by CLUES-Emory.
Lineage: kuppal2/xMSannotator (original) → RECETOX/recetox-xMSannotator (refactored) → CLUES-Emory/CLUES-xMSannotator (this repo)
| Document | Description |
|---|---|
| Pipeline Workflow Reference | Complete 5-stage pipeline reference with scoring formulas, parameter guide, and troubleshooting |
| Input File Formats | API-level specification for all input tables, parameters, and working examples |
| Input Data Pre-Processing | XCMS feature table, sample mapfile, and compound database formats with pre-processing steps |
| Stage 5 Output Reference | Column definitions for the primary output file |
| Changelog | All notable changes, bug fixes, and new features |
| Script | Description |
|---|---|
| Example Runscript | Complete single-database annotation workflow |
| Multi-DB Runscript | SLURM array job script for multi-database annotation |
| SLURM Wrapper | Bash/SLURM submission script for dual-polarity batch jobs |
| Document | Description |
|---|---|
| Developer Documentation | Setup, code style, and testing framework |
| Testing | testthat/patrick testing and code coverage |
| Modifications | Changes vs. original xMSannotator |
| Refactoring Patterns | Design patterns used during refactoring |
| Possible Issues | Known issues from the original codebase |
| Research Reproducibility | Online data dependencies affecting reproducibility |
Figure 1. Overview of the xMSannotator annotation workflow. The pipeline takes a peak table, compound database, and adduct table as inputs and proceeds through six stages: (1) brute-force mass matching within ppm tolerance, (1.5) WGCNA co-abundance network analysis with RT sub-clustering, (2) theoretical isotope envelope matching via enviPat, (3) multi-evidence chemical scoring combining adduct count, correlation, isotope confirmation, and RT coherence, (3b) pathway enrichment using Fisher's exact test, (4) confidence assignment (0–3) via decision-tree classification with module and RT coherence filtering, and (5) redundancy filtering to resolve multi-compound matches per feature. User-verified compounds supplied via the boosted_compounds parameter receive Confidence 4. The final output is a curated annotation table with confidence levels, chemical scores, and match categories.
The annotation pipeline runs through five stages via advanced_annotation(). See the Pipeline Workflow Reference for full algorithmic details, scoring formulas, and parameter guidance.
- Stage 1 - Mass Matching: Matches observed m/z values to compound databases across all specified adducts. Assigns peaks to correlation-based modules and RT clusters.
- Stage 2 - Isotope Detection: Identifies isotopic peaks (M+1, M+2, etc.) based on mass differences, intensity ratios, and RT agreement.
- Stage 3 - Chemical Scoring: Scores each compound annotation using adduct correlation evidence, module membership, and isotope support. Optionally integrates pathway enrichment (HMDB or custom).
- Stage 4 - Confidence Assignment: Assigns confidence levels (0-4) based on adduct evidence, isotope detection, RT coherence, and module coherence. Enforces hard evidence requirements via a post-hoc cap (see Confidence Levels). Identifies isotopologues (e.g., 13C, 15N substitutions). Adds
Confidence_Leveltext labels to output. Outputs Stage4a (all rows) and Stage4b (coherent rows only); Stage 5 uses Stage4b. - Stage 5 - Redundancy Filtering: Curates annotations by removing redundant entries, keeping the highest-confidence annotation per feature.
All intermediate results are saved as tab-delimited text files (Stage1_*.txt through Stage5_*.txt) for inspection.
Figure 2. Overview of confidence level requirements. Annotations are first classified into Confidence 0–3 via an initial decision tree (A), optionally upgraded based on corroborating isotope and adduct evidence (B), then subject to a post-hoc evidence ceiling (C) that can raise or lower the final assignment. Note: Level 4 assignments correspond to Schymanski Level 1 confidence levels. Levels 3 and 2 can be considered Schymanski Level 4. Level 1 and lower are consistent with Schymanski Level 5.
Each annotation receives a numeric Confidence level (0-4) and a human-readable Confidence_Level label in all output files. The final confidence is determined by hard evidence requirements enforced after all internal scoring steps.
| Level | Label | Evidence Required |
|---|---|---|
| 4 | Confirmed | User-confirmed compound (via boosted_compounds parameter) |
| 3 | High | Isotope rows (M+1, M+2, etc.) + 1 or more base adducts + module/RT coherent |
| 2 | Medium | 2+ distinct base adducts + module/RT coherent (no isotopes required) |
| 1 | Low | Single primary adduct match (default: M+H or M-H) |
| 0 | None | Single non-primary adduct, incoherent multi-row evidence, or no match |
Module coherence: When a compound has annotations in multiple peak modules, only the rows from the most-represented module are used for evidence evaluation. This prevents stray matches in unrelated modules from inflating confidence.
RT coherence: All adduct and isotope rows for a compound must fall within time_tolerance of each other to qualify for Confidence 2 or 3.
Primary adducts: The level1_primary_adducts parameter (default: c("M+H", "M-H")) controls which adducts qualify for Confidence 1 as a single match. This is independent of filter_by, allowing filter_by = NULL for equal scoring while still requiring primary ion evidence for Level 1.
For the complete confidence decision tree with worked examples, see the Pipeline Workflow Reference.
Bug fixes, new features, and code cleanup were completed using Claude Code with Claude Opus.
- Custom pathways: unified pathway scoring supports both HMDB and user-provided pathway databases via
pathway_modeparameter - Compound ID support: string compound IDs (e.g., "HMDB0000001") flow through the entire pipeline and appear in all output files
- Feature ID passthrough:
feature_id_columnparameter preserves custom feature identifiers (e.g., "C0001") through all stages - Confirmed compounds:
boosted_compoundsparameter boosts confidence of confirmed annotations to level 4 (Confirmed), with flexible mz/rt proximity matching - Isotope mass tolerance: separate
isotope_mass_toleranceparameter for ppm-based filtering of isotope matches - Isotopologue identification: post-confidence step uses
enviPatto identify which specific isotope substitution each isotope peak corresponds to (e.g., 13C:1 vs 15N:1 for M+1 peaks), addingisotopologueandisotopologue_qualitycolumns to output - Evidence-based confidence: hard evidence requirements enforced via post-hoc cap — Confidence 3 requires isotope evidence, Confidence 2 requires multiple adducts, Confidence 1 requires a primary ion match (see Confidence Levels)
- Confidence labels:
Confidence_Leveltext column (None/Low/Medium/High/Confirmed) added to all output files - Primary adduct parameter:
level1_primary_adductscontrols which single-adduct matches qualify for Confidence 1 (default: M+H, M-H), independent offilter_by - Module coherence filtering: compounds spanning multiple peak modules are filtered to the largest module group before confidence evaluation, preventing stray matches from inflating confidence. Stage 4 outputs split into Stage4a (all rows) and Stage4b (coherent only); Stage 5 uses Stage4b
- Stage outputs: all intermediate results saved as tab-delimited text files (Stage1 through Stage5) for inspection
- Adduct/isotope summaries: console output summarizing adduct detection and isotope detection after each step
- Abundance checks: configurable
multimer_abundance_checkandMplusH_abundance_ratio_checkparameters
- Replaced
rcdk(Java/rJava dependency) withenviPatfor isotope pattern calculations, eliminating the rJava dependency entirely (no JDK,JAVA_HOME, orR CMD javareconfrequired) - Removed false Confidence 2 assignments for compounds without corroborating evidence (unconditional score boost and score-proxy isotope detection)
- Fixed confidence assignment bugs: dead code in non-filter adduct path (always-FALSE condition), score zeroing preventing confidence boosts, forward-iterating row deletion in
apply_multimer_rules(), NULL adduct weights, unreachable guards, fragile column deletion by position, hardcoded column indices,cbind()type coercion, inconsistent early-return confidence types - Fixed chemical scoring: hardcoded RT tolerance now uses parameter, NA rows from empty results, duplicate rows from per-row instead of per-compound processing, isotope rows incorrectly removed by
na.omit() - Fixed isotope handling: isotopes now preserved through chemical scoring with 100x score boost, Stage 2 output column headers and file creation
- Fixed crash on charged molecular formulas (e.g.,
C12H14N2+2for Paraquat) after rcdk to enviPat migration - Fixed
feature_id_columnvalidation error with non-numeric IDs, duplicate feature_id columns in stage outputs, Stage 5 output creation,rm()warnings inget_confidence_stage4()
- Removed ~1500 lines of dead code not used by
advanced_annotation()workflow (get_confidence_stage2.R,multilevelannotationstep2.R,get_chemscorev1.6.71.R,group_by_rt_histv2.R,compute_confidence_levels.R) - Removed experimental permutation-based p-value testing (~650 lines, never production-ready)
- Removed all
setwd()calls from pipeline functions, replaced withfile.path()absolute paths - Removed unused
ISgroupcolumn, redundanttime.ycolumn, redundantforms_valid_adduct_pairfilter, andremove_tmp_files()auto-cleanup - Added roxygen2 documentation to 13 exported functions that lacked man pages
The package can be installed from GitHub:
devtools::install_github("CLUES-Emory/CLUES-xMSannotator")See the Example Runscript for a complete working example, and the Input File Formats guide for all parameter details.
When using this tool, please cite the original work:
Uppal, Karan, et al. "XMSannotator: An R Package for Network-Based Annotation of High-Resolution Metabolomics Data." Analytical Chemistry, vol. 89, no. 2, Jan. 2017, pp. 1063-67, doi:10.1021/acs.analchem.6b01214.

