xMSannotator Input File Formats Guide

This document describes the required input file formats for the CLUES.xMSannotator package.

Compound Table
Peak Table
Adduct Table
Adduct Weights
Pathway Database
Boosted Compounds
Expected Adducts
File Format Support
Converting Existing Databases
Advanced Annotation Function Parameters
Complete Working Example

1. Compound Table (Required)

The compound database containing metabolites to match against your peaks.

Required Columns

Column	Type	Description	Validation
`monoisotopic_mass`	numeric	Exact monoisotopic mass (Da)	Must be numeric
`molecular_formula`	character	Molecular formula	Must be character string
`name`	character	Compound name	Must be character string

Plus ONE of the following identifier columns:

Column	Type	Description	Validation
`compound_id`	character	Recommended. User-defined compound identifier (e.g., "HMDB0000001", "C00001")	Must be unique
`compound`	numeric	Legacy integer compound identifier	Must be unique, integers only

Identifier Behavior

If compound_id is provided: Your identifiers flow through the entire pipeline and appear as compound_id in all output files. An internal integer compound column is auto-generated.
If only compound is provided: Legacy mode. The compound_id in outputs will be formatted as "Formula_1", "Formula_2", etc.

Example: With compound_id (Recommended)

compound_table <- data.frame(
  compound_id = c("HMDB0000122", "HMDB0000243", "HMDB0000169", "HMDB0000044", "HMDB0000094"),
  monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
  molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
  name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)

Output compound_id column will contain: "HMDB0000122", "HMDB0000243", etc.

Example: Legacy Mode (numeric compound)

compound_table <- data.frame(
  compound = c(1, 2, 3, 4, 5),
  monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
  molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
  name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)

Output compound_id column will contain: "Formula_1", "Formula_2", etc.

Example: CSV File with compound_id

compound_id,monoisotopic_mass,molecular_formula,name
HMDB0000122,180.0634,C6H12O6,Glucose
HMDB0000243,132.0423,C4H8O5,Threonic acid
HMDB0000169,146.0579,C5H10O5,Ribose
HMDB0000044,176.0321,C6H8O6,Ascorbic acid
HMDB0000094,192.0270,C6H8O7,Citric acid

Example: Parquet Loading

compound_table <- load_compound_table_parquet("compounds.parquet")

Notes

Use compound_id for meaningful identifiers that will appear in your output files
monoisotopic_mass is the neutral exact mass, NOT the m/z value
molecular_formula must follow standard notation (e.g., "C6H12O6", not "C6 H12 O6")
The formula is used for:
- Golden rules validation (element ratio checks)
- Isotope pattern calculation
- Water-loss adduct validation

2. Peak Table (Required)

Your LC-MS peak data with m/z values, retention times, and intensities.

Required Columns

Column	Type	Description	Validation
`mz`	numeric	Measured m/z value	Required
`rt`	numeric	Retention time (seconds)	Required
`peak`	integer	Unique peak identifier	Auto-generated if missing

Optional Columns (for advanced annotation)

Column	Type	Description
`<sample1>`	numeric	Intensity for sample 1
`<sample2>`	numeric	Intensity for sample 2
...	numeric	Additional sample intensities

Example: Minimal Peak Table

peak_table <- data.frame(
  mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
  rt = c(120.5, 85.3, 95.2, 142.8, 156.1)
)
# 'peak' column will be auto-generated

Example: Full Peak Table with Intensities

peak_table <- data.frame(
  mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
  rt = c(120.5, 85.3, 95.2, 142.8, 156.1),
  sample_ctrl_1 = c(50000, 32000, 18000, 45000, 28000),
  sample_ctrl_2 = c(48000, 35000, 19500, 43000, 30000),
  sample_ctrl_3 = c(52000, 30000, 17500, 47000, 26000),
  sample_treat_1 = c(25000, 38000, 22000, 42000, 35000),
  sample_treat_2 = c(27000, 40000, 20000, 44000, 33000),
  sample_treat_3 = c(24000, 36000, 23000, 40000, 37000)
)

Example: CSV File

mz,rt,sample_ctrl_1,sample_ctrl_2,sample_treat_1,sample_treat_2
181.0707,120.5,50000,48000,25000,27000
133.0496,85.3,32000,35000,38000,40000
147.0652,95.2,18000,19500,22000,20000
177.0394,142.8,45000,43000,42000,44000
193.0343,156.1,28000,30000,35000,33000

Notes

Intensities are required for advanced_annotation() (used for correlation analysis)
Intensities are NOT required for simple_annotation()
All columns must be numeric
The peak identifier is auto-generated using lexicographic ordering of (mz, rt)

Optional: Custom Feature ID Column

If your peak table includes a custom feature identifier column (e.g., "FeatureID" with values like "C0001", "C0005"), you can preserve it through the annotation pipeline using the feature_id_column parameter in advanced_annotation().

Column	Type	Description
`<custom_id>`	character/numeric	Custom feature identifier (any column name)

Example peak table with custom feature ID:

peak_table <- data.frame(
  FeatureID = c("C0001", "C0002", "C0003", "C0004", "C0005"),
  mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
  rt = c(120.5, 85.3, 95.2, 142.8, 156.1),
  sample_1 = c(50000, 32000, 18000, 45000, 28000),
  sample_2 = c(48000, 35000, 19500, 43000, 30000)
)

When you specify feature_id_column = "FeatureID" in advanced_annotation(), this column will be included in all stage output files (Stage 1-5).

3. Adduct Table (Optional)

Defines the adducts to consider during annotation. A default table is provided.

Required Columns

Column	Type	Description	Example
`adduct`	character	Adduct name	"M+H"
`charge`	integer	Charge state	1, -1, 2
`factor`	integer	Number of molecules	1 for M+H, 2 for 2M+H
`mass`	numeric	Mass shift from neutral	1.007276 for +H

m/z Calculation Formula

expected_mz = (factor × monoisotopic_mass + mass) / |charge|

Example: Common Adducts

adduct_table <- data.frame(
  adduct = c("M+H", "M+Na", "M+K", "M+NH4", "M-H", "M+Cl", "M+FA-H", "2M+H", "M-H2O+H"),
  charge = c(1L, 1L, 1L, 1L, -1L, -1L, -1L, 1L, 1L),
  factor = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
  mass = c(
    1.007276,    # M+H: +proton
    22.989218,   # M+Na: +sodium
    38.963158,   # M+K: +potassium
    18.033823,   # M+NH4: +ammonium
    -1.007276,   # M-H: -proton
    34.969402,   # M+Cl: +chloride
    44.998201,   # M+FA-H: +formate -H
    1.007276,    # 2M+H: dimer + proton
    -17.026549   # M-H2O+H: water loss + proton
  )
)

Example: Positive Mode Only

adduct_table_pos <- data.frame(
  adduct = c("M+H", "M+Na", "M+K", "M+NH4", "2M+H", "M-H2O+H"),
  charge = c(1L, 1L, 1L, 1L, 1L, 1L),
  factor = c(1L, 1L, 1L, 1L, 2L, 1L),
  mass = c(1.007276, 22.989218, 38.963158, 18.033823, 1.007276, -17.026549)
)

Example: Negative Mode Only

adduct_table_neg <- data.frame(
  adduct = c("M-H", "M+Cl", "M+FA-H", "M-H2O-H", "2M-H"),
  charge = c(-1L, -1L, -1L, -1L, -1L),
  factor = c(1L, 1L, 1L, 1L, 2L),
  mass = c(-1.007276, 34.969402, 44.998201, -19.01839, -1.007276)
)

Common Adduct Mass Shifts Reference

Adduct	Mass Shift	Mode
M+H	+1.007276	Positive
M+Na	+22.989218	Positive
M+K	+38.963158	Positive
M+NH4	+18.033823	Positive
M-H	-1.007276	Negative
M+Cl	+34.969402	Negative
M+FA-H	+44.998201	Negative
M+ACN+H	+42.033823	Positive
M-H2O+H	-17.026549	Positive
M-H2O-H	-19.01839	Negative

4. Adduct Weights (Optional)

Assigns priority weights to adducts for confidence scoring and filtering.

Required Columns

Column	Type	Description
`Adduct` or `adduct`	character	Adduct name
`Weight` or `weight`	numeric	Priority weight (higher = more important)

Example: Standard Weights

adduct_weights <- data.frame(
  Adduct = c("M+H", "M-H", "M+Na", "M+K", "M+NH4", "M+Cl"),
  Weight = c(5, 5, 3, 2, 3, 2)
)

Example: Prioritize Protonated Forms

adduct_weights <- data.frame(
  Adduct = c("M+H", "M-H"),
  Weight = c(10, 10)
)

Notes

Higher weights give higher priority during redundancy filtering
Compounds with higher-weighted adducts are retained preferentially
Default if not provided: M+H and M-H with weight 1

5. Pathway Database (Optional)

For pathway enrichment scoring. The advanced_annotation() function supports three pathway modes:

HMDB mode (default): Uses built-in HMDB pathway data
Custom mode: Uses user-provided pathway data
Skip mode: Bypasses pathway matching entirely

Required Columns for Custom Pathway Data

Column	Type	Description
`compound`	character	Compound identifier (must match `compound_table$compound`)
`pathway`	character	Pathway identifier

Example: Custom Pathway Mapping

# Custom pathway data for use with pathway_mode = "custom"
pathway_data <- data.frame(
  compound = c("1", "1", "2", "2", "3"),  # Must match your compound_table IDs
  pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose_phosphate", "tca_cycle")
)

Example: KEGG-style Mapping

pathway_data <- data.frame(
  compound = c("1", "1", "2", "2", "3"),
  pathway = c("map00010", "map00500", "map00010", "map00020", "map00020")
)

Using Custom Pathways in advanced_annotation()

# Option 1: Skip pathway matching entirely (simplest for custom databases)
result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  pathway_mode = "skip"
)

# Option 2: Provide custom pathway data
my_pathways <- data.frame(
  compound = c("1", "1", "2", "3"),
  pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose")
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  pathway_mode = "custom",
  pathway_data = my_pathways
)

# Option 3: Use HMDB pathways (default, requires HMDB compound IDs)
result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = hmdb_compounds  # Must have HMDB IDs
)

Filtering Pathways

You can exclude specific pathways or compounds from the analysis:

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  pathway_mode = "custom",
  pathway_data = my_pathways,
  excluded_pathways = c("pathway_to_skip"),           # Pathways to exclude
  excluded_pathway_compounds = c("compound_to_skip")  # Compounds to exclude
)

Notes

Same compound can appear in multiple pathways (multiple rows)
The compound column values must match your compound_table$compound values (converted to character)
The enrichment algorithm boosts scores for compounds in shared pathways
Use pathway_mode = "skip" when using custom compound databases without pathway information

6. Boosted Compounds (Optional)

List of known/validated compounds to boost to confidence level 4. Boosted compounds receive Confidence=4 (labeled "Confirmed") and their scores are multiplied by 100.

Required Columns

Column	Type	Required	Description
`compound_id`	character	Yes	Must match `compound_id` in compound_table (appears as `compound_id` in outputs)
`mz`	numeric	Conditional	m/z value for proximity matching (required if "mz" in `boost_match_by`)
`rt`	numeric	Conditional	Retention time in seconds (required if "rt" in `boost_match_by`)

Related Parameters in advanced_annotation()

Parameter	Type	Default	Description
`boosted_compounds`	data.frame	`NULL`	Table of compounds to boost
`boost_match_by`	character	`c("mz", "rt")`	Which columns to use for matching: `c("mz")`, `c("rt")`, or `c("mz", "rt")`
`boost_mass_tolerance`	numeric	same as `mass_tolerance`	Fractional tolerance for mz matching (e.g., `5e-6` = 5 ppm)
`boost_time_tolerance`	numeric	same as `time_tolerance`	Seconds tolerance for RT matching

Matching Behavior

mz + rt matching (boost_match_by = c("mz", "rt")): Annotation must match compound_id AND be within mz tolerance AND within RT tolerance
rt-only matching (boost_match_by = c("rt")): Annotation must match compound_id AND be within RT tolerance
mz-only matching (boost_match_by = c("mz")): Annotation must match compound_id AND be within mz tolerance

Example: With mz and RT Matching (Default)

# Boost compounds that match by ID, mz, and RT
boosted_compounds <- data.frame(
  compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
  mz = c(181.0707, 147.0652, 133.0496),
  rt = c(120.5, 95.2, 85.3)
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  boosted_compounds = boosted_compounds,
  boost_match_by = c("mz", "rt"),    # default
  boost_mass_tolerance = 5e-6,       # 5 ppm (defaults to mass_tolerance if not specified)
  boost_time_tolerance = 10          # 10 seconds (defaults to time_tolerance if not specified)
)

Example: RT-Only Matching

# Boost compounds that match by ID and RT only (ignore mz differences)
boosted_compounds <- data.frame(
  compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
  rt = c(120.5, 95.2, 85.3)  # mz column not needed
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  boosted_compounds = boosted_compounds,
  boost_match_by = c("rt"),          # RT-only matching
  boost_time_tolerance = 15          # 15 second tolerance
)

Example: mz-Only Matching

# Boost compounds that match by ID and mz only (ignore RT differences)
boosted_compounds <- data.frame(
  compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
  mz = c(181.0707, 147.0652, 133.0496)  # rt column not needed
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  boosted_compounds = boosted_compounds,
  boost_match_by = c("mz"),          # mz-only matching
  boost_mass_tolerance = 10e-6       # 10 ppm tolerance
)

Notes

The compound_id column must match the compound_id values in your compound_table
Boosted annotations receive Confidence=4 ("Confirmed", highest level) and score×100
Tolerance parameters use the same format as main parameters: fractional for mass (e.g., 5e-6 = 5 ppm), seconds for time

7. Expected Adducts (Optional)

List of adducts that are expected/required for high confidence annotations.

Required Columns

Column	Type	Description
`adduct`	character	Adduct name

Example

expected_adducts <- data.frame(
  adduct = c("M+H", "M-H")
)

Loading from CSV

expected_adducts <- load_expected_adducts_csv("expected_adducts.csv")

8. File Format Support

Supported Formats

Format	Read Function	Write Function	Notes
Parquet	`load_*_parquet()`	`save_parquet()`	Recommended for large files
CSV	`read.csv()` / `readr::read_csv()`	`write.csv()`	Universal compatibility
RDA	`load()`	`save()`	Native R format
Data Frame	Direct use	-	In-memory

Parquet Loading Functions

# Compound table
compound_table <- load_compound_table_parquet("compounds.parquet")

# Peak table
peak_table <- load_peak_table_parquet("peaks.parquet")

# Adduct table
adduct_table <- load_adduct_table_parquet("adducts.parquet")

Saving to Parquet

save_parquet(compound_table, "compounds.parquet")
save_parquet(peak_table, "peaks.parquet")
save_parquet(annotation_results, "results.parquet")

9. Converting Existing Databases

From HMDB (hmdbAllinf.rda)

# Load original HMDB data
load("xMSannotator-master/data/hmdbAllinf.rda")

# Convert to required format
compound_table <- data.frame(
  compound = seq_len(nrow(hmdbAllinf)),
  monoisotopic_mass = as.numeric(hmdbAllinf$MonoisotopicMass),
  molecular_formula = as.character(hmdbAllinf$Formula),
  name = as.character(hmdbAllinf$Name)
)

# Save as Parquet
save_parquet(compound_table, "hmdb_compounds.parquet")

From KEGG

# Assuming KEGG data with columns: KEGG_ID, ExactMass, Formula, Name
kegg_data <- read.csv("kegg_compounds.csv")

compound_table <- data.frame(
  compound = seq_len(nrow(kegg_data)),
  monoisotopic_mass = kegg_data$ExactMass,
  molecular_formula = kegg_data$Formula,
  name = kegg_data$Name
)

From LipidMaps

# Assuming LipidMaps data
lipidmaps_data <- read.csv("lipidmaps.csv")

compound_table <- data.frame(
  compound = seq_len(nrow(lipidmaps_data)),
  monoisotopic_mass = lipidmaps_data$EXACT_MASS,
  molecular_formula = lipidmaps_data$FORMULA,
  name = lipidmaps_data$COMMON_NAME
)

From Custom CSV

# Your CSV with any column names
my_data <- read.csv("my_database.csv")

compound_table <- data.frame(
  compound = seq_len(nrow(my_data)),
  monoisotopic_mass = my_data$mass,  # Map your column name
  molecular_formula = my_data$formula,
  name = my_data$compound_name
)

From MassBank/mzCloud Export

# Typical spectral library export format
massbank <- read.csv("massbank_export.csv")

compound_table <- data.frame(
  compound = seq_len(nrow(massbank)),
  monoisotopic_mass = massbank$EXACT_MASS,
  molecular_formula = massbank$MOLECULAR_FORMULA,
  name = massbank$COMPOUND_NAME
)

10. Advanced Annotation Function Parameters

The advanced_annotation() function accepts the following parameters:

Required Parameters

Parameter	Type	Description
`peak_table`	data.frame	Feature table with m/z, retention time, and intensity columns
`compound_table`	data.frame	Database of compounds with names, formulas, and monoisotopic masses

Optional Parameters

Parameter	Type	Default	Description
`adduct_table`	data.frame	`NULL`	Table of adducts with name, factor, charge, and mass columns. Uses `sample_adduct_table` if NULL
`adduct_weights`	data.frame	`NULL`	Weights for prioritizing specific adducts. Auto-generated with weight=5 if NULL
`feature_id_column`	character	`NULL`	Name of column in peak_table containing custom feature identifiers to preserve in all stage outputs
`intensity_deviation_tolerance`	numeric	`0.1`	Tolerance for intensity deviation in isotope matching (10%)
`mass_tolerance`	numeric	`5e-6`	Mass accuracy as fractional (relative) tolerance. Use `5e-6` for 5 ppm, `10e-6` for 10 ppm. Do NOT enter direct ppm values.
`isotope_mass_tolerance`	numeric	`NULL`	Fractional tolerance for isotope m/z matching. Defaults to `mass_tolerance`. Use `5e-6` for 5 ppm.
`mass_defect_tolerance`	numeric	`0.1`	Tolerance for mass defect matching
`mass_defect_precision`	numeric	`0.01`	Precision for binning mass defects
`time_tolerance`	numeric	`10`	Retention time tolerance in seconds
`peak_rt_width`	numeric	`1`	Expected chromatographic peak width for RT clustering
`correlation_threshold`	numeric	`0.7`	Minimum correlation for peak grouping
`MplusH_abundance_ratio_check`	logical	`TRUE`	Requires secondary adducts to have lower intensity than M+H/M-H during chemical scoring. Set to FALSE to disable.
`multimer_abundance_check`	logical	`TRUE`	Checks that multimer adducts (2M, 3M) have lower intensity than the monomer during confidence assignment. Set to FALSE to disable.
`deep_split`	integer	`2`	WGCNA parameter controlling cluster splitting (0-4, higher = more clusters)
`min_cluster_size`	integer	`10`	Minimum peaks per module in WGCNA clustering
`maximum_isotopes`	integer	`10`	Maximum isotope peaks to consider per compound
`min_ions_per_chemical`	integer	`2`	Minimum ions required to annotate a chemical
`filter_by`	character vector	`c("M-H", "M+H")`	Primary adducts for confidence scoring
`level1_primary_adducts`	character vector	`c("M+H", "M-H")`	Adducts that qualify for Confidence 1 as a single match in the evidence cap. Independent of `filter_by`.
`network_type`	character	`"unsigned"`	WGCNA network type ("unsigned", "signed", or "signed hybrid")
`redundancy_filtering`	logical	`TRUE`	Whether to remove redundant annotations
`identify_isotopologues_flag`	logical	`TRUE`	Use enviPat to identify specific isotopologue substitutions (e.g., 13C:1 vs 15N:1) for isotope peaks. Adds `isotopologue` and `isotopologue_quality` columns to output. Requires `enviPat` package; gracefully skips if not installed.
`pathway_mode`	character	`"HMDB"`	Pathway matching mode: "HMDB" (default), "custom", or "skip"
`pathway_data`	data.frame	`NULL`	Custom pathway-compound mappings (required if pathway_mode = "custom")
`excluded_pathways`	character vector	`NULL`	Pathways to exclude from analysis
`excluded_pathway_compounds`	character vector	`NULL`	Compounds to exclude from pathway analysis
`boosted_compounds`	data.frame	`NULL`	Table of confirmed compounds to boost to Confidence=4 (see Section 6)
`boost_match_by`	character vector	`c("mz", "rt")`	Which columns to use for boost matching: `c("mz")`, `c("rt")`, or `c("mz", "rt")`
`boost_mass_tolerance`	numeric	same as `mass_tolerance`	Fractional tolerance for boost mz matching (e.g., `5e-6` = 5 ppm)
`boost_time_tolerance`	numeric	same as `time_tolerance`	Seconds tolerance for boost RT matching
`outloc`	character	`tempdir()`	Output directory for intermediate files
`n_workers`	integer	`detectCores()`	Number of parallel workers for WGCNA

Note on mass tolerance format: All mass tolerance parameters (mass_tolerance, isotope_mass_tolerance) use fractional (relative) tolerance notation:

5e-6 = 5 ppm (5 parts per million)
10e-6 = 10 ppm
The matching algorithm uses: |observed - expected| ≤ max(|observed|, |expected|) × tolerance
Do NOT enter ppm values directly (e.g., do NOT use 5 for 5 ppm)

Example Usage

library(CLUES.xMSannotator)

# Run with default parameters
result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds
)

# Run with custom parameters
result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  adduct_table = my_adducts,
  mass_tolerance = 10e-6,           # 10 ppm (fractional: 10 × 10^-6)
  isotope_mass_tolerance = 5e-6,    # 5 ppm for isotope matching
  time_tolerance = 15,              # 15 seconds
  correlation_threshold = 0.8,      # stricter correlation
  min_ions_per_chemical = 3,        # require more ions
  filter_by = c("M+H"),             # positive mode only
  n_workers = 4                     # limit parallelization
)

# Preserve custom feature IDs through the pipeline
result <- advanced_annotation(
  peak_table = my_peaks,           # has "FeatureID" column
  compound_table = my_compounds,
  feature_id_column = "FeatureID", # preserve this column in all outputs
  outloc = "output/"
)
# All Stage 1-5 output files will include the "FeatureID" column

# Skip pathway matching (for custom compound databases)
result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = custom_compounds,
  pathway_mode = "skip"            # bypass HMDB pathway matching
)

# Use custom pathway database
my_pathways <- data.frame(
  compound = c("1", "1", "2", "3"),
  pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose")
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  pathway_mode = "custom",
  pathway_data = my_pathways,
  excluded_pathways = c("unwanted_pathway")  # optional filtering
)

# Boost confidence of confirmed/validated compounds
validated_compounds <- data.frame(
  compound_id = c("HMDB0000122", "HMDB0000243"),  # Must match compound_table$compound_id
  mz = c(181.0707, 133.0496),                     # Observed m/z values
  rt = c(120.5, 85.3)                             # Observed retention times (seconds)
)

result <- advanced_annotation(
  peak_table = my_peaks,
  compound_table = my_compounds,
  boosted_compounds = validated_compounds,
  boost_match_by = c("rt"),           # Match by compound_id + RT only
  boost_time_tolerance = 15,          # 15 second RT tolerance
  pathway_mode = "skip"
)
# Annotations matching validated compounds will have Confidence=4 ("Confirmed") and score×100

11. Complete Working Example

Simple Annotation

library(CLUES.xMSannotator)

# 1. Create compound database
compound_table <- data.frame(
  compound = 1:5,
  monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
  molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
  name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)

# 2. Create peak table
peak_table <- data.frame(
  mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343, 203.0526),
  rt = c(120.5, 85.3, 95.2, 142.8, 156.1, 165.4)
)

# 3. Define adducts (optional - has defaults)
adduct_table <- data.frame(
  adduct = c("M+H", "M+Na", "M-H"),
  charge = c(1L, 1L, -1L),
  factor = c(1L, 1L, 1L),
  mass = c(1.007276, 22.989218, -1.007276)
)

# 4. Run simple annotation
annotation <- simple_annotation(
  peak_table = peak_table,
  compound_table = compound_table,
  adduct_table = adduct_table,
  mass_tolerance = 5e-6  # 5 ppm
)

# 5. View results
print(annotation)

Advanced Annotation

library(CLUES.xMSannotator)

# 1. Load compound database from Parquet
compound_table <- load_compound_table_parquet("my_compounds.parquet")

# 2. Load peak table with intensities
peak_table <- load_peak_table_parquet("my_peaks.parquet")

# 3. Define adducts and weights
adduct_table <- data.frame(
  adduct = c("M+H", "M+Na", "M+K", "M+NH4"),
  charge = c(1L, 1L, 1L, 1L),
  factor = c(1L, 1L, 1L, 1L),
  mass = c(1.007276, 22.989218, 38.963158, 18.033823)
)

adduct_weights <- data.frame(
  adduct = c("M+H", "M+Na", "M+K", "M+NH4"),
  weight = c(5, 3, 2, 3)
)

# 4. Run advanced annotation with custom compound database
# Use pathway_mode = "skip" to bypass HMDB pathway matching
result <- advanced_annotation(
  peak_table = peak_table,
  compound_table = compound_table,
  adduct_table = adduct_table,
  adduct_weights = adduct_weights,
  mass_tolerance = 5e-6,              # 5 ppm
  time_tolerance = 10,                # 10 seconds
  correlation_threshold = 0.7,
  filter_by = c("M+H"),               # positive mode
  pathway_mode = "skip",              # skip HMDB pathway matching
  outloc = "output/"
)

# 5. Or use custom pathway data
my_pathways <- data.frame(
  compound = c("1", "1", "2", "2", "3"),
  pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose_phosphate", "tca_cycle")
)

result_with_pathways <- advanced_annotation(
  peak_table = peak_table,
  compound_table = compound_table,
  adduct_table = adduct_table,
  adduct_weights = adduct_weights,
  pathway_mode = "custom",
  pathway_data = my_pathways,
  outloc = "output_with_pathways/"
)

# 6. Save results
save_parquet(result, "annotation_results.parquet")
write.csv(result, "annotation_results.csv", row.names = FALSE)

Validation Functions Reference

The package includes validation functions that check your data:

Function	Purpose
`as_peak_table()`	Validates peak table format
`as_compound_table()`	Validates compound table format
`as_adduct_table()`	Validates adduct table format
`as_pathway_table()`	Validates custom pathway table format
`as_expected_adducts_table()`	Validates expected adducts
`load_boost_compounds_csv()`	Loads and validates boosted compounds from CSV
`as_boosted_compounds_table()`	Validates boosted compounds (internal, called by `load_boost_compounds_csv()`)

Manual Validation Example

# Check your data before annotation
tryCatch({
  validated_compounds <- as_compound_table(compound_table)
  validated_peaks <- as_peak_table(peak_table, intensities = TRUE)
  validated_adducts <- as_adduct_table(adduct_table)
  print("All data validated successfully!")
}, error = function(e) {
  print(paste("Validation error:", e$message))
})

Document created: 2026-01-22 Last updated: 2026-03-14 For use with CLUES.xMSannotator v1.0.0

FilesExpand file tree

xMSannotator_Input_Formats.md

Latest commit

History

xMSannotator_Input_Formats.md

File metadata and controls

xMSannotator Input File Formats Guide

Table of Contents

1. Compound Table (Required)

Required Columns

Identifier Behavior

Example: With compound_id (Recommended)

Example: Legacy Mode (numeric compound)

Example: CSV File with compound_id

Example: Parquet Loading

Notes

2. Peak Table (Required)

Required Columns

Optional Columns (for advanced annotation)

Example: Minimal Peak Table

Example: Full Peak Table with Intensities

Example: CSV File

Notes

Optional: Custom Feature ID Column

3. Adduct Table (Optional)

Required Columns

m/z Calculation Formula

Example: Common Adducts

Example: Positive Mode Only

Example: Negative Mode Only

Common Adduct Mass Shifts Reference

4. Adduct Weights (Optional)

Required Columns

Example: Standard Weights

Example: Prioritize Protonated Forms

Notes

5. Pathway Database (Optional)

Required Columns for Custom Pathway Data

Example: Custom Pathway Mapping

Example: KEGG-style Mapping

Using Custom Pathways in advanced_annotation()

Filtering Pathways

Notes

6. Boosted Compounds (Optional)

Required Columns

Related Parameters in advanced_annotation()

Matching Behavior

Example: With mz and RT Matching (Default)

Example: RT-Only Matching

Example: mz-Only Matching

Notes

7. Expected Adducts (Optional)

Required Columns

Example

Loading from CSV

8. File Format Support

Supported Formats

Parquet Loading Functions

Saving to Parquet

9. Converting Existing Databases

From HMDB (hmdbAllinf.rda)

From KEGG

From LipidMaps

From Custom CSV

From MassBank/mzCloud Export

10. Advanced Annotation Function Parameters

Required Parameters

Optional Parameters

Example Usage

11. Complete Working Example

Simple Annotation

Advanced Annotation

Validation Functions Reference

Manual Validation Example