This document describes the required input file formats for the CLUES.xMSannotator package.
- Compound Table
- Peak Table
- Adduct Table
- Adduct Weights
- Pathway Database
- Boosted Compounds
- Expected Adducts
- File Format Support
- Converting Existing Databases
- Advanced Annotation Function Parameters
- Complete Working Example
The compound database containing metabolites to match against your peaks.
| Column | Type | Description | Validation |
|---|---|---|---|
monoisotopic_mass |
numeric | Exact monoisotopic mass (Da) | Must be numeric |
molecular_formula |
character | Molecular formula | Must be character string |
name |
character | Compound name | Must be character string |
Plus ONE of the following identifier columns:
| Column | Type | Description | Validation |
|---|---|---|---|
compound_id |
character | Recommended. User-defined compound identifier (e.g., "HMDB0000001", "C00001") | Must be unique |
compound |
numeric | Legacy integer compound identifier | Must be unique, integers only |
- If
compound_idis provided: Your identifiers flow through the entire pipeline and appear ascompound_idin all output files. An internal integercompoundcolumn is auto-generated. - If only
compoundis provided: Legacy mode. Thecompound_idin outputs will be formatted as "Formula_1", "Formula_2", etc.
compound_table <- data.frame(
compound_id = c("HMDB0000122", "HMDB0000243", "HMDB0000169", "HMDB0000044", "HMDB0000094"),
monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)Output compound_id column will contain: "HMDB0000122", "HMDB0000243", etc.
compound_table <- data.frame(
compound = c(1, 2, 3, 4, 5),
monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)Output compound_id column will contain: "Formula_1", "Formula_2", etc.
compound_id,monoisotopic_mass,molecular_formula,name
HMDB0000122,180.0634,C6H12O6,Glucose
HMDB0000243,132.0423,C4H8O5,Threonic acid
HMDB0000169,146.0579,C5H10O5,Ribose
HMDB0000044,176.0321,C6H8O6,Ascorbic acid
HMDB0000094,192.0270,C6H8O7,Citric acidcompound_table <- load_compound_table_parquet("compounds.parquet")- Use
compound_idfor meaningful identifiers that will appear in your output files monoisotopic_massis the neutral exact mass, NOT the m/z valuemolecular_formulamust follow standard notation (e.g., "C6H12O6", not "C6 H12 O6")- The formula is used for:
- Golden rules validation (element ratio checks)
- Isotope pattern calculation
- Water-loss adduct validation
Your LC-MS peak data with m/z values, retention times, and intensities.
| Column | Type | Description | Validation |
|---|---|---|---|
mz |
numeric | Measured m/z value | Required |
rt |
numeric | Retention time (seconds) | Required |
peak |
integer | Unique peak identifier | Auto-generated if missing |
| Column | Type | Description |
|---|---|---|
<sample1> |
numeric | Intensity for sample 1 |
<sample2> |
numeric | Intensity for sample 2 |
| ... | numeric | Additional sample intensities |
peak_table <- data.frame(
mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
rt = c(120.5, 85.3, 95.2, 142.8, 156.1)
)
# 'peak' column will be auto-generatedpeak_table <- data.frame(
mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
rt = c(120.5, 85.3, 95.2, 142.8, 156.1),
sample_ctrl_1 = c(50000, 32000, 18000, 45000, 28000),
sample_ctrl_2 = c(48000, 35000, 19500, 43000, 30000),
sample_ctrl_3 = c(52000, 30000, 17500, 47000, 26000),
sample_treat_1 = c(25000, 38000, 22000, 42000, 35000),
sample_treat_2 = c(27000, 40000, 20000, 44000, 33000),
sample_treat_3 = c(24000, 36000, 23000, 40000, 37000)
)mz,rt,sample_ctrl_1,sample_ctrl_2,sample_treat_1,sample_treat_2
181.0707,120.5,50000,48000,25000,27000
133.0496,85.3,32000,35000,38000,40000
147.0652,95.2,18000,19500,22000,20000
177.0394,142.8,45000,43000,42000,44000
193.0343,156.1,28000,30000,35000,33000- Intensities are required for
advanced_annotation()(used for correlation analysis) - Intensities are NOT required for
simple_annotation() - All columns must be numeric
- The
peakidentifier is auto-generated using lexicographic ordering of (mz, rt)
If your peak table includes a custom feature identifier column (e.g., "FeatureID" with values like "C0001", "C0005"), you can preserve it through the annotation pipeline using the feature_id_column parameter in advanced_annotation().
| Column | Type | Description |
|---|---|---|
<custom_id> |
character/numeric | Custom feature identifier (any column name) |
Example peak table with custom feature ID:
peak_table <- data.frame(
FeatureID = c("C0001", "C0002", "C0003", "C0004", "C0005"),
mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343),
rt = c(120.5, 85.3, 95.2, 142.8, 156.1),
sample_1 = c(50000, 32000, 18000, 45000, 28000),
sample_2 = c(48000, 35000, 19500, 43000, 30000)
)When you specify feature_id_column = "FeatureID" in advanced_annotation(), this column will be included in all stage output files (Stage 1-5).
Defines the adducts to consider during annotation. A default table is provided.
| Column | Type | Description | Example |
|---|---|---|---|
adduct |
character | Adduct name | "M+H" |
charge |
integer | Charge state | 1, -1, 2 |
factor |
integer | Number of molecules | 1 for M+H, 2 for 2M+H |
mass |
numeric | Mass shift from neutral | 1.007276 for +H |
expected_mz = (factor × monoisotopic_mass + mass) / |charge|
adduct_table <- data.frame(
adduct = c("M+H", "M+Na", "M+K", "M+NH4", "M-H", "M+Cl", "M+FA-H", "2M+H", "M-H2O+H"),
charge = c(1L, 1L, 1L, 1L, -1L, -1L, -1L, 1L, 1L),
factor = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
mass = c(
1.007276, # M+H: +proton
22.989218, # M+Na: +sodium
38.963158, # M+K: +potassium
18.033823, # M+NH4: +ammonium
-1.007276, # M-H: -proton
34.969402, # M+Cl: +chloride
44.998201, # M+FA-H: +formate -H
1.007276, # 2M+H: dimer + proton
-17.026549 # M-H2O+H: water loss + proton
)
)adduct_table_pos <- data.frame(
adduct = c("M+H", "M+Na", "M+K", "M+NH4", "2M+H", "M-H2O+H"),
charge = c(1L, 1L, 1L, 1L, 1L, 1L),
factor = c(1L, 1L, 1L, 1L, 2L, 1L),
mass = c(1.007276, 22.989218, 38.963158, 18.033823, 1.007276, -17.026549)
)adduct_table_neg <- data.frame(
adduct = c("M-H", "M+Cl", "M+FA-H", "M-H2O-H", "2M-H"),
charge = c(-1L, -1L, -1L, -1L, -1L),
factor = c(1L, 1L, 1L, 1L, 2L),
mass = c(-1.007276, 34.969402, 44.998201, -19.01839, -1.007276)
)| Adduct | Mass Shift | Mode |
|---|---|---|
| M+H | +1.007276 | Positive |
| M+Na | +22.989218 | Positive |
| M+K | +38.963158 | Positive |
| M+NH4 | +18.033823 | Positive |
| M-H | -1.007276 | Negative |
| M+Cl | +34.969402 | Negative |
| M+FA-H | +44.998201 | Negative |
| M+ACN+H | +42.033823 | Positive |
| M-H2O+H | -17.026549 | Positive |
| M-H2O-H | -19.01839 | Negative |
Assigns priority weights to adducts for confidence scoring and filtering.
| Column | Type | Description |
|---|---|---|
Adduct or adduct |
character | Adduct name |
Weight or weight |
numeric | Priority weight (higher = more important) |
adduct_weights <- data.frame(
Adduct = c("M+H", "M-H", "M+Na", "M+K", "M+NH4", "M+Cl"),
Weight = c(5, 5, 3, 2, 3, 2)
)adduct_weights <- data.frame(
Adduct = c("M+H", "M-H"),
Weight = c(10, 10)
)- Higher weights give higher priority during redundancy filtering
- Compounds with higher-weighted adducts are retained preferentially
- Default if not provided: M+H and M-H with weight 1
For pathway enrichment scoring. The advanced_annotation() function supports three pathway modes:
- HMDB mode (default): Uses built-in HMDB pathway data
- Custom mode: Uses user-provided pathway data
- Skip mode: Bypasses pathway matching entirely
| Column | Type | Description |
|---|---|---|
compound |
character | Compound identifier (must match compound_table$compound) |
pathway |
character | Pathway identifier |
# Custom pathway data for use with pathway_mode = "custom"
pathway_data <- data.frame(
compound = c("1", "1", "2", "2", "3"), # Must match your compound_table IDs
pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose_phosphate", "tca_cycle")
)pathway_data <- data.frame(
compound = c("1", "1", "2", "2", "3"),
pathway = c("map00010", "map00500", "map00010", "map00020", "map00020")
)# Option 1: Skip pathway matching entirely (simplest for custom databases)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
pathway_mode = "skip"
)
# Option 2: Provide custom pathway data
my_pathways <- data.frame(
compound = c("1", "1", "2", "3"),
pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose")
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
pathway_mode = "custom",
pathway_data = my_pathways
)
# Option 3: Use HMDB pathways (default, requires HMDB compound IDs)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = hmdb_compounds # Must have HMDB IDs
)You can exclude specific pathways or compounds from the analysis:
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
pathway_mode = "custom",
pathway_data = my_pathways,
excluded_pathways = c("pathway_to_skip"), # Pathways to exclude
excluded_pathway_compounds = c("compound_to_skip") # Compounds to exclude
)- Same compound can appear in multiple pathways (multiple rows)
- The
compoundcolumn values must match yourcompound_table$compoundvalues (converted to character) - The enrichment algorithm boosts scores for compounds in shared pathways
- Use
pathway_mode = "skip"when using custom compound databases without pathway information
List of known/validated compounds to boost to confidence level 4. Boosted compounds receive Confidence=4 (labeled "Confirmed") and their scores are multiplied by 100.
| Column | Type | Required | Description |
|---|---|---|---|
compound_id |
character | Yes | Must match compound_id in compound_table (appears as compound_id in outputs) |
mz |
numeric | Conditional | m/z value for proximity matching (required if "mz" in boost_match_by) |
rt |
numeric | Conditional | Retention time in seconds (required if "rt" in boost_match_by) |
| Parameter | Type | Default | Description |
|---|---|---|---|
boosted_compounds |
data.frame | NULL |
Table of compounds to boost |
boost_match_by |
character | c("mz", "rt") |
Which columns to use for matching: c("mz"), c("rt"), or c("mz", "rt") |
boost_mass_tolerance |
numeric | same as mass_tolerance |
Fractional tolerance for mz matching (e.g., 5e-6 = 5 ppm) |
boost_time_tolerance |
numeric | same as time_tolerance |
Seconds tolerance for RT matching |
- mz + rt matching (
boost_match_by = c("mz", "rt")): Annotation must match compound_id AND be within mz tolerance AND within RT tolerance - rt-only matching (
boost_match_by = c("rt")): Annotation must match compound_id AND be within RT tolerance - mz-only matching (
boost_match_by = c("mz")): Annotation must match compound_id AND be within mz tolerance
# Boost compounds that match by ID, mz, and RT
boosted_compounds <- data.frame(
compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
mz = c(181.0707, 147.0652, 133.0496),
rt = c(120.5, 95.2, 85.3)
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
boosted_compounds = boosted_compounds,
boost_match_by = c("mz", "rt"), # default
boost_mass_tolerance = 5e-6, # 5 ppm (defaults to mass_tolerance if not specified)
boost_time_tolerance = 10 # 10 seconds (defaults to time_tolerance if not specified)
)# Boost compounds that match by ID and RT only (ignore mz differences)
boosted_compounds <- data.frame(
compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
rt = c(120.5, 95.2, 85.3) # mz column not needed
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
boosted_compounds = boosted_compounds,
boost_match_by = c("rt"), # RT-only matching
boost_time_tolerance = 15 # 15 second tolerance
)# Boost compounds that match by ID and mz only (ignore RT differences)
boosted_compounds <- data.frame(
compound_id = c("HMDB0000122", "HMDB0000158", "HMDB0000167"),
mz = c(181.0707, 147.0652, 133.0496) # rt column not needed
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
boosted_compounds = boosted_compounds,
boost_match_by = c("mz"), # mz-only matching
boost_mass_tolerance = 10e-6 # 10 ppm tolerance
)- The
compound_idcolumn must match thecompound_idvalues in your compound_table - Boosted annotations receive Confidence=4 ("Confirmed", highest level) and score×100
- Tolerance parameters use the same format as main parameters: fractional for mass (e.g.,
5e-6= 5 ppm), seconds for time
List of adducts that are expected/required for high confidence annotations.
| Column | Type | Description |
|---|---|---|
adduct |
character | Adduct name |
expected_adducts <- data.frame(
adduct = c("M+H", "M-H")
)expected_adducts <- load_expected_adducts_csv("expected_adducts.csv")| Format | Read Function | Write Function | Notes |
|---|---|---|---|
| Parquet | load_*_parquet() |
save_parquet() |
Recommended for large files |
| CSV | read.csv() / readr::read_csv() |
write.csv() |
Universal compatibility |
| RDA | load() |
save() |
Native R format |
| Data Frame | Direct use | - | In-memory |
# Compound table
compound_table <- load_compound_table_parquet("compounds.parquet")
# Peak table
peak_table <- load_peak_table_parquet("peaks.parquet")
# Adduct table
adduct_table <- load_adduct_table_parquet("adducts.parquet")save_parquet(compound_table, "compounds.parquet")
save_parquet(peak_table, "peaks.parquet")
save_parquet(annotation_results, "results.parquet")# Load original HMDB data
load("xMSannotator-master/data/hmdbAllinf.rda")
# Convert to required format
compound_table <- data.frame(
compound = seq_len(nrow(hmdbAllinf)),
monoisotopic_mass = as.numeric(hmdbAllinf$MonoisotopicMass),
molecular_formula = as.character(hmdbAllinf$Formula),
name = as.character(hmdbAllinf$Name)
)
# Save as Parquet
save_parquet(compound_table, "hmdb_compounds.parquet")# Assuming KEGG data with columns: KEGG_ID, ExactMass, Formula, Name
kegg_data <- read.csv("kegg_compounds.csv")
compound_table <- data.frame(
compound = seq_len(nrow(kegg_data)),
monoisotopic_mass = kegg_data$ExactMass,
molecular_formula = kegg_data$Formula,
name = kegg_data$Name
)# Assuming LipidMaps data
lipidmaps_data <- read.csv("lipidmaps.csv")
compound_table <- data.frame(
compound = seq_len(nrow(lipidmaps_data)),
monoisotopic_mass = lipidmaps_data$EXACT_MASS,
molecular_formula = lipidmaps_data$FORMULA,
name = lipidmaps_data$COMMON_NAME
)# Your CSV with any column names
my_data <- read.csv("my_database.csv")
compound_table <- data.frame(
compound = seq_len(nrow(my_data)),
monoisotopic_mass = my_data$mass, # Map your column name
molecular_formula = my_data$formula,
name = my_data$compound_name
)# Typical spectral library export format
massbank <- read.csv("massbank_export.csv")
compound_table <- data.frame(
compound = seq_len(nrow(massbank)),
monoisotopic_mass = massbank$EXACT_MASS,
molecular_formula = massbank$MOLECULAR_FORMULA,
name = massbank$COMPOUND_NAME
)The advanced_annotation() function accepts the following parameters:
| Parameter | Type | Description |
|---|---|---|
peak_table |
data.frame | Feature table with m/z, retention time, and intensity columns |
compound_table |
data.frame | Database of compounds with names, formulas, and monoisotopic masses |
| Parameter | Type | Default | Description |
|---|---|---|---|
adduct_table |
data.frame | NULL |
Table of adducts with name, factor, charge, and mass columns. Uses sample_adduct_table if NULL |
adduct_weights |
data.frame | NULL |
Weights for prioritizing specific adducts. Auto-generated with weight=5 if NULL |
feature_id_column |
character | NULL |
Name of column in peak_table containing custom feature identifiers to preserve in all stage outputs |
intensity_deviation_tolerance |
numeric | 0.1 |
Tolerance for intensity deviation in isotope matching (10%) |
mass_tolerance |
numeric | 5e-6 |
Mass accuracy as fractional (relative) tolerance. Use 5e-6 for 5 ppm, 10e-6 for 10 ppm. Do NOT enter direct ppm values. |
isotope_mass_tolerance |
numeric | NULL |
Fractional tolerance for isotope m/z matching. Defaults to mass_tolerance. Use 5e-6 for 5 ppm. |
mass_defect_tolerance |
numeric | 0.1 |
Tolerance for mass defect matching |
mass_defect_precision |
numeric | 0.01 |
Precision for binning mass defects |
time_tolerance |
numeric | 10 |
Retention time tolerance in seconds |
peak_rt_width |
numeric | 1 |
Expected chromatographic peak width for RT clustering |
correlation_threshold |
numeric | 0.7 |
Minimum correlation for peak grouping |
MplusH_abundance_ratio_check |
logical | TRUE |
Requires secondary adducts to have lower intensity than M+H/M-H during chemical scoring. Set to FALSE to disable. |
multimer_abundance_check |
logical | TRUE |
Checks that multimer adducts (2M, 3M) have lower intensity than the monomer during confidence assignment. Set to FALSE to disable. |
deep_split |
integer | 2 |
WGCNA parameter controlling cluster splitting (0-4, higher = more clusters) |
min_cluster_size |
integer | 10 |
Minimum peaks per module in WGCNA clustering |
maximum_isotopes |
integer | 10 |
Maximum isotope peaks to consider per compound |
min_ions_per_chemical |
integer | 2 |
Minimum ions required to annotate a chemical |
filter_by |
character vector | c("M-H", "M+H") |
Primary adducts for confidence scoring |
level1_primary_adducts |
character vector | c("M+H", "M-H") |
Adducts that qualify for Confidence 1 as a single match in the evidence cap. Independent of filter_by. |
network_type |
character | "unsigned" |
WGCNA network type ("unsigned", "signed", or "signed hybrid") |
redundancy_filtering |
logical | TRUE |
Whether to remove redundant annotations |
identify_isotopologues_flag |
logical | TRUE |
Use enviPat to identify specific isotopologue substitutions (e.g., 13C:1 vs 15N:1) for isotope peaks. Adds isotopologue and isotopologue_quality columns to output. Requires enviPat package; gracefully skips if not installed. |
pathway_mode |
character | "HMDB" |
Pathway matching mode: "HMDB" (default), "custom", or "skip" |
pathway_data |
data.frame | NULL |
Custom pathway-compound mappings (required if pathway_mode = "custom") |
excluded_pathways |
character vector | NULL |
Pathways to exclude from analysis |
excluded_pathway_compounds |
character vector | NULL |
Compounds to exclude from pathway analysis |
boosted_compounds |
data.frame | NULL |
Table of confirmed compounds to boost to Confidence=4 (see Section 6) |
boost_match_by |
character vector | c("mz", "rt") |
Which columns to use for boost matching: c("mz"), c("rt"), or c("mz", "rt") |
boost_mass_tolerance |
numeric | same as mass_tolerance |
Fractional tolerance for boost mz matching (e.g., 5e-6 = 5 ppm) |
boost_time_tolerance |
numeric | same as time_tolerance |
Seconds tolerance for boost RT matching |
outloc |
character | tempdir() |
Output directory for intermediate files |
n_workers |
integer | detectCores() |
Number of parallel workers for WGCNA |
Note on mass tolerance format: All mass tolerance parameters (mass_tolerance, isotope_mass_tolerance) use fractional (relative) tolerance notation:
5e-6= 5 ppm (5 parts per million)10e-6= 10 ppm- The matching algorithm uses:
|observed - expected| ≤ max(|observed|, |expected|) × tolerance - Do NOT enter ppm values directly (e.g., do NOT use
5for 5 ppm)
library(CLUES.xMSannotator)
# Run with default parameters
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds
)
# Run with custom parameters
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
adduct_table = my_adducts,
mass_tolerance = 10e-6, # 10 ppm (fractional: 10 × 10^-6)
isotope_mass_tolerance = 5e-6, # 5 ppm for isotope matching
time_tolerance = 15, # 15 seconds
correlation_threshold = 0.8, # stricter correlation
min_ions_per_chemical = 3, # require more ions
filter_by = c("M+H"), # positive mode only
n_workers = 4 # limit parallelization
)
# Preserve custom feature IDs through the pipeline
result <- advanced_annotation(
peak_table = my_peaks, # has "FeatureID" column
compound_table = my_compounds,
feature_id_column = "FeatureID", # preserve this column in all outputs
outloc = "output/"
)
# All Stage 1-5 output files will include the "FeatureID" column
# Skip pathway matching (for custom compound databases)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = custom_compounds,
pathway_mode = "skip" # bypass HMDB pathway matching
)
# Use custom pathway database
my_pathways <- data.frame(
compound = c("1", "1", "2", "3"),
pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose")
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
pathway_mode = "custom",
pathway_data = my_pathways,
excluded_pathways = c("unwanted_pathway") # optional filtering
)
# Boost confidence of confirmed/validated compounds
validated_compounds <- data.frame(
compound_id = c("HMDB0000122", "HMDB0000243"), # Must match compound_table$compound_id
mz = c(181.0707, 133.0496), # Observed m/z values
rt = c(120.5, 85.3) # Observed retention times (seconds)
)
result <- advanced_annotation(
peak_table = my_peaks,
compound_table = my_compounds,
boosted_compounds = validated_compounds,
boost_match_by = c("rt"), # Match by compound_id + RT only
boost_time_tolerance = 15, # 15 second RT tolerance
pathway_mode = "skip"
)
# Annotations matching validated compounds will have Confidence=4 ("Confirmed") and score×100library(CLUES.xMSannotator)
# 1. Create compound database
compound_table <- data.frame(
compound = 1:5,
monoisotopic_mass = c(180.0634, 132.0423, 146.0579, 176.0321, 192.0270),
molecular_formula = c("C6H12O6", "C4H8O5", "C5H10O5", "C6H8O6", "C6H8O7"),
name = c("Glucose", "Threonic acid", "Ribose", "Ascorbic acid", "Citric acid")
)
# 2. Create peak table
peak_table <- data.frame(
mz = c(181.0707, 133.0496, 147.0652, 177.0394, 193.0343, 203.0526),
rt = c(120.5, 85.3, 95.2, 142.8, 156.1, 165.4)
)
# 3. Define adducts (optional - has defaults)
adduct_table <- data.frame(
adduct = c("M+H", "M+Na", "M-H"),
charge = c(1L, 1L, -1L),
factor = c(1L, 1L, 1L),
mass = c(1.007276, 22.989218, -1.007276)
)
# 4. Run simple annotation
annotation <- simple_annotation(
peak_table = peak_table,
compound_table = compound_table,
adduct_table = adduct_table,
mass_tolerance = 5e-6 # 5 ppm
)
# 5. View results
print(annotation)library(CLUES.xMSannotator)
# 1. Load compound database from Parquet
compound_table <- load_compound_table_parquet("my_compounds.parquet")
# 2. Load peak table with intensities
peak_table <- load_peak_table_parquet("my_peaks.parquet")
# 3. Define adducts and weights
adduct_table <- data.frame(
adduct = c("M+H", "M+Na", "M+K", "M+NH4"),
charge = c(1L, 1L, 1L, 1L),
factor = c(1L, 1L, 1L, 1L),
mass = c(1.007276, 22.989218, 38.963158, 18.033823)
)
adduct_weights <- data.frame(
adduct = c("M+H", "M+Na", "M+K", "M+NH4"),
weight = c(5, 3, 2, 3)
)
# 4. Run advanced annotation with custom compound database
# Use pathway_mode = "skip" to bypass HMDB pathway matching
result <- advanced_annotation(
peak_table = peak_table,
compound_table = compound_table,
adduct_table = adduct_table,
adduct_weights = adduct_weights,
mass_tolerance = 5e-6, # 5 ppm
time_tolerance = 10, # 10 seconds
correlation_threshold = 0.7,
filter_by = c("M+H"), # positive mode
pathway_mode = "skip", # skip HMDB pathway matching
outloc = "output/"
)
# 5. Or use custom pathway data
my_pathways <- data.frame(
compound = c("1", "1", "2", "2", "3"),
pathway = c("glycolysis", "tca_cycle", "glycolysis", "pentose_phosphate", "tca_cycle")
)
result_with_pathways <- advanced_annotation(
peak_table = peak_table,
compound_table = compound_table,
adduct_table = adduct_table,
adduct_weights = adduct_weights,
pathway_mode = "custom",
pathway_data = my_pathways,
outloc = "output_with_pathways/"
)
# 6. Save results
save_parquet(result, "annotation_results.parquet")
write.csv(result, "annotation_results.csv", row.names = FALSE)The package includes validation functions that check your data:
| Function | Purpose |
|---|---|
as_peak_table() |
Validates peak table format |
as_compound_table() |
Validates compound table format |
as_adduct_table() |
Validates adduct table format |
as_pathway_table() |
Validates custom pathway table format |
as_expected_adducts_table() |
Validates expected adducts |
load_boost_compounds_csv() |
Loads and validates boosted compounds from CSV |
as_boosted_compounds_table() |
Validates boosted compounds (internal, called by load_boost_compounds_csv()) |
# Check your data before annotation
tryCatch({
validated_compounds <- as_compound_table(compound_table)
validated_peaks <- as_peak_table(peak_table, intensities = TRUE)
validated_adducts <- as_adduct_table(adduct_table)
print("All data validated successfully!")
}, error = function(e) {
print(paste("Validation error:", e$message))
})Document created: 2026-01-22 Last updated: 2026-03-14 For use with CLUES.xMSannotator v1.0.0