Skip to content

Latest commit

 

History

History
154 lines (135 loc) · 38.8 KB

File metadata and controls

154 lines (135 loc) · 38.8 KB

Changelog

All notable changes to this project will be documented in this file.

Added

  • Added docs/Stage5_Output_Reference.md documenting all 19 columns in Stage5_curated_results.txt with definitions, data types, column origins, internal-to-output name mapping, and interpretation guides for Confidence and score. (2026-03-14)

Changed

  • Audited docs/xMSannotator_Workflow.md against codebase (19/20 references correct). Fixed almost_equal() line reference in Stage 1 mass tolerance section, added missing feature_id_column to Key Parameters Reference table. (2026-03-14)
  • Updated readme.md with Documentation & Examples section containing a table of contents for all guides, example scripts, and developer docs. Added inline references to detailed docs from Pipeline Overview, Confidence Levels, and Installation sections. Updated docs/README.md to fix stale data_transformation.md link and list all current documents and scripts. (2026-03-14)
  • Updated docs/xMSannotator_Input_Formats.md to reflect current codebase: added missing level1_primary_adducts parameter to Section 10, updated confidence level 4 label from "Boosted" to "Confirmed" in Section 6. (2026-03-14)
  • Renamed docs/data_transformation.md to docs/advanced_annotation_input_formatting.md. Rewrote for current pipeline inputs: replaced obsolete recetox-aplcms conversion and HMDB RDA-to-Parquet content with documentation of XCMS feature table format, sample mapfile format, xlsx compound database format, and pre-processing steps (blank removal, fold-change filtering, peak table construction) as implemented in the Example Runscript. Feature Table section distinguishes required columns (mz, time, sample intensities) from workflow-specific QC metadata columns (XCMS stats, QA scores, blank/detection stats, CV columns) that are only needed for post-processing metadata joins and can be modified as needed. (2026-03-14)
  • Rewrote pipeline workflow reference document (docs/xMSannotator_Workflow.md) for clarity and usability. Added Quick Start section with minimal example, replaced ASCII data flow diagram with numbered stage list, added per-stage "Key questions" callout boxes, added worked scoring example for Stage 3, replaced Stage 4 prose with pseudocode decision flowchart, added Parameter Decision Guide (instrument/study/tuning), and added Common Issues troubleshooting section. All existing algorithmic content, tables, formulas, and function references preserved. (2026-03-13)

Added

  • Added comprehensive pipeline workflow reference document (docs/xMSannotator_Workflow.md). Covers all stages (1 through 5) with algorithmic details, scoring formulas, parameter defaults, theory explanations, data flow diagram, output file manifest, and full parameter reference table. (2026-03-13)

Fixed

  • Fixed Stage 5 redundancy filtering reading stale pre-upgrade data. init_chemscoremat() in multilevelannotationstep5.R used any(is.na()) to detect whether a data frame was passed, but annotation data frames contain legitimate NA values in isotopologue/isotopologue_quality columns, causing it to always fall back to reading the old Stage4_confidence_levels.txt file. This discarded all confidence upgrades, caps, labels, and coherence filtering — reverting 136 Conf 2 rows to Conf 0/1 and reintroducing 83 incoherent rows. Fix: changed sentinel check to is_scalar_na() (detects the default NA parameter vs. a real data frame), updated fallback to read Stage4b_confidence_levels.txt (post-coherence), and removed duplicate Stage5 file write from inside multilevelannotationstep5(). (2026-03-10)
  • Fixed single M+H/M-H annotations stuck at Confidence 0 — cap_confidence_with_evidence() now processes Level 0 compounds and rescues primary adduct matches to Level 1. Previously, cap only processed compounds with Confidence > 0, so single-row primary adduct matches that Stage 4 assigned Level 0 (due to score > 10 strict inequality and filter_by = NULL) were never evaluated. (2026-03-10)
  • Fixed orphan isotope rows (no base adduct row present) incorrectly upgraded to Level 2. Both upgrade_confidence_with_evidence() and cap_confidence_with_evidence() now count actual base rows (n_base_rows) separately from unique base adduct types (n_base_adducts). Isotope evidence tiers require n_base_rows >= 1, preventing lone isotope rows (e.g., M+ACN+H_[+1] with no M+ACN+H row) from being treated as isotope + base adduct evidence. (2026-03-10)

Added

  • Added cap_confidence_with_evidence() post-hoc confidence cap function (multilevelannotationstep4.R). Runs after Stage 4 and the upgrade step to enforce hard evidence requirements on confidence values. Unlike upgrade (which only raises), the cap can lower confidence when evidence is insufficient. Cap tiers: Conf 3 requires isotope rows + adduct + module/RT coherence, Conf 2 requires 2+ base adducts + coherence, Conf 1 requires a single primary adduct match, everything else caps to 0. Skips Conf 0 and Conf 4 (user-confirmed). (2026-03-10)
  • Added add_confidence_labels() function (multilevelannotationstep4.R). Maps numeric Confidence column to human-readable Confidence_Level text labels (None/Low/Medium/High/Confirmed) in all output files. (2026-03-10)
  • Added level1_primary_adducts parameter to advanced_annotation() (default: c("M+H", "M-H")). Controls which adducts qualify for Confidence 1 as a single match in the evidence cap. Independent of filter_by, allowing filter_by = NULL for equal Stage 4 scoring while still requiring primary ion evidence for Level 1. (2026-03-10)

Changed

  • Sorted Stage4a and Stage4b output files by Confidence (descending), then compound_id, score, and Adduct. All rows for a compound are now grouped together, ordered from highest to lowest confidence. Matches the existing Stage5 sort order from multilevelannotationstep5(). (2026-03-10)
  • Renamed confidence level 4 label from "Boosted" to "Confirmed" in documentation. (2026-03-10)

Fixed

  • Removed false Confidence 2 assignments for compounds without corroborating evidence. Two changes: (1) Removed boost block in compute_confidence_for_compound() that unconditionally upgraded any weighted adduct with score > 10 from Conf 0/1 → Conf 2, bypassing filter checks. This was redundant for filter compounds (Stage 4 already assigns Conf 2 for filter matches) and too aggressive for non-filter compounds (filter_by=NULL path assigns Conf 0 because has_filter_match() returns FALSE). (2) Removed has_isotope_boost score proxy (score >= 100) from upgrade_confidence_with_evidence() evidence tiers. Scores can reach ≥ 100 from pathway matching or base scoring without any isotopes, creating false-positive upgrades. Evidence tiers now use only actual isotope row detection and multiple adduct counts. Single mass matches with no isotopes and no multiple adducts now correctly remain at Conf 0. (2026-03-10)

Changed

  • Refined coherence enforcement: now filters to the largest module group per compound instead of blanket downgrading to Conf 0. Added enforce_compound_coherence() helper that keeps only rows from the most-represented module for multi-module compounds. Applied as a pre-filter in compute_confidence_for_compound() (before Stage 4 evaluation) and upgrade_confidence_with_evidence() (before evidence gathering), replacing the post-hoc gate. Stage 4 now only sees coherent rows, allowing its internal RT clustering and module rules to work correctly on the filtered subset. Compounds like PEST0681 (2 rows in module 110 + 1 stray row in module 62) now retain their strong evidence instead of being killed. (2026-03-09)
  • Split Stage 4 output into Stage4a (all rows with confidence) and Stage4b (coherent rows only). Stage 5 redundancy filtering now receives Stage4b (coherent subset). Previously wrote single Stage4_confidence_levels.txt. (2026-03-09)

Fixed

  • Enforced module + RT coherence on all confidence levels > 0. Added check_compound_coherence() helper that verifies all rows for a compound are in the same peak module and within max.rt.diff. Applied at two enforcement points: (1) post-hoc gate in compute_confidence_for_compound() catches all Stage 4 paths at their single exit point, (2) replaced RT-only check in upgrade_confidence_with_evidence() with full module + RT coherence check. Previously, 6 Stage 4 code paths could assign Conf 1 or 2 without checking module coherence, resulting in ~92% of multi-row Conf 2 compounds being split across multiple peak modules. (2026-03-09)
  • Fixed dead code in compute_confidence_for_compound() (multilevelannotationstep4.R). The else block at line 580 re-checked filter.by inside a branch that already established no filter match — the inner condition was always FALSE. Replaced with Confidence <- CONFIDENCE_LOW so non-filter compounds with weighted adducts and high scores now receive Confidence 1 instead of being stuck at 0. (2026-03-09)
  • Fixed score zeroing in get_confidence_stage4() (multilevelannotationstep4.R). Removed else { final_res$score <- 0 } which zeroed the internal score for non-filter compounds, preventing the Confidence 1→2 boost check at line 572 from working correctly. Output scores (from Stage 3 merge) were not affected. (2026-03-09)

Added

  • Added upgrade_confidence_with_evidence() post-hoc confidence upgrade function (multilevelannotationstep4.R). Evaluates compounds below Confidence 3 using evidence already available: isotope rows, multiple base adducts, isotope score boost (≥100), and RT coherence. Can only upgrade, never downgrade. When filter_by is set, non-filter compounds are assigned one tier lower than equivalent filter-matched compounds (e.g., isotope rows + 2 adducts → Conf 2 instead of 3). When filter_by is NULL/NA, same tier as filter-matched. Integrated into advanced_annotation() as Tool 10c, running after identify_isotopologues() and before Stage 4 output write. (2026-03-09)

Changed

  • Replaced rcdk (Java/rJava dependency) with enviPat for isotope pattern calculations. compute_isotopic_pattern() and precompute_isotope_patterns() now use enviPat::isopattern() instead of rcdk::get.formula()/rcdk::get.isotopes.pattern(). This eliminates the rJava dependency entirely — no JDK, JAVA_HOME, or R CMD javareconf required. enviPat moved from Suggests to Imports; rcdk removed from Imports. Isotope pattern output format (mass, abund, mass_number_difference, exact_mass_diff) is preserved. Test data regenerated to match enviPat output values. (2026-03-04)

Removed

  • Removed experimental permutation-based p-value testing. Deleted R/compute_permutation.R (~650 lines, 6 functions: compute_permutation_pvalues(), precompute_isotope_patterns(), detect_isotopic_peaks_cached(), compute_isotopes_with_cache(), compute_full_pvalues(), compute_streaming_pvalues()). Removed enable_permutation, n_permutations, permutation_method, permutation_seed parameters from advanced_annotation(). Deleted 6 man pages and 2 planning documents. The feature was disabled (if (FALSE)) and never production-ready. (2026-03-05)

Fixed

  • Fixed crash on charged molecular formulas (e.g., C12H14N2+2 for Paraquat) after rcdk → enviPat migration. enviPat cannot parse charge notation in formulas unlike rcdk/CDK. Added strip_formula_charge() to remove charge suffixes before calling enviPat::isopattern() (charge doesn't affect isotope patterns). Also added tryCatch safety net in detect_isotopic_peaks() to gracefully skip any other unparseable formulas instead of crashing the entire annotation run. (2026-03-04)
  • Fixed duplicate feature_id columns (feature_id.x, feature_id.y, feature_id) in Stage 4/5 output. The feature_id join was performed inside skip_pathway_step(), multilevelannotationstep3(), and multilevelannotationstep4(), then again by safe_join_feature_id() in advanced_annotation(), causing dplyr::left_join() to create .x/.y suffixes. Removed redundant joins from the three internal functions; safe_join_feature_id() now handles all feature_id joining at output stages. (2026-03-04)
  • Suppressed enviPat "NOTE: You are sure that is the mass of an electrone?" message in get_isotopologue_labels() by explicitly passing emass = 0.00054857990924 to isopattern(). (2026-03-04)

Changed

  • Fixed .gitignore which contained *.Rd pattern blocking all man pages from being committed. Removed stale xmsannotator/ paths and added .DS_Store exclusion. (2026-03-03)
  • Added roxygen2 documentation to 13 exported functions that lacked man pages: simple_annotation, get_chemscore, compute_chemical_score, add_isotopic_peaks, remove_water_adducts, create_adduct_weights, group_by_rt, load_peak_table_parquet, load_adduct_table_parquet, load_compound_table_parquet, load_expected_adducts_csv, load_boost_compounds_csv, save_parquet. (2026-03-03)
  • Updated .Rbuildignore to exclude non-package files (.github, docs/, Dockerfile, .DS_Store, .gitignore, readme.md) from the package tarball. (2026-03-03)
  • Deleted 4 orphaned .Rd files for removed/internal functions: custom_pathway_step.Rd, compute_boosted_confidences.Rd, get_confidence_stage2.Rd, group_by_rt_histv2.Rd. (2026-03-03)
  • Regenerated all man/*.Rd files via roxygen2::roxygenise(). (2026-03-03)

Added

  • Added isotopologue identification step (Tool 10b) to advanced_annotation(). After confidence level assignment, uses enviPat::isopattern() to identify which specific isotope substitution each isotope peak corresponds to (e.g., 13C:1 vs 15N:1 for M+1 peaks). Adds two columns to output: isotopologue (identity label) and isotopologue_quality ("confirmed" if both m/z and abundance match, "mz_only" if only m/z matches). Uses isotope_mass_tolerance for ppm cutoff and intensity_deviation_tolerance for abundance validation. Requires enviPat package (Suggests dependency); gracefully skips if not installed. New identify_isotopologues_flag parameter (default TRUE) to enable/disable. (2026-03-04)
  • Added get_monoisotopic_names() internal helper to compute_isotopes.R for identifying monoisotopic element names from enviPat column headers. (2026-03-04)
  • Added identify_isotopologues.R with identify_isotopologues() and get_isotopologue_labels() functions. (2026-03-04)
  • Added multimer_abundance_check parameter to advanced_annotation() (default TRUE). When enabled, checks that multimer adducts (2M, 3M) have lower intensity than the monomer during confidence level assignment. If a multimer is more abundant than the monomer, the confidence level is downgraded. Set to FALSE to disable this validation. Parameter is passed through multilevelannotationstep4() to get_confidence_stage4(). (2026-01-27)
  • Added MplusH_abundance_ratio_check parameter to advanced_annotation() (default TRUE). When enabled, requires secondary adducts to have lower intensity than the primary M+H or M-H adduct during chemical scoring. Set to FALSE to disable this abundance ratio validation. Parameter is passed through to get_chemscore(). (2026-01-27)
  • Added permutation-based significance testing to advanced_annotation(). New parameters: enable_permutation (default FALSE), n_permutations (default 1000), permutation_method (default "full"), permutation_seed (42). When enabled, computes p-values by permuting m/z values across peaks to generate null distributions, then outputs Stage4_permutation_pvalues_multi.txt with a perm_pvalue column. Uses parallel processing via n_workers parameter. Two methods available: "full" (all permutations in parallel, faster) and "streaming" (chunked processing, lower memory). Note: permutation testing is currently disabled/in development and not ready for production use. (2026-01-26)
  • Added compute_permutation.R with compute_permutation_pvalues(), compute_full_pvalues(), and compute_streaming_pvalues() functions for permutation testing. (2026-01-26)
  • Added pathway_mode parameter to advanced_annotation() with options: "HMDB" (default), "custom", or "skip". This allows users with custom compound databases to skip pathway matching or provide their own pathway data. (2026-01-24)
  • Added pathway_data parameter to advanced_annotation() for providing custom pathway-compound mappings when pathway_mode = "custom". Format: data frame with compound and pathway columns. (2026-01-24)
  • Added excluded_pathways and excluded_pathway_compounds parameters to advanced_annotation() for filtering pathway analysis. (2026-01-24)
  • Added as_pathway_table() validation function to utils.R for validating custom pathway data format. (2026-01-24)
  • Added skip_pathway_step() and custom_pathway_step() helper functions for non-HMDB pathway handling. (2026-01-24)
  • Added adduct detection summary output to console after simple_annotation() in advanced_annotation(). Shows total annotations, unique peaks matched, unique compounds matched, and adduct breakdown. (2026-01-24)
  • Added isotope detection summary output to console after compute_isotopes() in advanced_annotation(). Shows count of monoisotopic peaks, isotopes detected (with percentage), and breakdown by adduct type and mass number difference. (2026-01-24)
  • Added feature_id_column parameter to advanced_annotation() allowing users to preserve their custom feature identifiers (e.g., "C0001", "C0005") through the pipeline. The specified column is now included in all stage outputs (Stage 1-5). Feature ID is joined by peak for Stage 1 and by mz+time for Stages 2-5 for robust matching. (2026-01-24)
  • Added mz_rt_feature_id_map parameter to multilevelannotationstep3() and multilevelannotationstep4() to pass feature ID mapping for output files. (2026-01-24)
  • Added Stage3_pathway_matched.txt output after HMDB pathway matching in advanced_annotation(). This captures annotation scores after pathway enrichment (only written when pathway_mode = "HMDB"). (2026-01-25)
  • Added Stage4_confidence_levels.txt output after multilevelannotationstep4() in advanced_annotation(). This captures confidence level assignments before redundancy filtering. (2026-01-25)
  • Added isotope_mass_tolerance parameter to advanced_annotation() for ppm-based filtering of isotope matches. Defaults to same value as mass_tolerance. This filters out isotope candidates where the observed m/z differs from the theoretical isotope m/z by more than the specified ppm tolerance. Improves isotope detection accuracy by rejecting false positive matches with poor mass accuracy. (2026-01-25)
  • Added automatic creation of outloc directory if it doesn't exist. Previously, specifying a non-existent output directory would cause write.table() to fail. (2026-01-25)
  • Added support for string compound IDs via compound_id column in compound_table. Users can now provide meaningful identifiers (e.g., "HMDB0000001", "C00001") that flow through the entire pipeline and appear in all output files as compound_id. The internal integer compound column is auto-generated if compound_id is provided. Legacy mode (numeric compound only) remains supported for backward compatibility. (2026-01-25)
  • Added boosted_compounds parameter to advanced_annotation() for boosting confidence of confirmed annotations to level 4. Requires compound_id column matching compound_id in compound_table. Optional mz and rt columns enable proximity-based matching. (2026-01-25)
  • Added boost_match_by parameter to advanced_annotation() to control matching mode: c("mz") for mz-only, c("rt") for RT-only, or c("mz", "rt") for both (default). (2026-01-25)
  • Added boost_mass_tolerance and boost_time_tolerance parameters to advanced_annotation() for separate control of boost matching tolerances. Uses same format as main tolerance parameters (fractional for mass, seconds for time). Defaults to main tolerances if not specified. (2026-01-25)

Changed

  • Bumped version from 0.10.0 to 1.0.0 in DESCRIPTION, conda/meta.yaml, and docs/xMSannotator_Input_Formats.md. Major version bump signals the new CLUES-Emory release with package rename, permutation testing, boosted compounds, pathway modes, abundance checks, and many bug fixes. (2026-02-08)
  • Updated xMSannotator_Input_Formats.md: renamed all recetox.xmsannotator/RECETOX xMSannotator references to CLUES.xMSannotator, added missing MplusH_abundance_ratio_check and multimer_abundance_check parameters to the Optional Parameters table, added load_boost_compounds_csv() to Validation Functions table and noted as_boosted_compounds_table() is internal, updated footer date. Moved file from repo root to CLUES.xMSannotator/docs/. Added link in docs/README.md. (2026-02-08)
  • Restored RECETOX GitHub URLs in docs/possible_issues.md, docs/refactoring.md, docs/modifications.md, and docs/README.md. These docs describe original RECETOX refactoring work with commit-specific URLs that should point to RECETOX/recetox-xMSannotator, not the CLUES fork. Also restored title in docs/modifications.md. (2026-02-08)
  • Rewrote readme.md to credit upstream lineage (kuppal2 -> RECETOX -> CLUES-Emory), link to the RECETOX fork, and include a summary of CLUES-Emory changes (new features, bug fixes, code cleanup) derived from CHANGELOG.md. (2026-02-08)
  • Renamed package from recetox.xmsannotator to CLUES.xMSannotator to reflect new ownership under CLUES-Emory organization. GitHub repo: CLUES-Emory/CLUES-xMSannotator. Updated DESCRIPTION (package name, author/maintainer), NAMESPACE, R/RcppExports.R, R/simple_annotation.R, src/RcppExports.cpp, tests/testthat.R, readme.md, conda files, GitHub workflow, docs, and git remote URL. Renamed directory from recetox-xMSannotator to CLUES.xMSannotator. (2026-02-08)
  • Unified pathway scoring: multilevelannotationstep3() now accepts both HMDB and custom pathway databases via db_name parameter ("HMDB" or "custom"). Custom mode uses the same module-aware Fisher's test scoring logic (compute_score_pathways) as HMDB mode, replacing the simple co-member counting approach of the old custom_pathway_step(). New parameters: pathway_data, excluded_pathways, excluded_pathway_compounds. chemCompMZ parameter is now optional (default NULL), only needed for HMDB mode. (2026-02-08)
  • Updated advanced_annotation() pathway dispatch to route pathway_mode = "custom" through multilevelannotationstep3() instead of custom_pathway_step(). (2026-02-08)
  • Renamed HMDB pathway output file from Stage3_correlation_scores.txt to Stage3_HMDB_pathways.txt for consistency with Stage3_custom_pathways.txt and Stage3_pathway_skipped.txt. (2026-02-08)
  • Made skip_pathway_step() join feature ID to chemscoremat directly (consistent with multilevelannotationstep3()), instead of joining to a temporary copy for writing only. Both functions now return data with feature ID included. (2026-02-08)
  • Renamed multilevelannotationstep4_v2.R to multilevelannotationstep4.R and stripped all _v2 suffixes from function names: filter_clusters, create_cluster_table, compute_delta_rt, assign_conf, get_confidence_stage4, compute_delta_ppm, boost_confidence_of_IDs, multilevelannotationstep4. The refactored version is now the canonical implementation. (2026-02-07)
  • Updated advanced_annotation() to call multilevelannotationstep4() (was multilevelannotationstep4_v2()). (2026-02-07)
  • Updated comment in compute_permutation.R to remove stale reference to original file's data(adduct_table) global scope pollution. (2026-02-07)
  • Switched advanced_annotation() to use multilevelannotationstep4_v2() for Stage 4 confidence level assignment. The v2 version includes bug fixes (forward-iterating row deletion, NULL adduct weights, unreachable guards), performance improvements (pre-split by compound_id, vectorized delta ppm), and better code structure (11 smaller functions with named constants). (2026-02-07)
  • Extracted is_filter_empty() helper function in multilevelannotationstep4_v2.R to consolidate the repeated is.null(filter.by) || (length(filter.by) == 1 && is.na(filter.by[1])) pattern used in has_filter_match(), count_filter_matches(), and apply_rt_clustering_rules(). (2026-02-07)
  • Optimized permutation testing with precomputed isotope pattern cache. Isotope patterns (computed via rcdk) are now calculated once for all unique molecular formulas before the permutation loop, then reused across all permutations. This eliminates redundant rcdk::get.isotopes.pattern() calls - e.g., for 100 permutations x 1000 annotations, this reduces pattern computations from ~100,000 to ~2,000 (number of unique formulas). Added precompute_isotope_patterns(), detect_isotopic_peaks_cached(), and compute_isotopes_with_cache() functions to compute_permutation.R. (2026-01-26)
  • Improved permutation progress messages to print every 10 permutations instead of every 100, providing better feedback during long-running permutation tests. (2026-01-26)
  • Renamed chemical_ID column to compound_id throughout the package for consistency. This affects all stage output files (Stage1 through Stage5), internal functions, and test files. The column now matches the input compound_id column name from the compound_table, eliminating the confusing name change from input to output. (2026-01-25)
  • Updated documentation in xMSannotator_Input_Formats.md to clarify that mass tolerance parameters (mass_tolerance, isotope_mass_tolerance) use fractional (relative) tolerance format (e.g., 5e-6 for 5 ppm, not direct ppm values). Added formula explanation and usage examples. (2026-01-25)
  • Refactored get_chemscore() to preserve isotopes through chemical scoring. Isotopes are now separated before compute_chemical_score(), which prevents them from being filtered out during module selection. Isotopes are re-added after scoring with their parent adduct's score. This ensures isotopes appear in Stage 3+ outputs. (2026-01-24)
  • Added 100x isotope boost to chemical scores in get_chemscore() when isotope evidence is present, matching master branch behavior in calc_base_score(). Chemicals with detected isotopes now receive significantly higher scores, improving confidence level assignments. (2026-01-24)
  • Added Stage3_chemical_scores.txt output after get_chemscore() in advanced_annotation(). This captures the chemical scoring results before pathway matching. (2026-01-24)
  • Refactored all stage outputs to use tab-delimited text files (.txt) with descriptive filenames, saved directly to output directory (no subfolders). New output files: Stage1_mass_matched.txt, Stage1_peak_clusters.txt, Stage2_isotope_detection.txt, Stage3_chemical_scores.txt, Stage4_confidence_levels.txt, Stage5_curated_results.txt. (2026-01-24)
  • Added outloc parameter to multilevelannotationstep3() function to specify output directory (previously wrote to current directory). (2026-01-24)

Fixed

  • Fixed forward-iterating row deletion bug in apply_multimer_rules() (multilevelannotationstep4_v2.R). Removing rows inside a forward loop corrupted indices, causing wrong rows to be removed or skipped. Fix: collect indices to remove first, then remove all at once. (2026-02-07)
  • Fixed create_adduct_weights(NULL) returning NULL instead of default weights (utils.R). When adduct_weights was converted from NA to NULL, is.na(NULL) returns logical(0) which bypassed the default creation. Added explicit is.null() check. (2026-02-07)
  • Fixed unreachable early-return guard in apply_multimer_rules() multimer check (multilevelannotationstep4_v2.R). gregexpr() always returns a list matching input length, so length(check_abundance) == 0 was never true. Replaced with proper check for actual multimer pattern matches. (2026-02-07)
  • Fixed fragile column deletion by position after merge() in get_confidence_stage4_v2() (multilevelannotationstep4_v2.R). Replaced curdata[, -1] with curdata[["cur_adducts"]] <- NULL for explicit column removal by name. (2026-02-07)
  • Fixed hardcoded column index 8 in compute_delta_ppm_v2() (multilevelannotationstep4_v2.R). Now dynamically finds the theoretical.mz column position for inserting delta_ppm. (2026-02-07)
  • Fixed position-based column access curdata[,1] in compute_confidence_for_compound() (multilevelannotationstep4_v2.R). Replaced with explicit curdata$score_level access. (2026-02-07)
  • Fixed cbind() type coercion risk when building result data frames in get_confidence_stage4_v2() and check_minimum_score() (multilevelannotationstep4_v2.R). Replaced cbind(score_level = ..., curdata) with data.frame(score_level = ..., curdata, check.names = FALSE) to prevent silent coercion of all columns to character. (2026-02-07)
  • Fixed inconsistent confidence type in apply_unique_adduct_rules() early return (multilevelannotationstep4_v2.R). Was using bare CONFIDENCE_MEDIUM numeric without score_level column name. Now uses data.frame(score_level = CONFIDENCE_MEDIUM, ...) consistent with other return paths. (2026-02-07)
  • Fixed data(adduct_table) polluting global environment and causing simple_annotation() to fail in permutation testing. Root cause: data(adduct_table) loads the package's dataset (with uppercase Adduct column) into the global environment, which then gets captured in closures passed to compute_permutation_pvalues(). When simple_annotation() runs inside permuted contexts, it references the global adduct_table instead of its parameter adduct_table (lowercase adduct column), causing column name mismatches. Fix: (1) Removed dead data(adduct_table) code from multilevelannotationstep4.R - the loaded table was sorted but never used, (2) Changed data(adduct_table) to data("adduct_table", envir = environment()) in get_confidence_stage2() to load into local scope instead of global. (2026-01-26)
  • Fixed permutation p-values generating 0 null scores because run_permutation() was missing preprocessing steps. After simple_annotation(), the annotation was missing required columns (mass_defect, module, rt_cluster, mean_intensity) that downstream functions (compute_isotopes(), reformat_annotation_table()) require. Fix: Added compute_mass_defect() call and dplyr::inner_join() to copy module, rt_cluster, mean_intensity from peak_table to the null annotation. (2026-01-26)
  • Fixed permutation p-values generating 0 null scores because simple_annotation() doesn't return a score column. Root cause: The previous permutation test only ran mass matching, but scores are computed downstream by the full pipeline (isotope detection + chemical scoring). Fix: Updated compute_permutation_pvalues() to run the full scoring pipeline for each permutation: simple_annotation -> compute_isotopes -> reformat_annotation_table -> get_chemscore. The original peak correlation matrix (computed from peak intensity patterns) is preserved and re-indexed for each permutation's shuffled mz values. This preserves co-elution evidence while breaking the mz-to-compound relationship, providing a meaningful null hypothesis test. New parameters added: adduct_weights, time_tolerance, intensity_deviation_tolerance, mass_defect_tolerance, isotope_mass_tolerance_ppm, correlation_threshold, filter_by, peak_correlation_matrix. (2026-01-26)
  • Fixed permutation p-values all being identical (100% p < 0.05) due to flawed null score matching logic. Root cause: After permuting mz values, the code tried to match null_annotation back to original annotations by mz value, but the permuted mz values almost never matched original mz values within 1e-6 tolerance, resulting in null_scores being mostly 0. Fix: Changed to global null distribution approach - each permutation returns ALL null scores, and p-values are calculated as the proportion of all null scores >= each observed score. This is a valid permutation test that sidesteps the matching problem. (2026-01-26)
  • Fixed parallel processing function lookup failure in compute_permutation_pvalues() where forked processes couldn't find simple_annotation by name. Root cause: when mclapply forks processes on Unix, the closure run_permutation references simple_annotation by name but the package namespace isn't properly accessible in child processes. Fix: Explicitly capture simple_annotation as a local variable (simple_annotation_fn) before defining the closure, ensuring the function reference is stored in the closure's environment. (2026-01-26)
  • Fixed parallel processing errors in compute_permutation_pvalues() where all workers would fail with "all scheduled cores encountered errors in user code". Root cause: forked processes couldn't find simple_annotation and NULL results weren't detected. Fix: (1) Wrapped run_permutation in tryCatch to handle errors gracefully, (2) Added mc.preschedule = FALSE to mclapply calls for better error handling, (3) Added detection of failed permutations (NULL or try-error results), (4) Added warning messages reporting failed permutation counts, (5) Added error if all permutations fail, (6) P-values now calculated using actual successful permutation count. (2026-01-26)
  • Fixed undefined adduct_weights variable in compute_pathways() function (compute_pathways.R line 126). Added adduct_weights as a required parameter. (2026-01-24)
  • Fixed hardcoded RT tolerance in get_chemscore(). The function previously used <= 10 regardless of the max_diff_rt parameter value. Now correctly uses the max_diff_rt parameter for RT filtering. (2026-01-24)
  • Fixed NA rows appearing in annotation output from get_chemscore(). When compute_chemical_score() returned empty data (no valid adduct evidence), the function returned undefined values which pmap_dfr converted to NA rows. Fix: Return NULL explicitly for empty results, which pmap_dfr skips automatically. Added early check for empty filtdata and post-filter check after complete.cases. (2026-01-24)
  • Fixed Stage 2 output (Stage2_isotope_detection.txt) missing column headers. The file was previously written with col.names = FALSE for all rows. Now writes header on first row only. (2026-01-24)
  • Fixed duplicate rows in annotation output caused by pmap_dfr in advanced_annotation() calling get_chemscore() once per input row instead of once per unique compound_id. This caused N*M row multiplication where N is the number of input rows per chemical and M is the number of output rows per chemical. Fix: Use distinct(compound_id) before the pmap_dfr call. (2026-01-24)
  • Fixed isotope rows being removed by na.omit() in get_chemscore(). Isotopes detected by compute_isotopes() have NA values in non-critical columns (theoretical.mz, Name, MonoisotopicMass) which caused them to be incorrectly filtered out. This resulted in loss of isotope evidence and lower confidence level assignments. Fix: Replace na.omit() with selective filtering on critical columns only. (2026-01-24)
  • Removed all setwd() calls from annotation pipeline functions. Using setwd() changes global state and can leave the working directory in an unexpected state if a function fails. Replaced with absolute paths using file.path(). Affected files: get_chemscore_october.R, multilevelannotationstep4.R, multilevelannotationstep5.R, get_chemscorev1.6.71.R. (2026-01-24)
  • Fixed feature_id_column causing validation error in as_peak_table(). When peak_table contained a non-numeric feature ID column (e.g., "C0001"), the as_peak_table() function's validation that all columns must be numeric would fail. Fix: Remove the feature_id_column from peak_table before validation while preserving it in peak_table_orig for mapping. (2026-01-24)
  • Fixed Stage 2 output (Stage2_isotope_detection.txt) not being created. The outlocorig parameter was passed to get_chemscore() but never used (dead code). Fix: Write Stage 2 output directly in advanced_annotation() after reformat_annotation_table() completes. Removed dead outlocorig parameter from get_chemscore() call. (2026-01-24)
  • Fixed rm() warnings in get_confidence_stage4() function. The function attempted to remove variables (temp_curdata, groupB, good_mod, module_clust) that only exist in certain code paths, causing "object not found" warnings. Removed unnecessary rm() call - R's garbage collection handles cleanup automatically when the function returns. (2026-01-25)
  • Fixed Stage5_curated_results.txt not being created when redundancy_filtering = TRUE but feature_id_column is not provided. Stage 5 output is now always written when redundancy filtering is enabled. (2026-01-25)
  • Removed redundant time.y column from reformat_annotation_table() output in integration_utils.R. The column was identical to time (both contained annotation$rt) and was legacy code. (2026-01-25)
  • Fixed duplicate feature_id.x and feature_id.y columns appearing in stage outputs. The feature_id was being joined multiple times (inside multilevelannotationstep3() and multilevelannotationstep4(), then again in stage output sections). Added safe_join_feature_id() helper function that skips the join if the column already exists. (2026-01-25)

Removed

  • Removed unused conda/ directory (meta.yaml, environment-build.yaml, environment-dev.yaml) and .github/workflows/r-conda.yml CI workflow. These were inherited from upstream RECETOX and are not needed since CLUES.xMSannotator is distributed as a standard R package only. Removed the R Conda CI badge from readme.md and updated setup instructions in docs/developer_documentation.md. (2026-02-17)
  • Removed redundant Stage3_pathway_matched.txt write.table from advanced_annotation() (HMDB mode). multilevelannotationstep3() already writes Stage3_HMDB_pathways.txt with the same data. (2026-02-08)
  • Removed custom_pathway_step() from advanced_annotation.R. Its functionality is replaced by multilevelannotationstep3() with db_name = "custom". (2026-02-08)
  • Deleted original multilevelannotationstep4.R and get_confidence_stage4.R (now dead code, replaced by refactored v2 implementation). (2026-02-07)
  • Removed unused compute_confidence_levels.R file (~67 lines). This file contained alternative implementations (compute_expected_confidences(), compute_boosted_confidences(), compute_confidence_levels()) that were never called in the pipeline. The active implementation is in multilevelannotationstep4.R + get_confidence_stage4.R. The file also created a naming collision with the local compute_confidence_levels() function in multilevelannotationstep4.R. (2026-01-27)
  • Removed unused ISgroup dummy column from annotation output. The column was always set to "-" and provided no information. (2026-01-25)
  • Removed redundant forms_valid_adduct_pair filter in advanced_annotation() - this filter is already applied in simple_annotation() (2026-01-24)
  • Removed remove_tmp_files() function and automatic cleanup of stage output files. All output files are now preserved for user inspection. (2026-01-24)
  • Removed dead code files not used by advanced_annotation() workflow (~1500 lines): get_confidence_stage2.R (legacy v1 confidence), multilevelannotationstep2.R (unused Step 2 wrapper), get_chemscorev1.6.71.R (replaced by get_chemscore_october.R), group_by_rt_histv2.R (only called by removed get_chemscorev1.6.71). See Code_Cleanup.md for full analysis. (2026-01-25)