Merge MZMinetoMSstatsFormat converter into development#132
Conversation
Brings metabolomics into the MSstats family by adding an MZMine converter that mirrors the structure of DIANNtoMSstatsFormat. Phase 1 of a two-phase task; Phase 2 (MSstatsShiny BIO=Metabolomics) will be a separate PR.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📝 WalkthroughWalkthroughThis PR adds MZMine metabolomics data conversion to MSstatsConvert. It introduces an S4 class ( ChangesMZMine Metabolomics Converter
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Pre-existing roxygen2 issue surfaced during this work. When re-documenting the package with roxygen2 7.3.3, the build fails with "In topic 'MSstatsClean': @inherits failed to find topic '.cleanRawProteinProspector'." This is unrelated to MZMine. Somewhere in the package an |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
inst/tinytest/test_converters_MZMinetoMSstatsFormat.R (1)
98-100: ⚡ Quick winStrengthen fallback coverage assertion to detect missing mz_rt IDs.
On Lines 98-100,
all(ProteinName %in% expected_mz_rt)only checks “no unexpected values”; it does not guarantee every expected fallback appears. Use set equality on unique values to avoid false positives.Proposed test assertion hardening
expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67", "489.334_7.89", "555.447_9.1", "123.056_1.45") -expect_true(all(as.character(output_nolib_dt$ProteinName) %in% expected_mz_rt)) +expect_equal( + sort(unique(as.character(output_nolib_dt$ProteinName))), + sort(expected_mz_rt) +)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R` around lines 98 - 100, The test currently only asserts that there are no unexpected ProteinName values using expected_mz_rt and output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change the assertion to compare the sets (or sorted uniques) so it verifies every expected fallback is present and no extras — e.g., replace the all(... %in% ...) check with a set equality check between unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted variants) to ensure exact match.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-56: The code sorts annotations using setorder(ann, id, -score)
but assumes score is numeric; coerce ann$score to numeric before ordering to
avoid mis-ranking when score is character/factor. Update the logic around the
ann data.table (before setorder) to convert score (e.g., ann[, score :=
as.numeric(score)]) and handle potential NAs (optionally warn if coercion
produces NA) so setorder and the subsequent unique(ann, by = "id") produce
correct ann_top results.
In `@R/converters_MZMinetoMSstatsFormat.R`:
- Around line 8-10: Update the parameter docs for annotation to state that the
`Run` column must match the standardized sample/run names produced by the
converter (i.e., after the function's column-name cleaning which strips the
trailing " Peak area" and normalizes spaces/dots), because the code (look for
the converter function that accepts `annotation` and the internal name-cleaning
logic that normalizes sample column names) uses those cleaned names when
joining; tell users to either provide `Run` values in that standardized form or
to run the same name-cleaning routine on their annotation before passing it in.
- Around line 54-55: The wrapper call to MSstatsImport in
converters_MZMinetoMSstatsFormat.R ignores the variadic arguments (...)
advertised for passing to data.table::fread; update the call where input is
assigned (the MSstatsImport invocation) to forward ... into MSstatsImport so
caller-supplied fread options are preserved, i.e., include ... in the
MSstatsImport argument list when calling MSstatsImport(list(input = input),
"MSstats", "MZMine") so the fread passthrough works as documented.
---
Nitpick comments:
In `@inst/tinytest/test_converters_MZMinetoMSstatsFormat.R`:
- Around line 98-100: The test currently only asserts that there are no
unexpected ProteinName values using expected_mz_rt and
output_nolib_dt$ProteinName, which can miss missing expected mz_rt IDs; change
the assertion to compare the sets (or sorted uniques) so it verifies every
expected fallback is present and no extras — e.g., replace the all(... %in% ...)
check with a set equality check between
unique(as.character(output_nolib_dt$ProteinName)) and expected_mz_rt (or sorted
variants) to ensure exact match.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 60c2c010-0294-4a60-b8ea-1991a39f0d27
⛔ Files ignored due to path filters (3)
inst/tinytest/raw_data/MZMine/annotation.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_annotations.csvis excluded by!**/*.csvinst/tinytest/raw_data/MZMine/mzmine_input.csvis excluded by!**/*.csv
📒 Files selected for processing (13)
.Rbuildignore.gitignoreDESCRIPTIONNAMESPACER/MSstatsConvert_core_functions.RR/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MSstatsClean.Rdman/MSstatsInputFiles.Rdman/MZMinetoMSstatsFormat.Rdman/dot-cleanRawMZMine.Rdvignettes/msstats_data_format.Rmd
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MZMine.R`:
- Around line 55-57: The score column is being coerced with as.numeric directly
(ann[, score := suppressWarnings(as.numeric(score))]) which will return factor
internal codes if score is a factor; change the coercion to go through character
first (i.e., suppressWarnings(as.numeric(as.character(score)))) so factor levels
are converted correctly, keep the anyNA(ann$score) check and the existing stop
message unchanged to validate numeric coercion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 95a37b9b-01a2-4648-b3c5-3aa94dcfecbf
📒 Files selected for processing (4)
R/clean_MZMine.RR/converters_MZMinetoMSstatsFormat.Rinst/tinytest/test_converters_MZMinetoMSstatsFormat.Rman/MZMinetoMSstatsFormat.Rd
✅ Files skipped from review due to trivial changes (1)
- man/MZMinetoMSstatsFormat.Rd
End-to-end test:MZMinetoMSstatsFormat against the biologist's Same input file PART 1: ROW KEY COMPARISONTheir (feature, run) keys: 59087 -> Every (feature, run) pair matches. No rows dropped, no extra rows. PART 2: INTENSITY COMPARISONstatus N Zero value mismatches. -> All peak-area values agree. The 5,596 both-NA rows are where the PART 3: COMPOUND NAME COMPARISONTheir annotated features (any source): 1796 Their library-annotated features: 69 In both: 69 Name mismatches on the 69 shared library-annotated features: 0 Note on the 1,679 features named by SIRIUS in their pipeline: |
| 2,245.1290,3.45,5000,4800,5200,4900 | ||
| 3,367.2010,5.67,800,0,750,820 | ||
| 4,489.3340,7.89,2000,2100,1900,2050 | ||
| 5,555.4470,9.10,100,0,0,0 |
There was a problem hiding this comment.
In the msstats-ai repo, it would be helpful to start compiling questions to ask a developer when they're creating converters (and integrate it as part of an existing skill / a new skill for msstatsConvert development). One such question to ask:
- Can we assume zeros mean a value is missing due to low abundance? If so, we can safely convert them to NA.
| feature4_proteins = unique(output_dt[PeptideSequence == "4", ProteinName]) | ||
| expect_equal(as.character(feature4_proteins), "489.334_7.89") | ||
| feature5_proteins = unique(output_dt[PeptideSequence == "5", ProteinName]) | ||
| expect_equal(as.character(feature5_proteins), "555.447_9.1") |
There was a problem hiding this comment.
Another question for msstats-ai repo:
- Should we filter out features that don't have a clear corresponding protein/metabolite? Generally the preference is yes, we should filter them out (because it messes with multiple hypothesis testing)
| @@ -0,0 +1,7 @@ | |||
| row ID,row m/z,row retention time,sampleA.mzML Peak area,sampleB.mzML Peak area,sampleC.mzML Peak area,sampleD.mzML Peak area | |||
There was a problem hiding this comment.
I think it was great you created a really small version of the user input that is simple. Makes the unit tests really straightforward and easy to follow (noticed other unit tests are harder to follow since there's so many rows in the input)
Maybe in msstats-ai, mention to generate a subset of the test dataset we're given for testing simplicity. And the test dataset should have things like missing values, etc.
| output_nolib = MZMinetoMSstatsFormat(input, annotation = annot, | ||
| mzmine_annotations = NULL, | ||
| use_log_file = FALSE) | ||
| output_nolib_dt = data.table::as.data.table(output_nolib) | ||
|
|
||
| # Every ProteinName is the mz_rt fallback string | ||
| expect_equal(ncol(output_nolib), 11) | ||
| expect_equal(nrow(output_nolib), 24) | ||
| expected_mz_rt = c("123.056_1.23", "245.129_3.45", "367.201_5.67", | ||
| "489.334_7.89", "555.447_9.1", "123.056_1.45") | ||
| expect_equal( | ||
| sort(unique(as.character(output_nolib_dt$ProteinName))), | ||
| sort(expected_mz_rt) | ||
| ) | ||
| # Compound names from the library must not leak in | ||
| expect_false(any(as.character(output_nolib_dt$ProteinName) %in% | ||
| c("Caffeine", "GlucoseHigh", "GlucoseLow", "Lactate"))) |
There was a problem hiding this comment.
I think you can safely get rid of these unit tests for now and make the mzmine annotation file be mandatory. I don't see this scenario happening / would rather not support it initially.
There was a problem hiding this comment.
Reading your E2E testing
Their annotated features (any source): 1796
sirius: 1679
library: 69
ms2query: 48
Their library-annotated features: 69
Our library-annotated features: 69
In both: 69
Only in theirs (we missed naming): 0
Only in ours (their score < 0.7 cutoff): 0
Name mismatches on the 69 shared library-annotated features: 0
-> Perfect agreement on every library-named feature.
Note on the 1,679 features named by SIRIUS in their pipeline:
those show up as mz_rt fallback strings in our output (e.g.
"455.282_0.65"), because Phase 1 only implements the library
annotation source, not SIRIUS / MS2Query / CANOPUS.
Could you elaborate what you mean here? Are you saying that for features identified by SIRIUS, there's no compound_name? And then that's why you fell back to mz_rt? Because this may affect any of my comments related to the mz_rt fallback.
There was a problem hiding this comment.
@tonywu1999 Okay, let me clarify what's happening with those features.
The library annotations file (mzmine-summer25-microbial-results_annotations.csv) only contains entries for features the spectral library could match i.e. 69 unique feature IDs out of 2,569 total. SIRIUS-named features aren't in that file at all; SIRIUS writes its results to a separate structure_identifications.tsv.
Our converter only reads the library annotations file (the mzmine_annotations argument). So when SIRIUS named a feature in their pipeline, we have no annotation row for that feature at all. Not a low-scoring one, not a missing compound_name, just no row. That's why those features fall back to mz_rt in our output.
To pick up SIRIUS / MS2Query / CANOPUS names, the user would have to either (a) run the full notebook to produce the master_annotation_table.csv and feed that in as mzmine_annotations (after renaming consensus_name → compound_name and consensus_score → score), or (b) we extend the converter to accept multiple annotation sources, which is what Phase 1 deliberately scoped out.
| # Without a library: ProteinName is an mz_rt fallback string | ||
| mzmine_no_lib = MZMinetoMSstatsFormat( | ||
| mzmine_input, | ||
| annotation = mzmine_annotation, | ||
| mzmine_annotations = NULL, | ||
| use_log_file = FALSE | ||
| ) | ||
| head(mzmine_no_lib) | ||
| ``` |
| stop("mzmine_annotations is missing required column(s): ", | ||
| paste(missing_ann, collapse = ", "), ".") | ||
| } | ||
| ann[, score := suppressWarnings(as.numeric(as.character(score)))] |
There was a problem hiding this comment.
What kind of warnings show up? Why as.character and then as.numeric when you can just do as.numeric?
| } | ||
| ann[, score := suppressWarnings(as.numeric(as.character(score)))] | ||
| if (anyNA(ann$score)) { | ||
| stop("mzmine_annotations$score must be numeric (or coercible to numeric).") |
There was a problem hiding this comment.
This error message may be confusing to a user. I'd say the score column must be numeric values in the mzmine annotation file
| round(mz_input[[rt_col]], 2)) | ||
|
|
||
| if (!is.null(mzmine_annotations)) { | ||
| ann <- data.table::as.data.table(mzmine_annotations) |
There was a problem hiding this comment.
Can you use a variable name that is more descriptive, e.g. feature to compound mapping?
| mz_input, | ||
| id.vars = c("ProteinName", "PeptideSequence"), | ||
| measure.vars = peak_area_cols, | ||
| variable.name = "sample_col", |
There was a problem hiding this comment.
Could this simply be named to "Run"?
| compound <- mz_rt_fallback | ||
| } | ||
|
|
||
| mz_input[, ProteinName := compound] |
There was a problem hiding this comment.
Are we assuming mz_input is sorted here by ID? Would it be better to do a join?
e.g.
mz_input[
ann,
ProteinName := i.compound_name,
on = c(id_col = "id"),
mult = "first"
]
Summary
Adds
MZMinetoMSstatsFormat, a new converter for MZMine metabolomics output, mirroring the structure and conventions ofDIANNtoMSstatsFormat. This is Phase 1 of a cross-package task to add metabolomics support to the MSstats family; Phase 2 (theBIO = "Metabolomics"value in MSstatsShiny) will be a separate PR in Vitek-Lab/MSstatsShiny that depends on this one.What's added
R/clean_MZMine.R: internal.cleanRawMZMine()that handles the wide-to-long pivot, column normalization, and the optional compound-name join.R/converters_MZMinetoMSstatsFormat.R— public converter. Signature mirrorsDIANNtoMSstatsFormatwith metabolomics-appropriate defaults (remove_shared_peptides = FALSE, no decoy/oxidation filters,IsotopeLabelType = "Light").R/MSstatsConvert_core_functions.R—MSstatsMZMineFilesclass registration +MSstatsCleanmethod dispatch. Dispatch is name-based in this package, so no other registration site needs editing.inst/tinytest/raw_data/MZMine/: a wide-format quant CSV (6 features × 4 samples), spectral-library annotations CSV, and run-annotation CSV.inst/tinytest/test_converters_MZMinetoMSstatsFormat.R: 38 assertions covering happy path, IsotopeLabelType, NA charge/fragment columns, highest-score-wins annotation policy, mz_rt fallback for unannotated features, andremoveProtein_with1Featurebehavior.vignettes/msstats_data_format.Rmd.Column mapping
ProteinNamemzmine_annotations(highest-scoring hit per feature), or mz_rt fallbackPeptideSequencerow IDfrom quant filePrecursorChargeNAFragmentIonNAProductChargeNAIsotopeLabelType"Light"(matches DIANN's non-SILAC convention)RunPeak areastrippedIntensityDesign choices worth noting
Optional
mzmine_annotationsparameter. Most features in real MZMine datasets have no spectral-library annotation (in the example dataset I tested with, 69 of 2,569 features were annotated — about 3%). To make output biologically usable,MZMinetoMSstatsFormataccepts an optionalmzmine_annotationsdata frame that joins compound names by highest score per feature. WhenNULL, every feature falls back to a mz_rt string (paste0(round(mz, 4), "_", round(rt, 2))). This is the simple spectral-library join only - not the full multi-source SIRIUS/MS2Query/CANOPUS consensus workflow some labs use.Verification
tinytest::test_all("."): 732/732 pass, including the 38 new MZMine assertions.R CMD checkon MSstatsConvert: clean. One new NOTE (MZMinetoMSstatsFormat: no visible binding for global variable 'Intensity'); same pattern as DIANN/Skyline/Spectronaut.R CMD checkon../MSstatsshows no MZMine-related findings (pre-existing WARNINGs/NOTEs only). I wasn't able to run cross-checks onMSstatsTMT,MSstatsPTM, orMSstatsLiPbecause they're not cloned in my immediate workspace - flagging in case you want to verify those before merge.Heads-up for review
Column-name standardization affects user-facing
Runvalues..standardizeColnamesstrips spaces and dots from column names, so the per-sample column"sampleA.mzML Peak area"becomessampleAmzMLPeakareainternally and the resultingRunvalue (after thePeakareasuffix strip) issampleAmzML. This means a user's annotation file must useRun = sampleAmzML, notsampleA.mzMLorsampleA. This is consistent with how every MSstatsConvert converterbehaves — but MZMine's column names are uglier than the others, so the transformation is more visible. Worth deciding whether docs should call this out more loudly, the converter should normalize the user's
Runvalues internally, or it's fine as-is.Closes Phase 1 of the metabolomics integration. Phase 2 (MSstatsShiny
BIO = "Metabolomics") will follow in a separate PR.Motivation and context — short summary
This PR adds first-class support for untargeted metabolomics MZMine exports to MSstatsConvert by introducing a converter and cleaning pipeline that mirrors DIANN conventions but uses metabolomics-appropriate defaults (e.g., IsotopeLabelType = "Light", NA charge/fragment fields). It pivots MZMine wide-format " Peak area" tables to MSstats long format, optionally joins spectral-library compound annotations (highest-score wins), and falls back to an mz_rt identifier for unannotated features.
Detailed changes
Unit tests added/modified
Coding guidelines / notable issues
@inheritParamsreference targets a function whose roxygen block does not produce a discoverable topic, causing roxygen2 v7.3.3 to error (roxygen2 v8 tolerates it). The author avoided re-running document() to keep scope limited and recommends a separate housekeeping PR to fix the ProteinProspector docstring (adjust@inheritParamsor ensure the referenced topic is documentable).