Freed heavy fit objects and large dataProcess intermediates early#207
Freed heavy fit objects and large dataProcess intermediates early#207tonywu1999 wants to merge 4 commits into
Conversation
* Stripped survreg() $y / $linear.predictors and lm() $model after fitting in .fitSurvival and .fitLinearModel; downstream predict / summary / vcov paths are unaffected. * Rewrote .updateUnequalVariances to mutate a single data.frame in place instead of rebuilding it ~5x per iteration via data.frame(...). * Added rm(raw), rm(peptides_dict), and gc(verbose = FALSE) inside dataProcess() so the ~250-400 MB held by function arguments is freed before MSstatsSummarizationOutput runs. * Added rm(survival_fit) and rm(fit) inside the per-protein linear / TMP summarizers so local fit objects drop as soon as survival / result is materialised. * Tests: inst/tinytest/test_memory_optimization.R Issue 1 (tests 1a, 1b) — 14 assertions, all green. See MSstats-ai/todos/active/TODO-MS-20260514_fix-memory-bugs.md Co-Authored-By: Claude <noreply@anthropic.com>
📝 WalkthroughWalkthroughFrees large intermediates and fitted-model internals during data processing and summarization, fixes a conditional in ChangesMemory Optimization in Data Processing and Fitting
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/utils_summarization.R`:
- Around line 208-223: The loess iterations reuse stale fitted values because
input[["fitted"]] is only set from the initial fit; after each weighted refit
(wls.fit) update the stored fitted values so subsequent loess calls use
consistent residual/fitted pairs — e.g. inside the loop after computing wls.fit
(the variable wls.fit) assign input[["fitted"]] <- fitted(wls.fit) or
input[["fitted"]] <- wls.fit$fitted.values before recalculating
input[["abs.resids"]] and proceeding to the next iteration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d6992f3-0ec8-4947-80c1-53b71f9e9caa
📒 Files selected for processing (4)
R/dataProcess.RR/utils_imputation.RR/utils_summarization.Rinst/tinytest/test_memory_optimization_fits.R
| for (i in seq_len(num_iter)) { | ||
| if (i == 1) { | ||
| abs.resids = data.frame(abs.resids = abs(fit$residuals)) | ||
| fitted = data.frame(fitted = fit$fitted.values) | ||
| input = data.frame(input, | ||
| "abs.resids" = abs.resids, | ||
| "fitted" = fitted) | ||
| input[["abs.resids"]] = abs(fit$residuals) | ||
| input[["fitted"]] = fit$fitted.values | ||
| } | ||
| fit.loess = loess(abs.resids ~ fitted, data = input) | ||
| loess.fitted = data.frame(loess.fitted = fitted(fit.loess)) | ||
| input = data.frame(input, "loess.fitted" = loess.fitted) | ||
| ## loess fitted valuaes are predicted sd | ||
| input$weight = 1 / (input$loess.fitted ^ 2) | ||
| input = input[, !(colnames(input) %in% "abs.resids")] | ||
| input[["loess.fitted"]] = fitted(fit.loess) | ||
| ## loess fitted values are predicted sd | ||
| input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2) | ||
| input[["abs.resids"]] = NULL | ||
| ## re-fit using weight | ||
| wls.fit = lm(formula(fit), data = input, weights = weight) | ||
| abs.resids = data.frame(abs.resids = abs(wls.fit$residuals)) | ||
| input = data.frame(input, "abs.resids" = abs.resids) | ||
| input = input[, -which(colnames(input) %in% c("loess.fitted", "weight"))] | ||
| input[["abs.resids"]] = abs(wls.fit$residuals) | ||
| input[["loess.fitted"]] = NULL | ||
| input[["weight"]] = NULL | ||
| } |
There was a problem hiding this comment.
Refresh fitted after each weighted refit.
At Line 220 you refresh abs.resids from wls.fit, but fitted is never refreshed after Line 211. For num_iter > 1, later loess iterations use stale fitted values and inconsistent residual/fitted pairs.
Suggested fix
for (i in seq_len(num_iter)) {
if (i == 1) {
input[["abs.resids"]] = abs(fit$residuals)
input[["fitted"]] = fit$fitted.values
}
fit.loess = loess(abs.resids ~ fitted, data = input)
input[["loess.fitted"]] = fitted(fit.loess)
## loess fitted values are predicted sd
input[["weight"]] = 1 / (input[["loess.fitted"]] ^ 2)
input[["abs.resids"]] = NULL
## re-fit using weight
wls.fit = lm(formula(fit), data = input, weights = weight)
input[["abs.resids"]] = abs(wls.fit$residuals)
+ input[["fitted"]] = wls.fit$fitted.values
input[["loess.fitted"]] = NULL
input[["weight"]] = NULL
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/utils_summarization.R` around lines 208 - 223, The loess iterations reuse
stale fitted values because input[["fitted"]] is only set from the initial fit;
after each weighted refit (wls.fit) update the stored fitted values so
subsequent loess calls use consistent residual/fitted pairs — e.g. inside the
loop after computing wls.fit (the variable wls.fit) assign input[["fitted"]] <-
fitted(wls.fit) or input[["fitted"]] <- wls.fit$fitted.values before
recalculating input[["abs.resids"]] and proceeding to the next iteration.
|
Thanks, fantastic update. |
mstaniak
left a comment
There was a problem hiding this comment.
Thanks again for his fantastic update, just one question
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/profile_dataprocess_peak.R`:
- Around line 58-60: The required-column validation in
profile_dataprocess_peak.R currently defines required_cols as c("ProteinName",
"PeptideSequence", "PrecursorCharge", "FragmentIon", "ProductCharge",
"IsotopeLabelType", "Run", "BioReplicate", "Condition", "Intensity") which omits
PeptideModifiedSequence expected by the package validity path and risks later
failures; update the required_cols vector (and the analogous check around lines
85-89) to include "PeptideModifiedSequence", and ensure any check/handling for
"AnomalyScores" remains separate (i.e., validate presence of
PeptideModifiedSequence as required while treating AnomalyScores as an
optional/special-case column).
- Around line 143-148: The benchmark is using non-default parameters which makes
results non-comparable to dataProcess(); update the calls to
MSstats:::MSstatsSelectFeatures and MSstats:::MSstatsPrepareForSummarization to
use the same default knobs as dataProcess() rather than the explicit overrides
(featureSubset = "all", n_top_feature = 3, remove_uninformative_feature_outlier
= TRUE), i.e. remove the custom arguments and call the functions with their
defaults (or explicitly pass the same parameter values used by dataProcess()) so
the benchmark reflects the standard dataProcess() baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b06a9a32-3f43-4c52-841e-6e968edfde2d
📒 Files selected for processing (1)
benchmark/profile_dataprocess_peak.R
| required_cols <- c("ProteinName", "PeptideSequence", "PrecursorCharge", | ||
| "FragmentIon", "ProductCharge", "IsotopeLabelType", | ||
| "Run", "BioReplicate", "Condition", "Intensity") |
There was a problem hiding this comment.
Tighten required-column validation to match the pipeline contract.
The script currently accepts a 10-column schema, but the package validity path expects PeptideModifiedSequence too (with AnomalyScores handled separately). This can admit unsupported inputs and fail later in less obvious places.
Suggested patch
-required_cols <- c("ProteinName", "PeptideSequence", "PrecursorCharge",
+required_cols <- c("ProteinName", "PeptideSequence", "PeptideModifiedSequence", "PrecursorCharge",
"FragmentIon", "ProductCharge", "IsotopeLabelType",
"Run", "BioReplicate", "Condition", "Intensity")Also applies to: 85-89
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/profile_dataprocess_peak.R` around lines 58 - 60, The
required-column validation in profile_dataprocess_peak.R currently defines
required_cols as c("ProteinName", "PeptideSequence", "PrecursorCharge",
"FragmentIon", "ProductCharge", "IsotopeLabelType", "Run", "BioReplicate",
"Condition", "Intensity") which omits PeptideModifiedSequence expected by the
package validity path and risks later failures; update the required_cols vector
(and the analogous check around lines 85-89) to include
"PeptideModifiedSequence", and ensure any check/handling for "AnomalyScores"
remains separate (i.e., validate presence of PeptideModifiedSequence as required
while treating AnomalyScores as an optional/special-case column).
|
@Rudhik1904 If you could report your benchmarking results with a large dataset in this PR (i.e. amount of memory freed up, etc.), that would be great for documentation purposes |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
R/utils_summarization.R (1)
219-221:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winRefresh
fittedafter each weighted refit.At Line 219,
abs.residsis updated fromwls.fit, butfittedis not. Fornum_iter > 1, subsequent loess steps use stale fitted values.Suggested fix
wls.fit = lm(formula(fit), data = input, weights = weight) input[, abs.resids := abs(wls.fit$residuals)] + input[, fitted := wls.fit$fitted.values] input[, loess.fitted := NULL] input[, weight := NULL]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@R/utils_summarization.R` around lines 219 - 221, The code updates input[, abs.resids := abs(wls.fit$residuals)] but does not refresh the fitted values, so subsequent iterations (when num_iter > 1) use stale predictions; after computing wls.fit add a line to update the stored fitted values (e.g., set input[, fitted := wls.fit$fitted] or equivalent) so each loess reweighting uses the most recent fitted vector before clearing loess.fitted and weight.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@benchmark/profile_dataprocess_peak.R`:
- Line 8: Replace the garbled header comment on line containing "mark and
elapsed time..." with a clear, readable description; e.g., rewrite the comment
used by the benchmark (the commented header string) to: "Mark elapsed time at
every stage boundary; useful for tracking the running max of `used` after
gc(reset = TRUE) to locate peak memory usage." Ensure you update the same
comment block that contains the malformed text so the benchmark header is
readable and informative.
---
Duplicate comments:
In `@R/utils_summarization.R`:
- Around line 219-221: The code updates input[, abs.resids :=
abs(wls.fit$residuals)] but does not refresh the fitted values, so subsequent
iterations (when num_iter > 1) use stale predictions; after computing wls.fit
add a line to update the stored fitted values (e.g., set input[, fitted :=
wls.fit$fitted] or equivalent) so each loess reweighting uses the most recent
fitted vector before clearing loess.fitted and weight.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dad8d3f4-a2a6-430c-be4c-8fd22547e10c
📒 Files selected for processing (2)
R/utils_summarization.Rbenchmark/profile_dataprocess_peak.R
| # Peak-memory + wall-clock instrumentation for MSstats::dataProcess. | ||
| # | ||
| # Replays dataProcess's internals manually, printing the gc() high-water | ||
| # mark and elapsed time at every stage boundary. Useful forunning max of `used` since gc(reset = TRUE)r locating |
There was a problem hiding this comment.
Fix the garbled benchmark description comment.
Line 8 contains corrupted text (forunning ... r locating), which makes the header guidance harder to follow.
Suggested patch
-# mark and elapsed time at every stage boundary. Useful forunning max of `used` since gc(reset = TRUE)r locating
+# mark and elapsed time at every stage boundary. Useful for locating🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@benchmark/profile_dataprocess_peak.R` at line 8, Replace the garbled header
comment on line containing "mark and elapsed time..." with a clear, readable
description; e.g., rewrite the comment used by the benchmark (the commented
header string) to: "Mark elapsed time at every stage boundary; useful for
tracking the running max of `used` after gc(reset = TRUE) to locate peak memory
usage." Ensure you update the same comment block that contains the malformed
text so the benchmark header is readable and informative.
User description
PR Type
Bug fix, Enhancement, Tests
Description
Free large pipeline inputs earlier
Strip heavy fields from fit objects
Avoid repeated copies in variance updates
Add fit memory regression tests
Diagram Walkthrough
File Walkthrough
dataProcess.R
Free pipeline intermediates and fit objects earlyR/dataProcess.R
rawafter initial preparationpeptides_dictafter normalizationgc(verbose = FALSE)before output assemblysurvival_fitandfitsoonerutils_imputation.R
Strip unused `survreg` fields after fittingR/utils_imputation.R
fit$yaftersurvreg()fittingfit$linear.predictorsbefore returningutils_summarization.R
Reduce `lm` footprint and variance-copy churnR/utils_summarization.R
linear_model$modelbefore returninginputtodata.frameonce upfronttest_memory_optimization_fits.R
Add regression tests for memory-stripped fitsinst/tinytest/test_memory_optimization_fits.R
survregfieldslm$modelsummary()andvcov()still workMotivation and Context
This PR reduces peak memory use in MSstats' data-processing and per-protein summarization by freeing large intermediate objects and by stripping heavy fields from fitted model objects that are not required for downstream use. Two main memory sources are targeted: (1) large intermediate data structures created during dataProcess() (raw, peptides_dict, etc.), and (2) large components inside fitted model objects (survreg and lm) that are retained unnecessarily across loop iterations. The PR also reduces copy churn in the unequal-variance weighting loop. Together these changes free on the order of ~250–400 MB during dataProcess in observed benchmarks.
Short summary of the solution
Detailed changes
R/dataProcess.R
R/utils_imputation.R (.fitSurvival)
R/utils_summarization.R (.fitLinearModel, .updateUnequalVariances)
Per-protein summarization orchestration
New benchmark and helpers
Unit tests added or modified
inst/tinytest/test_memory_optimization_fits.R (new)
benchmark/profile_dataprocess_peak.R (new)
No existing tests were modified; new tests follow the project's tinytest pattern and validate both correctness (downstream APIs) and memory reduction.
Coding guidelines / potential violations