fix(maxquant): Filter out decoys with decoy column#133
Conversation
📝 WalkthroughWalkthrough
ChangesContaminant Filtering Enhancement
Estimated Code Review Effort🎯 2 (Simple) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@R/clean_MaxQuant.R`:
- Around line 20-27: The current filter_cols in clean_MaxQuant.R uses incorrect
MaxQuant header names (e.g., "Potentialcontaminant" and "Decoy") so filtering
silently skips expected columns; change filter_cols to use the literal MaxQuant
column names present in our inputs (e.g., "Contaminant",
"Potential.contaminant", "Reverse") and remove "Decoy"; if remove_by_site is
true append "Only.identified.by.site" to filter_cols and update the msg text to
match these literal names so the log reflects the actual columns being filtered;
locate the filter_cols and msg variables in clean_MaxQuant.R to make this edit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| filter_cols = c("Contaminant", "Potentialcontaminant", "Reverse", "Decoy") | ||
| msg = paste("** + Contaminant, + Reverse, + Decoy, + Potential.contaminant", | ||
| "proteins are removed.") | ||
| if (remove_by_site) { | ||
| filter_cols = c(filter_cols, "Onlyidentifiedbysite") | ||
| msg = paste("** + Contaminant, + Reverse, + Potential.contaminant,", | ||
| msg = paste("** + Contaminant, + Reverse, + Decoy, + Potential.contaminant,", | ||
| "+ Only.identified.by.site proteins are removed.") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for references to MaxQuant Decoy and Potentialcontaminant columns
echo "=== Searching for 'Decoy' column references ==="
rg -n -C3 --type=r -i 'decoy' -g '!*test*'
echo -e "\n=== Searching for 'Potentialcontaminant' column references ==="
rg -n -C3 --type=r 'Potentialcontaminant' -g '!*test*'
echo -e "\n=== Checking test files or documentation for MaxQuant column examples ==="
fd -e r -e md -e Rd . | xargs rg -l -i 'maxquant.*column|evidence.*file'Repository: Vitek-Lab/MSstatsConvert
Length of output: 9424
🌐 Web query:
What are the column names for contaminant and decoy proteins in recent MaxQuant evidence.txt and proteinGroups.txt output files?
💡 Result:
In recent MaxQuant output tables, the columns used to flag decoy and contaminant entries are: 1) evidence.txt (peptide evidence) - Decoy (reverse) column name: Reverse [1][2] - Contaminant column name: Potential.contaminant [1][2] 2) proteinGroups.txt (protein group evidence) - Decoy (reverse) column name: Reverse [3] - Contaminant column name: Potential contaminant (reported as “Potential contaminant” in the proteinGroups.txt column description, corresponding to the same contaminant flag used in MaxQuant outputs) [3] Note on exact naming: Many downstream tools and documentation refer to the contaminant flag as “Potential.contaminant” (with a dot) for evidence-style tables, while protein group tooling sometimes displays it as “Potential contaminant” (space) in column descriptions; you should still confirm the literal header line in your specific proteinGroups.txt because column names can vary by MaxQuant version [1][2].
Citations:
- 1: https://cambridgecentreforproteomics.github.io/course_expression_proteomics/materials/08_maxquant_notes.html
- 2: https://www.bioconductor.org/packages/release/bioc/vignettes/proDA/inst/doc/data-import.html
- 3: https://analyst-suite.monash-proteomics.cloud.edu.au/apps/lfq-analyst/LFQ-Analyst_manual.pdf
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-120) ==="
sed -n '1,120p' R/clean_MaxQuant.R | cat -n
# Find and inspect .filterManyColumns usage/definition
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50
echo
echo "=== Show .filterManyColumns implementation (if found) ==="
# Try to locate file containing it
file=$(rg -l "\.filterManyColumns" -S R | head -n 1 || true)
if [ -n "${file:-}" ]; then
echo "Found in: $file"
sed -n '1,220p' "$file" | cat -n
else
echo "No implementation found under R/"
fi
# Additionally, inspect .cleanRawMaxQuant body for evidence/protein_groups handling
echo
echo "=== Show body of .cleanRawMaxQuant (lines 1-220) ==="
sed -n '1,220p' R/clean_MaxQuant.R | cat -nRepository: Vitek-Lab/MSstatsConvert
Length of output: 19393
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-160) ==="
sed -n '1,160p' R/clean_MaxQuant.R | cat -n
# Find and inspect helper it uses for filtering many columns
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50
echo
echo "=== Show candidate helper implementation(s) ==="
files=$(rg -l "\.filterManyColumns" -S R || true)
if [ -n "${files:-}" ]; then
for f in $files; do
echo "--- $f (first 240 lines) ---"
sed -n '1,240p' "$f" | cat -n | head -n 240
done
else
echo "No occurrences found under R/"
fiRepository: Vitek-Lab/MSstatsConvert
Length of output: 19390
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-200) ==="
sed -n '1,200p' R/clean_MaxQuant.R | cat -n
# Find and inspect helper for filtering many columns
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50
echo
echo "=== Show candidate helper implementation(s) ==="
files=$(rg -l "\.filterManyColumns" -S R || true)
if [ -n "${files:-}" ]; then
for f in $files; do
echo "--- $f (first 260 lines) ---"
sed -n '1,260p' "$f" | cat -n
done
else
echo "No occurrences found under R/"
fiRepository: Vitek-Lab/MSstatsConvert
Length of output: 19390
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for Potential.contaminant in repo ==="
rg -n --type=r -i 'Potential\.contaminant|Potentialcontaminant|Potential contaminant' .
echo
echo "=== Search for MaxQuant header mappings for decoy/contaminant ==="
rg -n --type=r -i 'MaxQuant|evidence\.txt|proteinGroups\.txt|Reverse|Potential\.contaminant|Decoy' .Repository: Vitek-Lab/MSstatsConvert
Length of output: 11402
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Header: inst/tinytest/raw_data/MaxQuant/mq_ev.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuant/mq_ev.csv
echo
echo "=== Header: inst/tinytest/raw_data/MaxQuant/mq_pg.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuant/mq_pg.csv
echo
echo "=== Header: inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv
echo
echo "=== Header: inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv
echo
echo "=== Columns containing 'Reverse' / 'Decoy' / 'Potential' ==="
for f in \
inst/tinytest/raw_data/MaxQuant/mq_ev.csv \
inst/tinytest/raw_data/MaxQuant/mq_pg.csv \
inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv \
inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv; do
echo "--- $f ---"
head -n 1 "$f" | tr ',' '\n' | sed 's/^ *//;s/ *$//' | rg -n -i 'reverse|decoy|potential'
doneRepository: Vitek-Lab/MSstatsConvert
Length of output: 27286
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Inspect inst/tinytest/test_cleanRaw.R ==="
sed -n '1,140p' inst/tinytest/test_cleanRaw.R | cat -n
echo
echo "=== Inspect inst/tinytest/test_converters_MaxQtoMSstatsFormat.R ==="
sed -n '1,120p' inst/tinytest/test_converters_MaxQtoMSstatsFormat.R | cat -n
echo
echo "=== Inspect inst/tinytest/test_converters_MaxQtoMSstatsTMTFormat.R ==="
sed -n '1,160p' inst/tinytest/test_converters_MaxQtoMSstatsTMTFormat.R | cat -n
echo
echo "=== Inspect R/MSstatsConvert_core_functions.R for MSstatsClean MaxQuant method ==="
rg -n "setMethod\\(\"MSstatsClean\".*MaxQuant|cleanRawMaxQuant" -n R/MSstatsConvert_core_functions.R
sed -n '190,270p' R/MSstatsConvert_core_functions.R | cat -nRepository: Vitek-Lab/MSstatsConvert
Length of output: 15630
Fix MaxQuant contaminant/decoy column names used for filtering (filter_cols)
R/clean_MaxQuant.Rsetsfilter_cols = c("Contaminant", "Potentialcontaminant", "Reverse", "Decoy"), but the expected MaxQuant headers in this repo’s tinytest inputs areReverseandPotential.contaminant(noDecoy, noPotentialcontaminant).- Because
.filterManyColumns()only filters columns inintersect(filter_columns, colnames(input)), the incorrect names are silently skipped, so contaminant proteins may not be removed even though the loggedmsgclaimsPotential.contaminantis filtered.
Update filter_cols to match the literal MaxQuant headers (at least Potential.contaminant and Reverse; remove/conditionally handle non-existent Decoy/Potentialcontaminant/Contaminant).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@R/clean_MaxQuant.R` around lines 20 - 27, The current filter_cols in
clean_MaxQuant.R uses incorrect MaxQuant header names (e.g.,
"Potentialcontaminant" and "Decoy") so filtering silently skips expected
columns; change filter_cols to use the literal MaxQuant column names present in
our inputs (e.g., "Contaminant", "Potential.contaminant", "Reverse") and remove
"Decoy"; if remove_by_site is true append "Only.identified.by.site" to
filter_cols and update the msg text to match these literal names so the log
reflects the actual columns being filtered; locate the filter_cols and msg
variables in clean_MaxQuant.R to make this edit.
Motivation and Context
https://groups.google.com/g/msstats/c/NwsByfS2Y5M
Motivation and Context
MaxQuant proteomics software has introduced the "Potential.contaminant" column in its output format as an additional means of identifying potentially problematic proteins. The
.cleanRawMaxQuant()function previously filtered proteins only based on theContaminant,Reverse, andDecoycolumns. This PR updates the function to also filter out proteins marked in the newPotential.contaminantcolumn, ensuring that the MSstatsConvert package properly handles recent changes to MaxQuant's output format and prevents potentially problematic proteins from being included in downstream analysis.Changes
R/clean_MaxQuant.R:"Potentialcontaminant"to thefilter_colsvector (line 20) to filter rows where this column contains marked valuesremove_by_site = TRUEcase (lines 25-26) to also mention "Potential.contaminant" alongside existing filters ("Contaminant", "Reverse", "Decoy", and "Only.identified.by.site")Unit Tests
No unit tests were added or modified in this PR. The existing test suite in
inst/tinytest/test_cleanRaw.Rdoes test MaxQuant cleaning functionality (lines 40-55), but the test data already contains thePotential.contaminantcolumn in themq_pg.csvfile, so the filtering behavior is implicitly covered by existing tests. No explicit new test cases were created to specifically validate thePotentialcontaminantfiltering behavior.Coding Guidelines
No violations of coding guidelines identified. The changes follow the existing code patterns and maintain consistency with the R coding style used throughout the package.