Skip to content

fix(maxquant): Filter out decoys with decoy column#133

Open
tonywu1999 wants to merge 1 commit into
develfrom
fix-maxquant
Open

fix(maxquant): Filter out decoys with decoy column#133
tonywu1999 wants to merge 1 commit into
develfrom
fix-maxquant

Conversation

@tonywu1999
Copy link
Copy Markdown
Contributor

@tonywu1999 tonywu1999 commented May 22, 2026

Motivation and Context

https://groups.google.com/g/msstats/c/NwsByfS2Y5M

I was using a MaxQuant output for the first time with MSstats. I ran MaxQtoMSstatsFormat then dataProcess and realized there were "REV_" (reverse) sequences in the processed_data$ProteinLevelData slot. Looking back at an old MaxQuant output shows that the decoy columns used to be named "Reverse" in the proteinGroup.txt and evidence.txt output files. Using MaxQuant 2.8.0.0 the "Reverse" column is now missing from these files and there is a new column called "Decoy" that seems to represent the same information. Although I couldn't find any documentation of this change, I believe this is why I saw reverse sequences in my processed data.
I reran MSstats but changed "Decoy" column names to "Reverse" in the proteinGroup and evidence data which resulted in MSstats removing decoy sequences and they were no longer present in the processed data.

Motivation and Context

MaxQuant proteomics software has introduced the "Potential.contaminant" column in its output format as an additional means of identifying potentially problematic proteins. The .cleanRawMaxQuant() function previously filtered proteins only based on the Contaminant, Reverse, and Decoy columns. This PR updates the function to also filter out proteins marked in the new Potential.contaminant column, ensuring that the MSstatsConvert package properly handles recent changes to MaxQuant's output format and prevents potentially problematic proteins from being included in downstream analysis.

Changes

  • R/clean_MaxQuant.R:
    • Added "Potentialcontaminant" to the filter_cols vector (line 20) to filter rows where this column contains marked values
    • Updated the informational message (lines 21-22) to include "Potential.contaminant" in the list of filtered protein categories
    • Updated the informational message for remove_by_site = TRUE case (lines 25-26) to also mention "Potential.contaminant" alongside existing filters ("Contaminant", "Reverse", "Decoy", and "Only.identified.by.site")
    • Total changes: +3/-3 lines

Unit Tests

No unit tests were added or modified in this PR. The existing test suite in inst/tinytest/test_cleanRaw.R does test MaxQuant cleaning functionality (lines 40-55), but the test data already contains the Potential.contaminant column in the mq_pg.csv file, so the filtering behavior is implicitly covered by existing tests. No explicit new test cases were created to specifically validate the Potentialcontaminant filtering behavior.

Coding Guidelines

No violations of coding guidelines identified. The changes follow the existing code patterns and maintain consistency with the R coding style used throughout the package.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

.cleanRawMaxQuant() now includes Potentialcontaminant as an additional column for filtering contaminants. The function's status message is updated to report Potential.contaminant alongside the existing Contaminant, Reverse, and Decoy filters when removing flagged rows.

Changes

Contaminant Filtering Enhancement

Layer / File(s) Summary
Expand contaminant filter columns and status message
R/clean_MaxQuant.R
The contaminant filter configuration is expanded to include Potentialcontaminant as an additional filter column alongside Contaminant, Reverse, and Decoy. The informational status message is updated to mention Potential.contaminant in the filtering output.

Estimated Code Review Effort

🎯 2 (Simple) | ⏱️ ~5 minutes

Poem

🐰 A contaminant lurks in the data stream,
Potential and actual—now caught by the scheme!
With dots in the names and filters aligned,
MaxQuant data shines, with contaminants consigned!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions filtering decoys with a decoy column, but the actual change adds 'Potential.contaminant' filtering, not decoy-related changes. Update the title to accurately reflect that the change filters Potential.contaminant column alongside existing contaminant-like filters.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description is comprehensive and well-structured, addressing motivation, changes, testing, and coding guidelines.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-maxquant

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tonywu1999 tonywu1999 changed the title fix(maxquant): Fix MaxQuant converter w.r.t. recent MaxQ changes fix(maxquant): Filter out decoys with decoy column May 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@R/clean_MaxQuant.R`:
- Around line 20-27: The current filter_cols in clean_MaxQuant.R uses incorrect
MaxQuant header names (e.g., "Potentialcontaminant" and "Decoy") so filtering
silently skips expected columns; change filter_cols to use the literal MaxQuant
column names present in our inputs (e.g., "Contaminant",
"Potential.contaminant", "Reverse") and remove "Decoy"; if remove_by_site is
true append "Only.identified.by.site" to filter_cols and update the msg text to
match these literal names so the log reflects the actual columns being filtered;
locate the filter_cols and msg variables in clean_MaxQuant.R to make this edit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5e62551-f121-49e2-8429-864235c2d3d7

📥 Commits

Reviewing files that changed from the base of the PR and between b9564f2 and 40b0139.

📒 Files selected for processing (1)
  • R/clean_MaxQuant.R

Comment thread R/clean_MaxQuant.R
Comment on lines +20 to 27
filter_cols = c("Contaminant", "Potentialcontaminant", "Reverse", "Decoy")
msg = paste("** + Contaminant, + Reverse, + Decoy, + Potential.contaminant",
"proteins are removed.")
if (remove_by_site) {
filter_cols = c(filter_cols, "Onlyidentifiedbysite")
msg = paste("** + Contaminant, + Reverse, + Potential.contaminant,",
msg = paste("** + Contaminant, + Reverse, + Decoy, + Potential.contaminant,",
"+ Only.identified.by.site proteins are removed.")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for references to MaxQuant Decoy and Potentialcontaminant columns

echo "=== Searching for 'Decoy' column references ==="
rg -n -C3 --type=r -i 'decoy' -g '!*test*'

echo -e "\n=== Searching for 'Potentialcontaminant' column references ==="
rg -n -C3 --type=r 'Potentialcontaminant' -g '!*test*'

echo -e "\n=== Checking test files or documentation for MaxQuant column examples ==="
fd -e r -e md -e Rd . | xargs rg -l -i 'maxquant.*column|evidence.*file'

Repository: Vitek-Lab/MSstatsConvert

Length of output: 9424


🌐 Web query:

What are the column names for contaminant and decoy proteins in recent MaxQuant evidence.txt and proteinGroups.txt output files?

💡 Result:

In recent MaxQuant output tables, the columns used to flag decoy and contaminant entries are: 1) evidence.txt (peptide evidence) - Decoy (reverse) column name: Reverse [1][2] - Contaminant column name: Potential.contaminant [1][2] 2) proteinGroups.txt (protein group evidence) - Decoy (reverse) column name: Reverse [3] - Contaminant column name: Potential contaminant (reported as “Potential contaminant” in the proteinGroups.txt column description, corresponding to the same contaminant flag used in MaxQuant outputs) [3] Note on exact naming: Many downstream tools and documentation refer to the contaminant flag as “Potential.contaminant” (with a dot) for evidence-style tables, while protein group tooling sometimes displays it as “Potential contaminant” (space) in column descriptions; you should still confirm the literal header line in your specific proteinGroups.txt because column names can vary by MaxQuant version [1][2].

Citations:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-120) ==="
sed -n '1,120p' R/clean_MaxQuant.R | cat -n

# Find and inspect .filterManyColumns usage/definition
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50

echo
echo "=== Show .filterManyColumns implementation (if found) ==="
# Try to locate file containing it
file=$(rg -l "\.filterManyColumns" -S R | head -n 1 || true)
if [ -n "${file:-}" ]; then
  echo "Found in: $file"
  sed -n '1,220p' "$file" | cat -n
else
  echo "No implementation found under R/"
fi

# Additionally, inspect .cleanRawMaxQuant body for evidence/protein_groups handling
echo
echo "=== Show body of .cleanRawMaxQuant (lines 1-220) ==="
sed -n '1,220p' R/clean_MaxQuant.R | cat -n

Repository: Vitek-Lab/MSstatsConvert

Length of output: 19393


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-160) ==="
sed -n '1,160p' R/clean_MaxQuant.R | cat -n

# Find and inspect helper it uses for filtering many columns
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50

echo
echo "=== Show candidate helper implementation(s) ==="
files=$(rg -l "\.filterManyColumns" -S R || true)
if [ -n "${files:-}" ]; then
  for f in $files; do
    echo "--- $f (first 240 lines) ---"
    sed -n '1,240p' "$f" | cat -n | head -n 240
  done
else
  echo "No occurrences found under R/"
fi

Repository: Vitek-Lab/MSstatsConvert

Length of output: 19390


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show clean_MaxQuant.R around the reported lines and nearby logic
echo "=== clean_MaxQuant.R (lines 1-200) ==="
sed -n '1,200p' R/clean_MaxQuant.R | cat -n

# Find and inspect helper for filtering many columns
echo
echo "=== Searching for .filterManyColumns definition/uses ==="
rg -n "\.filterManyColumns" -S R | head -n 50

echo
echo "=== Show candidate helper implementation(s) ==="
files=$(rg -l "\.filterManyColumns" -S R || true)
if [ -n "${files:-}" ]; then
  for f in $files; do
    echo "--- $f (first 260 lines) ---"
    sed -n '1,260p' "$f" | cat -n
  done
else
  echo "No occurrences found under R/"
fi

Repository: Vitek-Lab/MSstatsConvert

Length of output: 19390


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for Potential.contaminant in repo ==="
rg -n --type=r -i 'Potential\.contaminant|Potentialcontaminant|Potential contaminant' .

echo
echo "=== Search for MaxQuant header mappings for decoy/contaminant ==="
rg -n --type=r -i 'MaxQuant|evidence\.txt|proteinGroups\.txt|Reverse|Potential\.contaminant|Decoy' .

Repository: Vitek-Lab/MSstatsConvert

Length of output: 11402


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Header: inst/tinytest/raw_data/MaxQuant/mq_ev.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuant/mq_ev.csv

echo
echo "=== Header: inst/tinytest/raw_data/MaxQuant/mq_pg.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuant/mq_pg.csv

echo
echo "=== Header: inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv

echo
echo "=== Header: inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv ==="
head -n 1 inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv

echo
echo "=== Columns containing 'Reverse' / 'Decoy' / 'Potential' ==="
for f in \
  inst/tinytest/raw_data/MaxQuant/mq_ev.csv \
  inst/tinytest/raw_data/MaxQuant/mq_pg.csv \
  inst/tinytest/raw_data/MaxQuantTMT/mq_ev.csv \
  inst/tinytest/raw_data/MaxQuantTMT/mq_pg.csv; do
  echo "--- $f ---"
  head -n 1 "$f" | tr ',' '\n' | sed 's/^ *//;s/ *$//' | rg -n -i 'reverse|decoy|potential'
done

Repository: Vitek-Lab/MSstatsConvert

Length of output: 27286


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Inspect inst/tinytest/test_cleanRaw.R ==="
sed -n '1,140p' inst/tinytest/test_cleanRaw.R | cat -n

echo
echo "=== Inspect inst/tinytest/test_converters_MaxQtoMSstatsFormat.R ==="
sed -n '1,120p' inst/tinytest/test_converters_MaxQtoMSstatsFormat.R | cat -n

echo
echo "=== Inspect inst/tinytest/test_converters_MaxQtoMSstatsTMTFormat.R ==="
sed -n '1,160p' inst/tinytest/test_converters_MaxQtoMSstatsTMTFormat.R | cat -n

echo
echo "=== Inspect R/MSstatsConvert_core_functions.R for MSstatsClean MaxQuant method ==="
rg -n "setMethod\\(\"MSstatsClean\".*MaxQuant|cleanRawMaxQuant" -n R/MSstatsConvert_core_functions.R
sed -n '190,270p' R/MSstatsConvert_core_functions.R | cat -n

Repository: Vitek-Lab/MSstatsConvert

Length of output: 15630


Fix MaxQuant contaminant/decoy column names used for filtering (filter_cols)

  • R/clean_MaxQuant.R sets filter_cols = c("Contaminant", "Potentialcontaminant", "Reverse", "Decoy"), but the expected MaxQuant headers in this repo’s tinytest inputs are Reverse and Potential.contaminant (no Decoy, no Potentialcontaminant).
  • Because .filterManyColumns() only filters columns in intersect(filter_columns, colnames(input)), the incorrect names are silently skipped, so contaminant proteins may not be removed even though the logged msg claims Potential.contaminant is filtered.

Update filter_cols to match the literal MaxQuant headers (at least Potential.contaminant and Reverse; remove/conditionally handle non-existent Decoy/Potentialcontaminant/Contaminant).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@R/clean_MaxQuant.R` around lines 20 - 27, The current filter_cols in
clean_MaxQuant.R uses incorrect MaxQuant header names (e.g.,
"Potentialcontaminant" and "Decoy") so filtering silently skips expected
columns; change filter_cols to use the literal MaxQuant column names present in
our inputs (e.g., "Contaminant", "Potential.contaminant", "Reverse") and remove
"Decoy"; if remove_by_site is true append "Only.identified.by.site" to
filter_cols and update the msg text to match these literal names so the log
reflects the actual columns being filtered; locate the filter_cols and msg
variables in clean_MaxQuant.R to make this edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant