Skip to content

Latest commit

 

History

History
293 lines (223 loc) · 18 KB

File metadata and controls

293 lines (223 loc) · 18 KB

pUniFind: Unified large pretrained deep learning model pushing the limit of mass spectra interpretation

This is the official repository for pUniFind, the most powerful zero-shot open peptide-spectrum scoring model surpassing other SOTA search engines and the first zero-shot open de novo sequencing deep learning model supporting over 1300 modifications. Developed by pFind group and DP Technology.

📚 Table of Contents

🚀 Quick Start

Demo data can be downloaded from Google Drive.

Local deployment for Windows

Running our model on windows requires gpu. Dear reviewers, you can use xzhbez6w as your Bohrium ID, which can help you avoid registering a Bohrium account. Ordinary users are required to register a free Bohrium ID to add users.

Download the .exe download link install package first and then install by following the instructions.

For GPU Batch size, you can set it to 128 if your GPU has more than 8GB memory. If your GPU only has around 4GB memory, consider setting it to 64. You can use the nvidia-smi command in your terminal to check your GPU's memory information.

The rescoring results will be saved in the result folder. The de novo sequencing results will be stored in the pUniFind_result folder.

Alternative Solution for Download Failures

If you encounter a download failure prompt after installation, please download the corresponding files from the Hugging Face repository and place them in the following directories:

  1. torchlib.zip: Place in Installation Path\piUniMS\punifind\torch\lib, then extract all .dll files inside into this directory.
  2. checkpoint_rank.pt: Place in Installation Path\piUniMS\punifind\ckpts.
  3. checkpoint4.pt: Place in Installation Path\piUniMS\punifind\ckpts.

Then reopen the software.

Note: The default installation path is C:\Users\[Your Username]\AppData\Local\Programs\piUniMS. If you customized the location during installation, you can find the actual path by hovering your mouse over the desktop shortcut.

Local deployment for Linux (Preview usage)

(We will optimize the user experience in the near future, This is just a preview version linux source code.)

Env setup

Our pUniFind support multi-gpu processing to speed up.

env version
cuda >= 11.7
python 3.8

Please ensure your working directory is the root of the pUniFind repository when running environment configuration, re-scoring, or de novo sequencing scripts.

# set up conda env
conda create -n pUniFind python=3.9 -y
conda activate pUniFind

git clone https://github.com/pFindStudio/pUniFind.git
# get to project path
cd pUniFind
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e .

Download checkpoint to ckpts

You can download checkpoint from here to ckpts folder.

Use checkpoint_rank.pt for scoring and de novo sequencing of Thermo data, and checkpoint_tims.pt for de novo sequencing of TIMS data. The default model is checkpoint_rank.pt. You can configure this setting in official_denovo_workflow.sh or official_score_workflow.sh.

Check Installation

To verify that your environment is configured correctly, try running pytest via the command line to test the demo data in the projects folder. This may take a few minutes. Run the following command:

pytest

Open Rescore

Put the following folder under projects folder.

project_name/ # pFind Task folder generated by pfind !!!!!!
├── ***.pac # protein ids (generated by pfind at fasta folder moved by users to project root)
├── param/ # pFind search parameters (generated by pfind)
├── result/ # pFind search result (generated by pfind)
└── mgfs/ # mgf files (generated by pfind moved by users)

Then do rescoring by this:

pUniFind rescore project_path batchsize

We recommand a batchsize of 256 at first and adjust this by seeing speed and cuda memory size.

  • Results will be stored at project_namefdr0.01_pUniFind.spectra.

Open De Novo

Put the following folder under projects folder.

project_name/ 
└── mgfs/ # mgf files (generated by pfind moved by users)   

Then do open de novo by this:

pUniFind denovo project_path batchsize

We recommand a batchsize of 256 at first and adjust this by seeing speed and cuda memory size.

  • Direct de novo results will be stored at project_name_001_5_merged.csv and project_name_001_5_filtered.csv under pUniFind_result folder.

  • Modification statistics will be stored at project_name_mod.txt under pUniFind_result folder.

  • All peptides connected will be stored at project_name.fasta under pUniFind_result folder.

If you only cares about very few modifications, we recommend you further search pFind3 (with open mode disabled) use fasta file above and set modifications you care (considering project_name_mod.txt) as variable modification.

If you want to use TIMS mode, you can just add --ckpt tims_checkpoint_path after commands above. checkpoint for TIMS can be downloaded from here.

Web application

If you do not have gpu. You can access our Bohrium Web Interface to rent gpu and run pUniFind online directly.

The gpu resource from bohrium can be unstable. If you can not get your job started, this is most likely result from lack of gpu resource. We recommend you try 4090 at first. If 4090 is not avilable, we recommend 3090.

If you have any problem, please contact us through Technical Suport.

# Folder to upload

# Rescoring
pFind Task folder/ # generated by pfind
├── ***.pac # protein ids (generated by pfind at fasta folder moved by users to project root)
├── param/ # pFind search parameters (generated by pfind)
├── result/ # pFind search result (generated by pfind)
└── mgfs/ # mgf files (generated by pfind moved by users)

# De novo sequencing
project folder/ 
└── mgfs/ # mgf files (generated by pfind moved by users)   

📊 Output Formats

Open Rescore Results

column name meaning example
File_Name Title of spectrum from mgf. example.1.1.2.0.dta
Scan_No Scan No. 1000
Charge Charge 1
Sequence Sequence identified by pUniFind SPTCTNQEL
Calc_MHplus+ MH+ mass 2031.948724
Modification Modification 4,Carbamidomethyl[C];8,Cation_Na[E];
Proteins Proteins tr|A0A075B6G3|A0A075B6G3_HUMAN/

Open De Novo Results

Currently, to improve performance, we only predict scores for peptides with a precursor mass error tolerance within 20 ppm and peptide lengths ranging from 6 to 40 residues. Predicted peptides outside of these ranges will not be logged. In future releases, we plan to support more flexible settings.

To better visualize result format we will show columns as rows with the same order.

For _merged.csv and _filtered.csv

column name meaning example
spectrum title Title of spectrum from mgf. example.1.1.2.0.dta
score Score predicted by pUniFind, which is the same score as open rescoring. 7.241
cos similarity Cos similarity between experiment spectrum and spectrum of de novo result peptide predicted by pUniFind 0.95
Retention time Experimental retention time (seconds). 1169.002807
Missing fragment ion site Position of Missing fragment ion site. The last number seperated by "_" is peptide length, ignore the last number. 6_8
mass difference Mass difference between predicted peptide sequence and experiment precursor mass. 0.0003662109375
Peptide sequence Peptide sequencing predicted ['SPTCTNQEL']
Peptide sequence with modification Peptide sequencing predicted with modification and modification sites. SPTCTNQEL_4_Carbamidomethyl_8_Cation_Na
Modifications Modifications and sites predicted "4,Carbamidomethyl[C];8,Cation_Na[E];"

For .fasta

Just typical fasta file.

For _mod.txt

column name meaning example
Modification Name Name of modification. Oxidation[M]
Frequency of modification Frequency appeared in topN candidates 3296

🔧 Advanced Configuration Options

There are a few configurations to set in shell scripts(official_denovo_workflow.sh, official_score_workflow.sh), which may modify for your usage:

Name Usage
num_proc Number of cpu process during data process. This is particularly useful if there are a lot of mgfs. num_proc <= number of cpu cores (default=16)
range_pred Number of candidates with different length to de novo. We will first predict the length of peptide and then predict peptides with multiple length. You should use odd number. (default=5)

In the future there will be more options supported, such as:

Name Usage
de novo min/max length Minimum or maximum length of peptide to be predicted. Since pUniFind first predict length and then predict peptide, if the length is not satisfied, peptide of corresponding spectrum will not be predicted to speed up.
predict_score_all There are cases that all candidate peptide for a spectrum do not satisfy 20 ppm threashold. In current default mode, pUniFind will not predict the score for them to speedup.
instrument Type of instrument, e.g. QE, Lumos, TIMS, Astral. For now, it is QE by default.
nce file path Type of instrument. For now, it is 30 by default.

📈 Result Visualization

Spectral Visualization for Database Search Re-Scoring

Using the script get_pLabel_from_pUniFind_rescoring.py in this repository, you can export files in .plabel format, which can then be directly imported into pLabel for spectral visualization.

python get_pLabel_from_pUniFind_rescoring_English.py [pFind project folder path] [pUniFind result.spectra file path]

Spectral Visualization for De Novo Sequencing

alt text We provide user-friendly de novo Result Visualization tool for both workflow mentioned in our paper.

  • Regular de novo: pLabel is a convenient tool to visualize spectrum. pLabel requires a .plabel file and the corresponding .mgf file to do visualization. What user need to do is to change the path of mgf file in .plabel file (which is generated by pUniFind). User guide of pLabel can be seen in link above. It is important to check that the mgf name and mgf path is correct!!
# pLabel format example
[FilePath]
File_Path=C:\Users\Ecoli-E1-F2-20151208_HCDFT_extract103.mgf # path of mgf!!
[Modification]
1=Oxidation[M]
2=Carbamidomethyl[C]
[xlink]
xlink=NULL
[Total]
total=1
[Spectrum1]
name=ECOLI-E1-F2-20151208.30360.30360.3.0.DTA
pep1=0 LGLDVLVHGEAER 1 
  • Modification rich de novo: This workflow rely on pFind (disable open mode) to do database (generated by pUniFind) search. pBuild is a visualization tool which is already integrated to pFind.

🧠 Please Read Me

Data type:

  • Currently, pUniFind do not support ITMS(considered to be outdated with low resolution) or ETD/EThcD data. For Astral narrow window DIA de novo sequencing, we recommend users first try timsTOF mode to do de novo sequencing. Since Astral narrow window DIA data is relatively scarce, if users can contribute Astral narrow window DIA data, we are willing to provide finetune services to make pUniFind perform better on Astral.

Open de novo sequencing is a very challenging and complicated task, there are a few things you should take care.

  • There are a few "mass coincidences", some of them are : Q+Deamidated[Q]=E, N+Deamidated[N]=D, glycidamide[anything]=S, Acetyl+K=AV/VA,K+Crotonyl=PV/VP,K+Formy=GV/VG,K+Ubiq=GG,G+Methyl=A, etc. We do not recommend you to search these modifications in modification rich de novo workflow unless that kind modification is exactly what you want, in which case, you might want to postprocess searched result. You can find modification information in modification.ini file in install path of pFind or our github repo.
  • There are a few loss modifications you might want to ignore: Arg-loss[AnyC-termR], Met-loss[ProteinN-termM], Met-loss+Acetyl[ProteinN-termM], etc.

🛠️ Technical Support

Should you encounter any technical issues, observe suboptimal performance, or identify inconsistencies between pUniFind results and our evaluation metrics, we welcome your feedback 🙏. We are looking for bad cases to further refine our model. We are actively updating and refining our software, since the main author is far from graduation :(.

We provide priority support for user-reported issues through the following channels:

For technical inquiries:

  1. GitHub Issues: Open a new issue with:

    • Data description.
    • Error logs and environment.
    • Uploaded folder description
  2. pFind Studio user support WeChat group:

    • Please add WeChat: JL_Zhao2000, and I will invite you into our user support group. (Because WeChat invitation expire in one week.)

For collaboration requests:
📧 Contact info: Jiale Zhao. Email: zhaojiale22z@ict.ac.cn or marshmallowzjl@gmail.com.

💪 Performance Boost

To further enhance the database search and rescoring performance in HLA or non-specific enzyme digestion scenarios, please follow these steps:

  1. In the pFind project folder, navigate to the task directory and open the param subfolder. Then, modify the parameter pepnum in the file pFind.cfg from 10 to 20. (This change can be made either before or after the pFind search, which is currently not supported by GUI.)
  2. Open the corresponding task in pFind and run the database search.
  3. Finally, perform the rescoring using pUniFind.

❓ FAQ

  • MGF format: Please make sure your MGF file is generated by pFind. MGF files from different software can be very different. The latest version of pFind now supports data coming from various instruments (including: Thermo, timsTOF, etc.). For rescoring or de novo sequencing, you can search your .raw/.d data against any FASTA file. pFind will first preprocess and generate MGF files (you might need to click MGF in Data Extraction under MS Data). If you still insist on using mgfs from MSConvert, you can put all mgfs generated by MSConvert at a certain folder and use the script in our repository and run:

    python3 mgf_processor.py -i /somewhere/mgfs_to_process_folder/ -o /somewhere/processed_mgf_folder/ -p (8 by default, number of processes)
  • Install path: Please make sure your install path and data/result path do not contain spaces.

  • Windows uninstall: If you want to reinstall the Windows version of pUniFind, please use unins000.exe to uninstall; otherwise, you may not be able to change your install path. If you have already uninstalled using other methods, please reinstall and then uninstall using the method mentioned above.

  • Linux deployment: If you met this libstdc++.so.6: version `GLIBCXX_3.4.29' not found problem, see This Solution. In my case, I solved this by export LD_LIBRARY_PATH=/your_path/miniconda3/envs/pUniFind/lib:$LD_LIBRARY_PATH.

  • Result not found: If you can not get result file for rescoring. please see if you put .pac file at the right place (project root folder, not inside result/). Also, you can check both pUniFind_result folder and result folder.

  • Targeted methods: If you are using targeted acquisition methods (e.g., AIMS, PRM, SRM) and observe suboptimal performance, please contact our team. We can analyze your data and recommend the optimal analysis strategy.

  • Component Download Failure in Software: If an error occurs while downloading the checkpoint, please try downloading them one by one instead of multiple at the same time. If the issue persists, please contact us. Please try to use a network with a relatively fast speed for downloading.

🤝 Citation

If you find our software is useful and helped your research, please cite us 🙏 through:

@misc{zhao2025punifindunifiedlargepretrained,
      title={pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation}, 
      author={Jiale Zhao and Pengzhi Mao and Kaifei Wang and Yiming Li and Yaping Peng and Ranfei Chen and Shuqi Lu and Xiaohong Ji and Jiaxiang Ding and Xin Zhang and Yucheng Liao and Weinan E and Weijie Zhang and Han Wen and Hao Chi},
      year={2025},
      eprint={2507.00087},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.00087}, 
}

Every citation of yours will motivate the main author to make pUniFind more user-friendly and more powerful. Main author need your valuable citations and stars to find a job after graduation 😫.