pUniFind: Unified large pretrained deep learning model pushing the limit of mass spectra interpretation
This is the official repository for pUniFind, the most powerful zero-shot open peptide-spectrum scoring model surpassing other SOTA search engines and the first zero-shot open de novo sequencing deep learning model supporting over 1300 modifications. Developed by pFind group and DP Technology.
- 🚀 Quick Start
- 📊 Output Formats
- 📈 Result Visualization
- 🔧 Advanced Configuration Options
- 🧠 Please Read Me
- 🛠️ Technical Support
- 💪 Performance Boost
- ❓ FAQ
- 🤝 Citation
Demo data can be downloaded from Google Drive.
Running our model on windows requires gpu. Dear reviewers, you can use xzhbez6w as your Bohrium ID, which can help you avoid registering a Bohrium account. Ordinary users are required to register a free Bohrium ID to add users.
Download the .exe download link install package first and then install by following the instructions.
For GPU Batch size, you can set it to 128 if your GPU has more than 8GB memory. If your GPU only has around 4GB memory, consider setting it to 64. You can use the nvidia-smi command in your terminal to check your GPU's memory information.
The rescoring results will be saved in the result folder. The de novo sequencing results will be stored in the pUniFind_result folder.
If you encounter a download failure prompt after installation, please download the corresponding files from the Hugging Face repository and place them in the following directories:
- torchlib.zip: Place in
Installation Path\piUniMS\punifind\torch\lib, then extract all.dllfiles inside into this directory. - checkpoint_rank.pt: Place in
Installation Path\piUniMS\punifind\ckpts. - checkpoint4.pt: Place in
Installation Path\piUniMS\punifind\ckpts.
Then reopen the software.
Note: The default installation path is C:\Users\[Your Username]\AppData\Local\Programs\piUniMS. If you customized the location during installation, you can find the actual path by hovering your mouse over the desktop shortcut.
(We will optimize the user experience in the near future, This is just a preview version linux source code.)
Our pUniFind support multi-gpu processing to speed up.
| env | version |
|---|---|
| cuda | >= 11.7 |
| python | 3.8 |
Please ensure your working directory is the root of the pUniFind repository when running environment configuration, re-scoring, or de novo sequencing scripts.
# set up conda env
conda create -n pUniFind python=3.9 -y
conda activate pUniFind
git clone https://github.com/pFindStudio/pUniFind.git
# get to project path
cd pUniFind
pip install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
pip install -e .You can download checkpoint from here to ckpts folder.
Use checkpoint_rank.pt for scoring and de novo sequencing of Thermo data, and checkpoint_tims.pt for de novo sequencing of TIMS data. The default model is checkpoint_rank.pt. You can configure this setting in official_denovo_workflow.sh or official_score_workflow.sh.
To verify that your environment is configured correctly, try running pytest via the command line to test the demo data in the projects folder. This may take a few minutes. Run the following command:
pytestPut the following folder under projects folder.
project_name/ # pFind Task folder generated by pfind !!!!!!
├── ***.pac # protein ids (generated by pfind at fasta folder moved by users to project root)
├── param/ # pFind search parameters (generated by pfind)
├── result/ # pFind search result (generated by pfind)
└── mgfs/ # mgf files (generated by pfind moved by users)Then do rescoring by this:
pUniFind rescore project_path batchsizeWe recommand a batchsize of 256 at first and adjust this by seeing speed and cuda memory size.
- Results will be stored at
project_namefdr0.01_pUniFind.spectra.
Put the following folder under projects folder.
project_name/
└── mgfs/ # mgf files (generated by pfind moved by users) Then do open de novo by this:
pUniFind denovo project_path batchsizeWe recommand a batchsize of 256 at first and adjust this by seeing speed and cuda memory size.
-
Direct de novo results will be stored at
project_name_001_5_merged.csvandproject_name_001_5_filtered.csvunderpUniFind_resultfolder. -
Modification statistics will be stored at
project_name_mod.txtunderpUniFind_resultfolder. -
All peptides connected will be stored at
project_name.fastaunderpUniFind_resultfolder.
If you only cares about very few modifications, we recommend you further search pFind3 (with open mode disabled) use fasta file above and set modifications you care (considering project_name_mod.txt) as variable modification.
If you want to use TIMS mode, you can just add --ckpt tims_checkpoint_path after commands above. checkpoint for TIMS can be downloaded from here.
If you do not have gpu. You can access our Bohrium Web Interface to rent gpu and run pUniFind online directly.
The gpu resource from bohrium can be unstable. If you can not get your job started, this is most likely result from lack of gpu resource. We recommend you try 4090 at first. If 4090 is not avilable, we recommend 3090.
If you have any problem, please contact us through Technical Suport.
# Folder to upload
# Rescoring
pFind Task folder/ # generated by pfind
├── ***.pac # protein ids (generated by pfind at fasta folder moved by users to project root)
├── param/ # pFind search parameters (generated by pfind)
├── result/ # pFind search result (generated by pfind)
└── mgfs/ # mgf files (generated by pfind moved by users)
# De novo sequencing
project folder/
└── mgfs/ # mgf files (generated by pfind moved by users)
| column name | meaning | example |
|---|---|---|
| File_Name | Title of spectrum from mgf. | example.1.1.2.0.dta |
| Scan_No | Scan No. | 1000 |
| Charge | Charge | 1 |
| Sequence | Sequence identified by pUniFind | SPTCTNQEL |
| Calc_MHplus+ | MH+ mass | 2031.948724 |
| Modification | Modification | 4,Carbamidomethyl[C];8,Cation_Na[E]; |
| Proteins | Proteins | tr|A0A075B6G3|A0A075B6G3_HUMAN/ |
Currently, to improve performance, we only predict scores for peptides with a precursor mass error tolerance within 20 ppm and peptide lengths ranging from 6 to 40 residues. Predicted peptides outside of these ranges will not be logged. In future releases, we plan to support more flexible settings.
To better visualize result format we will show columns as rows with the same order.
| column name | meaning | example |
|---|---|---|
| spectrum title | Title of spectrum from mgf. | example.1.1.2.0.dta |
| score | Score predicted by pUniFind, which is the same score as open rescoring. | 7.241 |
| cos similarity | Cos similarity between experiment spectrum and spectrum of de novo result peptide predicted by pUniFind | 0.95 |
| Retention time | Experimental retention time (seconds). | 1169.002807 |
| Missing fragment ion site | Position of Missing fragment ion site. The last number seperated by "_" is peptide length, ignore the last number. | 6_8 |
| mass difference | Mass difference between predicted peptide sequence and experiment precursor mass. | 0.0003662109375 |
| Peptide sequence | Peptide sequencing predicted | ['SPTCTNQEL'] |
| Peptide sequence with modification | Peptide sequencing predicted with modification and modification sites. | SPTCTNQEL_4_Carbamidomethyl_8_Cation_Na |
| Modifications | Modifications and sites predicted | "4,Carbamidomethyl[C];8,Cation_Na[E];" |
Just typical fasta file.
| column name | meaning | example |
|---|---|---|
| Modification Name | Name of modification. | Oxidation[M] |
| Frequency of modification | Frequency appeared in topN candidates | 3296 |
There are a few configurations to set in shell scripts(official_denovo_workflow.sh, official_score_workflow.sh), which may modify for your usage:
| Name | Usage |
|---|---|
| num_proc | Number of cpu process during data process. This is particularly useful if there are a lot of mgfs. num_proc <= number of cpu cores (default=16) |
| range_pred | Number of candidates with different length to de novo. We will first predict the length of peptide and then predict peptides with multiple length. You should use odd number. (default=5) |
In the future there will be more options supported, such as:
| Name | Usage |
|---|---|
| de novo min/max length | Minimum or maximum length of peptide to be predicted. Since pUniFind first predict length and then predict peptide, if the length is not satisfied, peptide of corresponding spectrum will not be predicted to speed up. |
| predict_score_all | There are cases that all candidate peptide for a spectrum do not satisfy 20 ppm threashold. In current default mode, pUniFind will not predict the score for them to speedup. |
| instrument | Type of instrument, e.g. QE, Lumos, TIMS, Astral. For now, it is QE by default. |
| nce file path | Type of instrument. For now, it is 30 by default. |
Using the script get_pLabel_from_pUniFind_rescoring.py in this repository, you can export files in .plabel format, which can then be directly imported into pLabel for spectral visualization.
python get_pLabel_from_pUniFind_rescoring_English.py [pFind project folder path] [pUniFind result.spectra file path]
We provide user-friendly de novo Result Visualization tool for both workflow mentioned in our paper.
- Regular de novo: pLabel is a convenient tool to visualize spectrum. pLabel requires a .plabel file and the corresponding .mgf file to do visualization. What user need to do is to change the path of mgf file in .plabel file (which is generated by pUniFind). User guide of pLabel can be seen in link above. It is important to check that the mgf name and mgf path is correct!!
# pLabel format example
[FilePath]
File_Path=C:\Users\Ecoli-E1-F2-20151208_HCDFT_extract103.mgf # path of mgf!!
[Modification]
1=Oxidation[M]
2=Carbamidomethyl[C]
[xlink]
xlink=NULL
[Total]
total=1
[Spectrum1]
name=ECOLI-E1-F2-20151208.30360.30360.3.0.DTA
pep1=0 LGLDVLVHGEAER 1 - Modification rich de novo: This workflow rely on pFind (disable
openmode) to do database (generated by pUniFind) search. pBuild is a visualization tool which is already integrated to pFind.
Data type:
- Currently, pUniFind do not support ITMS(considered to be outdated with low resolution) or ETD/EThcD data. For Astral narrow window DIA de novo sequencing, we recommend users first try timsTOF mode to do de novo sequencing. Since Astral narrow window DIA data is relatively scarce, if users can contribute Astral narrow window DIA data, we are willing to provide finetune services to make pUniFind perform better on Astral.
Open de novo sequencing is a very challenging and complicated task, there are a few things you should take care.
- There are a few "mass coincidences", some of them are :
Q+Deamidated[Q]=E,N+Deamidated[N]=D,glycidamide[anything]=S,Acetyl+K=AV/VA,K+Crotonyl=PV/VP,K+Formy=GV/VG,K+Ubiq=GG,G+Methyl=A, etc. We do not recommend you to search these modifications in modification rich de novo workflow unless that kind modification is exactly what you want, in which case, you might want to postprocess searched result. You can find modification information inmodification.inifile in install path of pFind or our github repo. - There are a few loss modifications you might want to ignore:
Arg-loss[AnyC-termR],Met-loss[ProteinN-termM],Met-loss+Acetyl[ProteinN-termM], etc.
Should you encounter any technical issues, observe suboptimal performance, or identify inconsistencies between pUniFind results and our evaluation metrics, we welcome your feedback 🙏. We are looking for bad cases to further refine our model. We are actively updating and refining our software, since the main author is far from graduation :(.
We provide priority support for user-reported issues through the following channels:
For technical inquiries:
-
GitHub Issues: Open a new issue with:
- Data description.
- Error logs and environment.
- Uploaded folder description
-
pFind Studio user support WeChat group:
- Please add WeChat:
JL_Zhao2000, and I will invite you into our user support group. (Because WeChat invitation expire in one week.)
- Please add WeChat:
For collaboration requests:
📧 Contact info: Jiale Zhao. Email: zhaojiale22z@ict.ac.cn or marshmallowzjl@gmail.com.
To further enhance the database search and rescoring performance in HLA or non-specific enzyme digestion scenarios, please follow these steps:
- In the pFind project folder, navigate to the task directory and open the param subfolder. Then, modify the parameter pepnum in the file pFind.cfg from 10 to 20. (This change can be made either before or after the pFind search, which is currently not supported by GUI.)
- Open the corresponding task in pFind and run the database search.
- Finally, perform the rescoring using pUniFind.
-
MGF format: Please make sure your MGF file is generated by pFind. MGF files from different software can be very different. The latest version of pFind now supports data coming from various instruments (including: Thermo, timsTOF, etc.). For rescoring or de novo sequencing, you can search your .raw/.d data against any FASTA file. pFind will first preprocess and generate MGF files (you might need to click
MGFin Data Extraction under MS Data). If you still insist on using mgfs from MSConvert, you can put all mgfs generated by MSConvert at a certain folder and use the script in our repository and run:python3 mgf_processor.py -i /somewhere/mgfs_to_process_folder/ -o /somewhere/processed_mgf_folder/ -p (8 by default, number of processes)
-
Install path: Please make sure your install path and data/result path do not contain spaces.
-
Windows uninstall: If you want to reinstall the Windows version of pUniFind, please use
unins000.exeto uninstall; otherwise, you may not be able to change your install path. If you have already uninstalled using other methods, please reinstall and then uninstall using the method mentioned above. -
Linux deployment: If you met this
libstdc++.so.6: version `GLIBCXX_3.4.29' not foundproblem, see This Solution. In my case, I solved this byexport LD_LIBRARY_PATH=/your_path/miniconda3/envs/pUniFind/lib:$LD_LIBRARY_PATH. -
Result not found: If you can not get result file for rescoring. please see if you put
.pacfile at the right place (project root folder, not inside result/). Also, you can check bothpUniFind_resultfolder andresultfolder. -
Targeted methods: If you are using targeted acquisition methods (e.g., AIMS, PRM, SRM) and observe suboptimal performance, please contact our team. We can analyze your data and recommend the optimal analysis strategy.
-
Component Download Failure in Software: If an error occurs while downloading the checkpoint, please try downloading them one by one instead of multiple at the same time. If the issue persists, please contact us. Please try to use a network with a relatively fast speed for downloading.
If you find our software is useful and helped your research, please cite us 🙏 through:
@misc{zhao2025punifindunifiedlargepretrained,
title={pUniFind: a unified large pre-trained deep learning model pushing the limit of mass spectra interpretation},
author={Jiale Zhao and Pengzhi Mao and Kaifei Wang and Yiming Li and Yaping Peng and Ranfei Chen and Shuqi Lu and Xiaohong Ji and Jiaxiang Ding and Xin Zhang and Yucheng Liao and Weinan E and Weijie Zhang and Han Wen and Hao Chi},
year={2025},
eprint={2507.00087},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.00087},
}Every citation of yours will motivate the main author to make pUniFind more user-friendly and more powerful. Main author need your valuable citations and stars to find a job after graduation 😫.