-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discusssion about searching dark proteins by re-using previous searching results or store internal files #169
Comments
Hi Xiaolong, This is an interesting (and challenging!) project. I think depending on how much effort you want to put into it, you could probably do quite a bit of optimization to make this dramatically faster. I would recommend building a full experimental fragment ion index from your 800 TB of data. This would allow you to find all MS2 spectra containing specific fragment m/z values within a given mass tolerance. Binning fragment ion m/z values is probably the easiest way to do this. This could allow you to, for instance, assemble an mzML file containing all likely MS2 scans for a set of peptides (e.g. all scans within MS1 tolerance and with 4+ matched fragment ions), that you could score with any existing software. Alternatively, Sage is fully modular and designed as a library, so it's possible to run just pieces of the pipeline as part of other programs you write (in either Rust or Python - see the very cool SagePy project). You can load MS2 spectra from any kind of data source and then score them. It's probably worth looking more into what PepQuery/etc are doing, since they are more designed for this kind of task - most of Sage's speed comes from optimizations around file loading, memory usage, etc that you won't necessarily benefit from if you are going to take the modular route. I would also reconsider using a relational database for this task... you will probably hit massive bottlenecks if you are storing m/z and intensity values directly. Parquet files or a custom format on S3 is probably better, especially if you want to build an experimental index. |
Thank you so much for your quick response. I will spend a lot of effort in it because I am running a lab. We can benefit a lot from this project. I hope to get more help from you in the future. |
I found quantms from your X tweets. I found that I install it in 2022 but never tested... I need to do more background check... |
Hi, Michael.
I starred sage but haven't realized that it is such a good innovation. I got suggestions from Jimmy. Here is our disscussion UWPR/Comet#75
I also want to seek your advice. The proteome data is public but re-using the results is challenging because the searching result is not well organized. So I downloaded about 800TB of public data and decide to process them and organize the results.
There are tools like ms2rescore and alphapeptdeep can boost the identification of PSMs. Also, there is need to use the data to check existence of new protein sequences or peptides in the MS data as "Dark proteins" can be important.
‘Dark proteins’ hiding in our cells could hold clues to cancer and other diseases
PepQuery can search novel peptides, but it is too strict and quite different from the traditional target-decoy search.
So if I have some new protein sequences to check, how to do the search without running the whole pipeline?
My plan is
So the aim is, whenever given a new protein sequence not included in the database, we can quickly get it's searching results from the huge amount of data. Also, we can estimate it's protein LFQ in different experiments.
I would like to know your thoughts. If I use sage, what should I do? I can write some code, but mostly in Python.
Best wishes!
Xiaolong
The text was updated successfully, but these errors were encountered: