Skip to content

Add msproteomics test datasets#1953

Open
an-altosian wants to merge 1 commit into
nf-core:msproteomicsfrom
an-altosian:msproteomics
Open

Add msproteomics test datasets#1953
an-altosian wants to merge 1 commit into
nf-core:msproteomicsfrom
an-altosian:msproteomics

Conversation

@an-altosian
Copy link
Copy Markdown

Summary

Public test datasets for the nf-core/msproteomics pipeline.

  • TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) — 2 mzML subsets
  • DDA LFQ: Zenodo 1051552 (Human SILAC) — 2 mzML subsets
  • DIA: CPTAC CCRCC (Human DIA) — 1 mzML subset
  • FASTA: UniProt reference databases (Erwinia, Human SwissProt 2000-protein subset, E.coli+UPS1)
  • Module inputs: Pre-computed intermediate files for unit testing individual modules
  • Samplesheets: CSV inputs for all workflow test profiles

FASTA file sizes

File Size Proteins
ecoli_ups1_test.fasta 1.8 MB
erwinia_carotovora.fasta 1.6 MB
erwinia_uniprot.fasta 1.9 MB
human_sp_subset.fasta 1.4 MB 2,000

human_sp_subset.fasta is a smart subset: 169 proteins identified from the DDA LFQ test spectra (HEK SILAC) + 1,831 evenly-spaced SwissProt entries for search space diversity. Validated by running the full FragPipe DDA LFQ pipeline end-to-end (174 proteins identified at 1% FDR).

Supersedes #1946 (closed due to force-push history issue).

🤖 Generated with Claude Code

Public datasets for nf-core/msproteomics pipeline stub and integration testing:
- TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) - 2 mzML subsets
- DDA LFQ: Zenodo 1051552 (Human SILAC) - 2 mzML subsets
- DIA: CPTAC CCRCC (Human DIA) - 1 mzML subset
- FASTA: UniProt reference databases (Erwinia, Human SwissProt subset, E.coli+UPS1)
- Module inputs: pre-computed intermediate files for unit testing individual modules
- Samplesheets: CSV inputs for all workflow test profiles
- Script: generate_test_subsets.sh for reproducible subset generation

human_sp_subset.fasta contains 2000 proteins: 169 identified from DDA LFQ
test spectra (HEK SILAC) + 1831 evenly-spaced entries for search space diversity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@an-altosian an-altosian requested a review from mashehu March 25, 2026 16:25
@an-altosian
Copy link
Copy Markdown
Author

@mashehu any other comments?

@mashehu
Copy link
Copy Markdown
Contributor

mashehu commented Mar 27, 2026

going below 1000 proteins would not make sense?
(commented on the old PR sorry) #1946 (comment)

@an-altosian
Copy link
Copy Markdown
Author

I did this. Now the fasta is 1.4Mb.

@mashehu
Copy link
Copy Markdown
Contributor

mashehu commented Mar 27, 2026

ah, because it said 2000. would it still work if it is <100?

@an-altosian
Copy link
Copy Markdown
Author

any other comments? I can try but I am not sure from when fragpipe and diann will reject it.

@an-altosian
Copy link
Copy Markdown
Author

another question, what is the size limit of files? I thought it was 5 Mb and I did ensure all fasta files are under 5 Mb.

@mashehu
Copy link
Copy Markdown
Contributor

mashehu commented Mar 27, 2026

we basically want it as small as possible. github has a limit of 7Mb, i think, but this repo has quite a lot of files in this range already

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants