Add msproteomics test datasets by an-altosian · Pull Request #1953 · nf-core/test-datasets

an-altosian · 2026-03-25T16:23:30Z

Summary

Public test datasets for the nf-core/msproteomics pipeline.

TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) — 2 mzML subsets
DDA LFQ: Zenodo 1051552 (Human SILAC) — 2 mzML subsets
DIA: CPTAC CCRCC (Human DIA) — 1 mzML subset
FASTA: UniProt reference databases (Erwinia, Human SwissProt 2000-protein subset, E.coli+UPS1)
Module inputs: Pre-computed intermediate files for unit testing individual modules
Samplesheets: CSV inputs for all workflow test profiles

FASTA file sizes

File	Size	Proteins
`ecoli_ups1_test.fasta`	1.8 MB	—
`erwinia_carotovora.fasta`	1.6 MB	—
`erwinia_uniprot.fasta`	1.9 MB	—
`human_sp_subset.fasta`	1.4 MB	2,000

human_sp_subset.fasta is a smart subset: 169 proteins identified from the DDA LFQ test spectra (HEK SILAC) + 1,831 evenly-spaced SwissProt entries for search space diversity. Validated by running the full FragPipe DDA LFQ pipeline end-to-end (174 proteins identified at 1% FDR).

Supersedes #1946 (closed due to force-push history issue).

🤖 Generated with Claude Code

Public datasets for nf-core/msproteomics pipeline stub and integration testing: - TMT: PRIDE PXD000001 (Erwinia carotovora, TMT6) - 2 mzML subsets - DDA LFQ: Zenodo 1051552 (Human SILAC) - 2 mzML subsets - DIA: CPTAC CCRCC (Human DIA) - 1 mzML subset - FASTA: UniProt reference databases (Erwinia, Human SwissProt subset, E.coli+UPS1) - Module inputs: pre-computed intermediate files for unit testing individual modules - Samplesheets: CSV inputs for all workflow test profiles - Script: generate_test_subsets.sh for reproducible subset generation human_sp_subset.fasta contains 2000 proteins: 169 identified from DDA LFQ test spectra (HEK SILAC) + 1831 evenly-spaced entries for search space diversity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

an-altosian · 2026-03-27T03:35:57Z

@mashehu any other comments?

mashehu · 2026-03-27T09:25:51Z

going below 1000 proteins would not make sense?
(commented on the old PR sorry) #1946 (comment)

an-altosian · 2026-03-27T14:22:44Z

I did this. Now the fasta is 1.4Mb.

mashehu · 2026-03-27T14:23:53Z

ah, because it said 2000. would it still work if it is <100?

an-altosian · 2026-03-27T14:31:35Z

any other comments? I can try but I am not sure from when fragpipe and diann will reject it.

an-altosian · 2026-03-27T14:59:58Z

another question, what is the size limit of files? I thought it was 5 Mb and I did ensure all fasta files are under 5 Mb.

mashehu · 2026-03-27T15:02:17Z

we basically want it as small as possible. github has a limit of 7Mb, i think, but this repo has quite a lot of files in this range already

an-altosian requested a review from mashehu March 25, 2026 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add msproteomics test datasets#1953

Add msproteomics test datasets#1953
an-altosian wants to merge 1 commit into
nf-core:msproteomicsfrom
an-altosian:msproteomics

an-altosian commented Mar 25, 2026

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026 •

edited

Loading

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

an-altosian commented Mar 25, 2026

Summary

FASTA file sizes

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

an-altosian commented Mar 27, 2026

Uh oh!

mashehu commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mashehu commented Mar 27, 2026 •

edited

Loading