Skip to content

Add test data for custom/clustering, clustermetrics and clustervisualization#2051

Open
dbaku42 wants to merge 17 commits into
nf-core:modulesfrom
dbaku42:clustering-test-data
Open

Add test data for custom/clustering, clustermetrics and clustervisualization#2051
dbaku42 wants to merge 17 commits into
nf-core:modulesfrom
dbaku42:clustering-test-data

Conversation

@dbaku42
Copy link
Copy Markdown

@dbaku42 dbaku42 commented May 14, 2026

Description

Adds the test files required by the new custom/clustering, custom/clustermetrics and custom/clustervisualization modules (see nf-core/modules#11372).

Files added under:
data/genomics/homo_sapiens/popgen/clustering/

Will be referenced in the modules tests as:

file(params.modules_testdata_base_path + 'genomics/homo_sapiens/popgen/clustering/test.eigenvec', checkIfExists: true)

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this @dbaku42. AI-assisted review (Claude, on behalf of @pinin4fjords). A few things to address before this can land:

Scope and description

The PR description only mentions data/genomics/homo_sapiens/popgen/clustering/, but the diff also adds 1000g_phase3_plink2_pca/ and 1000g_phase3_small/ as fresh top-level subdirectories under homo_sapiens/. Please update the description to enumerate all three locations and link each to the specific module(s) the test data supports.

Path organisation

homo_sapiens/popgen/ already hosts PLINK content (plink_simulated.*, 1000GP.chr* etc.). Adding 1000g_phase3_plink2_pca/ as a sibling of popgen/ rather than nesting it inside puts PLINK test data in two parallel locations. Suggest moving it under popgen/ (e.g. popgen/1000g_phase3/ or popgen/plink2_pca/) so the sibling sub-tree style with popgen/clustering/ stays consistent.

Path reference in description

params.modules_testdata_base_path + '/data/genomics/...' would resolve to a double /data/data/. The base path already ends in data/. The modules PR should use:

file(params.modules_testdata_base_path + 'genomics/homo_sapiens/popgen/clustering/test.eigenvec', checkIfExists: true)

Documentation

  • Add a paragraph + file line to the main README.md of the modules branch describing the new content. This is the consistent ask in recent merged PRs (#2031, #2044).
  • Add provenance docs for the new files. The branch README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder." The existing popgen/README.md documents data generation via plink --simulate ...; the new files need equivalent (e.g. for the VCF, where it was subsetted from + the bcftools/plink command; for the eigenvec / clusters / features, either generate them from the existing simulated PLINK data with plink --pca for reproducibility, or document that they're hand-rolled minimal examples).

Optional

The popgen/clustering/ test files are hand-rolled floats. Technically valid (PR #2049: "only technical validity matters, not biological interpretation"), but a few-row eigenvec generated by plink --pca on the existing popgen/plink_simulated.* data would be reproducible from documented commands and slot in naturally with the existing simulated PLINK content. Not blocking; just more in keeping with the directory's conventions.

Comment on lines +1 to +4
Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs).

Intended for testing the nf-core/snpclustering pipeline
(PCA + clustering downstream analysis).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: high - this misattributes the data to the snpclustering pipeline. These files are intended for nf-core/modules (see modules#11372 - specifically whichever PLINK2 PCA module consumes the VCF). The README should:

  1. Name the consuming module(s) explicitly rather than referencing a pipeline.
  2. Document where this VCF was subsetted from and the exact command(s) used to produce it and the .tbi, so the test data is reproducible (per the modules branch README requirement: "a short description about how this file was generated").

The existing popgen/README.md is a good model: it shows the plink --simulate ... and bcftools view ... commands that produced the files it documents.

Comment on lines +1 to +4
Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs).

Intended for testing the nf-core/snpclustering pipeline
(PCA + clustering downstream analysis with plink2).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: high - two problems:

  1. Same snpclustering misattribution as the sibling directory's README - should name the actual module(s) this data is for.
  2. This directory contains only this README - no data file is added alongside it. Either add the data file the README is supposed to describe, or drop the directory.

It's also unclear what distinguishes 1000g_phase3_small/ from 1000g_phase3_plink2_pca/ given the READMEs are nearly identical text. If one is the input VCF and the other the PLINK2-processed output, the READMEs should make that explicit.

@@ -0,0 +1,6 @@
#FID IID PC1 PC2 PC3
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity: medium - the new popgen/clustering/ directory has no README or provenance documentation. The branch's main README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder."

Two options that would satisfy this:

  1. Generate from existing data (preferred, more in keeping with popgen/ conventions): run plink --pca on popgen/plink_simulated.* to produce a real eigenvec, and document the command. Then derive the test_clusters.csv and test_features.tsv from that. Same reproducible-command pattern that popgen/README.md already uses for plink --simulate.
  2. Add a popgen/clustering/README.md that explicitly states the files are hand-rolled minimal examples for testing the format-parsing path of the clustering modules, with sample IDs and PCs chosen for test simplicity rather than biological meaning.

dbaku42 added 8 commits May 14, 2026 23:50
Remove unwanted 1000g_phase3_plink2_pca/ directory
Added README.md with test data information for clustering modules.
Updated README to clarify the purpose and format of test files.
Added sections for custom modules and test data in README.
Removed section on test data for custom modules in popgen/clustering.
Updated the README to consolidate test data information for clustering modules.
@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 14, 2026

Hi @pinin4fjords,

updated the PR description with mention of both READMEs (local + main).

All review points have been addressed:

  • unwanted folders removed
  • provenance documentation added
  • main README entry added
  • correct path reference

Thank you for your help 🙂

Comment thread README.md
- popgen/clustering/: Test data for the new custom modules (`custom/clustering`, `custom/clustermetrics`, `custom/clustervisualization`) - `test.eigenvec`, `test_clusters.csv`, `test_features.tsv` + `README.md` (hand-rolled minimal examples)
- svsig:

- NA03697B2_new.pbmm2.repeats.svsig.gz: structural variant file for NA03697B2_new.pbmm2.repeats.bam, created with PBSV discover version (2.9.0 default settings)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you hit a line you didn't mean to- please revert

fix: correct indentation of popgen/clustering in root README.md
Comment thread README.md
- toy.symm.upper.2.cool, toy.symm.upper.2.cp2.cool: test file for cooler_merge. Downloaded from [open2c/cooler](https://github.com/open2c/cooler/master/tests/data/toy.symm.upper.2.cool)
- toy.symm.upper.balanced.2.cool: test file for the cooltools/insulation module. Balanced copy of toy.symm.upper.2.cool, generated with cooler balance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was this line I meant with the extra space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants