Add test data for custom/clustering, clustermetrics and clustervisualization by dbaku42 · Pull Request #2051 · nf-core/test-datasets

dbaku42 · 2026-05-14T13:39:15Z

Description

Adds the test files required by the new custom/clustering, custom/clustermetrics and custom/clustervisualization modules (see nf-core/modules#11372).

Files added under:
data/genomics/homo_sapiens/popgen/clustering/

Will be referenced in the modules tests as:

file(params.modules_testdata_base_path + 'genomics/homo_sapiens/popgen/clustering/test.eigenvec', checkIfExists: true)

Revised README to clarify test VCF purpose and context.

Add test data for custom/clustering, clustermetrics and clustervisualization

pinin4fjords

Thanks for opening this @dbaku42. AI-assisted review (Claude, on behalf of @pinin4fjords). A few things to address before this can land:

Scope and description

The PR description only mentions data/genomics/homo_sapiens/popgen/clustering/, but the diff also adds 1000g_phase3_plink2_pca/ and 1000g_phase3_small/ as fresh top-level subdirectories under homo_sapiens/. Please update the description to enumerate all three locations and link each to the specific module(s) the test data supports.

Path organisation

homo_sapiens/popgen/ already hosts PLINK content (plink_simulated.*, 1000GP.chr* etc.). Adding 1000g_phase3_plink2_pca/ as a sibling of popgen/ rather than nesting it inside puts PLINK test data in two parallel locations. Suggest moving it under popgen/ (e.g. popgen/1000g_phase3/ or popgen/plink2_pca/) so the sibling sub-tree style with popgen/clustering/ stays consistent.

Path reference in description

params.modules_testdata_base_path + '/data/genomics/...' would resolve to a double /data/data/. The base path already ends in data/. The modules PR should use:

file(params.modules_testdata_base_path + 'genomics/homo_sapiens/popgen/clustering/test.eigenvec', checkIfExists: true)

Documentation

Add a paragraph + file line to the main README.md of the modules branch describing the new content. This is the consistent ask in recent merged PRs (#2031, #2044).
Add provenance docs for the new files. The branch README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder." The existing popgen/README.md documents data generation via plink --simulate ...; the new files need equivalent (e.g. for the VCF, where it was subsetted from + the bcftools/plink command; for the eigenvec / clusters / features, either generate them from the existing simulated PLINK data with plink --pca for reproducibility, or document that they're hand-rolled minimal examples).

Optional

The popgen/clustering/ test files are hand-rolled floats. Technically valid (PR #2049: "only technical validity matters, not biological interpretation"), but a few-row eigenvec generated by plink --pca on the existing popgen/plink_simulated.* data would be reproducible from documented commands and slot in naturally with the existing simulated PLINK content. Not blocking; just more in keeping with the directory's conventions.

pinin4fjords · 2026-05-14T15:46:42Z

+Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs).
+
+Intended for testing the nf-core/snpclustering pipeline 
+(PCA + clustering downstream analysis).


Severity: high - this misattributes the data to the snpclustering pipeline. These files are intended for nf-core/modules (see modules#11372 - specifically whichever PLINK2 PCA module consumes the VCF). The README should:

Name the consuming module(s) explicitly rather than referencing a pipeline.

Document where this VCF was subsetted from and the exact command(s) used to produce it and the .tbi, so the test data is reproducible (per the modules branch README requirement: "a short description about how this file was generated").

The existing popgen/README.md is a good model: it shows the plink --simulate ... and bcftools view ... commands that produced the files it documents.

pinin4fjords · 2026-05-14T15:46:42Z

+Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs).
+
+Intended for testing the nf-core/snpclustering pipeline 
+(PCA + clustering downstream analysis with plink2).


Severity: high - two problems:

Same snpclustering misattribution as the sibling directory's README - should name the actual module(s) this data is for.

This directory contains only this README - no data file is added alongside it. Either add the data file the README is supposed to describe, or drop the directory.

It's also unclear what distinguishes 1000g_phase3_small/ from 1000g_phase3_plink2_pca/ given the READMEs are nearly identical text. If one is the input VCF and the other the PLINK2-processed output, the READMEs should make that explicit.

pinin4fjords · 2026-05-14T15:46:42Z

@@ -0,0 +1,6 @@
+#FID	IID	PC1	PC2	PC3


Severity: medium - the new popgen/clustering/ directory has no README or provenance documentation. The branch's main README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder."

Two options that would satisfy this:

Generate from existing data (preferred, more in keeping with popgen/ conventions): run plink --pca on popgen/plink_simulated.* to produce a real eigenvec, and document the command. Then derive the test_clusters.csv and test_features.tsv from that. Same reproducible-command pattern that popgen/README.md already uses for plink --simulate.

Add a popgen/clustering/README.md that explicitly states the files are hand-rolled minimal examples for testing the format-parsing path of the clustering modules, with sample IDs and PCs chosen for test simplicity rather than biological meaning.

Remove unwanted 1000g_phase3_plink2_pca/ directory

Remove unwanted 1000g_phase3_small/

Added README.md with test data information for clustering modules.

Updated README to clarify the purpose and format of test files.

Added sections for custom modules and test data in README.

Removed section on test data for custom modules in popgen/clustering.

Updated the README to consolidate test data information for clustering modules.

dbaku42 · 2026-05-14T23:44:36Z

Hi @pinin4fjords,

updated the PR description with mention of both READMEs (local + main).

All review points have been addressed:

unwanted folders removed
provenance documentation added
main README entry added
correct path reference

Thank you for your help 🙂

pinin4fjords · 2026-05-15T09:47:36Z

+    - popgen/clustering/: Test data for the new custom modules (`custom/clustering`, `custom/clustermetrics`, `custom/clustervisualization`) - `test.eigenvec`, `test_clusters.csv`, `test_features.tsv` + `README.md` (hand-rolled minimal examples)
  - svsig:

    - NA03697B2_new.pbmm2.repeats.svsig.gz: structural variant file for NA03697B2_new.pbmm2.repeats.bam, created with PBSV discover version (2.9.0 default settings)


Looks like you hit a line you didn't mean to- please revert

fix: correct indentation of popgen/clustering in root README.md

pinin4fjords · 2026-05-15T12:43:50Z

        - toy.symm.upper.2.cool, toy.symm.upper.2.cp2.cool: test file for cooler_merge. Downloaded from [open2c/cooler](https://github.com/open2c/cooler/master/tests/data/toy.symm.upper.2.cool)
        - toy.symm.upper.balanced.2.cool: test file for the cooltools/insulation module. Balanced copy of toy.symm.upper.2.cool, generated with cooler balance.
-
+       


It was this line I meant with the extra space

dbaku42 and others added 5 commits April 22, 2026 23:18

feat: add tiny 1000G VCF test dataset for plink2 pca

ac0e2a1

Update README for 1000 Genomes test VCF

8d7ac3d

Revised README to clarify test VCF purpose and context.

Add clustering subdirectory

ccbc57e

Add files via upload

d8a2521

Add test data for custom/clustering, clustermetrics and clustervisualization

Delete data/genomics/homo_sapiens/popgen/clustering/.gitkeep

63071ab

dbaku42 mentioned this pull request May 14, 2026

Add cluster metrics viz nf-core/modules#11372

Open

pinin4fjords requested changes May 14, 2026

View reviewed changes

dbaku42 added 8 commits May 14, 2026 23:50

Delete data/genomics/homo_sapiens/1000g_phase3_plink2_pca directory

79a5061

Remove unwanted 1000g_phase3_plink2_pca/ directory

Delete data/genomics/homo_sapiens/1000g_phase3_small directory

31d1427

Remove unwanted 1000g_phase3_small/

Create README.md for clustering test data

4c4b5d8

Added README.md with test data information for clustering modules.

Revise README for test data clarity

333b631

Updated README to clarify the purpose and format of test files.

Update README with custom modules and test data

a0425a4

Added sections for custom modules and test data in README.

Remove popgen/clustering test data section from README

6f921b0

Removed section on test data for custom modules in popgen/clustering.

Refactor clustering test data section in README

be1412b

Updated the README to consolidate test data information for clustering modules.

Fix formatting in README for clustering test data

d082c22

pinin4fjords reviewed May 15, 2026

View reviewed changes

Fix formatting in README.md for clustering test data

4b519bd

fix: correct indentation of popgen/clustering in root README.md

pinin4fjords reviewed May 15, 2026

View reviewed changes

dbaku42 added 3 commits May 15, 2026 16:30

Delete data/genomics/homo_sapiens/popgen/clustering/test_features.tsv

34218bf

Delete data/genomics/homo_sapiens/popgen/clustering/test_clusters.csv

abce3c0

Delete data/genomics/homo_sapiens/popgen/clustering/test.eigenvec

70baac4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test data for custom/clustering, clustermetrics and clustervisualization#2051

Add test data for custom/clustering, clustermetrics and clustervisualization#2051
dbaku42 wants to merge 17 commits into
nf-core:modulesfrom
dbaku42:clustering-test-data

dbaku42 commented May 14, 2026 •

edited

Loading

Uh oh!

pinin4fjords left a comment

Uh oh!

pinin4fjords May 14, 2026

Uh oh!

pinin4fjords May 14, 2026

Uh oh!

pinin4fjords May 14, 2026

Uh oh!

dbaku42 commented May 14, 2026

Uh oh!

pinin4fjords May 15, 2026

Uh oh!

pinin4fjords May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		- toy.symm.upper.2.cool, toy.symm.upper.2.cp2.cool: test file for cooler_merge. Downloaded from [open2c/cooler](https://github.com/open2c/cooler/master/tests/data/toy.symm.upper.2.cool)
		- toy.symm.upper.balanced.2.cool: test file for the cooltools/insulation module. Balanced copy of toy.symm.upper.2.cool, generated with cooler balance.

Conversation

dbaku42 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

pinin4fjords left a comment

Choose a reason for hiding this comment

Uh oh!

pinin4fjords May 14, 2026

Choose a reason for hiding this comment

Uh oh!

pinin4fjords May 14, 2026

Choose a reason for hiding this comment

Uh oh!

pinin4fjords May 14, 2026

Choose a reason for hiding this comment

Uh oh!

dbaku42 commented May 14, 2026

Uh oh!

pinin4fjords May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pinin4fjords May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dbaku42 commented May 14, 2026 •

edited

Loading