Add test data for custom/clustering, clustermetrics and clustervisualization#2051
Add test data for custom/clustering, clustermetrics and clustervisualization#2051dbaku42 wants to merge 17 commits into
Conversation
Revised README to clarify test VCF purpose and context.
Add test data for custom/clustering, clustermetrics and clustervisualization
pinin4fjords
left a comment
There was a problem hiding this comment.
Thanks for opening this @dbaku42. AI-assisted review (Claude, on behalf of @pinin4fjords). A few things to address before this can land:
Scope and description
The PR description only mentions data/genomics/homo_sapiens/popgen/clustering/, but the diff also adds 1000g_phase3_plink2_pca/ and 1000g_phase3_small/ as fresh top-level subdirectories under homo_sapiens/. Please update the description to enumerate all three locations and link each to the specific module(s) the test data supports.
Path organisation
homo_sapiens/popgen/ already hosts PLINK content (plink_simulated.*, 1000GP.chr* etc.). Adding 1000g_phase3_plink2_pca/ as a sibling of popgen/ rather than nesting it inside puts PLINK test data in two parallel locations. Suggest moving it under popgen/ (e.g. popgen/1000g_phase3/ or popgen/plink2_pca/) so the sibling sub-tree style with popgen/clustering/ stays consistent.
Path reference in description
params.modules_testdata_base_path + '/data/genomics/...' would resolve to a double /data/data/. The base path already ends in data/. The modules PR should use:
file(params.modules_testdata_base_path + 'genomics/homo_sapiens/popgen/clustering/test.eigenvec', checkIfExists: true)Documentation
- Add a paragraph + file line to the main
README.mdof themodulesbranch describing the new content. This is the consistent ask in recent merged PRs (#2031, #2044). - Add provenance docs for the new files. The branch README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder." The existing
popgen/README.mddocuments data generation viaplink --simulate ...; the new files need equivalent (e.g. for the VCF, where it was subsetted from + the bcftools/plink command; for the eigenvec / clusters / features, either generate them from the existing simulated PLINK data withplink --pcafor reproducibility, or document that they're hand-rolled minimal examples).
Optional
The popgen/clustering/ test files are hand-rolled floats. Technically valid (PR #2049: "only technical validity matters, not biological interpretation"), but a few-row eigenvec generated by plink --pca on the existing popgen/plink_simulated.* data would be reproducible from documented commands and slot in naturally with the existing simulated PLINK content. Not blocking; just more in keeping with the directory's conventions.
| Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs). | ||
|
|
||
| Intended for testing the nf-core/snpclustering pipeline | ||
| (PCA + clustering downstream analysis). |
There was a problem hiding this comment.
Severity: high - this misattributes the data to the snpclustering pipeline. These files are intended for nf-core/modules (see modules#11372 - specifically whichever PLINK2 PCA module consumes the VCF). The README should:
- Name the consuming module(s) explicitly rather than referencing a pipeline.
- Document where this VCF was subsetted from and the exact command(s) used to produce it and the
.tbi, so the test data is reproducible (per themodulesbranch README requirement: "a short description about how this file was generated").
The existing popgen/README.md is a good model: it shows the plink --simulate ... and bcftools view ... commands that produced the files it documents.
| Small test VCF from 1000 Genomes Phase 3 (~100 samples × ~1000 SNPs). | ||
|
|
||
| Intended for testing the nf-core/snpclustering pipeline | ||
| (PCA + clustering downstream analysis with plink2). |
There was a problem hiding this comment.
Severity: high - two problems:
- Same snpclustering misattribution as the sibling directory's README - should name the actual module(s) this data is for.
- This directory contains only this README - no data file is added alongside it. Either add the data file the README is supposed to describe, or drop the directory.
It's also unclear what distinguishes 1000g_phase3_small/ from 1000g_phase3_plink2_pca/ given the READMEs are nearly identical text. If one is the input VCF and the other the PLINK2-processed output, the READMEs should make that explicit.
| @@ -0,0 +1,6 @@ | |||
| #FID IID PC1 PC2 PC3 | |||
There was a problem hiding this comment.
Severity: medium - the new popgen/clustering/ directory has no README or provenance documentation. The branch's main README requires "a short description about how this file was generated [...] either in this description or in the respective subfolder."
Two options that would satisfy this:
- Generate from existing data (preferred, more in keeping with
popgen/conventions): runplink --pcaonpopgen/plink_simulated.*to produce a real eigenvec, and document the command. Then derive the test_clusters.csv and test_features.tsv from that. Same reproducible-command pattern thatpopgen/README.mdalready uses forplink --simulate. - Add a
popgen/clustering/README.mdthat explicitly states the files are hand-rolled minimal examples for testing the format-parsing path of the clustering modules, with sample IDs and PCs chosen for test simplicity rather than biological meaning.
Remove unwanted 1000g_phase3_plink2_pca/ directory
Remove unwanted 1000g_phase3_small/
Added README.md with test data information for clustering modules.
Updated README to clarify the purpose and format of test files.
Added sections for custom modules and test data in README.
Removed section on test data for custom modules in popgen/clustering.
Updated the README to consolidate test data information for clustering modules.
|
Hi @pinin4fjords, updated the PR description with mention of both READMEs (local + main). All review points have been addressed:
Thank you for your help 🙂 |
| - popgen/clustering/: Test data for the new custom modules (`custom/clustering`, `custom/clustermetrics`, `custom/clustervisualization`) - `test.eigenvec`, `test_clusters.csv`, `test_features.tsv` + `README.md` (hand-rolled minimal examples) | ||
| - svsig: | ||
|
|
||
| - NA03697B2_new.pbmm2.repeats.svsig.gz: structural variant file for NA03697B2_new.pbmm2.repeats.bam, created with PBSV discover version (2.9.0 default settings) |
There was a problem hiding this comment.
Looks like you hit a line you didn't mean to- please revert
fix: correct indentation of popgen/clustering in root README.md
| - toy.symm.upper.2.cool, toy.symm.upper.2.cp2.cool: test file for cooler_merge. Downloaded from [open2c/cooler](https://github.com/open2c/cooler/master/tests/data/toy.symm.upper.2.cool) | ||
| - toy.symm.upper.balanced.2.cool: test file for the cooltools/insulation module. Balanced copy of toy.symm.upper.2.cool, generated with cooler balance. | ||
|
|
||
There was a problem hiding this comment.
It was this line I meant with the extra space
Description
Adds the test files required by the new
custom/clustering,custom/clustermetricsandcustom/clustervisualizationmodules (see nf-core/modules#11372).Files added under:
data/genomics/homo_sapiens/popgen/clustering/Will be referenced in the modules tests as: