Add cluster metrics viz by dbaku42 · Pull Request #11372 · nf-core/modules

dbaku42 · 2026-04-29T13:26:32Z

Description

This PR adds two new modules:

cluster_metrics: Computes standard clustering quality metrics (Silhouette score, Calinski-Harabasz index, Davies-Bouldin index) and performs a k-sweep analysis.
cluster_viz: Generates 2D visualizations (PCA, UMAP, t-SNE) colored by cluster label and exports the coordinates as TSV files.

Both modules are designed to work together with the existing clustering module (which performs KMeans/DBSCAN on PLINK2 PCA results).

Features

Full support for conda via environment.yml
Comprehensive nf-test (normal + stub tests)
Clean output channels with proper naming
Optional plots (PNG) for visualization

Author

Donald Baku (@dbaku42)

Related modules

Related to: #11337 (clustering)

This completes the core components of the snpclustering subworkflow.

SPPearce · 2026-04-30T08:24:31Z

Um, you need to rebase these changes on the current master, as there was a bulk update (and you have a massive conflict ;) ).
Ping me when you have done that.

SPPearce · 2026-04-30T14:37:47Z

Hmm, this is an entirely custom set of modules?
This needs discussion as to where they live.
They could potentially be in custom/

dbaku42 · 2026-05-04T12:45:59Z

Thanks @SPPearce, that makes sense.

I’ve moved both custom Python-based modules under modules/nf-core/custom/, updated the internal paths and tests, fixed the author typo, and refreshed the nf-test snapshots.

Both module tests now pass locally for cluster_metrics and cluster_viz.

SPPearce

I think it'd need to be custom/clustermetrics and custom/clustervisualiation (no underscores allowed).
You'll need to use template directly, rather than referring to ${projectDir}/modules/nf-core/custom/cluster_metrics/templates/cluster_metrics.py.

dbaku42 · 2026-05-06T15:32:41Z

Thanks @SPPearce — I renamed the custom modules to remove underscores and updated both modules to use ${moduleDir} instead of hardcoded ${projectDir} paths. I also updated the tests and snapshots accordingly.

- cluster_metrics: computes clustering quality metrics + k-sweep - cluster_viz: generates PCA, UMAP and t-SNE plots colored by cluster - Both use conda environment.yml - Full nf-test coverage

…tion

SPPearce

You can't use moduleDir at all in the main.nf, it won't work on cloud systems as the python script won't be there.
You need to use a template, like these modules.

SPPearce · 2026-05-08T07:52:13Z

You could make the docker/singularity containers via Seqera Containers rather than needing a custom image in the nf-core quay.io.

pinin4fjords

Hi @dbaku42, thanks for the contribution. AI-assisted review (Claude, on behalf of @pinin4fjords). Some overlap with @SPPearce's earlier feedback. Beginner-friendly notes plus suggestion blocks you can apply directly.

The biggest fix is the templates/ mechanism. Right now your script: block runs python3 ${moduleDir}/templates/cluster_metrics.py, which is just running a normal Python script that happens to live in templates/. No templating is happening. This works on your laptop because ${moduleDir} resolves to a local path, but it breaks on cloud/HPC executors because that folder is not staged into the work directory the task runs in.

Nextflow's template directive does two things:

Stages the file into the task work directory automatically (so it works on any executor).
Runs the file through Groovy GString interpolation before staging, so ${features}, ${task.ext.prefix}, etc. get substituted into the Python source. This replaces argparse entirely.

Because the file goes through Groovy interpolation, a few Python source patterns need adjusting (\n -> \\n, r"\s+" -> r"\\s+", etc., I verified this locally). I have flagged the specific spots inline. See modules/nf-core/custom/tx2gene/templates/tx2gene.py for the canonical shape.

Two scope questions for discussion:

The *_selected.json "best k" output bakes a decision into the module, and the sweep refits KMeans regardless of what method produced the input labels. Could the module just emit the k-sweep table and let the consuming pipeline pick?
cluster_viz does PCA + UMAP + t-SNE + plotting + TSV export, and the PCA piece is essentially a re-plot of upstream PLINK eigenvec output. Consider dropping PCA plotting from this module and tightening to the embedding methods that genuinely belong together.

The clustervisualiation directory name looks like a typo for clustervisualization; worth fixing while you're renaming things.

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

…iz.py Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

…version

…R to fix numba caching in Singularity

… imports, and KMean n_init stable

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

…d test assertions

dbaku42 · 2026-05-13T15:03:32Z

Hi @pinin4fjords , I'm stuck on a snapshot mismatch issue and would appreciate some guidance. The tests for CUSTOM_CLUSTERMETRICS and CUSTOM_CLUSTERVISUALIZATION are failing because the output files (test_metrics.tsv, test_k_sweep.csv, test_selected.json) produce different MD5 hashes in CI compared to my local environment.

The root cause seems to be that my local snapshots were generated using the Wave container, while the CI runs tests using conda (as shown in the logs: Creating env using conda: .../environment.yml). The two environments resolve slightly different package versions, leading to numerically different outputs from scikit-learn/pandas.

Thanks in advance! 🙏

Templates: - clustering.py: replace inline `${n_clusters}` / `${dbscan_eps}` / `${dbscan_min_samples}` (which fail ruff and don't parse as Python source) with locals assigned at the top of main(); replace yaml.dump with the format_yaml_like helper used by the other two templates so pyyaml is no longer needed in the env. - cluster_metrics.py and cluster_viz.py: drop the PLINK-aware _normalise_id_column / multi-mode load_clusters glue; require sample_id + cluster on the documented inputs. Drop the silent try/except plot warning. Drop the cluster_mode / alignment_mode branching from main() along with the redundant input_clusters / input_features / n_samples_used / alignment_mode metadata. Drop the argparse + sys.argv wrapper pattern in cluster_metrics.py in favour of the direct template-substitution main() that cluster_viz.py already uses. Environments: - Strip the misleading `# clustermetrics/environment.yml` headers; prune per-module deps so the env declares what the script actually imports (clustering: numpy/pandas/python/scikit-learn; clustermetrics: + matplotlib; clustervisualization: + matplotlib/seaborn/umap-learn). Pin to the versions Wave resolved unpinned: numpy 2.4.4, pandas 3.0.3, python 3.12.13, scikit-learn 1.8.0, matplotlib 3.10.9, seaborn 0.13.2, umap-learn 0.5.12. Containers: - Rebuild Wave containers per module and replace the shared `_pruned:0...` URL (which didn't actually contain the packages the env declared) with the matching freshly-built Docker tag. Switch the singularity branch from `docker://...` to the proper https blob URL per nf-core convention. Tests: - Drop the orphan `test_pca.eigenvec` test data file from clustervisualization now that the PCA piece is no longer in the script. - Regenerate snapshots (run on AWS x86 Linux via docker profile). - Pre-commit autofix (prettier, end-of-file-fixer, ruff-format). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds pyyaml to all three environment.yml files and replaces the embedded format_yaml_like helper in clustering.py / cluster_metrics.py / cluster_viz.py with a one-line `yaml.dump(versions, fh, default_flow_style=False, sort_keys=False)`. format_yaml_like dates from before Seqera Containers / Wave made it cheap to add deps. Now that we control the env precisely, pyyaml is a clean swap and the templates are ~10 lines shorter each. Rebuilt Wave Docker + Singularity containers per module with pyyaml=6.0.3 pinned (matching what Wave resolved unpinned) and updated the container URLs. Regenerated snapshots on the AWS x86 Linux VM with --profile docker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CPU/container-environment portability issues surfaced on CI that didn't appear on the AWS x86 VM I used to regenerate snapshots: 1. CUSTOM_CLUSTERING `*_clustering_info.json` md5 differed between CI and the snapshot. KMeans inertia is a sum of squared distances, and BLAS reduction order varies by a few ULPs across CPU instruction sets even with a fixed `random_state`. Keep full precision in the production output and round the value in the test snapshot only, matching the `Math.round(... / 1000) * 1000` pattern used in `custom/basicpy`. 2. CUSTOM_CLUSTERVISUALIZATION stub `versions.yml` had `umap-learn: null` under singularity. The stub heredoc ran `python3 -c "import umap"`, which triggers numba's JIT cache write to a path that is read-only inside the singularity image, so the import silently failed and the bash substitution expanded to an empty string. Switched the stub version checks to `importlib.metadata.version(<pkg>)` for every package so they read dist-info without importing the module. Same answer, no import side effects. Regenerated the clustering snapshot on the VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reuse commodity functions and reduce hand-rolled glue: - clustering.py: replace the 45-line manual eigenvec line parser with `pd.read_csv(sep=r"\\s+")` + a small header check. Same accepted layouts (FID/IID and IID-only, with the leading '#' on the header line); drops the no-header inference path PLINK2 doesn't actually emit. - clustering.py: `len(set(labels) - {-1})` instead of the `len(set(...)) - (1 if -1 in labels else 0)` puzzle; drop redundant `int()` wraps around `len()`; replace `Path(...).write_text(json.dumps())` with `json.dump` + `open()` to match the versions.yml write style. - cluster_metrics.py / cluster_viz.py: replace the positional-index alignment dance with `load_features().join(load_clusters(), how="inner")`. Both helpers now return DataFrame / Series indexed by sample_id, so pandas does the join. Drops ~10 lines of indexing logic per module and removes a subtle reliance on `common.index` preserving positional identity. - cluster_metrics.py: tighten `cluster_quality()` to one expression and drop the early-return + shared dict pattern. Returns just the three scores; `n_clusters` is set at the call site, so the k-sweep no longer needs to filter the result down before merging. - cluster_metrics.py / cluster_viz.py: same json.dump cleanup. Regenerated snapshots on the AWS x86 VM; all six tests pass under `--profile docker`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The .gitignore picked up entries for local snpclustering subworkflow scaffolding and a stray `modules/nf-core/clustering/` ignore. None of those should land in nf-core/modules master; restoring the file to match origin/master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- clustering tests: add a DBSCAN scenario alongside the existing kmeans one. The DBSCAN branch of clustering.py was previously untested. - clustermetrics + clustervisualization meta.yml: tighten the `features` and `clusters` input descriptions and patterns. Both modules require a TSV with a `sample_id` column and numeric features, and a CSV with `sample_id` + integer `cluster` columns; the old `"Feature matrix file"` / `pattern: "*"` did not advertise the schema or noise-label convention. - All three modules' stub `versions.yml` blocks: switch every package lookup to `importlib.metadata.version(...)`. clustervisualization already did this for umap-learn to dodge numba's read-only cache issue under singularity; using it everywhere is more robust against similar import-side-effect surprises and keeps the three stubs consistent. All 7 nf-tests pass under `--profile docker` on the AWS x86 VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nf-core spec (modules/general.md): "All non-mandatory command-line tool non-file arguments MUST be provided as a string via the $task.ext.args variable." Adds argparse-on-task.ext.args plumbing in two template scripts so the previously-hardcoded tuning parameters can be overridden via modules.config: - cluster_metrics.py: --k-min (default 2) and --k-max (default 12) for the KMeans sweep range. - cluster_viz.py: --umap-neighbors (default 15) and --tsne-perplexity (default 30) for the embedding parameters. Both are still clamped against sample count so tiny inputs work. The `task.ext.args` substitution is parsed with shlex+argparse from inside the template (matrixfilter does the same with R's parse_args), which keeps required inputs as `val` channels (per input-output-options.md) and optional ones via `ext.args` (per general.md). `clustering` has no optional non-file args; its required algorithm/n_clusters/dbscan_eps/dbscan_min_samples remain `val` channel inputs. Output filename style also unified to `${prefix}.<descriptor>.<ext>` (dot-separated) across all three modules, matching the spec example (`${prefix}.fq.gz`) and the existing dot-separated pattern in `clustervisualization` / `custom/tx2gene` / `custom/matrixfilter`: - clustering: `*_clusters.csv` -> `*.clusters.csv`, `*_clustering_info.json` -> `*.clustering_info.json`. - clustermetrics: `*_metrics.tsv` -> `*.metrics.tsv`, `*_k_sweep.csv` -> `*.k_sweep.csv`, `*_selected.json` -> `*.selected.json`, and plot files from `*_<name>.png` -> `*.<name>.png`. Also renamed `_calinski.png` -> `.calinski_harabasz.png` to match the other metric files (was the only truncated one). Updated meta.yml output patterns and stub touches to match. Regenerated snapshots on the AWS x86 VM with `--profile docker`; all 7 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pinin4fjords

Hi @dbaku42, sorry for the volume of direct commits on your branch (AI-assisted by Claude on my behalf) - given the back-and-forth pattern of the last few rounds it felt like the fastest way to land the modules in a clean, spec-aligned state. Please review and push back on anything that doesn't sit right; happy to revert specific bits if you disagree with a choice. Highlights of what changed and why:

Containers rebuilt on Seqera Containers via Wave to match the actual environment.yml for each module. nf-core uses Biocontainers or Seqera Containers; for multi-package Python envs like these, Seqera Containers via the wave CLI is the right call. Singularity branch is the https://community-cr-prod.seqera.io/.../data blob URL (not docker://).
${...} substitutions in clustering.py: ruff lints the source as Python, so KMeans(n_clusters=${n_clusters}, ...) fails. Moved substitutions to locals at the top of main() and use the locals from then on.
versions.yml writes via yaml.dump (with pyyaml added to the envs), since adding deps is cheap now that we're on Wave. Stub heredocs use importlib.metadata.version(...) to avoid import side effects (import umap triggers numba JIT cache writes that fail under read-only singularity).
Test portability: KMeans inertia wobbles a few ULPs across CPUs (BLAS reduction order), so the clustering test rounds it in the assertion only. Added a DBSCAN test case so both algorithm branches are exercised.
ext.args plumbing in cluster_metrics.py (--k-min, --k-max) and cluster_viz.py (--umap-neighbors, --tsne-perplexity), parsed via argparse + shlex.split("$task.ext.args") inside the template - matches the spec that optional non-file args go via ext.args. Same approach as custom/matrixfilter uses for R.
Output filenames unified to ${prefix}.<descriptor>.<ext> across all three modules (was mixed dot/underscore), matching the spec example and the existing custom/tx2gene / custom/matrixfilter pattern.
Simplifications: removed the PLINK-aware _normalise_id_column / multi-mode load_clusters from cluster_metrics.py and cluster_viz.py (both already require sample_id + cluster and the clustering module emits exactly that). Replaced the positional-index alignment with load_features().join(load_clusters(), how="inner"). Replaced the 45-line manual eigenvec parser with pd.read_csv. Reverted a stray .gitignore change unrelated to this PR.

One blocker before this can land: tests/data/ files can't live here. nf-core stores all module test data in nf-core/test-datasets on the modules branch, referenced via params.modules_testdata_base_path. See custom/basicpy for an example. To unblock:

PR your five test files to the modules branch of nf-core/test-datasets under something like genomics/popgen/clustering/.
Once it merges, update each tests/main.nf.test to use params.modules_testdata_base_path + ... and delete the tests/data/ directories.
Re-record snapshots and push.

Marking as Changes Requested for that, and resolving the previous threads since they're addressed by the recent commits. Apologies again for the heavy hand - shout if anything needs reverting.

Local workflow to catch the easy stuff before pushing in future:

prek run --all-files
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustering/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustermetrics/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustervisualization/tests/main.nf.test

dbaku42 · 2026-05-14T12:38:37Z

Hi @pinin4fjords,

thank you so much for all the work you did directly on the branch (and for the super clear explanations)!
I've reviewed all the commits and they look excellent: the Wave/Seqera containers, correct use of the template directive, task.ext.args, unified file naming, versions.yml with pyyaml + importlib.metadata, the Singularity/Numba fixes, etc. I have no objections at all — on the contrary, thank you for speeding up the process so much.

Regarding the remaining blocker:

I'm preparing right now a PR on nf-core/test-datasets (branch modules) to upload the 5 test files under genomics/popgen/clustering/ (as you suggested).
As soon as that PR is merged, I'll update the tests/main.nf.test files for both modules, delete the tests/data/ folders, re-snapshot and push the changes here.

I'll let you know as soon as the test-datasets PR is opened (and I'll tag you).
If you'd like me to use a slightly different path or rename anything in the test files, just let me know!

Thank you again for the huge help ❤️
(By the way, the modules look really great now)

@nf-core/modules

dbaku42 · 2026-05-14T13:58:44Z

Hi @pinin4fjords,

I've just opened the PR for the test data on nf-core/test-datasets:
nf-core/test-datasets#2051

As soon as it's merged I'll update the two main.nf.test files, remove the local tests/data/ folders, re-snapshot and push here.

Thanks again for all the help!

dbaku42 requested review from a team, edmundmiller, maxulysse and pinin4fjords as code owners April 29, 2026 13:26

dbaku42 force-pushed the add-cluster-metrics-viz branch from c038148 to 9e0f196 Compare April 30, 2026 12:29

SPPearce requested changes May 4, 2026

View reviewed changes

dbaku42 added 6 commits May 7, 2026 00:16

New modules: cluster_metrics + cluster_viz

9abde15

- cluster_metrics: computes clustering quality metrics + k-sweep - cluster_viz: generates PCA, UMAP and t-SNE plots colored by cluster - Both use conda environment.yml - Full nf-test coverage

Move custom clustering modules under custom

0320a41

Move custom clustering modules under custom

e424338

Fix custom clustering module lint

670bc33

Fix custom clustering module lint and snapshots

3a26119

Address review comments for clustering custom modules

e4877e4

dbaku42 force-pushed the add-cluster-metrics-viz branch from 462fff8 to e4877e4 Compare May 6, 2026 22:17

dbaku42 added 3 commits May 7, 2026 00:38

Fix custom clustering module metadata

7e19b22

Add Dockerfile for custom/clustermetrics and custom/clustervisualiation

fdb1be4

Add container directive for custom/clustermetrics and clustervisualia…

d3a382b

…tion

SPPearce requested changes May 8, 2026

View reviewed changes

SPPearce reviewed May 8, 2026

View reviewed changes

pinin4fjords reviewed May 8, 2026

View reviewed changes

dbaku42 and others added 6 commits May 8, 2026 13:06

Update modules/nf-core/custom/clustermetrics/main.nf

89287db

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Update modules/nf-core/custom/clustermetrics/main.nf

d5ac668

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Update modules/nf-core/custom/clustermetrics/main.nf

55e868a

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Update modules/nf-core/custom/clustermetrics/main.nf

e492edb

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Update modules/nf-core/custom/clustervisualiation/templates/cluster_v…

b8529f6

…iz.py Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Update modules/nf-core/custom/clustervisualiation/templates/cluster_v…

12bb00d

…iz.py Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

dbaku42 and others added 14 commits May 11, 2026 20:10

fix: align clustermetrics and clustervisualization envs and containers

0980208

fix: use docker:// prefix for singularity container to enable OCI con…

f240623

…version

fix(custom/clustervisualization): set NUMBA_CACHE_DIR and MPLCONFIGDI…

c46c26c

…R to fix numba caching in Singularity

fix(custom/clustervisualization): move NUMBA_CACHE_DIR fix before any…

ad39971

… imports, and KMean n_init stable

Apply suggestion from @pinin4fjords

c8fec15

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Apply suggestion from @pinin4fjords

3634466

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>

Merge branch 'master' into add-cluster-metrics-viz

0ca17c9

Prek and script fix

6a7ac45

Fixed pandas series problem in cluster_metrics.py

cddb5a8

fix: escape \n in f-strings for Groovy template compatibility

4a72291

Format CUSTOM_CLUSTERMETRICS template with ruff

2bf8afa

fix(clustermetrics,clustervisualization): update nf-test snapshots an…

fd15709

…d test assertions

Merge branch 'master' into add-cluster-metrics-viz

792373b

fix environment conflict

6114ebd

feat: add custom clustering and metrics modules

8348609

dbaku42 closed this May 13, 2026

dbaku42 reopened this May 13, 2026

dbaku42 and others added 8 commits May 13, 2026 22:50

Merge branch 'master' into add-cluster-metrics-viz

f1915bd

pinin4fjords requested changes May 14, 2026

View reviewed changes

dbaku42 mentioned this pull request May 14, 2026

Add test data for custom/clustering, clustermetrics and clustervisualization nf-core/test-datasets#2051

Open

Conversation

dbaku42 commented Apr 29, 2026

Description

Features

Author

Related modules

Uh oh!

SPPearce commented Apr 30, 2026

Uh oh!

SPPearce commented Apr 30, 2026

Uh oh!

dbaku42 commented May 4, 2026

Uh oh!

SPPearce left a comment

Choose a reason for hiding this comment

Uh oh!

dbaku42 commented May 6, 2026

Uh oh!

SPPearce left a comment

Choose a reason for hiding this comment

Uh oh!

SPPearce May 8, 2026

Choose a reason for hiding this comment

Uh oh!

pinin4fjords left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbaku42 commented May 13, 2026

Uh oh!

pinin4fjords left a comment

Choose a reason for hiding this comment

Uh oh!

dbaku42 commented May 14, 2026

Uh oh!

dbaku42 commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants