Skip to content

Add cluster metrics viz#11372

Open
dbaku42 wants to merge 46 commits into
nf-core:masterfrom
dbaku42:add-cluster-metrics-viz
Open

Add cluster metrics viz#11372
dbaku42 wants to merge 46 commits into
nf-core:masterfrom
dbaku42:add-cluster-metrics-viz

Conversation

@dbaku42
Copy link
Copy Markdown

@dbaku42 dbaku42 commented Apr 29, 2026

Description

This PR adds two new modules:

  • cluster_metrics: Computes standard clustering quality metrics (Silhouette score, Calinski-Harabasz index, Davies-Bouldin index) and performs a k-sweep analysis.
  • cluster_viz: Generates 2D visualizations (PCA, UMAP, t-SNE) colored by cluster label and exports the coordinates as TSV files.

Both modules are designed to work together with the existing clustering module (which performs KMeans/DBSCAN on PLINK2 PCA results).

Features

  • Full support for conda via environment.yml
  • Comprehensive nf-test (normal + stub tests)
  • Clean output channels with proper naming
  • Optional plots (PNG) for visualization

Author

Related modules

  • Related to: #11337 (clustering)

This completes the core components of the snpclustering subworkflow.

@SPPearce
Copy link
Copy Markdown
Contributor

Um, you need to rebase these changes on the current master, as there was a bulk update (and you have a massive conflict ;) ).
Ping me when you have done that.

@dbaku42 dbaku42 force-pushed the add-cluster-metrics-viz branch from c038148 to 9e0f196 Compare April 30, 2026 12:29
@SPPearce
Copy link
Copy Markdown
Contributor

Hmm, this is an entirely custom set of modules?
This needs discussion as to where they live.
They could potentially be in custom/

@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 4, 2026

Thanks @SPPearce, that makes sense.

I’ve moved both custom Python-based modules under modules/nf-core/custom/, updated the internal paths and tests, fixed the author typo, and refreshed the nf-test snapshots.

Both module tests now pass locally for cluster_metrics and cluster_viz.

Copy link
Copy Markdown
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd need to be custom/clustermetrics and custom/clustervisualiation (no underscores allowed).
You'll need to use template directly, rather than referring to ${projectDir}/modules/nf-core/custom/cluster_metrics/templates/cluster_metrics.py.

@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 6, 2026

Thanks @SPPearce — I renamed the custom modules to remove underscores and updated both modules to use ${moduleDir} instead of hardcoded ${projectDir} paths. I also updated the tests and snapshots accordingly.

@dbaku42 dbaku42 force-pushed the add-cluster-metrics-viz branch from 462fff8 to e4877e4 Compare May 6, 2026 22:17
Copy link
Copy Markdown
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use moduleDir at all in the main.nf, it won't work on cloud systems as the python script won't be there.
You need to use a template, like these modules.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make the docker/singularity containers via Seqera Containers rather than needing a custom image in the nf-core quay.io.

Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dbaku42, thanks for the contribution. AI-assisted review (Claude, on behalf of @pinin4fjords). Some overlap with @SPPearce's earlier feedback. Beginner-friendly notes plus suggestion blocks you can apply directly.

The biggest fix is the templates/ mechanism. Right now your script: block runs python3 ${moduleDir}/templates/cluster_metrics.py, which is just running a normal Python script that happens to live in templates/. No templating is happening. This works on your laptop because ${moduleDir} resolves to a local path, but it breaks on cloud/HPC executors because that folder is not staged into the work directory the task runs in.

Nextflow's template directive does two things:

  1. Stages the file into the task work directory automatically (so it works on any executor).
  2. Runs the file through Groovy GString interpolation before staging, so ${features}, ${task.ext.prefix}, etc. get substituted into the Python source. This replaces argparse entirely.

Because the file goes through Groovy interpolation, a few Python source patterns need adjusting (\n -> \\n, r"\s+" -> r"\\s+", etc., I verified this locally). I have flagged the specific spots inline. See modules/nf-core/custom/tx2gene/templates/tx2gene.py for the canonical shape.

Two scope questions for discussion:

  • The *_selected.json "best k" output bakes a decision into the module, and the sweep refits KMeans regardless of what method produced the input labels. Could the module just emit the k-sweep table and let the consuming pipeline pick?
  • cluster_viz does PCA + UMAP + t-SNE + plotting + TSV export, and the PCA piece is essentially a re-plot of upstream PLINK eigenvec output. Consider dropping PCA plotting from this module and tightening to the embedding methods that genuinely belong together.

The clustervisualiation directory name looks like a typo for clustervisualization; worth fixing while you're renaming things.

Comment thread modules/nf-core/custom/clustermetrics/Dockerfile Outdated
Comment thread modules/nf-core/custom/clustermetrics/main.nf Outdated
Comment thread modules/nf-core/custom/clustermetrics/main.nf Outdated
Comment thread modules/nf-core/custom/clustermetrics/main.nf Outdated
Comment thread modules/nf-core/custom/clustermetrics/main.nf
Comment thread modules/nf-core/custom/clustervisualiation/templates/cluster_viz.py Outdated
Comment thread modules/nf-core/custom/clustervisualiation/templates/cluster_viz.py Outdated
Comment thread modules/nf-core/custom/clustervisualiation/templates/cluster_viz.py Outdated
Comment thread modules/nf-core/custom/clustervisualiation/templates/cluster_viz.py Outdated
Comment thread modules/nf-core/custom/clustervisualiation/tests/main.nf.test Outdated
dbaku42 and others added 6 commits May 8, 2026 13:06
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…iz.py

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
…iz.py

Co-authored-by: Jonathan Manning <pininforthefjords@gmail.com>
@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 13, 2026

Hi @pinin4fjords , I'm stuck on a snapshot mismatch issue and would appreciate some guidance. The tests for CUSTOM_CLUSTERMETRICS and CUSTOM_CLUSTERVISUALIZATION are failing because the output files (test_metrics.tsv, test_k_sweep.csv, test_selected.json) produce different MD5 hashes in CI compared to my local environment.

The root cause seems to be that my local snapshots were generated using the Wave container, while the CI runs tests using conda (as shown in the logs: Creating env using conda: .../environment.yml). The two environments resolve slightly different package versions, leading to numerically different outputs from scikit-learn/pandas.

Thanks in advance! 🙏

@dbaku42 dbaku42 closed this May 13, 2026
@dbaku42 dbaku42 reopened this May 13, 2026
dbaku42 and others added 8 commits May 13, 2026 22:50
Templates:
- clustering.py: replace inline `${n_clusters}` / `${dbscan_eps}` /
  `${dbscan_min_samples}` (which fail ruff and don't parse as Python source)
  with locals assigned at the top of main(); replace yaml.dump with the
  format_yaml_like helper used by the other two templates so pyyaml is no
  longer needed in the env.
- cluster_metrics.py and cluster_viz.py: drop the PLINK-aware
  _normalise_id_column / multi-mode load_clusters glue; require sample_id
  + cluster on the documented inputs. Drop the silent try/except plot
  warning. Drop the cluster_mode / alignment_mode branching from main()
  along with the redundant input_clusters / input_features / n_samples_used
  / alignment_mode metadata. Drop the argparse + sys.argv wrapper pattern
  in cluster_metrics.py in favour of the direct template-substitution main()
  that cluster_viz.py already uses.

Environments:
- Strip the misleading `# clustermetrics/environment.yml` headers; prune
  per-module deps so the env declares what the script actually imports
  (clustering: numpy/pandas/python/scikit-learn; clustermetrics: + matplotlib;
  clustervisualization: + matplotlib/seaborn/umap-learn). Pin to the versions
  Wave resolved unpinned: numpy 2.4.4, pandas 3.0.3, python 3.12.13,
  scikit-learn 1.8.0, matplotlib 3.10.9, seaborn 0.13.2, umap-learn 0.5.12.

Containers:
- Rebuild Wave containers per module and replace the shared `_pruned:0...`
  URL (which didn't actually contain the packages the env declared) with
  the matching freshly-built Docker tag. Switch the singularity branch from
  `docker://...` to the proper https blob URL per nf-core convention.

Tests:
- Drop the orphan `test_pca.eigenvec` test data file from
  clustervisualization now that the PCA piece is no longer in the script.
- Regenerate snapshots (run on AWS x86 Linux via docker profile).
- Pre-commit autofix (prettier, end-of-file-fixer, ruff-format).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds pyyaml to all three environment.yml files and replaces the embedded
format_yaml_like helper in clustering.py / cluster_metrics.py / cluster_viz.py
with a one-line `yaml.dump(versions, fh, default_flow_style=False, sort_keys=False)`.

format_yaml_like dates from before Seqera Containers / Wave made it cheap
to add deps. Now that we control the env precisely, pyyaml is a clean
swap and the templates are ~10 lines shorter each.

Rebuilt Wave Docker + Singularity containers per module with pyyaml=6.0.3
pinned (matching what Wave resolved unpinned) and updated the container URLs.
Regenerated snapshots on the AWS x86 Linux VM with --profile docker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CPU/container-environment portability issues surfaced on CI that didn't
appear on the AWS x86 VM I used to regenerate snapshots:

1. CUSTOM_CLUSTERING `*_clustering_info.json` md5 differed between CI and
   the snapshot. KMeans inertia is a sum of squared distances, and BLAS
   reduction order varies by a few ULPs across CPU instruction sets even
   with a fixed `random_state`. Keep full precision in the production
   output and round the value in the test snapshot only, matching the
   `Math.round(... / 1000) * 1000` pattern used in `custom/basicpy`.

2. CUSTOM_CLUSTERVISUALIZATION stub `versions.yml` had `umap-learn: null`
   under singularity. The stub heredoc ran `python3 -c "import umap"`,
   which triggers numba's JIT cache write to a path that is read-only
   inside the singularity image, so the import silently failed and the
   bash substitution expanded to an empty string. Switched the stub
   version checks to `importlib.metadata.version(<pkg>)` for every package
   so they read dist-info without importing the module. Same answer,
   no import side effects.

Regenerated the clustering snapshot on the VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reuse commodity functions and reduce hand-rolled glue:

- clustering.py: replace the 45-line manual eigenvec line parser with
  `pd.read_csv(sep=r"\\s+")` + a small header check. Same accepted
  layouts (FID/IID and IID-only, with the leading '#' on the header
  line); drops the no-header inference path PLINK2 doesn't actually
  emit.
- clustering.py: `len(set(labels) - {-1})` instead of the
  `len(set(...)) - (1 if -1 in labels else 0)` puzzle; drop redundant
  `int()` wraps around `len()`; replace `Path(...).write_text(json.dumps())`
  with `json.dump` + `open()` to match the versions.yml write style.
- cluster_metrics.py / cluster_viz.py: replace the positional-index
  alignment dance with `load_features().join(load_clusters(), how="inner")`.
  Both helpers now return DataFrame / Series indexed by sample_id, so
  pandas does the join. Drops ~10 lines of indexing logic per module
  and removes a subtle reliance on `common.index` preserving positional
  identity.
- cluster_metrics.py: tighten `cluster_quality()` to one expression and
  drop the early-return + shared dict pattern. Returns just the three
  scores; `n_clusters` is set at the call site, so the k-sweep no longer
  needs to filter the result down before merging.
- cluster_metrics.py / cluster_viz.py: same json.dump cleanup.

Regenerated snapshots on the AWS x86 VM; all six tests pass under
`--profile docker`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .gitignore picked up entries for local snpclustering subworkflow
scaffolding and a stray `modules/nf-core/clustering/` ignore. None of
those should land in nf-core/modules master; restoring the file to
match origin/master.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- clustering tests: add a DBSCAN scenario alongside the existing kmeans
  one. The DBSCAN branch of clustering.py was previously untested.
- clustermetrics + clustervisualization meta.yml: tighten the `features`
  and `clusters` input descriptions and patterns. Both modules require a
  TSV with a `sample_id` column and numeric features, and a CSV with
  `sample_id` + integer `cluster` columns; the old `"Feature matrix
  file"` / `pattern: "*"` did not advertise the schema or noise-label
  convention.
- All three modules' stub `versions.yml` blocks: switch every package
  lookup to `importlib.metadata.version(...)`. clustervisualization
  already did this for umap-learn to dodge numba's read-only cache
  issue under singularity; using it everywhere is more robust against
  similar import-side-effect surprises and keeps the three stubs
  consistent.

All 7 nf-tests pass under `--profile docker` on the AWS x86 VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nf-core spec (modules/general.md): "All non-mandatory command-line tool
non-file arguments MUST be provided as a string via the $task.ext.args
variable." Adds argparse-on-task.ext.args plumbing in two template
scripts so the previously-hardcoded tuning parameters can be overridden
via modules.config:

- cluster_metrics.py: --k-min (default 2) and --k-max (default 12) for
  the KMeans sweep range.
- cluster_viz.py: --umap-neighbors (default 15) and --tsne-perplexity
  (default 30) for the embedding parameters. Both are still clamped
  against sample count so tiny inputs work.

The `task.ext.args` substitution is parsed with shlex+argparse from
inside the template (matrixfilter does the same with R's parse_args),
which keeps required inputs as `val` channels (per
input-output-options.md) and optional ones via `ext.args` (per
general.md). `clustering` has no optional non-file args; its required
algorithm/n_clusters/dbscan_eps/dbscan_min_samples remain `val` channel
inputs.

Output filename style also unified to `${prefix}.<descriptor>.<ext>`
(dot-separated) across all three modules, matching the spec example
(`${prefix}.fq.gz`) and the existing dot-separated pattern in
`clustervisualization` / `custom/tx2gene` / `custom/matrixfilter`:

- clustering: `*_clusters.csv` -> `*.clusters.csv`,
  `*_clustering_info.json` -> `*.clustering_info.json`.
- clustermetrics: `*_metrics.tsv` -> `*.metrics.tsv`,
  `*_k_sweep.csv` -> `*.k_sweep.csv`,
  `*_selected.json` -> `*.selected.json`, and plot files from
  `*_<name>.png` -> `*.<name>.png`. Also renamed
  `_calinski.png` -> `.calinski_harabasz.png` to match the other metric
  files (was the only truncated one).

Updated meta.yml output patterns and stub touches to match. Regenerated
snapshots on the AWS x86 VM with `--profile docker`; all 7 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dbaku42, sorry for the volume of direct commits on your branch (AI-assisted by Claude on my behalf) - given the back-and-forth pattern of the last few rounds it felt like the fastest way to land the modules in a clean, spec-aligned state. Please review and push back on anything that doesn't sit right; happy to revert specific bits if you disagree with a choice. Highlights of what changed and why:

  • Containers rebuilt on Seqera Containers via Wave to match the actual environment.yml for each module. nf-core uses Biocontainers or Seqera Containers; for multi-package Python envs like these, Seqera Containers via the wave CLI is the right call. Singularity branch is the https://community-cr-prod.seqera.io/.../data blob URL (not docker://).
  • ${...} substitutions in clustering.py: ruff lints the source as Python, so KMeans(n_clusters=${n_clusters}, ...) fails. Moved substitutions to locals at the top of main() and use the locals from then on.
  • versions.yml writes via yaml.dump (with pyyaml added to the envs), since adding deps is cheap now that we're on Wave. Stub heredocs use importlib.metadata.version(...) to avoid import side effects (import umap triggers numba JIT cache writes that fail under read-only singularity).
  • Test portability: KMeans inertia wobbles a few ULPs across CPUs (BLAS reduction order), so the clustering test rounds it in the assertion only. Added a DBSCAN test case so both algorithm branches are exercised.
  • ext.args plumbing in cluster_metrics.py (--k-min, --k-max) and cluster_viz.py (--umap-neighbors, --tsne-perplexity), parsed via argparse + shlex.split("$task.ext.args") inside the template - matches the spec that optional non-file args go via ext.args. Same approach as custom/matrixfilter uses for R.
  • Output filenames unified to ${prefix}.<descriptor>.<ext> across all three modules (was mixed dot/underscore), matching the spec example and the existing custom/tx2gene / custom/matrixfilter pattern.
  • Simplifications: removed the PLINK-aware _normalise_id_column / multi-mode load_clusters from cluster_metrics.py and cluster_viz.py (both already require sample_id + cluster and the clustering module emits exactly that). Replaced the positional-index alignment with load_features().join(load_clusters(), how="inner"). Replaced the 45-line manual eigenvec parser with pd.read_csv. Reverted a stray .gitignore change unrelated to this PR.

One blocker before this can land: tests/data/ files can't live here. nf-core stores all module test data in nf-core/test-datasets on the modules branch, referenced via params.modules_testdata_base_path. See custom/basicpy for an example. To unblock:

  1. PR your five test files to the modules branch of nf-core/test-datasets under something like genomics/popgen/clustering/.
  2. Once it merges, update each tests/main.nf.test to use params.modules_testdata_base_path + ... and delete the tests/data/ directories.
  3. Re-record snapshots and push.

Marking as Changes Requested for that, and resolving the previous threads since they're addressed by the recent commits. Apologies again for the heavy hand - shout if anything needs reverting.

Local workflow to catch the easy stuff before pushing in future:

prek run --all-files
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustering/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustermetrics/tests/main.nf.test
nf-test test --profile docker --update-snapshot modules/nf-core/custom/clustervisualization/tests/main.nf.test

@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 14, 2026

Hi @pinin4fjords,

thank you so much for all the work you did directly on the branch (and for the super clear explanations)!
I've reviewed all the commits and they look excellent: the Wave/Seqera containers, correct use of the template directive, task.ext.args, unified file naming, versions.yml with pyyaml + importlib.metadata, the Singularity/Numba fixes, etc. I have no objections at all — on the contrary, thank you for speeding up the process so much.

Regarding the remaining blocker:

  • I'm preparing right now a PR on nf-core/test-datasets (branch modules) to upload the 5 test files under genomics/popgen/clustering/ (as you suggested).
  • As soon as that PR is merged, I'll update the tests/main.nf.test files for both modules, delete the tests/data/ folders, re-snapshot and push the changes here.

I'll let you know as soon as the test-datasets PR is opened (and I'll tag you).
If you'd like me to use a slightly different path or rename anything in the test files, just let me know!

Thank you again for the huge help ❤️
(By the way, the modules look really great now)

@nf-core/modules

@dbaku42
Copy link
Copy Markdown
Author

dbaku42 commented May 14, 2026

Hi @pinin4fjords,

I've just opened the PR for the test data on nf-core/test-datasets:
nf-core/test-datasets#2051

As soon as it's merged I'll update the two main.nf.test files, remove the local tests/data/ folders, re-snapshot and push here.

Thanks again for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants