Support staging closure for input files in typed process#6905
Support staging closure for input files in typed process#6905bentsherman wants to merge 1 commit intomasterfrom
Conversation
✅ Deploy Preview for nextflow-docs-staging ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
@Lehmann-Fabian let me know if this PR meets your needs from #2622 |
| final result = new ArrayBag(files.size()) | ||
| for( final holder : files ) { | ||
| final stageName = stagingClosure.call(holder.getStorePath()) | ||
| result << holder.withName(stageName) | ||
| } | ||
| return result |
There was a problem hiding this comment.
@Lehmann-Fabian the main difference I saw with your PR is that it seems like you allowed the staging closure to still use glob patterns
I wasn't sure if this is actually needed, and the implementation is much simpler if we say that the staging closure must return a fully-resolved file name (no glob patterns), so I left it out
But let me know if you actually needed this
pditommaso
left a comment
There was a problem hiding this comment.
I'm skeptical there's a strong use case for dynamic resolution, on there other hands brings back bloated closure in process definitions as we had (have) for publishDir
|
You can see Fabian's use case in #2622 I have come to see it as an alternative to to staging via glob pattern, which doesn't give you as much control and forces certain conventions on you (e.g. numbering files) The closure doesn't bother me so much because (1) it will be rarely used and (2) it is a mapping function rather than a closure with no arguments (which is redundant) It's essentially the same thing we do for workflow outputs: output {
samples {
path { sample -> "fastq/${sample.id}/" }
}
}It's the simplest way to map files from one environment to another |
|
From Fabian's example (nf-core/rangeland): script:
"""
# ...
# Rename files: /trend/<Tile>/<Filename> to <Tile>_<Filename>, otherwise we can not reextract the tile name later
results=`find trend -name '*.tif*'`
parallel -j $task.cpus 'mv {} {//}_{/}' ::: \$results
"""So the main use case is when you have a multi-dimensional file collection (e.g. grid data) Many tools expect a particular file/directory structure, and while we are trying to get away from encoding metadata in file paths, the reality is that many tools still do it Nextflow's pattern-based staging isn't good enough here, so you end up "re-staging" the files in your process script (like the |
Alternative: Declarative String Replacement Rule (sed-like)Instead of a closure, we could use a declarative pattern-based renaming rule — a string that specifies how to transform the original file path into the staged name. Option A:
|
| Token | Meaning | Example for /data/tile1/sample.tif |
|---|---|---|
{name} |
File name with extension | sample.tif |
{simpleName} |
File name without extension | sample |
{extension} |
File extension | tif |
{parent} |
Immediate parent dir name | tile1 |
{parent2} |
Grandparent dir name | data |
{path:N} |
Last N path segments | tile1/sample.tif |
Examples:
// Preserve parent directory structure
stageAs slice, '{parent}/{name}'
// tile1/sample.tif, tile2/sample.tif
// Flatten with prefix from parent
stageAs slice, '{parent}_{name}'
// tile1_sample.tif, tile2_sample.tif
// Keep last 2 path segments (same as parent/name)
stageAs slice, '{path:2}'
// tile1/sample.tifOption C: Regex substitution with explicit syntax
stageAs slice, from: '.*/([^/]+)/([^/]+)', to: '$1/$2'Comparison
| Criteria | Closure (PR) | Path Templates (B) | Regex Sub (A/C) |
|---|---|---|---|
| Readability | Medium — Groovy knowledge needed | High — self-documenting tokens | Low — regex is hard to read |
| Power/Flexibility | Maximum — arbitrary Groovy code | Covers 90%+ of use cases | Full regex power |
| Declarative | No — imperative code | Yes | Yes |
| Serializable | No — closures are opaque | Yes — plain string | Yes — plain string |
| Config override | Hard | Easy (process.stageAs = ...) |
Possible but ugly |
| Learning curve | Low for Groovy users | Very low | High |
Composable with existing */? patterns |
No — separate code path | Could extend existing pattern language | No |
Recommendation: Path Templates (Option B)
The path template approach ('{parent}/{name}') seems the strongest alternative because:
- Declarative & serializable — it's just a string, making it compatible with config overrides, caching keys, and serialization
- Readable —
'{parent}/{name}'is immediately understandable vs{ file -> "${file.parent.name}/${file.name}" } - Consistent with Nextflow's existing pattern language — extends the glob-based
*/?patterns rather than introducing a completely different paradigm (closures) - Covers the real use cases — the rangeland case (
tile/sample.tifpreservation) and the PR's test case (group1/sample1.txt) are both expressible with simple token substitution - No closure serialization concerns — closures in Nextflow can cause issues with
-resumecaching and Kryo serialization
|
@FloWuenne can you describe the Hive use case you encountered? One question I also have -- do you actually need to rename / remap the input files from their original structure, or just preserve them as they are? Part of the issue is that They will be staged in as But the problem is that Nextflow doesn't know how many parent directories should be included when staging them into a task (i.e. which part belongs to the
If we could find an elegant solution to (2), we might not need the staging closure (or any other alternative) at all. We could just make the default staging behavior do the right thing and then you don't have to think about it. |
Signed-off-by: Ben Sherman <bentshermann@gmail.com>
1139434 to
8e619c6
Compare
This PR adds the ability to stage input files with a staging closure instead of a glob pattern, which allows for more find-grained control over individual file names