Skip to content

Support staging closure for input files in typed process#6905

Open
bentsherman wants to merge 1 commit intomasterfrom
typed-process-stageas-closure
Open

Support staging closure for input files in typed process#6905
bentsherman wants to merge 1 commit intomasterfrom
typed-process-stageas-closure

Conversation

@bentsherman
Copy link
Member

This PR adds the ability to stage input files with a staging closure instead of a glob pattern, which allows for more find-grained control over individual file names

@bentsherman bentsherman requested a review from jorgee March 10, 2026 00:29
@bentsherman bentsherman requested review from a team as code owners March 10, 2026 00:29
@netlify
Copy link

netlify bot commented Mar 10, 2026

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 8e619c6
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69b44086a3da750008f9cab1
😎 Deploy Preview https://deploy-preview-6905--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@bentsherman
Copy link
Member Author

@Lehmann-Fabian let me know if this PR meets your needs from #2622

Comment on lines +310 to +315
final result = new ArrayBag(files.size())
for( final holder : files ) {
final stageName = stagingClosure.call(holder.getStorePath())
result << holder.withName(stageName)
}
return result
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lehmann-Fabian the main difference I saw with your PR is that it seems like you allowed the staging closure to still use glob patterns

I wasn't sure if this is actually needed, and the implementation is much simpler if we say that the staging closure must return a fully-resolved file name (no glob patterns), so I left it out

But let me know if you actually needed this

@bentsherman bentsherman added this to the 26.04 milestone Mar 10, 2026
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical there's a strong use case for dynamic resolution, on there other hands brings back bloated closure in process definitions as we had (have) for publishDir

@bentsherman
Copy link
Member Author

You can see Fabian's use case in #2622

I have come to see it as an alternative to to staging via glob pattern, which doesn't give you as much control and forces certain conventions on you (e.g. numbering files)

The closure doesn't bother me so much because (1) it will be rarely used and (2) it is a mapping function rather than a closure with no arguments (which is redundant)

It's essentially the same thing we do for workflow outputs:

output {
    samples {
        path { sample -> "fastq/${sample.id}/" }
    }
}

It's the simplest way to map files from one environment to another

@bentsherman bentsherman linked an issue Mar 10, 2026 that may be closed by this pull request
@bentsherman
Copy link
Member Author

From Fabian's example (nf-core/rangeland):

    script:
    """
    # ...

    # Rename files: /trend/<Tile>/<Filename> to <Tile>_<Filename>, otherwise we can not reextract the tile name later
    results=`find trend -name '*.tif*'`
    parallel -j $task.cpus 'mv {} {//}_{/}' ::: \$results
    """

So the main use case is when you have a multi-dimensional file collection (e.g. grid data)

Many tools expect a particular file/directory structure, and while we are trying to get away from encoding metadata in file paths, the reality is that many tools still do it

Nextflow's pattern-based staging isn't good enough here, so you end up "re-staging" the files in your process script (like the mv command above). So I think a input file -> stage name staging closure is a better way to do this

@pditommaso
Copy link
Member

Alternative: Declarative String Replacement Rule (sed-like)

Instead of a closure, we could use a declarative pattern-based renaming rule — a string that specifies how to transform the original file path into the staged name.

Option A: sed-style substitution syntax

stageAs slice, 's|.*/([^/]+)/([^/]+)$|$1/$2|'

Option B: Path template with named segments

Define a mini-language that references parts of the source path:

stageAs slice, '{parent}/{name}'

Where built-in tokens map to path components:

Token Meaning Example for /data/tile1/sample.tif
{name} File name with extension sample.tif
{simpleName} File name without extension sample
{extension} File extension tif
{parent} Immediate parent dir name tile1
{parent2} Grandparent dir name data
{path:N} Last N path segments tile1/sample.tif

Examples:

// Preserve parent directory structure
stageAs slice, '{parent}/{name}'
// tile1/sample.tif, tile2/sample.tif

// Flatten with prefix from parent
stageAs slice, '{parent}_{name}'
// tile1_sample.tif, tile2_sample.tif

// Keep last 2 path segments (same as parent/name)
stageAs slice, '{path:2}'
// tile1/sample.tif

Option C: Regex substitution with explicit syntax

stageAs slice, from: '.*/([^/]+)/([^/]+)', to: '$1/$2'

Comparison

Criteria Closure (PR) Path Templates (B) Regex Sub (A/C)
Readability Medium — Groovy knowledge needed High — self-documenting tokens Low — regex is hard to read
Power/Flexibility Maximum — arbitrary Groovy code Covers 90%+ of use cases Full regex power
Declarative No — imperative code Yes Yes
Serializable No — closures are opaque Yes — plain string Yes — plain string
Config override Hard Easy (process.stageAs = ...) Possible but ugly
Learning curve Low for Groovy users Very low High
Composable with existing */? patterns No — separate code path Could extend existing pattern language No

Recommendation: Path Templates (Option B)

The path template approach ('{parent}/{name}') seems the strongest alternative because:

  1. Declarative & serializable — it's just a string, making it compatible with config overrides, caching keys, and serialization
  2. Readable'{parent}/{name}' is immediately understandable vs { file -> "${file.parent.name}/${file.name}" }
  3. Consistent with Nextflow's existing pattern language — extends the glob-based */? patterns rather than introducing a completely different paradigm (closures)
  4. Covers the real use cases — the rangeland case (tile/sample.tif preservation) and the PR's test case (group1/sample1.txt) are both expressible with simple token substitution
  5. No closure serialization concerns — closures in Nextflow can cause issues with -resume caching and Kryo serialization

@bentsherman
Copy link
Member Author

@FloWuenne can you describe the Hive use case you encountered?

One question I also have -- do you actually need to rename / remap the input files from their original structure, or just preserve them as they are?

Part of the issue is that stageAs ignores directories when staging input files. It just stages them in by their base file name. So if I provide a collection of files like this:

/.../group1/sample1.txt
/.../group1/sample2.txt
/.../group1/sample3.txt
/.../group2/sample1.txt
/.../group2/sample2.txt
/.../group2/sample3.txt
/.../group3/sample1.txt
/.../group3/sample2.txt
/.../group3/sample3.txt

They will be staged in as sample1.txt, sample2.txt, and sample3.txt, which means they will overwrite each other.

But the problem is that Nextflow doesn't know how many parent directories should be included when staging them into a task (i.e. which part belongs to the ...?)

  1. if the files are coming from an upstream task output, then we can simply use the relative path against the task directory

  2. if the files coming from outside the work directory... ?

If we could find an elegant solution to (2), we might not need the staging closure (or any other alternative) at all. We could just make the default staging behavior do the right thing and then you don't have to think about it.

Signed-off-by: Ben Sherman <bentshermann@gmail.com>
@bentsherman bentsherman force-pushed the typed-process-stageas-closure branch from 1139434 to 8e619c6 Compare March 13, 2026 16:51
@bentsherman bentsherman removed this from the 26.04 milestone Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

option to save/restore relative file paths

2 participants