-
Notifications
You must be signed in to change notification settings - Fork 777
Description
New feature
Add capability to access decompressed file (at least attributes like actual - decompressed - size in memory) on the Path object.
Use case
More often than not, the memory space required for an algorithm (through its implementation) is a fixed multiplier of the memory space taken by the input data, plus a small allocation for the code itself. From experience, that info provides very effective heuristics for pinning memory for a given process, considering n-fold parallel memory allocations by its subprocesses. That's how I'd like to define memory directive for processes, dynamically using closures.
However, this heuristic cannot be deployed if the input data is compressed, as the Path object doesn't expose actual byte length of the data. I agree it's not fun to decompress before the processing itself (full of edge cases with the cloud, data in remote centers and everything), but if possible, that information should be surfaced.
Suggested implementation
I have none, this is an open question. Maybe Nextflow input channels should support all popular compression, so it can introspect them all, or define a subset of acceptable compression algorithms. Or just add the handles when the underlying Java implementation can do it. I don't know, but this an evergrowing limitation in pipeline definitions on my end. As of now, I cannot tell my users how much RAM they need. The gist is giving as much as they can on any failing job, which is a really bad habit on computing clusters and the cloud.