Skip to content

Determine memory requirements for compressed inputs #6899

@AlexVCaron

Description

@AlexVCaron

New feature

Add capability to access decompressed file (at least attributes like actual - decompressed - size in memory) on the Path object.

Use case

More often than not, the memory space required for an algorithm (through its implementation) is a fixed multiplier of the memory space taken by the input data, plus a small allocation for the code itself. From experience, that info provides very effective heuristics for pinning memory for a given process, considering n-fold parallel memory allocations by its subprocesses. That's how I'd like to define memory directive for processes, dynamically using closures.

However, this heuristic cannot be deployed if the input data is compressed, as the Path object doesn't expose actual byte length of the data. I agree it's not fun to decompress before the processing itself (full of edge cases with the cloud, data in remote centers and everything), but if possible, that information should be surfaced.

Suggested implementation

I have none, this is an open question. Maybe Nextflow input channels should support all popular compression, so it can introspect them all, or define a subset of acceptable compression algorithms. Or just add the handles when the underlying Java implementation can do it. I don't know, but this an evergrowing limitation in pipeline definitions on my end. As of now, I cannot tell my users how much RAM they need. The gist is giving as much as they can on any failing job, which is a really bad habit on computing clusters and the cloud.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions