Skip to content

Feature request: Can we have a more compact output formats than CSV such as Parquet? #3332

@jachymb

Description

@jachymb

I run some experiments, where the output CSV file easily becomes >100 GiB. An example is when fitting a model with Gaussian process as a latent variable, where there is essentially one parameter for each datapoint and this is repeated on each row of the output file. Running this many times over for different inputs makes it challenging to even manage the file storage and also just reading the file to memory becomes trickier.

It would be cool, if we had the option to directly store the outputs in other formats, in particular Apache Parquet or Avro are popular in data science and use a more compact data representation with some compression on top and allow for natural integration with other big data tooling.

Personally, I would favor Parquet: It is a columnar format, which could be suitable if we want to discard columns with nuisance parameters or the runtime values (I mean the values like stepsize__ etc.) from the stored STAN output without any unnecessary computational overhead (i.e. not processing the entire file). Also, it does support structured values, which means a vector/matrix parameter could be stored as in a single column, making the whole thing easier to parse than the CSV.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions