Feature request: Can we have a more compact output formats than CSV such as Parquet?

I run some experiments, where the output CSV file easily becomes >100 GiB. An example is when fitting a model with Gaussian process as a latent variable, where there is essentially one parameter for each datapoint and this is repeated on each row of the output file. Running this many times over  for different inputs makes it challenging to even manage the file storage and also just reading the file to memory becomes trickier.

It would be cool, if we had the option to directly store the outputs in other formats, in particular Apache Parquet or Avro are popular in data science and use a more compact data representation with some compression on top and allow for natural integration with other big data tooling. 

Personally, I would favor Parquet: It is a columnar format, which could be suitable if we want to discard columns with nuisance parameters or the runtime values (I mean the values like `stepsize__` etc.)  from the stored STAN output without any unnecessary computational overhead (i.e. not processing the entire file). Also, it does support structured values, which means a vector/matrix parameter could be stored as in a single column, making the whole thing easier to parse than the CSV.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Feature request: Can we have a more compact output formats than CSV such as Parquet? #3332

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Feature request: Can we have a more compact output formats than CSV such as Parquet? #3332

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions