Skip to content

Difference files for identical data sets #1

@mjcollin

Description

@mjcollin

When there are no differences, the differ makes an empty data set. Writing this out in Spark as a CSV results in an empty file, not even a header. Also, writing it out as a parquet results in n*3 segments, each small, that do contain the schema but no data and use up a fair amount of space. Currently that's 200 segments * 3 * 26kB = 15 MB.

What is the best way to represent no changes in both parquet and CSV form?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions