When there are no differences, the differ makes an empty data set. Writing this out in Spark as a CSV results in an empty file, not even a header. Also, writing it out as a parquet results in n*3 segments, each small, that do contain the schema but no data and use up a fair amount of space. Currently that's 200 segments * 3 * 26kB = 15 MB.
What is the best way to represent no changes in both parquet and CSV form?
When there are no differences, the differ makes an empty data set. Writing this out in Spark as a CSV results in an empty file, not even a header. Also, writing it out as a parquet results in n*3 segments, each small, that do contain the schema but no data and use up a fair amount of space. Currently that's 200 segments * 3 * 26kB = 15 MB.
What is the best way to represent no changes in both parquet and CSV form?