-
Notifications
You must be signed in to change notification settings - Fork 64
Description
Problem Description
An upcoming version of SDGym will save the artifacts (synthesizers, synthetic data) that are created when benchmarking. These artifacts will be saved based on the date of the benchmarking run. For example:
output_destination/
|--- SDGym_results_06_24_2025/
|--- census_06_24_2025/
|--- CTGANSynthesizer/
|-- CTGANSynthesizer.pkl
|-- CTGANSynthesizer_synthetic_data.csv
|--- GaussianCopulaSynthesizer/
|-- GaussianCopulaSynthesizer.pkl
|-- GaussianCopulaSynthesizer_synthetic_data.csv
|--- <dataset_name>_06_24_2025/
|--- <artifacts>
|--- results.csv
|--- metainfo.yaml
The problem is that it's possible to run the SDGym benchmark multiple times on a single day. We need to have a consistent, well-defined output in the case that that happens.
Expected behavior
If there is only 1 run that happens per day, then the naming scheme for all the artifacts should be exactly as shown above.
- The final results should be named
results.csv - The meta info should be named
metainfo.yaml - The synthesizer folder should be named
<synthesizer_name>/
If another runs happens on the same day, then we should do the following:
- Create a new final results file called
results(1).csv. (If that's taken, then keep incrementing the suffixes,results(2).csv,results(3).csv, etc.)- The number should correspond with the
run_idthat is saved in themetainfofile. The first one will berun_<date>_0, then the next will berun_<date>_1, thenrun_<date>_2, etc.
- The number should correspond with the
- Do the same with the metainfo file. Name it
metainfo(1).yaml. (If that's taken, keep incrementing the suffixes,metainfo(2).yaml,metainfo(3).yaml, etc.)- The number should correspond with the
run_idthat is saved in themetainfofile. The first one will berun_<date>_0, then the next will berun_<date>_1, thenrun_<date>_2, etc. - The number should also be the same as the corresponding results file
- The number should correspond with the
- Generally speaking, it would be really rare to run the same (synthesizer, dataset) combo a second time on the same day. However if it happens, then we should do the same naming scheme for the new synthesizer folder. Use the same number for the run that is used for the corresponding results and metainfo file.
- For example the folder would then be called
CTGANSynthesizer(1)/. - Inside the folder, the artifacts should be renamed to:
CTGANSynthesizer(1).pkl,CTGANSynthesizer(1)_synthetic_data.csv, etc. - In the
results(1).csvwe should refer to it asCTGANSynthesizer(1).
- For example the folder would then be called
Below is the structure for how it would look like:
output_destination/
|--- SDGym_results_06_24_2025/
|--- census_06_24_2025/
|--- CTGANSynthesizer/
|-- CTGANSynthesizer.pkl
|-- CTGANSynthesizer_synthetic_data.csv
|--- GaussianCopulaSynthesizer/
|-- GaussianCopulaSynthesizer.pkl
|-- GaussianCopulaSynthesizer_synthetic_data.csv
|--- CTGANSynthesizer(1)/
|-- CTGANSynthesizer(1).pkl
|-- CTGANSynthesizer(1)_synthetic_data.csv
|--- <dataset_name>_06_24_2025/
|--- <artifacts>
|--- results.csv
|--- results(1).csv
|--- metainfo.yaml
|--- metainfo(1).yaml
Additional context
If a user is doing multiple runs on the same day, it is most likely because they are splitting up the (synthesizer, dataset) combinations that they would like to test. Perhaps a second run is done on slower synthesizers or larger datasets.
In an ideal case, we'd just like to append the results from the subsequent run(s) to the existing results.csv and metainfo.yaml file. However, this would require us to implement a file locking system in case multiple, concurrent runs are trying to access the same file at the same time. For now, this is out-of-scope so we're writing a new file instead.