Skip to content

When performing multiple SDGym runs on the same day, save the artifacts with consistent naming #448

@npatki

Description

@npatki

Problem Description

An upcoming version of SDGym will save the artifacts (synthesizers, synthetic data) that are created when benchmarking. These artifacts will be saved based on the date of the benchmarking run. For example:

output_destination/
|--- SDGym_results_06_24_2025/
     |--- census_06_24_2025/
          |--- CTGANSynthesizer/
               |-- CTGANSynthesizer.pkl
               |-- CTGANSynthesizer_synthetic_data.csv
          |--- GaussianCopulaSynthesizer/
               |-- GaussianCopulaSynthesizer.pkl
               |-- GaussianCopulaSynthesizer_synthetic_data.csv
     |--- <dataset_name>_06_24_2025/
          |--- <artifacts>
     |--- results.csv
     |--- metainfo.yaml

The problem is that it's possible to run the SDGym benchmark multiple times on a single day. We need to have a consistent, well-defined output in the case that that happens.

Expected behavior

If there is only 1 run that happens per day, then the naming scheme for all the artifacts should be exactly as shown above.

  • The final results should be named results.csv
  • The meta info should be named metainfo.yaml
  • The synthesizer folder should be named <synthesizer_name>/

If another runs happens on the same day, then we should do the following:

  • Create a new final results file called results(1).csv. (If that's taken, then keep incrementing the suffixes, results(2).csv, results(3).csv, etc.)
    • The number should correspond with the run_id that is saved in the metainfo file. The first one will be run_<date>_0, then the next will be run_<date>_1, then run_<date>_2, etc.
  • Do the same with the metainfo file. Name it metainfo(1).yaml. (If that's taken, keep incrementing the suffixes, metainfo(2).yaml, metainfo(3).yaml, etc.)
    • The number should correspond with the run_id that is saved in the metainfo file. The first one will be run_<date>_0, then the next will be run_<date>_1, then run_<date>_2, etc.
    • The number should also be the same as the corresponding results file
  • Generally speaking, it would be really rare to run the same (synthesizer, dataset) combo a second time on the same day. However if it happens, then we should do the same naming scheme for the new synthesizer folder. Use the same number for the run that is used for the corresponding results and metainfo file.
    • For example the folder would then be called CTGANSynthesizer(1)/.
    • Inside the folder, the artifacts should be renamed to: CTGANSynthesizer(1).pkl, CTGANSynthesizer(1)_synthetic_data.csv, etc.
    • In the results(1).csv we should refer to it as CTGANSynthesizer(1).

Below is the structure for how it would look like:

output_destination/
|--- SDGym_results_06_24_2025/
     |--- census_06_24_2025/
          |--- CTGANSynthesizer/
               |-- CTGANSynthesizer.pkl
               |-- CTGANSynthesizer_synthetic_data.csv
          |--- GaussianCopulaSynthesizer/
               |-- GaussianCopulaSynthesizer.pkl
               |-- GaussianCopulaSynthesizer_synthetic_data.csv
          |--- CTGANSynthesizer(1)/
               |-- CTGANSynthesizer(1).pkl
               |-- CTGANSynthesizer(1)_synthetic_data.csv
     |--- <dataset_name>_06_24_2025/
          |--- <artifacts>
     |--- results.csv
     |--- results(1).csv
     |--- metainfo.yaml
     |--- metainfo(1).yaml

Additional context

If a user is doing multiple runs on the same day, it is most likely because they are splitting up the (synthesizer, dataset) combinations that they would like to test. Perhaps a second run is done on slower synthesizers or larger datasets.

In an ideal case, we'd just like to append the results from the subsequent run(s) to the existing results.csv and metainfo.yaml file. However, this would require us to implement a file locking system in case multiple, concurrent runs are trying to access the same file at the same time. For now, this is out-of-scope so we're writing a new file instead.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions