A Differentially Private (DP) Synthetic Data benchmarking package, posing the question: "Can a DP Synthesizer produce private (tabular) data that preserves scientific findings?" In other words, do DP Synthesizers satisfy Epistemic Parity?
Citation: Rosenblatt, L., Holovenko, A., Rumezhak, T., Stadnik, A., Herman, B., Stoyanovich, J., & Howe, B. (2022). Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy. arXiv preprint arXiv:2208.12700. (under review)
The benchmark is currently in beta-0.1. Still, you can install the development version by running the following commands:
- Create your preferred package management environment with
python=3.7(for example,conda create -n "synrd" python=3.7) git clone https://github.com/DataResponsibly/SynRD.gitcd SynRDpip install git+https://github.com/ryan112358/private-pgm.gitpip install .
Step (4) installs a non-PyPi dependency (this excellent package for DP synthesizers here: (https://github.com/ryan112358/private-pgm)[https://github.com/ryan112358/private-pgm]).
Note: This package is under heavy development - if functionality doesn't work/is missing, feel free to add an issue or submit a PR to fix!
If you would like to use the GEMSynthesizer, you must follow an alternative installation process for SynRD:
- Create your preferred package management environment with
python=3.7(for example,conda create -n "synrd" python=3.7) - Git clone the SynRD repo:
git clone https://github.com/DataResponsibly/SynRD cd SynRD/synthesizers- Git clone the dp-query-release repo:
git clone https://github.com/terranceliu/dp-query-release.git - Move
src/folder out ofdp-query-release/and intoSynRD/synthesizers/ - From the top level of SynRD clone, run
pip install .
If you would like to benchmark with the paper Fruiht2018Naturally, please follow some of the following rpy2 installation instructions to configure your R-Python interface package.
If you have a mac with an M1 chip, you may have success installing rpy2 via the following:
- Uninstall existing R versions on your machine.
- Install
R-4.2.2-arm64.pkgfrom https://cran.r-project.org/bin/macosx/. conda install -n base conda-forge::mambamamba install -c conda-forge rpy2
To run analysis for papers using R, you must ensure that R is downloaded and your R_HOME environment variable is set to the path of the R executable.
For installing with Anaconda, you may use conda install r-base r-essentials.
For confirming rpy2 is working as expected, try the following in Python:
import rpy2
rpy2.robjects.r['pi'] # Returns R object with the number pi- Each "paper" in the benchmark is named according to bibtex convention (authorYEARfirstword).
Brief details on how to add a new paper.
- Create a new folder with (authorYEARfirstword)
- Create a
process.ipynbnotebook as your data playground. Use this to investigate data cleaning/processing/results generation. - In parellel with (2), create a
authorYEARfirstword.pyfile, and extend thePublication()metaclass withAuthorYEARFirstword(Publication). Add the relevant details (seemeta_classes.pyfor notes on what this means). Then, begin to move overfindingsfromprocess.ipynbinto replicable lambdas inAuthorYEARFirstword(Publication). - Ensure that
AuthorYEARFirstword(Publication)has aFINDINGSlist class attribute. This should consist ofFindingobjects that wrap eachfinding_i(self)lambda in the properFinding, VisualFinding or FigureFindingmetaclass, and adds it to the list. - See
Saw2018Crossfor an example of a cleanly implementedPublicationclass.
Finding lambdas should have a particular structure that should be strictly adhered to. Consider the following example, and note particularly the return values
def finding_i_j(self): # there can be kwargs
"""
(Text from paper, usually 2 or 3 sentences)
"""
# often can use a table finding directly or
# as a starting point to quickly recreate
# finding
results = self.table()
# (pandas stuff happens here to generate
# the findings)
return ([values],
soft_finding,
[hard_findings])The finding lambdas can essentially perform any computation necessary, but must return a tuple of
-
A list of values (these are a set of any relevant values to the soft finding, non-exhaustive)
[interest_stem_ninth,interest_stem_eleventh]
-
A soft_finding boolean (this is simply a boolean that reflects the primary inequality/contrast presented in the original paper for this finding)
soft_finding = interest_stem_ninth > interest_stem_eleventh
-
A list of hard findings i.e. values (this could be the difference or set of differences that affected the soft_finding inequality. F)
hard_finding = interest_stem_ninth - interest_stem_eleventh hard_findings = [hard_finding]
