Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
204 commits
Select commit Hold shift + click to select a range
ef62066
Add appointment noshows dataset and modify metrics and privbayes to b…
gmingas Jan 7, 2021
ede6621
Add noshows input file which just resamples the columns using synthpo…
gmingas Jan 7, 2021
76b4381
Add featuretools to requirements file
gmingas Jan 12, 2021
48dfe8a
Add feature importance .py file with skeleton code
gmingas Jan 12, 2021
ad93ad3
Add feature importance target to Makefile
ots22 Jan 12, 2021
b4180a0
Add PATE-GAN to synth-methods
kasra-hosseini Jan 12, 2021
f5ff866
Create an empty directory: libs
kasra-hosseini Jan 12, 2021
8a0ff84
Merge branch 'feature/152-add-noshows-example' into feature/114-pate-gan
kasra-hosseini Jan 12, 2021
8bfc66d
Compute featuretools utility metric (WIP); add relevant parameters to…
ots22 Jan 12, 2021
481b744
Reformatting (black)
ots22 Jan 12, 2021
9832a46
remove __init__ from libs dir
kasra-hosseini Jan 13, 2021
c19a0ab
Calculate feature importance metrics for real and synth data
OscartGiles Jan 13, 2021
c7c25b5
Add dummy variables and out-of-sample prediction to feature importanc…
gmingas Jan 13, 2021
d29f1f9
BUGFIX: change the order of args in confustion matrix (it should be: …
kasra-hosseini Jan 13, 2021
fb9316b
If columns_to_synthesize is specified in the input file, only synthes…
kasra-hosseini Jan 13, 2021
686037b
Similar to PATE-GAN, if columns_to_synthesize is specified, only synt…
kasra-hosseini Jan 13, 2021
d6dd8af
Add an example input file for PATE-GAN
kasra-hosseini Jan 13, 2021
ecce16d
Merge branch 'feature/152-add-noshows-example' into feature/114-pate-gan
kasra-hosseini Jan 13, 2021
2ad20f1
Update .gitignore
kasra-hosseini Jan 13, 2021
db9def3
Update requirements.txt
kasra-hosseini Jan 13, 2021
aabd3ba
Add batch_size, num_teachers, num_iters and learning_rate params to t…
kasra-hosseini Jan 13, 2021
c46cd57
Add MB_SIZE, LR, NITER, NUM_TEACHERS to PateGanSynthesizer, these par…
kasra-hosseini Jan 13, 2021
4343f64
Update README
kasra-hosseini Jan 13, 2021
986cd30
Add compare_features function to compare two ranked feature vectors
kasra-hosseini Jan 14, 2021
aa9bf2d
Add RBO python code
kasra-hosseini Jan 14, 2021
048375a
Add an example that generates an ensemble of run-inputs for PrivBayes
ots22 Jan 15, 2021
0c5b663
Update the paths according to the new changes in synthetic_data_relea…
kasra-hosseini Jan 17, 2021
a995ffd
Fix Makefile targets related to utility/privacy
ots22 Jan 19, 2021
a7c2bb2
Add helper 'run' targets to Makefile
ots22 Jan 19, 2021
d3b202b
Extend PrivBayes example
ots22 Jan 19, 2021
a661141
Disable debugger entry in feature importance
ots22 Jan 19, 2021
7e0ad89
Default to p=0.6 in RBO of feature importance
ots22 Jan 19, 2021
68ba3a0
Allow feature importance function to handle datasets with no id colum…
gmingas Jan 19, 2021
8c86194
Add feature importance metric to polish data example
gmingas Jan 19, 2021
82ff344
Resolve merge conflict
gmingas Jan 19, 2021
97bb8ee
Add complete list of scored features (orig and synth) to feature impo…
ots22 Jan 19, 2021
faba416
Print all PATE-GAN params
kasra-hosseini Jan 19, 2021
3fa1428
Convert columns of type DateTime to ContinuousNumerical
kasra-hosseini Jan 19, 2021
4b79c1a
All columns can be synthetsized now
kasra-hosseini Jan 19, 2021
cc60eb1
Merge branch 'feature/152-add-noshows-example' into feature/114-pate-gan
kasra-hosseini Jan 19, 2021
ce083eb
Merge pull request #46 from alan-turing-institute/feature/114-pate-gan
kasra-hosseini Jan 19, 2021
14f3301
Add random forest seeds script
gmingas Jan 19, 2021
916925d
Change privbays k to 3 in run-input
gmingas Jan 19, 2021
8e1958c
Change printed messages in feature importance
gmingas Jan 19, 2021
a108700
Fix arguments in feature importance call
gmingas Jan 19, 2021
ee41424
Merge pull request #50 from alan-turing-institute/feature/157-rf-seeds
gmingas Jan 19, 2021
fae55be
add adult dataset
kasra-hosseini Jan 20, 2021
9de3da7
Fix to Makefile run dependencies
ots22 Jan 20, 2021
ae8455c
Add feature importance to noshows-resampling
ots22 Jan 20, 2021
17da643
Fix to feature_importance: Use created index when none is specified
ots22 Jan 20, 2021
43e2dee
Add bootstrap as a synthesis method
ots22 Jan 20, 2021
2ca775f
Add noshows bootstrap example
ots22 Jan 20, 2021
b3651f4
PrivBayes example: Use larger k, adjust output
ots22 Jan 20, 2021
420d55e
Add troubleshooting session notebook
ots22 Jan 20, 2021
5bb298d
Update README.md
ots22 Jan 20, 2021
c32978f
Update README.md
ots22 Jan 20, 2021
ec258c3
First attempt at input for PrivBayes on the adult dataset
ots22 Jan 20, 2021
3c646b8
Update privbayes-adult example
ots22 Jan 20, 2021
fe67be8
Add max_depth parameter to feature_importance
ots22 Jan 20, 2021
daee07b
Set max_depth parameter (feature_importance)
ots22 Jan 20, 2021
97787c2
Cast random_seed to int
ots22 Jan 21, 2021
fb33131
Add adult-bootstrap example
ots22 Jan 21, 2021
f884caa
Fix to adult-bootstrap example
ots22 Jan 21, 2021
65233ca
Feature importance: Add one random permutation and lower bound
ots22 Jan 21, 2021
60752b3
Add resampling example for 'adult' dataset
ots22 Jan 21, 2021
a988d38
Ensemble of adult resampling examples
ots22 Jan 21, 2021
3a2e837
Fix to adult resampling example
ots22 Jan 21, 2021
01f7404
Fix to adult resampling example
ots22 Jan 21, 2021
0e1ccbd
Add comment in rf random seeds script
gmingas Jan 22, 2021
d639adc
Add a new input file for PATE-GAN, update requirements file
Jan 22, 2021
43fe2e9
Add permutation feature importance
gmingas Jan 22, 2021
ee9b29d
Merge branch 'feature/152-add-noshows-example' of https://github.com/…
gmingas Jan 22, 2021
df91013
Remove plotting from privbayes adult ensemble example
ots22 Jan 22, 2021
2190bc9
Add postprocessing and plot script for feature importance examples
ots22 Jan 22, 2021
ea27b8d
Add more parameters for primitives, dummies, return auc, add dropna
gmingas Jan 22, 2021
79e98e5
Update run-input files
gmingas Jan 22, 2021
6f09548
Fix adult resampling run-input
gmingas Jan 22, 2021
c334383
Add adult bootstrap example
ots22 Jan 22, 2021
3b02c65
Update adult ensemble examples
ots22 Jan 22, 2021
65e8772
Turn off classifiers in adult run input files
ots22 Jan 22, 2021
fce68cc
Update feature importance imput parameters for pategan adult
ots22 Jan 22, 2021
8de9c06
Add pategan adult ensemble example
ots22 Jan 22, 2021
c761a72
Fix to feature importance
ots22 Jan 23, 2021
b2628d5
PrivBayes: Add option to save description.json or not
ots22 Jan 25, 2021
9c725d4
Update privbayes example 3
ots22 Jan 25, 2021
8d252d3
Add 'subsample' synthesis method
ots22 Jan 25, 2021
ae5a895
Add ensemble of subsample runs for 'adult' dataset
ots22 Jan 25, 2021
2a60c22
Fix adult subsample example script
ots22 Jan 25, 2021
05b46e8
Done
HarrisonWilde Jan 26, 2021
ca5d3d3
Add shapley feature importance
gmingas Jan 27, 2021
e7be61f
Done
HarrisonWilde Jan 26, 2021
317a331
Add commented interventional option to shapley computation and shap t…
gmingas Jan 27, 2021
35adafc
Merge branch 'feature/152-add-noshows-example' of https://github.com/…
gmingas Jan 27, 2021
591081f
Pass empty agg/trans primitives as None for featuretools to use the c…
ots22 Jan 27, 2021
909bd26
Adjust privbayes-adult-ensemble example
ots22 Jan 27, 2021
da29129
Adult resampling ensemble example: use complete set of featuretools p…
ots22 Jan 28, 2021
1c2b636
Feature importance: Computing SHAP values optional with 'compute_shap…
ots22 Jan 28, 2021
e0065ed
Add Framingham Privbayes example and ensemble
ots22 Jan 28, 2021
ecf78e8
Fix Framingham ensemble example
ots22 Jan 28, 2021
e8303e0
Fix Framingham ensemble example
ots22 Jan 28, 2021
8b34052
Additional values of epsilon for Framingham ensemble example
ots22 Jan 28, 2021
a5f7982
Add subsample ensemble, Framingham
ots22 Jan 28, 2021
fbbc234
Adding in code to clean the data
HarrisonWilde Jan 28, 2021
9bf9ab7
added cleaning code to datasets-raw with raw data
HarrisonWilde Jan 29, 2021
2391dd0
Add cross-AUC to feature importance metric (AUC when training on synt…
gmingas Jan 29, 2021
49df48a
Merge branch 'feature/152-add-noshows-example' into adding_framingham
ots22 Jan 29, 2021
59a1dbb
Merge pull request #53 from HarrisonWilde/adding_framingham
ots22 Jan 29, 2021
7f3a430
Additional figures from the feature-ranking examples
ots22 Jan 31, 2021
6d53e35
Emit json from provenance.py
ots22 Jan 31, 2021
fcf084d
Add provenance information to each Makefile target; rename parameter …
ots22 Jan 31, 2021
298a823
Add helper script to rename input parameter file in existing output
ots22 Jan 31, 2021
fd5daf2
Add parameter to allow skipping feature engineering step
gmingas Feb 1, 2021
b88a20b
Add weighted F1 score in feature importance
gmingas Feb 1, 2021
a18f562
Add runfile for framingham with reasonable-looking feature engineerin…
gmingas Feb 1, 2021
a17813a
run azure vms
OscartGiles Feb 2, 2021
1d8252c
Update infrastruture readme
OscartGiles Feb 2, 2021
19291e3
Update adult ensemble script
gmingas Feb 2, 2021
e116955
Update framingham ensemble script
gmingas Feb 2, 2021
ba2b315
Add updated noshows ensemble script
gmingas Feb 2, 2021
9f60e21
Update adult bootstrap, subsample and resample ensembles scripts
gmingas Feb 2, 2021
f4d1002
Fix missing text in framingham ensemble
gmingas Feb 2, 2021
207124e
Add new noshows ensemble script
gmingas Feb 2, 2021
e16575b
List multiple vm names
OscartGiles Feb 2, 2021
0faef92
Add framingham and noshows ensemble scripts and correct adult script
gmingas Feb 2, 2021
89e2cd5
Update dependencies on VMs
OscartGiles Feb 2, 2021
691a117
Update requirements.txt
OscartGiles Feb 2, 2021
b140932
Allow to pass user pool parameter for DataSynthesizer in run-input file
gmingas Feb 2, 2021
e94c75c
Update framingham script
gmingas Feb 2, 2021
6186020
Final version of ensemble scripts for framingham before runs
gmingas Feb 2, 2021
0d56d54
Update output files in scripts
gmingas Feb 2, 2021
f811724
Correct output directories for ensemble scripts
gmingas Feb 2, 2021
7df100a
Fix epsilons list
gmingas Feb 2, 2021
8213faa
Correct adult subsample script
gmingas Feb 3, 2021
2fe209f
Update requirements.txt
ots22 Feb 4, 2021
2d684f6
Merge branch 'feature/152-add-noshows-example' into feature/autoazurevm
OscartGiles Feb 4, 2021
b087da0
Add exception to catch error in AUC calcuation when very few samples …
gmingas Feb 4, 2021
aeda3a9
Merge branch 'feature/152-add-noshows-example' of https://github.com/…
gmingas Feb 4, 2021
bc7b071
Catch all AUC-related exceptions
gmingas Feb 4, 2021
272d869
Add script to check PrivBayes auto-selected k values
martintoreilly Feb 5, 2021
05b9201
Set AUC to nan on exception
gmingas Feb 5, 2021
396bf50
Drop nan rows instead of columns on adult subsample script
gmingas Feb 5, 2021
87c1540
Drop nan rows instead of columns on adult resampling and privbayes sc…
gmingas Feb 5, 2021
7ea1c42
Merge branch 'feature/152-add-noshows-example' of https://github.com/…
gmingas Feb 5, 2021
ce97bc8
Add cosine_similarity to feature_importance
Feb 8, 2021
9724453
Merge branch 'feature/152-add-noshows-example' into feature/autoazurevm
OscartGiles Feb 19, 2021
e6f6b94
Refactor VM deployment to allow teardown of subsets of VMs
OscartGiles Feb 19, 2021
822e49b
Working with all dependencies installed
OscartGiles Feb 19, 2021
ab66fd8
Update readme
OscartGiles Feb 19, 2021
94de8ae
Format with black
OscartGiles Feb 19, 2021
0d8bc26
Fix bugs caused by dropping columns in the featuretools pipeline and …
gmingas Mar 1, 2021
f1073ed
Make two RF parameters configurable through run file
gmingas Mar 2, 2021
132290c
Fix typo in feature_importance.py
gmingas Mar 2, 2021
2f4f8ad
Add calculation of L2, cosine and KL for randomly permuted feature sc…
gmingas Mar 2, 2021
8b64edf
Normalise vectors before calculating L2 and other metrics
gmingas Mar 2, 2021
44d461a
Remove ipdb breakpoint from feature_importance.py
gmingas Mar 3, 2021
2ed16e0
Merge pull request #54 from alan-turing-institute/feature/autoazurevm
OscartGiles Mar 3, 2021
6a9a7ee
Merge pull request #56 from alan-turing-institute/feature/improve-l2-…
gmingas Mar 3, 2021
d45b76d
Allow try to catch all eceptions when calculating AUC
gmingas Mar 9, 2021
7e6980a
Add first version of Household poverty data generation code
gmingas Mar 24, 2021
d4d19c4
Fix bugs with dropped columns and AUC and modify the parameters passe…
gmingas Mar 24, 2021
154aaf1
Add ensemble scripts for household poverty dataset
gmingas Mar 24, 2021
ed5b1df
Prevent resampling the Id column in the household dataset ensemble to…
gmingas Mar 25, 2021
ee1a17b
Fix AUC and dropped column bugs in feature importance for household d…
gmingas Mar 25, 2021
a4a3080
Make AUC except general
gmingas Mar 26, 2021
b9e77ef
Adjust household example run input
gmingas Mar 26, 2021
f942f8b
Reduce size of household dataset to accelerate synthesis and adapt sc…
gmingas Apr 1, 2021
e177a3f
Update run-input example
gmingas Apr 1, 2021
042b190
Add dataset README.md file with instructions to download
gmingas Apr 6, 2021
f42aacd
Merge pull request #58 from alan-turing-institute/feature/new-datasets
gmingas Apr 8, 2021
89a01c4
Add correlation rank similarity metric
gmingas Apr 5, 2021
71e4c6e
Clean up code, move correlation computation outside of compare_featur…
gmingas Apr 6, 2021
b04dfae
Remove extrapolated v1 metric calculations
gmingas Apr 8, 2021
69e301d
Add pulp to environment
gmingas Apr 8, 2021
2777bb8
Make modifications to ensemble files including creating new versions …
gmingas Apr 8, 2021
01e8029
Fix bug in json creation for small household dataset
gmingas Apr 9, 2021
15307e0
Merge branch 'develop' into develop-paper
OscartGiles Apr 12, 2021
d5f8d9d
Add run input file for Polish dataset with Privbayes and some changes…
gmingas Jun 15, 2021
b41f8b6
Adds ensemble scripts for polish data and fixes some more issues with…
gmingas Jun 15, 2021
33f1f8b
Removes some primitives that do not have meaning or utility in this d…
gmingas Jun 15, 2021
6d5d767
Handle exception caused by all-zero shapley values
gmingas Jun 16, 2021
1cbcd82
Merge branch 'develop-paper' into feature/dataset-modifications
OscartGiles Jun 24, 2021
6df3ab4
Fix docker-image build
OscartGiles Jun 24, 2021
8d53651
Add data generation code for artificial data, along with ensemble scr…
gmingas Jun 24, 2021
b2e73ad
Add user to sudo
OscartGiles Jun 24, 2021
60cf9ae
Add fourth artificial dataset
gmingas Jun 24, 2021
7fed707
Remove user - leave docker as root
OscartGiles Jun 24, 2021
0ea8607
Mount directory to docker image before running to use latest version
OscartGiles Jul 8, 2021
5d6c926
Update volume mount
OscartGiles Jul 8, 2021
d0df1fd
Remove interactive flag
OscartGiles Jul 8, 2021
2bb9c7c
Check volume mounted
OscartGiles Jul 8, 2021
fa2ac2b
this shouldn't fail
OscartGiles Jul 8, 2021
7f17034
Makefile no longer tries to download or preprocess data. Needs to be …
OscartGiles Jul 8, 2021
280d841
Add more artificial examples and fix number of generated rows to 10k …
gmingas Jul 9, 2021
c1fff08
Fix .json metadata for artificial example 6 and provide full list of…
gmingas Jul 10, 2021
c46a12d
Remove ipdb line
gmingas Jul 10, 2021
06523ab
Merge pull request #64 from alan-turing-institute/ogiles/fix-docker-b…
OscartGiles Jul 14, 2021
f6484a2
Merge branch 'feature/dataset-modifications' into feature/artificial-…
OscartGiles Jul 14, 2021
cc29421
Merge pull request #66 from alan-turing-institute/feature/artificial-…
OscartGiles Jul 14, 2021
7efa48d
add privgem as dependency
OscartGiles Jul 14, 2021
3cb3bca
Update requirements file
OscartGiles Jul 21, 2021
8a17d3b
Remove $(AE_DEIDENTIFIED_DATA) $(HP_DATA_CLEAN) from makefile
OscartGiles Jul 21, 2021
d255e09
Set privbayes-adult to encode categorical variables
OscartGiles Jul 21, 2021
b4ad25f
Download AE data
OscartGiles Jul 21, 2021
99896a5
Merge pull request #63 from alan-turing-institute/feature/dataset-mod…
OscartGiles Jul 26, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 3 additions & 5 deletions .github/workflows/run-synth-pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,7 @@ jobs:
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PAC }}
-
name: Build pipeline
run: docker build -f dockerfiles/quipp-dev.Dockerfile -t turinginst/quipp-env:latest .
-
name: Run pipeline
run: docker run turinginst/quipp-env:latest make
name: Run pipeline with privbayes-adult

run: docker run -v $GITHUB_WORKSPACE:/quipp-pipeline --workdir /quipp-pipeline turinginst/quipp-env:base make run-privbayes-adult
89 changes: 75 additions & 14 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
## --- echo commands (for debugging)
## SHELL = sh -xv


##-------------------------------------
## Set up main path variables
##-------------------------------------
Expand All @@ -22,25 +21,46 @@ SYNTH_OUTPUTS_PREFIX = $(addprefix synth-output/,$(RUN_INPUTS_BASE_PREFIX))
SYNTH_OUTPUTS_CSV = $(addsuffix /synthetic_data_1.csv,$(SYNTH_OUTPUTS_PREFIX))

## Construct a list of .json file names for each utility and privacy metric
SYNTH_OUTPUTS_PRIV_DISCL_RISK = $(addsuffix /privacy_disclosure_risk.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_CLASS = $(addsuffix /utility_classifiers.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_PRIV_DISCL_RISK = $(addsuffix /disclosure_risk.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_CLASS = $(addsuffix /utility_diff.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_CORR = $(addsuffix /utility_correlations.json,$(SYNTH_OUTPUTS_PREFIX))
SYNTH_OUTPUTS_UTIL_FEATURE_IMPORTANCE = $(addsuffix /utility_feature_importance.json,$(SYNTH_OUTPUTS_PREFIX))

.PHONY: all all-synthetic generated-data clean

all: $(SYNTH_OUTPUTS_PRIV_DISCL_RISK) $(SYNTH_OUTPUTS_UTIL_CLASS) $(SYNTH_OUTPUTS_UTIL_CORR)
all: $(SYNTH_OUTPUTS_PRIV_DISCL_RISK) $(SYNTH_OUTPUTS_UTIL_CLASS) $(SYNTH_OUTPUTS_UTIL_CORR) $(SYNTH_OUTPUTS_UTIL_FEATURE_IMPORTANCE)

all-synthetic: $(SYNTH_OUTPUTS_CSV)


##-------------------------------------
## Add provenance information to output
##-------------------------------------

PROVENANCE_DEF = provenance() {\
git_result=$$(python provenance.py | jq '{commit, local_modifications}') ; \
( jq ". += {git: $$git_result}" $$1 > $${1}.tmp ) && mv $${1}.tmp $$1 || \
echo "Warning: No provenance could be recorded for this target" && rm -f $${1}.tmp ; \
}
ADD_PROVENANCE = $(PROVENANCE_DEF) && provenance


##-------------------------------------
## Generate input data
##-------------------------------------

## set data file paths
AE_DEIDENTIFIED_DATA = generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify.csv generator-outputs/odi-nhs-ae/hospital_ae_data_deidentify.json
LONDON_POSTCODES = generators/odi-nhs-ae/data/London\ postcodes.csv
generated-data: $(AE_DEIDENTIFIED_DATA)
HP_DATA_CLEAN = generator-outputs/household_poverty/train_cleaned.csv generator-outputs/household_poverty/train_cleaned.json
ARTIFICIAL_DATA_1 = generator-outputs/artificial/artificial_1.csv generator-outputs/artificial/artificial_1.json
ARTIFICIAL_DATA_2 = generator-outputs/artificial/artificial_2.csv generator-outputs/artificial/artificial_2.json
ARTIFICIAL_DATA_3 = generator-outputs/artificial/artificial_3.csv generator-outputs/artificial/artificial_3.json
ARTIFICIAL_DATA_4 = generator-outputs/artificial/artificial_4.csv generator-outputs/artificial/artificial_4.json
ARTIFICIAL_DATA_5 = generator-outputs/artificial/artificial_5.csv generator-outputs/artificial/artificial_5.json
ARTIFICIAL_DATA_6 = generator-outputs/artificial/artificial_6.csv generator-outputs/artificial/artificial_6.json
ARTIFICIAL_DATA_7 = generator-outputs/artificial/artificial_7.csv generator-outputs/artificial/artificial_7.json
generated-data: $(AE_DEIDENTIFIED_DATA) $(HP_DATA_CLEAN) $(ARTIFICIAL_DATA_1) $(ARTIFICIAL_DATA_2) $(ARTIFICIAL_DATA_3) $(ARTIFICIAL_DATA_4) $(ARTIFICIAL_DATA_5) $(ARTIFICIAL_DATA_6) $(ARTIFICIAL_DATA_7)

# download the London Postcodes dataset used by the A&E generated
# dataset (this is about 133 MB)
Expand All @@ -54,21 +74,36 @@ $(LONDON_POSTCODES):
# its own rule
$(AE_DEIDENTIFIED_DATA) &: $(LONDON_POSTCODES)
mkdir -p generator-outputs/odi-nhs-ae/ && \
mkdir -p generator-outputs/household_poverty/ && \
cd generator-outputs/odi-nhs-ae/ && \
$(PYTHON) $(QUIPP_ROOT)/generators/odi-nhs-ae/generate.py && \
$(PYTHON) $(QUIPP_ROOT)/generators/odi-nhs-ae/deidentify.py

# pre-process the Household Poverty dataset
$(HP_DATA_CLEAN):
mkdir -p generator-outputs/household_poverty/ && \
cd generator-outputs/household_poverty/ && \
$(PYTHON) $(QUIPP_ROOT)/generators/household_poverty/clean.py

# generate the three artificial datasets
$(ARTIFICIAL_DATA_1) $(ARTIFICIAL_DATA_2) $(ARTIFICIAL_DATA_3) $(ARTIFICIAL_DATA_4) $(ARTIFICIAL_DATA_5) $(ARTIFICIAL_DATA_6) $(ARTIFICIAL_DATA_7):
mkdir -p generator-outputs/artificial/ && \
cd generator-outputs/artificial/ && \
$(PYTHON) $(QUIPP_ROOT)/generators/artificial/generate.py


##-------------------------------------
## Generate synthetic data
##-------------------------------------

## synthesize data - this rule also builds "synth-output/%/data_description.json"
$(SYNTH_OUTPUTS_CSV) : \
synth-output/%/synthetic_data_1.csv : run-inputs/%.json $(AE_DEIDENTIFIED_DATA)
mkdir -p $$(dirname $@) && \
cp $< $$(dirname $@) && \
python synthesize.py -i $< -o $$(dirname $@)
synth-output/%/synthetic_data_1.csv : run-inputs/%.json $(AE_DEIDENTIFIED_DATA) $(ARTIFICIAL_DATA_1) $(ARTIFICIAL_DATA_2) $(ARTIFICIAL_DATA_3) $(ARTIFICIAL_DATA_4) $(ARTIFICIAL_DATA_5) $(ARTIFICIAL_DATA_6) $(ARTIFICIAL_DATA_7)
outdir=$$(dirname $@) && \
mkdir -p $$outdir && \
cp $< $${outdir}/input.json && \
$(ADD_PROVENANCE) $${outdir}/input.json && \
python synthesize.py -i $< -o $$outdir


##-------------------------------------
Expand All @@ -77,19 +112,45 @@ synth-output/%/synthetic_data_1.csv : run-inputs/%.json $(AE_DEIDENTIFIED_DATA)

## compute privacy and utility metrics
$(SYNTH_OUTPUTS_PRIV_DISCL_RISK) : \
synth-output/%/privacy_disclosure_risk.json : \
synth-output/%/disclosure_risk.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/privacy-metrics/disclosure_risk.py -i $< -o $$(dirname $@)
python metrics/privacy-metrics/disclosure_risk.py -i $< -o $$(dirname $@) &&\
$(ADD_PROVENANCE) $@

$(SYNTH_OUTPUTS_UTIL_CLASS) : \
synth-output/%/utility_classifiers.json : \
synth-output/%/utility_diff.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/utility-metrics/classifiers.py -i $< -o $$(dirname $@)
python metrics/utility-metrics/classifiers.py -i $< -o $$(dirname $@) &&\
$(ADD_PROVENANCE) $@

$(SYNTH_OUTPUTS_UTIL_CORR) : \
synth-output/%/utility_correlations.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/utility-metrics/correlations.py -i $< -o $$(dirname $@)
python metrics/utility-metrics/correlations.py -i $< -o $$(dirname $@) &&\
$(ADD_PROVENANCE) $@

$(SYNTH_OUTPUTS_UTIL_FEATURE_IMPORTANCE) : \
synth-output/%/utility_feature_importance.json : \
run-inputs/%.json synth-output/%/synthetic_data_1.csv
python metrics/utility-metrics/feature_importance.py -i $< -o $$(dirname $@) &&\
$(ADD_PROVENANCE) $@


##-------------------------------------
## Helper targets for individual inputs
##-------------------------------------

## make run-example
##
## produces synthetic data and metrics from run-inputs/example.json with output in synth-output/example/

run-% :\
synth-output/%/synthetic_data_1.csv\
synth-output/%/utility_correlations.json\
synth-output/%/disclosure_risk.json\
synth-output/%/utility_diff.json\
synth-output/%/utility_feature_importance.json\
;


##-------------------------------------
Expand Down
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,25 @@ environmental variable `SGFROOT` to point to this location. That is, in bash,
We use the PrivBayes implementation within the DataSynthesizer fork found [here](https://github.com/gmingas/DataSynthesizer).
In order to install it, clone the above repository locally, go to its root directory and run `pip install .`

#### Forked synthetic_data_release

We use the PATE-GAN implementation within the `synthetic_data_release` fork found [here](https://github.com/kasra-hosseini/synthetic_data_release).
In order to use PATE-GAN in QUIPP:
1. create a new directory:

```bash
cd /path/to/QUIPP-pipeline
mkdir libs
```

2. Clone the above repository inside `libs` directory created in the previous step:

```bash
# from /path/to/QUIPP-pipeline
cd libs
git clone https://github.com/kasra-hosseini/synthetic_data_release.git
```

## Top-level directory contents

The top-level directory structure mirrors the data pipeline.
Expand Down
19 changes: 19 additions & 0 deletions datasets-raw/framingham/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"Framingham Heart Study" dataset by found in Kaggle (https://www.kaggle.com/amanajmera1/framingham-heart-study-dataset),
used under [CC0 1.0 Universal Public Domain Dedication](https://creativecommons.org/publicdomain/zero/1.0/) modified by filling NA values with the column means

The dataset is not used for commercial purposes.

Note: Might want to replace this with the details from https://biolincc.nhlbi.nih.gov/studies/framcohort/ , due to time constraints I am using the version available freely on Kaggle rather than waiting for approval, but I believe this is the true original source of the data.

Instructions to clean (also included in `datasets-raw/framingham/clean.py`):

```{python}
import pandas as pd

raw_df = pd.read_csv("framingham.csv")

df = raw_df.fillna(raw_df.mean())
df[["cigsPerDay", "age", "education", "BPMeds"]] = df[["cigsPerDay", "age", "education", "BPMeds"]].astype(int)

df.to_csv("../../datasets/framinghamframingham_cleaned.csv", index=False)
```
8 changes: 8 additions & 0 deletions datasets-raw/framingham/clean.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
import pandas as pd

raw_df = pd.read_csv("framingham.csv")

df = raw_df.fillna(raw_df.mean())
df[["cigsPerDay", "age", "education", "BPMeds"]] = df[["cigsPerDay", "age", "education", "BPMeds"]].astype(int)

df.to_csv("../../datasets/framinghamframingham_cleaned.csv", index=False)
1 change: 1 addition & 0 deletions datasets-raw/framingham/framingham.csv

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ These datasets are in the [format required by the pipeline](../README.md#data-fo

- [`generated/odi_nhs_ae`](generated/odi_nhs_ae): mock A&E dataset, generated with the scripts [here](../generators/odi-nhs-ae)
- [`polish_data_2011`](polish_data_2011): A prepared version of the [Social Diagnosis](http://www.diagnoza.com/index-en.html) project data, Council for Social Monitoring 2011. This is included in synthpop and extracted with [this script](polish_data_2011/data_prep.R).
- [`adult_dataset`](adult_dataset): https://archive.ics.uci.edu/ml/datasets/adult
Loading