Function to generate synthetic data with similar distributional properties to a real dataset #182

adamkucharski · 2025-02-04T12:20:45Z

A topic that has come up in discussions with applied partners is the value of being able to generate synthetic data with similar properties to a real - but sensitive, so not shareable – dataset, to allow external groups to develop and test methods.

{simulist} already has the ability to generate simulated data from defined distributions, so addressing the above need would require us to define a new function that could:

Estimate key distributions in a line list (e.g. key delays, perhaps with a simple omission of recent data points to avoid truncation/censoring issues; proportion with key outcomes; demographic distribution; secondary case distribution and contact distribution)
Output an object that could then be used as an input to the existing simulist pipeline.

As an example, this contact tracing analysis use synthetic-but-realistic marginal distributions for contacts in different settings, rather than publishing the full (and hence more sensitive) joint dataset: https://github.com/adamkucharski/2020-cov-tracing

joshwlambert · 2025-02-04T14:03:23Z

Thanks @adamkucharski, I like the idea, but I think the workflow as a whole is outside the scope of {simulist}.

Could you provide an overview of how you see the workflow, with pseudo-code if possible?

Then we can sketch out where {simulist} can be enhanced and where other functions/packages are needed. If you also have any datasets that could be used to test this workflow please share links to them if available.

adamkucharski · 2025-02-20T08:43:41Z

Some example pseudo-code below, taking a 'true dataset', fitting marginal distributions, then resimulating with these. Would be easier if limited to distributions, as obviously other colums could take lots of forms.

# Simulate 'real data'
linelist <- sim_linelist()

# Define columns to match
match_cols <- list(c("age","integer"),
                   c("case_type","category")
)

# Extract relevant distributions and store...

# Add set up and storage code

for(ii in 1:length(match_cols)){
  col_ii <- match_cols[[1]][1] # Column name
  type_ii <- match_cols[[1]][2] # Column type
  
  distn <- linelist |> pull(col_ii) # Get values
  
  if(type_ii == "category"){
    # ...
  }
  
  if(type_ii == "integer"){
    # ...
  }
  
}

# Define delays to match
match_delays <- list(c("date_onset","date_admission"),
                     c("date_onset","date_outcome")
)

# Fit relevant distributions and store...
fit_onset_admission <- NULL
fit_onset_outcome <- NULL

# Simulate synthetic data with matched properties

linelist <- sim_linelist(
  population_age = fit_age,
  case_type = fit_case_type,
  onset_to_hosp = fit_onset_admission,
  onset_to_death = fit_onset_outcome
)

joshwlambert · 2025-02-20T15:55:20Z

Thanks for the overview, it's would definitely be really neat to be able to seamlessly go from real line list data to synthetic line list data in these steps, with the final step calling sim_linelist().

However, I don't think this pipeline fits into the scope of the {simulist} package. It would be good to post this onto the Epiverse-TRACE discussion board to get other's thoughts and see what the right format is for such a pipeline (e.g. howto script, blog post, R package, etc.). Let me know if you're happy for me to transfer this issue there.

There are some added complications which would need working out in whatever form the pipeline takes. sim_linelist() produces a fixed set of line list columns, and therefore the output would be independent of the real line list being mocked. If we wanted the synthetic line list to match the columns of the original line list, this would require extra steps. Additionally, depending on how close we would want the synthetic line list to match the original, we'd need to add some conditioning to the simulation (either internally to {simulist} or externally).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Function to generate synthetic data with similar distributional properties to a real dataset #182

Function to generate synthetic data with similar distributional properties to a real dataset #182

adamkucharski commented Feb 4, 2025

joshwlambert commented Feb 4, 2025

adamkucharski commented Feb 20, 2025

joshwlambert commented Feb 20, 2025

Function to generate synthetic data with similar distributional properties to a real dataset #182

Function to generate synthetic data with similar distributional properties to a real dataset #182

Comments

adamkucharski commented Feb 4, 2025

joshwlambert commented Feb 4, 2025

adamkucharski commented Feb 20, 2025

joshwlambert commented Feb 20, 2025