Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to generate synthetic data with similar distributional properties to a real dataset #182

Open
adamkucharski opened this issue Feb 4, 2025 · 3 comments

Comments

@adamkucharski
Copy link
Member

A topic that has come up in discussions with applied partners is the value of being able to generate synthetic data with similar properties to a real - but sensitive, so not shareable – dataset, to allow external groups to develop and test methods.

{simulist} already has the ability to generate simulated data from defined distributions, so addressing the above need would require us to define a new function that could:

  1. Estimate key distributions in a line list (e.g. key delays, perhaps with a simple omission of recent data points to avoid truncation/censoring issues; proportion with key outcomes; demographic distribution; secondary case distribution and contact distribution)
  2. Output an object that could then be used as an input to the existing simulist pipeline.

As an example, this contact tracing analysis use synthetic-but-realistic marginal distributions for contacts in different settings, rather than publishing the full (and hence more sensitive) joint dataset: https://github.com/adamkucharski/2020-cov-tracing

@joshwlambert
Copy link
Member

Thanks @adamkucharski, I like the idea, but I think the workflow as a whole is outside the scope of {simulist}.

Could you provide an overview of how you see the workflow, with pseudo-code if possible?

Then we can sketch out where {simulist} can be enhanced and where other functions/packages are needed. If you also have any datasets that could be used to test this workflow please share links to them if available.

@adamkucharski
Copy link
Member Author

Some example pseudo-code below, taking a 'true dataset', fitting marginal distributions, then resimulating with these. Would be easier if limited to distributions, as obviously other colums could take lots of forms.

# Simulate 'real data'
linelist <- sim_linelist()

# Define columns to match
match_cols <- list(c("age","integer"),
                   c("case_type","category")
)

# Extract relevant distributions and store...

# Add set up and storage code

for(ii in 1:length(match_cols)){
  col_ii <- match_cols[[1]][1] # Column name
  type_ii <- match_cols[[1]][2] # Column type
  
  distn <- linelist |> pull(col_ii) # Get values
  
  if(type_ii == "category"){
    # ...
  }
  
  if(type_ii == "integer"){
    # ...
  }
  
}

# Define delays to match
match_delays <- list(c("date_onset","date_admission"),
                     c("date_onset","date_outcome")
)

# Fit relevant distributions and store...
fit_onset_admission <- NULL
fit_onset_outcome <- NULL

# Simulate synthetic data with matched properties

linelist <- sim_linelist(
  population_age = fit_age,
  case_type = fit_case_type,
  onset_to_hosp = fit_onset_admission,
  onset_to_death = fit_onset_outcome
)

@joshwlambert
Copy link
Member

Thanks for the overview, it's would definitely be really neat to be able to seamlessly go from real line list data to synthetic line list data in these steps, with the final step calling sim_linelist().

However, I don't think this pipeline fits into the scope of the {simulist} package. It would be good to post this onto the Epiverse-TRACE discussion board to get other's thoughts and see what the right format is for such a pipeline (e.g. howto script, blog post, R package, etc.). Let me know if you're happy for me to transfer this issue there.

There are some added complications which would need working out in whatever form the pipeline takes. sim_linelist() produces a fixed set of line list columns, and therefore the output would be independent of the real line list being mocked. If we wanted the synthetic line list to match the columns of the original line list, this would require extra steps. Additionally, depending on how close we would want the synthetic line list to match the original, we'd need to add some conditioning to the simulation (either internally to {simulist} or externally).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants