-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function to generate synthetic data with similar distributional properties to a real dataset #182
Comments
Thanks @adamkucharski, I like the idea, but I think the workflow as a whole is outside the scope of {simulist}. Could you provide an overview of how you see the workflow, with pseudo-code if possible? Then we can sketch out where {simulist} can be enhanced and where other functions/packages are needed. If you also have any datasets that could be used to test this workflow please share links to them if available. |
Some example pseudo-code below, taking a 'true dataset', fitting marginal distributions, then resimulating with these. Would be easier if limited to distributions, as obviously other colums could take lots of forms. # Simulate 'real data'
linelist <- sim_linelist()
# Define columns to match
match_cols <- list(c("age","integer"),
c("case_type","category")
)
# Extract relevant distributions and store...
# Add set up and storage code
for(ii in 1:length(match_cols)){
col_ii <- match_cols[[1]][1] # Column name
type_ii <- match_cols[[1]][2] # Column type
distn <- linelist |> pull(col_ii) # Get values
if(type_ii == "category"){
# ...
}
if(type_ii == "integer"){
# ...
}
}
# Define delays to match
match_delays <- list(c("date_onset","date_admission"),
c("date_onset","date_outcome")
)
# Fit relevant distributions and store...
fit_onset_admission <- NULL
fit_onset_outcome <- NULL
# Simulate synthetic data with matched properties
linelist <- sim_linelist(
population_age = fit_age,
case_type = fit_case_type,
onset_to_hosp = fit_onset_admission,
onset_to_death = fit_onset_outcome
) |
Thanks for the overview, it's would definitely be really neat to be able to seamlessly go from real line list data to synthetic line list data in these steps, with the final step calling However, I don't think this pipeline fits into the scope of the {simulist} package. It would be good to post this onto the Epiverse-TRACE discussion board to get other's thoughts and see what the right format is for such a pipeline (e.g. howto script, blog post, R package, etc.). Let me know if you're happy for me to transfer this issue there. There are some added complications which would need working out in whatever form the pipeline takes. |
A topic that has come up in discussions with applied partners is the value of being able to generate synthetic data with similar properties to a real - but sensitive, so not shareable – dataset, to allow external groups to develop and test methods.
{simulist} already has the ability to generate simulated data from defined distributions, so addressing the above need would require us to define a new function that could:
As an example, this contact tracing analysis use synthetic-but-realistic marginal distributions for contacts in different settings, rather than publishing the full (and hence more sensitive) joint dataset: https://github.com/adamkucharski/2020-cov-tracing
The text was updated successfully, but these errors were encountered: