Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

210 gencorgen with varying cluster sizes #229

Merged
merged 6 commits into from
Jun 10, 2024

Conversation

kgoldfeld
Copy link
Owner

@kgoldfeld kgoldfeld commented Jun 5, 2024

@assignUser I addressed #210 by updating function addCorGen. As part of this update, I fleshed out the help details. Here they are so you can understand what I did:

The original data table can come in one of two formats: a single row per idvar (where data are ungrouped) or multiple rows per idvar (in which case the data are grouped or clustered). The structure of the arguments depends on the format of the data.

In the case of ungrouped data, there are two ways to specify the number of correlated variables and the covariance matrix. In approach (1), nvars needs to be specified along with rho and corstr. In approach (2), corMatrix may be specified by identifying a single square n x n covariance matrix. The number of new variables generated for each record will be n. If nvars, rho, corstr, and corMatrix are all specified, the data will be generated based on the information provided in the covariance matrix alone. In both (1) and (2), the data will be returned in a wide format.

In the case of grouped data, where there are G groups, there are also two ways to proceed. In both cases, the number of new variables to be generated may vary by group, and will be determined by the number of records in each group, $n_i, i \in {1,...,G}$ (i.e., the number of records that share the same value of idvar). nvars is not used in grouped data. In approach (1), the arguments rho and corstr may both be specified to determine the structure of the covariance matrix. In approach (2), the argument corMatrix may be specified. corMatrix can be a single matrix with dimensions $n \ \text{x} \ n$ if $n_i = n$ for all i. However, if the sample sizes of each group vary (i.e., $n_i \ne n_j$ for some groups i and j), corMatrix must be a list of covariance matrices with a length G; each covariance matrix in the list will have dimensions $n_i \ \text{x} \ n_i, \ i \in {1,...,G}$. In the case of grouped data, the new data will be returned in long format (i.e., one new column only).

I didn't add any tests here. I need to do this, because this part of the package is woefully untested; I will hopefully get to it when I have more time.

@kgoldfeld kgoldfeld linked an issue Jun 5, 2024 that may be closed by this pull request
Copy link

github-actions bot commented Jun 5, 2024

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 5992ad7 is merged into main:

  • ✔️define_data: 17.3ms -> 17.2ms [-1.18%, +0.35%]
  • ✔️dist_beta: 214ms -> 213ms [-3.41%, +2.2%]
  • ✔️dist_binary: 8.69ms -> 8.58ms [-3.74%, +1.25%]
  • ✔️dist_binomial: 12.9ms -> 13.1ms [-0.24%, +2.81%]
  • ✔️dist_categorical: 58.1ms -> 58ms [-0.93%, +0.64%]
  • ✔️dist_exponential: 8.79ms -> 9.35ms [-4.87%, +17.68%]
  • ✔️dist_gamma: 16.4ms -> 16.5ms [-0.65%, +0.88%]
  • ✔️dist_mixture: 223ms -> 224ms [-0.91%, +1.5%]
  • ✔️dist_normal: 10.6ms -> 10.6ms [-2.36%, +1.58%]
  • ✔️gen_all_dists: 48.8ms -> 48.8ms [-1.04%, +1.23%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

@kgoldfeld kgoldfeld marked this pull request as ready for review June 7, 2024 01:38
@kgoldfeld kgoldfeld requested a review from assignUser June 7, 2024 01:38
@assignUser
Copy link
Collaborator

I'll be able to have a look over the weekend :)

Copy link
Collaborator

@assignUser assignUser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

this part of the package is woefully untested

Yeah it's surprisingly hard to retroactively add tests to things ^^

@@ -1,9 +1,15 @@
# simstudy (development version)

## New features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confused me at first because I was looking for the associated change xD

R/add_correlated_data.R Outdated Show resolved Hide resolved
Comment on lines +536 to +538
if (!wide) { # multiple record per id
if (is.null(corMatrix)) {
if (same_nvar) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine here imo because it's just the one line each but when you nest three levels deep it's usually a good time to check if pulling some of the logic out can flatten things out a bit. (a goto example: instead of wrapping the entire function with if(condition) invert the condition and return early removing that level of nesting from the entire function)

Co-authored-by: Jacob Wujciak-Jens <[email protected]>
Copy link

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 9834a6a is merged into main:

  • ✔️define_data: 18.4ms -> 17.1ms [-22.37%, +7.34%]
  • ✔️dist_beta: 213ms -> 212ms [-2.97%, +1.96%]
  • ✔️dist_binary: 8.56ms -> 8.61ms [-1.32%, +2.41%]
  • ✔️dist_binomial: 13.1ms -> 13ms [-1.56%, +1.06%]
  • ✔️dist_categorical: 58.4ms -> 58.3ms [-0.94%, +0.72%]
  • ✔️dist_exponential: 8.81ms -> 8.8ms [-1.7%, +1.36%]
  • ✔️dist_gamma: 16.3ms -> 16.3ms [-1.02%, +0.9%]
  • ✔️dist_mixture: 224ms -> 225ms [-1.22%, +1.88%]
  • ✔️dist_normal: 10.6ms -> 10.5ms [-3.26%, +0.46%]
  • ✔️gen_all_dists: 49ms -> 48.9ms [-1.31%, +1.14%]
    Further explanation regarding interpretation and methodology can be found in the documentation.

@kgoldfeld kgoldfeld merged commit e153754 into main Jun 10, 2024
9 checks passed
@kgoldfeld kgoldfeld deleted the 210-gencorgen-with-varying-cluster-sizes branch June 10, 2024 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

genCorGen with varying cluster sizes
2 participants