You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm attempting to simulate some binary data and noticing that the simulated data correlation matrix coefficient values appear to consistently come out less than expected.
I have real binary data. From this I calculated the correlation matrix and use that along with the real probabilities of the binary data to generate the new dataset.
I am unclear if this behaviour is to be expected or not. Since these original correlation coefficients were generated from "real" data I don't understand why it wouldnt be possible to replicate the same correlation matrix in the simulated data. Perhaps I am using the function incorrectly or there is some statistical property that I am not fully grasping. Any clarification, help/suggestion would be greatly appreciated.
You can see that all simulated data coefficient value are less than actual data correlation coefficient values by subtracting the correlation matrices and seeing all negative values (except 0)
That is just the nature of the copula algorithm. The transformation from continuous data (where the data are generated) to the binary outcome loses a lot of information, so correlation necessarily decreases. The same is true for Poisson data, though the loss is not as substantial. However, in the case of binary data, there is another algorithm: if you specify method = "ep", you should get much better results. Let me know if that improves things.
Wow the method = "ep" does make the correlation approximations much closer to those of the original data correlations. I had seen the "ep" option in the documentation, but discounted it due to an "range" error I was getting in my actual analysis code (not reprex). However, after revisiting I've got it working and got passed the error message by a little bit of pre-processing removing some rows with NA values that had gotten accidentally introduced (so i think that was the issue).
Thanks so much for your answer. So if I understand correctly, you are saying the copula algorithm looses information because it initially generates/simulates continuous data and then thresholds it to binary? Assuming I got that right, this makes alot of sense to me now and reminds me of what I had read about it prior to finding your package. I guess I should read up a bit more on how the "ep" algorithm works, but for now I am glad that it seems more fit for my purpose.
FYI here is the same plot as my previous post, but this time generated using the "ep" method.
Hi,
Thanks for the great package!
I'm attempting to simulate some binary data and noticing that the simulated data correlation matrix coefficient values appear to consistently come out less than expected.
I have real binary data. From this I calculated the correlation matrix and use that along with the real probabilities of the binary data to generate the new dataset.
I am unclear if this behaviour is to be expected or not. Since these original correlation coefficients were generated from "real" data I don't understand why it wouldnt be possible to replicate the same correlation matrix in the simulated data. Perhaps I am using the function incorrectly or there is some statistical property that I am not fully grasping. Any clarification, help/suggestion would be greatly appreciated.
Below is a reprex to illustrate:
You can see that all simulated data coefficient value are less than actual data correlation coefficient values by subtracting the correlation matrices and seeing all negative values (except 0)
Another visual comparison
Created on 2024-09-05 with reprex v2.1.0
The text was updated successfully, but these errors were encountered: