-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhance genData with dist="categorical" #100
Comments
I see how that could be helpful and save some post def <- defData(varname = "cat1", formula = c(0.5,0.2,0.2,0.1), variance = LETTERS[1:4], dist = categorical) In this case This would obviously be a pretty large change and not a quick decision but I think it would benefit the maintainability and usability of simstudy. Let me know what you think of the idea. Staying in the current way things work this change should be pretty easy I'll have a look. |
I totally agree with you that your suggested approach is cleaner and easier, and consistent with other R functions. But, we would need to understand how this does not break code using the current approach. I would go even further, though: def <- defData(varname = "cat1", param1 = c(0.5,0.2,0.2,0.1), param2 = LETTERS[1:4], dist = categorical) I know this is a little off topic, but I assume this would work with formulas as well? def <- defData(varname = "x", param1 = 0, param2 = 1, dist = normal)
def <- defData(def, varname = "y", param1 = ~-2 + 0.25*x, dist = binary) Maybe we are talking about simstudy2? |
Yeah, this would definitely require rewriting simstudy's core and I don't see a way to do this backwards compatible, without also keep large parts of the code or having special code that detects the string'ish usage and somehow converts it? I will think about that. I would really rather not have a separate simstudy2, maybe announce the changes a version or 2 before switching the system? Or somehow make it backwards compatible... hm not sure Yes this would also work for all the other parameters and with that big change we could make the parameters generic too, I already thought about a |
Maybe that could be nice Master's thesis - how to convert a cool but not ideally executed software package that has a moderately sized user community into a more robust, user friendly piece of software without deviating too far from its core, underlying philosophy. |
Or it could be a summer job ;) def <- defData(varname = "cat", dist = "categorical", variance = "a;b;c;d;", formula = genCatFormula(n = 4))
genData(10, def)
#> id cat
#> 1: 1 a
#> 2: 2 a
#> 3: 3 a
#> 4: 4 a
#> 5: 5 b
#> 6: 6 c
#> 7: 7 d
#> 8: 8 d
#> 9: 9 c
#> 10: 10 d You can test it in the branch issue-100, it is not really tested and probably needs some more work but let me know if this is along your idea. |
I really do wish I had some funds to support this - maybe some day. This definitely looks like what I had in mind - I will check it out morel closely. |
I took at look at the new functionality - it seems to work pretty nicely. One thing I noticed, and it is not really a problem, is that the categories specified in "variance" are realized as characters. There is one case where I thought this might be useful is to create different numerical values to represent the categories, that can be used as integers in subsequent data generation, as in this example: library(simstudy)
def <- defData(varname = "cat", dist = "categorical", variance = "2;3;5", formula = genCatFormula(n = 3))
set.seed(1111)
dd <- genData(5, def)
genCluster(dd, "id", "cat", "pid")
#> id cat pid
#> 1: 1 5 1
#> 2: 1 5 2
#> 3: 1 5 3
#> 4: 1 5 4
#> 5: 1 5 5
#> 6: 2 3 6
#> 7: 2 3 7
#> 8: 2 3 8
#> 9: 3 2 9
#> 10: 3 2 10
#> 11: 4 2 11
#> 12: 4 2 12
#> 13: 5 3 13
#> 14: 5 3 14
#> 15: 5 3 15
Created on 2021-07-06 by the reprex package (v2.0.0) This works fine, but it might be confusing to users in the context that cat is not an integer: str(dd)
#> Classes 'data.table' and 'data.frame': 5 obs. of 2 variables:
#> $ id : int 1 2 3 4 5
#> $ cat: chr "5" "3" "2" "2" ...
#> - attr(*, ".internal.selfref")=<externalptr>
#> - attr(*, "sorted")= chr "id" I don't feel too strongly about this, but was wondering what you think. |
Yes I just split the string and do not do any conversion. We could do that but would have to check for |
I modified the behaviour to convert the category variable to numeric if conversion is possible without introducing NAs. |
That's nice - a good addition. Thanks. |
I made some minor changes to the documentation in a few places. Can you remind me how I "re-build" that part again so that it shows up in the help files? When I Install and Restart, the changes don't appear - but I remember there is one additional step. |
|
Currently, a categorical distribution generates integer categories
{1, 2, 3, ...}
. It would be nice to be able to generate more flexible levels such as strings (e.g.,{A, B, C, D}
) or even more general integers (e.g. ,{80, 120, 150}
). One way this could be specified usingdefData
isThe text was updated successfully, but these errors were encountered: