Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhance genData with dist="categorical" #100

Closed
kgoldfeld opened this issue Jul 1, 2021 · 12 comments · Fixed by #101
Closed

enhance genData with dist="categorical" #100

kgoldfeld opened this issue Jul 1, 2021 · 12 comments · Fixed by #101
Labels
feature feature request or enhancement

Comments

@kgoldfeld
Copy link
Owner

Currently, a categorical distribution generates integer categories {1, 2, 3, ...}. It would be nice to be able to generate more flexible levels such as strings (e.g.,{A, B, C, D}) or even more general integers (e.g. ,{80, 120, 150}). One way this could be specified using defData is

def <- defData(varname = "cat1", formula = "0.5;0.2;0.2;0.1", variance = "A;B;C;D", dist = "categorical")
def <- defData(d1, varname = "cat2", formula = "0.4;0.3;0.3", variance = "80;120;150", dist = "categorical")
@kgoldfeld kgoldfeld added the feature feature request or enhancement label Jul 1, 2021
@assignUser
Copy link
Collaborator

I see how that could be helpful and save some post genData work replacing the simple categories. For me, it is also another reason that speaks to my "recent" idea which is basically #75
If we use R-expressions instead of strings (via non-standard evaluation we could still keep the definition table as a human-readable overview), you could just pass in any vector that is then used as the categories:

def <- defData(varname = "cat1", formula = c(0.5,0.2,0.2,0.1), variance = LETTERS[1:4], dist = categorical)

In this case categorical could be a function, that would make adding new distributions and custom distributions much easier as we would not have to insert them into the existing infrastructure of switch and if chains etc. which is as we know a bit error-prone. We would just need to define a certain API a distribution function would need to follow.

This would obviously be a pretty large change and not a quick decision but I think it would benefit the maintainability and usability of simstudy. Let me know what you think of the idea.

Staying in the current way things work this change should be pretty easy I'll have a look.

@kgoldfeld
Copy link
Owner Author

I totally agree with you that your suggested approach is cleaner and easier, and consistent with other R functions. But, we would need to understand how this does not break code using the current approach. I would go even further, though:

def <- defData(varname = "cat1", param1 = c(0.5,0.2,0.2,0.1), param2 = LETTERS[1:4], dist = categorical)

I know this is a little off topic, but I assume this would work with formulas as well?

def <- defData(varname = "x", param1 = 0, param2 = 1, dist = normal)
def <- defData(def, varname = "y", param1 = ~-2 + 0.25*x, dist = binary)

Maybe we are talking about simstudy2?

@assignUser
Copy link
Collaborator

Yeah, this would definitely require rewriting simstudy's core and I don't see a way to do this backwards compatible, without also keep large parts of the code or having special code that detects the string'ish usage and somehow converts it? I will think about that.

I would really rather not have a separate simstudy2, maybe announce the changes a version or 2 before switching the system? Or somehow make it backwards compatible... hm not sure

Yes this would also work for all the other parameters and with that big change we could make the parameters generic too, I already thought about a ... parameter too allow passing additional, arbitrary parameters to the function but that would be implementation dependent (aka needs more thought) and might break the definition tables printability.

@kgoldfeld
Copy link
Owner Author

Maybe that could be nice Master's thesis - how to convert a cool but not ideally executed software package that has a moderately sized user community into a more robust, user friendly piece of software without deviating too far from its core, underlying philosophy.

@assignUser
Copy link
Collaborator

Or it could be a summer job ;)
I have modified the categorical distribution to allow for this:

def <- defData(varname = "cat", dist = "categorical", variance = "a;b;c;d;", formula = genCatFormula(n = 4)) 
    genData(10, def)

#>     id cat
#>  1:  1   a
#>  2:  2   a
#>  3:  3   a
#>  4:  4   a
#>  5:  5   b
#>  6:  6   c
#>  7:  7   d
#>  8:  8   d
#>  9:  9   c
#> 10: 10   d

You can test it in the branch issue-100, it is not really tested and probably needs some more work but let me know if this is along your idea.

@kgoldfeld
Copy link
Owner Author

I really do wish I had some funds to support this - maybe some day. This definitely looks like what I had in mind - I will check it out morel closely.

@kgoldfeld
Copy link
Owner Author

kgoldfeld commented Jul 6, 2021

I took at look at the new functionality - it seems to work pretty nicely. One thing I noticed, and it is not really a problem, is that the categories specified in "variance" are realized as characters. There is one case where I thought this might be useful is to create different numerical values to represent the categories, that can be used as integers in subsequent data generation, as in this example:

library(simstudy)
def <- defData(varname = "cat", dist = "categorical", variance = "2;3;5", formula = genCatFormula(n = 3)) 

set.seed(1111)
dd <- genData(5, def)

genCluster(dd, "id", "cat", "pid")
#>     id cat pid
#>  1:  1   5   1
#>  2:  1   5   2
#>  3:  1   5   3
#>  4:  1   5   4
#>  5:  1   5   5
#>  6:  2   3   6
#>  7:  2   3   7
#>  8:  2   3   8
#>  9:  3   2   9
#> 10:  3   2  10
#> 11:  4   2  11
#> 12:  4   2  12
#> 13:  5   3  13
#> 14:  5   3  14
#> 15:  5   3  15

Created on 2021-07-06 by the reprex package (v2.0.0)

This works fine, but it might be confusing to users in the context that cat is not an integer:

str(dd)
#> Classes 'data.table' and 'data.frame':   5 obs. of  2 variables:
#>  $ id : int  1 2 3 4 5
#>  $ cat: chr  "5" "3" "2" "2" ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "sorted")= chr "id"

I don't feel too strongly about this, but was wondering what you think.

@assignUser
Copy link
Collaborator

Yes I just split the string and do not do any conversion. We could do that but would have to check for NAsdue to conversion to prevent issues when the categories are not numeric... I'll have a look.

@assignUser
Copy link
Collaborator

I modified the behaviour to convert the category variable to numeric if conversion is possible without introducing NAs.

@kgoldfeld
Copy link
Owner Author

That's nice - a good addition. Thanks.

@kgoldfeld
Copy link
Owner Author

I made some minor changes to the documentation in a few places. Can you remind me how I "re-build" that part again so that it shows up in the help files? When I Install and Restart, the changes don't appear - but I remember there is one additional step.

@assignUser
Copy link
Collaborator

devtools::document() and commit the changes or you can comment "/document" on the pull request once you have pushed your changes, then the bot will do it :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants