-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add genMixFormula #51
Comments
Do you mean something like this: genMixFormula <- function(vars, probs = NULL, roundTo = 3) {
assertNotMissing(vars = missing(vars))
if (is.null(probs)) {
n <- length(vars)
probs <- round(rep(1 / n, n), roundTo)
} else {
assertNumeric(probs = probs)
probs <- .adjustProbs(unlist(probs))
assertLengthEqual(vars = vars, probs = probs)
}
paste(vars, probs, sep = " | ", collapse = " + ")
}
genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"
genMixFormula(c("a", "..b", "c"), list(.2, .5, .3))
# [1] "a | 0.2 + ..b | 0.5 + c | 0.3" I wasn't quite sure what you meant with
|
So, if a <- c(1, 2, 5)
b <- c(.3, .2, .5)
genMixFormula(a, b}
# [1] "1 | .3 + 2 | .2 + 5 | .5" Or it could be more general if we want to define the formula first and later on we are changing a and b: genMixFormula(a, b}
# [1] "..a[1] | ..b[1] + ..a[2] | ..b[2 ] + ..a[3] | ..b[3]" But now that I think about it, I am not sure the second option makes sense, which is what I guess you are questioning. I did have another thought - maybe it makes sense to have a single |
With the function I posted you could do both variants, you would just need to pass in the array vars as strings as seen in my first example: genMixFormula(c("a", "..b[..i]", "c"))
# [1] "a | 0.333 + ..b[..i] | 0.333 + c | 0.333"
# Could just as well be:
genMixFormula(c("..a[2]", "..b[..i]", "..a[3]"))
# [1] "..a[2] | 0.333 + ..b[..i] | 0.333 + ..a[3] | 0.333" (This formular is ofc only usable once #41 is done) Btw should we enable passing |
Yeah - I guess I was trying to simplify things by not having to specify all the elements of the vector manually in the call to genMixFormula. I do see the challenge in distinguishing between the two examples I gave, but it would be nice to have some shorthand way of entering the vector c("..a[1]", "..a[2]". ..a[3]", ... "..a[n]"). But, we can go with the simpler version you have created, and if I ever really come up with a case where it might be useful to think about a shorthand version, then we could accomodate that. My guess is your solution is sufficient. I am not sure I get your follow-up question. In defData, a single categorical variable is defined (and later generated) so there is no correlation structure to be applied. Of course, the probabilities can be dependent on previously defined variables, so that correlation will be induced. Now, I could see a situation where in genCorData, rho could be a function of some group membership (so that correlation for some groups would be higher than others), but I don't think that is what you are talking about. |
ah I see, I'll think about the shorthand. ah, 🤦 of course xD |
Not to keep harping on this - after all this is not the most important function - but two possible ideas. One is to interpret based on contents of a <- c("x", "y", "z")
genMixFormula(a)
# [1] "x | 0.3333 + y | 0.3333 + z | 0.3333" but a <- c(1, 2, 6)
genMixFormula(a)
# [1] "..a[1] | 0.3333 + ..a[2] | 0.3333 + ..a[3] | 0.3333" In the first case genMixFormula(aa, len_a = 4)
# [1] "..aa[1] | 0.25 + ..aa[2] | 0.25 + ..aa[3] | 0.25 + ..aa[4] | 0.25" where |
It looks like you are getting very close with this, but still not perfect (or maybe I am using incorrectly).
``` r
library(simstudy)
a <- c(1, 2, 3)
mixform <- genMixFormula("..a", varLength = 3)
mixform
#> [1] "..a[[1]] | 0.333 + ..a[[2]] | 0.333 + ..a[[3]] | 0.333"
d1 <- defData(varname = "a", formula = 6, dist = "nonrandom")
d1 <- defData(d1, varname = "c", formula = 87, dist = "nonrandom")
d1 <- defData(d1, varname = "x", formula = mixform, dist = "mixture")
#> Error in .checkMixture(newform): Invalid variable(s):
#> Probabilities can only be numeric or numeric ..vars (not arrays). See ?distribution Created on 2020-10-02 by the reprex package (v0.3.0) And, if we don't want the probabilities to be 1/varLength, then we are out of luck. Not saying we need to do this now - fixing the first problem would be sufficient, but it might be nice to be able to specify:
or even more generally
|
Oh yeah this is just the new function, I still have to fix the check in evalDef to work with arrays. Ill have a look at the rest later. |
Should I hold off on merging until evalDef is updated? Or should I go ahead? |
The changes to eval def are bit bigger as we have to move these checks to pre generation and I'll have to think about how to do that in a nice way. (this will probably involve #18 ) I think merging now would be ok what do you think? |
Probably OK since only super users would likely attempt to use the mixture formula in this fashion. But given that it fails, we might not want to include it in the examples just yet. And more generally, I kind of like the example in help to be used in the context of data generation - more like what I did above. If you don't disagree (or don't really have an opinion), I am happy to edit the example. |
Sure feel free to improve the documentation :) I will of course make the array stuff work before 0.2.0 |
btw should we rename catProbs to genCatFormula to be inline with genFormula and genMixFormula? |
Yes - definitely |
I had another idea for a useful function, particularly if you implment the vectorization. It could be nice to have a function
genMixFormula(functions, probs)
that takes two arguments - a vector of function names or a double-dot variable that is itself a vector and a vector of probabilities (which could be NULL, defaulting to equal probability 1/n with n functions being mixed.Originally posted by @kgoldfeld in #44 (comment)
The text was updated successfully, but these errors were encountered: