-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pass-by-value vs. pass-by-reference #50
Comments
@kgoldfeld That is intended behavior by data.table as I understand. The terminology is pass-by-value vs. pass-by-reference. R almost exclusively works via pass-by-value, which means that arguments to functions are copied when they are passed on. This avoids side-effects like you describe here but also lowers performance due to the copying and increased memory usage. See this Stackoverflow answer for more general information. As data.table is specifically made for huge data sets they implement changes via reference (e.g. reordering columns etc.) because they don't have to copy stuff around. I did not remove calls to |
I did see in genOrdCat that you removed |
I can blame you for that 😝 because I just reused the body of simstudy/R/generate_correlated_data.R Line 622 in 84d995d
But what ever the case let us double check all functions and make it the default to copy the incoming data.table. We could add an option (e.g. environment variable) to deactivate that behavior for people who work with very large simulations. That way we could preserve the speed gains but make it a conscious choice. |
You are so right - I just looked at the release version. Not sure how I made that mistake, but I did. And then I accused you of it - doubly bad! I'm not sure it is necessary to deactivate that behavior - since I generally test all the functions generating a million observations to see if things get too bogged down. Just seems like that would add complexity for the user. Now, if people started complaining, then I guess we could do it. |
I had a quick look and at the moment many but not all functions copy their input data.table. I would argue for creating one consistent behavior across all simstudy functions, either with or without copying. Consistency makes for a good user experience 😸 |
I totally agree - we should copy in all cases where it is possible the user will be creating a new data.table as a result of the call to the function are their intention is to change the original data set. |
ok I will look into it |
I will think about this some more. For now lets post pone it after 0.2.0. |
I agree - probably not causing too many problems. But, I am sure we will hear about it if it is. |
I restored the old behaviour of genOrdCat in the trtAssign pull request :) after that is merged I will start the rhub checks. |
I still think we should default to one behavior for all function but maybe it would make sense to add a parameter option to turn the copying off/on. I think we should adress this after #79 so we can make sure that performance is not addressed negatively. |
I would want it to default to copying to a new data.table, but I am not totally convinced that anyone would ever want to turn it off (and allow changes to the data.table being passed as an argument, which just seems like weird behavior). |
I'm with you on the default as this is (mostly) the simstudy default anyway. Well the in-place changes are the default behavior for data.table for performance reasons. So maybe for very large datasets this could be helpful? It would at least allow an option for heavy dt users to keep the default behavior of dt they are used to? |
I think an option would be fine for those folks. |
I was just looking through the code and I noticed something changed that probably shouldn't have. Whenever we pass a data.table as an argument to a function, it is imperative to copy the data table to another data.table. So you might see something like:
This addresses a weird R problem (and maybe there is a better work around, but this was the only way I could figure out) where the original data.table gets changed by the function call even if that was not the intention. For example:
Originally posted by @kgoldfeld in #49 (comment)
The text was updated successfully, but these errors were encountered: