-
Notifications
You must be signed in to change notification settings - Fork 63
allow user to specify which rows to subsample from #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I do not think the desiderata is to choose which rows to subsample from , but rather implement subsampling in its true form (separate issue?). That is, train parallel GPMs on slightly overlapping sets of the data. However figuring out how GPMs (from the same or even family) trained on different portions of the dataset (with the same schema) interact to answer BQL queries is not a straightforward problem. |
@fsaad I'm not sure that bagging or bootstrapping GPMs will be all that helpful. If it is, I feel like that's a longer-term project. @riastradh-probcomp, what kinds of restrictions would you want to have? I can think of a few sampling strategies you might want to choose from: first, random, evenly spaced... or do you mean specify a test on rows as to whether they should be eligible for sampling? |
I don't remember what I was thinking when I made this, other than that I don't think I had in mind any particular mechanism. Whatever I meant, pseudorandom subsampling is doubtless a more immediately fruitful, and perhaps largely sufficient, strategy. |
If we're already using a subsample, and someone wants to do a query that focuses in on a particular sub-population, that might be a good time to re-subsample from the full population, selecting for that query. That might want to go into new generators (perhaps with a new label? #313) and do some new analysis. |
No description provided.
The text was updated successfully, but these errors were encountered: