Skip to content

allow user to specify which rows to subsample from #75

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
riastradh-probcomp opened this issue Jul 3, 2015 · 4 comments
Open

allow user to specify which rows to subsample from #75

riastradh-probcomp opened this issue Jul 3, 2015 · 4 comments

Comments

@riastradh-probcomp
Copy link
Contributor

No description provided.

@fsaad
Copy link
Collaborator

fsaad commented Aug 26, 2015

I do not think the desiderata is to choose which rows to subsample from , but rather implement subsampling in its true form (separate issue?). That is, train parallel GPMs on slightly overlapping sets of the data. However figuring out how GPMs (from the same or even family) trained on different portions of the dataset (with the same schema) interact to answer BQL queries is not a straightforward problem.

@gregory-marton
Copy link
Contributor

@fsaad I'm not sure that bagging or bootstrapping GPMs will be all that helpful. If it is, I feel like that's a longer-term project.

@riastradh-probcomp, what kinds of restrictions would you want to have?

I can think of a few sampling strategies you might want to choose from: first, random, evenly spaced... or do you mean specify a test on rows as to whether they should be eligible for sampling?
For the latter, I would lean towards asking them to do it in preprocessing, e.g. saving a temp table, rather than try to come up with a language.

@riastradh-probcomp
Copy link
Contributor Author

I don't remember what I was thinking when I made this, other than that I don't think I had in mind any particular mechanism.

Whatever I meant, pseudorandom subsampling is doubtless a more immediately fruitful, and perhaps largely sufficient, strategy.

@gregory-marton
Copy link
Contributor

If we're already using a subsample, and someone wants to do a query that focuses in on a particular sub-population, that might be a good time to re-subsample from the full population, selecting for that query. That might want to go into new generators (perhaps with a new label? #313) and do some new analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants