-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
glum LASSO fit not finishing when same problem fits within 1 second in glmnet #709
Comments
The short answer to this is that glum is not designed to solve that type of problem. From the documentation (in the benchmark section): "Note that glum was originally developed to solve problems where N >> K (number of observations is larger than the number of predictors), which is the case for the following benchmarks." I know that there is a large subset of problems where the number of coefficients is close to or larger than the number of observations, but that's simply not what glum was designed for. The algorithm currently materializes the full hessian matrix, which in your case would be a 10kx10k matrix (and would be very inefficient). There has been some effort to allow glum to work well in this type of situation but it is currently not the case. We should maybe make this more explicit in the documentation, and as you suggested, we should update our benchmarks. |
Ha that's a shame... I should say that even for the n > p case I mostly found glmnet to outperform glum, though I haven't tried with super huge n and very small p & this was just in some quick tests I tried... So I think the benchmarks definitely need updating... For very tall problems with n >>p you might also like to look at the oem package and compare performance with what you have. I case you would be interested in also covering the p >= n case eventually & allowing for good sparse GLM fits for that case - so what I did was simply to take the plain R glm IRLS algorithm which more or less comes down to
and replace the weighted least square step Given that this is fit using ridge penalties this procedure also works very well for high dimensional cases and for the p >> n case it is also possible to use the Woodbury identity to solve this ridge regression more efficiently (requiring one to solve just an n x n system as opposed to a p x p one) as in Liu et al. 2017. But it is also very efficient for very tall problems with n >> p or with n=p. Problems with millions of observations or variables are no problem. Usually one needs fewer than 20 IRLS iterations for this algo to converge & it doesn't need cross validation to tune the level of regularisation. The ridge regression can in general be solved with any least square solver (sparse or dense, e.g. the eigen Cholesky ones, or some coordinate descent or the eigen least square conjugate gradient solver), which makes this algorithm quite easy to implement (one simple algorithm being to simply augment your covariate matrix with a diagonal matrix with sqrt(lambdas) along the diagonal (or a general Thikonov regularisation matrix to support more exotic penalties, e.g. fused penalties etc) and augmenting y with zeros (or sqrt(lambdas)*prior.means if you would like to also support shrinking to non-zero centered Gaussian priors)... Box constraints or nonnegativity constraints implemented via simple clipping in each IRLS step also work very well - they work almost as well as if one would use a quadratic programming solver to impose hard constraints, e.g. using osqp (the latter would also readily allow one to impose linear inequality constraints). Anyway, let me know if at any stage in the future you might be interested in supporting L0 penalties for either p > n or n > p models, then I could advise you on that... An extension for multitask learning problems with multivariate outcomes is also straightforward: start by fitting each task independently, average the coefficients over the different tasks & use this average as starting coefficient for the fits on each task (so that features with large average coefficients across all tasks will get smaller adaptive ridge penalties). Iterate this process until the features selected across all tasks converge. As a final step one can do a relaxed fit on the selected features with a low near-zero penalty. These relaxed group L0 penalized GLMs massively outperform e.g. group LASSO (mgaussian) glmnet fits & readily extend for any GLM distribution. Not sufficiently proficient myself in Python to take on any development work, but would be happy to advise on the required steps, based on my experience developing an R package to fit L0 & group L0 penalized GLMs, with performance in terms of timing equal or better than glmnet & abess & L0Learn & performance in terms of quality of solution generally much better (often spectacularly so, especially for the multitask learning case)... |
Thanks a lot for your detailed reply. This gives us a lot to think about and evaluate. I will try updating the benchmarks this week or the next. We don't currently have the short-term resources to tackle this, but I will be on the lookout for an eventual contributor willing to take the dive into this. I have never worked with L0 penalties before but I will be talking to other people using the package to gauge their interest. |
Thanks for that! FYI - I was benchmarking the speed of some GLM fitting engines in R the other day (speedglm, fastglm, parglm, bigglm, glmnet's bigGlm), and for tall datasets with n>>p in R parglm's parglm.fit function appeared most promising for big datasets... A Gamma GLM with n=10000000 and p=70 runs in 9s (or 4.9s if you directly call the underlying C++ function) on my laptop - that's pretty performant I think (this was using R with Intel MKL installed as BLAS) :
|
Was just testing glum, but ran into a problem where a glum LASSO fit does not appear to finish (or if it does extremely slowly). The data I simulated was a spike train blurred with a Gaussian point spread function and with some Gaussian noise added with n=p=10000 and the sparse covariate matrix X then consisting of 10000 time-shifted copies of a Gaussian point spread function (i.e. it's a banded matrix) and 1000 out of 10000 coefficients being nonzero (all positive, ie using nonnegativity constraints) :
https://github.com/tomwenseleers/glmnet-benchmark/tree/master/data
However, if I try to fit a LASSO fit using glum it never appears to finish (or if it does, it would do so extremely slowly). The syntax I used is this one - am I doing something wrong here ?
https://github.com/tomwenseleers/glmnet-benchmark/blob/master/bench_glum.py
If I fit the same problem in
glmnet
(version glmnet_4.1-8, which is the recent version redone in Rcpp) it finishes in less than a second:https://github.com/tomwenseleers/glmnet-benchmark/blob/master/bench_glmnet.R
Fitting it with the covariate matrix coded as dense is slower, but still doable:
And incidentally, my own R package that's in development (L0glm), to fit L0 penalized GLMs using an iterative adaptive ridge procedure, can fit it 4x faster still than glmnet & also returns a much better quality solution with far fewer false positives & more true positives & better predictive performance (this is using a very simple algorithm where the weighted least square step in the GLM IRLS algorithm is replaced with a ridge regression with adaptive penalties, and using the eigen Cholesky LLT solver) :
Any thoughts what I might be doing wrong in my usage of glum? Or if this is rather a bug?
Also, would it be possible to update the included benchmarks by any chance with a wider range of simulated data, e.g. generated using abess's function
make_glm_data
inhttps://github.com/abess-team/abess/blob/master/python/abess/datasets.py
or in R
https://github.com/abess-team/abess/blob/master/R-package/R/generate.data.R ? E.g. for varying n & p & different correlation structures & error families?
Right now I feel the included benchmarks are out of date with developments in other packages...
The text was updated successfully, but these errors were encountered: