MemoryError when fitting on sparse X as apparently a Hessian matrix is being instantied ? #485

mathurinm · 2021-11-19T10:32:41Z

requirement: pip install libsvmdata -- python utility to download data from LIBSVM website; the first time downloading the data may take 2 mins.

The following script causes a MemoryError on my machine:

from libsvmdata import fetch_libsvm
from glum import GeneralizedLinearRegressor

X, y = fetch_libsvm("finance")
clf = GeneralizedLinearRegressor(family='gaussian', l1_ratio=1, alpha=1).fit(
    X, y)

output:

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~/workspace/mem_error.py in <module>
      3 
      4 X, y = fetch_libsvm("finance")
----> 5 clf = GeneralizedLinearRegressor(family='gaussian', l1_ratio=1, alpha=1).fit(
      6     X, y)

~/miniconda3/lib/python3.8/site-packages/glum/_glm.py in fit(self, X, y, sample_weight, offset, weights_sum)
   2448                     )
   2449                 )
-> 2450             coef = self._solve(
   2451                 X=X,
   2452                 y=y,

~/miniconda3/lib/python3.8/site-packages/glum/_glm.py in _solve(self, X, y, sample_weight, P2, P1, coef, offset, lower_bounds, upper_bounds, A_ineq, b_ineq)
    976             # 4.2 coordinate descent ##############################################
    977             elif self._solver == "irls-cd":
--> 978                 coef, self.n_iter_, self._n_cycles, self.diagnostics_ = _irls_solver(
    979                     _cd_solver, coef, irls_data
    980                 )

~/miniconda3/lib/python3.8/site-packages/glum/_solvers.py in _irls_solver(inner_solver, coef, data)
    287     https://www.csie.ntu.edu.tw/~cjlin/papers/l1_glmnet/long-glmnet.pdf
    288     """
--> 289     state = IRLSState(coef, data)
    290 
    291     state.eta, state.mu, state.obj_val, coef_P2 = _update_predictions(

~/miniconda3/lib/python3.8/site-packages/glum/_solvers.py in __init__(self, coef, data)
    529         self.gradient_rows = None
    530         self.hessian_rows = None
--> 531         self.hessian = np.zeros(
    532             (self.coef.shape[0], self.coef.shape[0]), dtype=data.X.dtype
    533         )

MemoryError: Unable to allocate 133. TiB for an array with shape (4272228, 4272228) and data type float64

ping @QB3

The text was updated successfully, but these errors were encountered:

MarcAntoineSchmidtQC · 2021-11-19T18:57:58Z

Hi @mathurinm! Unfortunately, glum is not the right package to estimate this type of model. Are you sure your design matrix is correctly specified? You are trying to train a model containing potentially 4.3 million coefficients with fewer than 20k observations (according to the description of the dataset here). You will probably need to transform your design matrix to reduce the number of coefficients before using glum for this purpose.

In general, glum is fine-tuned to solve problems involving many observations (potentially millions) with a number of parameters significantly lower than this (a few thousands at most).

tbenthompson · 2021-11-20T03:16:47Z

I just wanted to leave a comment saying that the implementation of a solver more appropriate for this purpose would not be a huge undertaking and 75% of the pieces necessary already exist within glum.

mathurinm · 2021-11-20T06:33:13Z

@MarcAntoineSchmidtQC I may be wrong in the way I do it, but I am trying to fit a Lasso. Statistical analysis of the Lasso shows that it recovers good variables even when the number of features is exponential in the number of variables.

Sparse solvers such as Celer, sklearn, blitz or glmnet handle this problem. I saw your impressive benchmarks results and tried to reproduce them on large scale data.

mathurinm mentioned this issue Nov 22, 2021

FIX set GLUM sampling to tolerance as only 1 iter is run for Lasso benchopt/benchmark_lasso#38

Merged

AlanTuQC mentioned this issue May 19, 2022

Add use_sparse_hessian functionality to prevent a dense Hessian from being instantiated #543

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError when fitting on sparse X as apparently a Hessian matrix is being instantied ? #485

MemoryError when fitting on sparse X as apparently a Hessian matrix is being instantied ? #485

mathurinm commented Nov 19, 2021 •

edited

Loading

MarcAntoineSchmidtQC commented Nov 19, 2021

tbenthompson commented Nov 20, 2021

mathurinm commented Nov 20, 2021

MemoryError when fitting on sparse X as apparently a Hessian matrix is being instantied ? #485

MemoryError when fitting on sparse X as apparently a Hessian matrix is being instantied ? #485

Comments

mathurinm commented Nov 19, 2021 • edited Loading

MarcAntoineSchmidtQC commented Nov 19, 2021

tbenthompson commented Nov 20, 2021

mathurinm commented Nov 20, 2021

mathurinm commented Nov 19, 2021 •

edited

Loading