Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize DenseMatrix by shifting data #414

Open
mlondschien opened this issue Nov 12, 2024 · 4 comments
Open

Standardize DenseMatrix by shifting data #414

mlondschien opened this issue Nov 12, 2024 · 4 comments

Comments

@mlondschien
Copy link
Contributor

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.9 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.2 ms ± 602 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
352 ms ± 24.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16
In [6]: print(np.max(np.abs(sandwich1.T - sandwich1)))
0.0

This is including #408. I guess centering the data explicitly results in an additional copy.

@mlondschien
Copy link
Contributor Author

xref Quantco/glum#872

This also happens if weights is standardized.

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p) ** 2
   ...: stds = rng.exponential(10, p) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)) * stds + means
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n) / n, True, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
50.6 ms ± 2.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
34.5 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
365 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
0.06973587287713845
0.0697358728771389
8.881784197001252e-16

In [6]: weights.sum()
Out[6]: np.float64(1.0)

@mlondschien
Copy link
Contributor Author

to make matters worse

In [1]: import numpy as np
   ...: import tabmat
   ...: 
   ...: n = 5_000
   ...: p = 1_000
   ...: 
   ...: rng = np.random.default_rng(0)
   ...: means = rng.exponential(10, p).astype(np.float32) ** 2
   ...: stds = rng.exponential(10, p).astype(np.float32) ** 2
   ...: 
   ...: X = rng.uniform(size=(n, p)).astype(np.float32) * stds + means
   ...: X = X.astype(np.float32)
   ...: 
   ...: matrix = tabmat.DenseMatrix(X)
   ...: standardized_matrix1, emp_mean1, emp_std1 = matrix.standardize(np.ones(n).astype(np.float32) / n, Tru
   ...: e, True)
   ...: 
   ...: emp_mean2 = X.mean(axis=0)
   ...: emp_std2 = X.std(axis=0)
   ...: X = (X - emp_mean2) / emp_std2
   ...: standardized_matrix2 = tabmat.DenseMatrix(X)
   ...: 
   ...: weights = rng.uniform(size=n).astype(np.float32)
   ...: weights /= weights.sum()

In [2]: %%timeit
   ...: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: 
   ...: 
26.5 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %%timeit
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: 
   ...: 
19.2 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %%timeit
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: 
174 ms ± 6.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [5]: sandwich1 = standardized_matrix1.sandwich(weights)
   ...: sandwich2 = standardized_matrix2.sandwich(weights)
   ...: sandwich3 = X.T @ np.diag(weights) @ X
   ...: 
   ...: print(np.max(np.abs(sandwich1 - sandwich2)))
   ...: print(np.max(np.abs(sandwich1 - sandwich3)))
   ...: print(np.max(np.abs(sandwich2 - sandwich3)))
32769.0
32769.0
8.34465e-07

@stanmart
Copy link
Collaborator

Interesting, the speed difference remains even when I look at larger/longer matrices (e.g. 5_000_000 x 100). It might be worth looking why StandardizedMatrix.sandwich is so much slower than DenseMatrix.sandwich.

I guess centering the data explicitly results in an additional copy.

This is a big deal IMO. Even if we implement standardizing by modifying the data, it should be optional and we should keep the possibility to do it without a copy.

@mlondschien
Copy link
Contributor Author

It might also be worth it to investigate why StandardizedMatrix.sandwich appears to return the wrong result, independently of speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants