-
Notifications
You must be signed in to change notification settings - Fork 364
Implement Series.cov #1620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Implement Series.cov #1620
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4858,6 +4858,51 @@ def mad(self): | |
|
|
||
| return mad | ||
|
|
||
| def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: | ||
| """ | ||
| Compute covariance with Series, excluding missing values. | ||
|
|
||
| Parameters | ||
| ---------- | ||
| other : Series | ||
| Series with which to compute the covariance. | ||
| min_periods : int, optional | ||
| Minimum number of observations needed to have a valid result. | ||
|
|
||
| Returns | ||
| ------- | ||
| float | ||
| Covariance between Series and other normalized by N-1 | ||
| (unbiased estimator). | ||
|
|
||
| Examples | ||
| -------- | ||
| >>> import databricks.koalas as ks | ||
| >>> ks.set_option("compute.ops_on_diff_frames", True) | ||
| >>> s1 = ks.Series([0.90010907, 0.13484424, 0.62036035]) | ||
| >>> s2 = ks.Series([0.12528585, 0.26962463, 0.51111198]) | ||
| >>> s1.cov(s2) | ||
| -0.01685762652715874 | ||
| >>> ks.reset_option("compute.ops_on_diff_frames") | ||
| """ | ||
|
|
||
| if not isinstance(other, Series): | ||
| raise ValueError("'other' must be a Series") | ||
|
|
||
| if len(self.index) != len(other.index): | ||
| raise ValueError("series are not aligned") | ||
|
Comment on lines
+4892
to
+4893
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Where is this from? Seems like pandas works even with a different length of Series. >>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops, I missed it. Thanks, @ueshin .
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this: >>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5And this: >>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5are equivalent... I believe this
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @lopez- Could you file the issue for |
||
|
|
||
| min_periods = 0 if min_periods is None else min_periods | ||
| if len(self.index) < min_periods or len(self.index) <= 1: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should also compare >>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan |
||
| return np.nan | ||
|
|
||
| if same_anchor(self, other): | ||
| # if the have the same anchor use the more performant Spark native `cov` | ||
| return self._internal.spark_frame.cov(self.name, other.name) | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. self._kdf._internal.resolved_copy.spark_frame.cov(
self._internal.data_spark_column_names[0],
other._internal.data_spark_column_names[0])? FYI: |
||
| else: | ||
| # if not on the same anchor calculate covariance manually | ||
| return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
What do you think about we assign a proper variable and reuse it?
Comment on lines
+4903
to
+4904
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we should create a new DataFrame and use it, something like: kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)I haven't checked the code, so please modify as it works. Btw, we should do this at the beginning of this method to avoid extra checks for length or something. |
||
|
|
||
| def unstack(self, level=-1): | ||
| """ | ||
| Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -948,6 +948,32 @@ def test_series_repeat(self): | |
| else: | ||
| self.assert_eq(kser1.repeat(kser2).sort_index(), pser1.repeat(pser2).sort_index()) | ||
|
|
||
| def test_cov(self): | ||
| pser = pd.Series([90, 91, 85]) | ||
| kser = ks.from_pandas() | ||
| kser_other = ks.Series([90, 91, 85]) | ||
| pser_other = kser_other.to_pandas() | ||
|
|
||
| self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True) | ||
|
|
||
| kser = ks.Series([90]) | ||
| pser = kser.to_pandas() | ||
| kser_other = ks.Series([85]) | ||
| pser_other = kser_other.to_pandas() | ||
|
|
||
| k_isnan = np.isnan(kser.cov(kser_other)) | ||
| p_isnan = np.isnan(pser.cov(pser_other)) | ||
| self.assert_eq(k_isnan, p_isnan) | ||
|
|
||
| kser = ks.Series([90, 91, 85]) | ||
| pser = kser.to_pandas() | ||
| kser_other = ks.Series([90, 91, 85]) | ||
| pser_other = kser_other.to_pandas() | ||
|
|
||
| k_isnan = np.isnan(kser.cov(kser_other, 4)) | ||
| p_isnan = np.isnan(pser.cov(pser_other, 4)) | ||
| self.assert_eq(k_isnan, p_isnan) | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have a test when each Series has a different index and an Exception case? For example, kser = ks.Series([90, 91, 85], index=[1, 2, 3])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
pser_other = kser_other.to_pandas()
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)and self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series
self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned |
||
|
|
||
| class OpsOnDiffFramesDisabledTest(ReusedSQLTestCase, SQLTestUtils): | ||
| @classmethod | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.