Conversation
databricks/koalas/series.py
Outdated
| Parameters | ||
| ---------- | ||
| other : Series | ||
| min_periods : int |
There was a problem hiding this comment.
Maybe min_periods also can be an optional ?
- because It will be a zero when nothing is given for
min_periods
| k_isnan = np.isnan(kser.cov(kser_other, 4)) | ||
| p_isnan = np.isnan(pser.cov(pser_other, 4)) | ||
| self.assert_eq(k_isnan, p_isnan) | ||
|
|
There was a problem hiding this comment.
Can we have a test when each Series has a different index and an Exception case?
For example,
kser = ks.Series([90, 91, 85], index=[1, 2, 3])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
pser_other = kser_other.to_pandas()
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)and
self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series
self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned| return self._internal.spark_frame.cov(self.name, other.name) | ||
| else: | ||
| # if not on the same anchor calculate covariance manually | ||
| return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
len(self.index) is performed four times in this code.
What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index) at the line above this variable is first used)
|
Could you add this to the docs also ?? It is placed at |
|
Otherwise, looks fine to me. Thanks, @lopez- |
| if len(self.index) != len(other.index): | ||
| raise ValueError("series are not aligned") |
There was a problem hiding this comment.
Where is this from? Seems like pandas works even with a different length of Series.
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5There was a problem hiding this comment.
Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5And this:
>>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5are equivalent... I believe this align is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?
There was a problem hiding this comment.
@lopez- Could you file the issue for align?
Also, is it possible for you to implement it?
| kser = ks.Series([90, 91, 85]) | ||
| pser = kser.to_pandas() | ||
| kser_other = ks.Series([90, 91, 85]) | ||
| pser_other = kser_other.to_pandas() |
There was a problem hiding this comment.
Please define pandas object first. to_pandas() invokes extra Spark jobs and it will take more time for tests.
pser = pd.Series([90, 91, 85])
kser = ks.from_pandas(pser)
databricks/koalas/series.py
Outdated
|
|
||
| def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: | ||
| """ | ||
| Return the covariance between two series. |
There was a problem hiding this comment.
Shall we just copy the docstring from pandas' with a few modification of examples?
| raise ValueError("series are not aligned") | ||
|
|
||
| min_periods = 0 if min_periods is None else min_periods | ||
| if len(self.index) < min_periods or len(self.index) <= 1: |
There was a problem hiding this comment.
We should also compare len(self.index) with min_periods?
>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan|
|
||
| if same_anchor(self, other): | ||
| # if the have the same anchor use the more performant Spark native `cov` | ||
| return self._internal.spark_frame.cov(self.name, other.name) |
There was a problem hiding this comment.
self._kdf._internal.resolved_copy.spark_frame.cov(
self._internal.data_spark_column_names[0],
other._internal.data_spark_column_names[0])?
FYI: self.name won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.
| # if not on the same anchor calculate covariance manually | ||
| return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
Maybe we should create a new DataFrame and use it, something like:
kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)I haven't checked the code, so please modify as it works.
Btw, we should do this at the beginning of this method to avoid extra checks for length or something.
|
Any updates here ? Just confirming :) |
|
Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week. |
This PR proposes Series.cov