Implement Series.cov by lopez- · Pull Request #1620 · databricks/koalas

lopez- · 2020-06-30T05:48:38Z

This PR proposes Series.cov

>>> s1 = ks.Series([1, 2, 3, 4])
>>> s2 = ks.Series([5, 6, 7, 8])
>>> s1
0    1
1    2
2    3
3    4
Name: 0, dtype: int64

>>> s2
0    5
1    6
2    7
3    8
Name: 0, dtype: int64

>>> s1.cov(s2)
1.666666...

itholic · 2020-06-30T05:57:15Z

databricks/koalas/series.py

+        Parameters
+        ----------
+        other : Series
+        min_periods : int


Maybe min_periods also can be an optional ?

because It will be a zero when nothing is given for min_periods

itholic · 2020-06-30T06:03:08Z

databricks/koalas/tests/test_ops_on_diff_frames.py

+        k_isnan = np.isnan(kser.cov(kser_other, 4))
+        p_isnan = np.isnan(pser.cov(pser_other, 4))
+        self.assert_eq(k_isnan, p_isnan)
+


Can we have a test when each Series has a different index and an Exception case?

For example,

kser = ks.Series([90, 91, 85], index=[1, 2, 3]) pser = kser.to_pandas() kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3]) pser_other = kser_other.to_pandas() self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)

and

self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned

itholic · 2020-06-30T06:15:58Z

databricks/koalas/series.py

+            return self._internal.spark_frame.cov(self.name, other.name)
+        else:
+            # if not on the same anchor calculate covariance manually
+            return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)


len(self.index) is performed four times in this code.

What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index) at the line above this variable is first used)

databricks/koalas/series.py

itholic · 2020-06-30T06:27:42Z

Could you add this to the docs also ??

It is placed at docs/source/reference/series.rst :)

itholic · 2020-06-30T06:28:03Z

Otherwise, looks fine to me.

Thanks, @lopez-

ueshin · 2020-06-30T21:15:51Z

databricks/koalas/series.py

+        if len(self.index) != len(other.index):
+            raise ValueError("series are not aligned")


Where is this from? Seems like pandas works even with a different length of Series.

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6])) 0.5

Oops, I missed it. Thanks, @ueshin .

Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6])) 0.5

And this:

>>> pd.Series([1, 2]).cov(pd.Series([5, 6])) 0.5

are equivalent... I believe this align is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?

@lopez- Could you file the issue for align?
Also, is it possible for you to implement it?

ueshin · 2020-06-30T21:21:31Z

databricks/koalas/tests/test_ops_on_diff_frames.py

+        kser = ks.Series([90, 91, 85])
+        pser = kser.to_pandas()
+        kser_other = ks.Series([90, 91, 85])
+        pser_other = kser_other.to_pandas()


Please define pandas object first. to_pandas() invokes extra Spark jobs and it will take more time for tests.

pser = pd.Series([90, 91, 85]) kser = ks.from_pandas(pser)

ueshin · 2020-06-30T21:26:39Z

databricks/koalas/series.py


+    def cov(self, other: "Series", min_periods: Optional[int] = None) -> float:
+        """
+        Return the covariance between two series.


Shall we just copy the docstring from pandas' with a few modification of examples?

ueshin · 2020-06-30T21:27:50Z

databricks/koalas/series.py

+            raise ValueError("series are not aligned")
+
+        min_periods = 0 if min_periods is None else min_periods
+        if len(self.index) < min_periods or len(self.index) <= 1:


We should also compare len(self.index) with min_periods?

>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3) nan

ueshin · 2020-06-30T21:36:30Z

databricks/koalas/series.py

+
+        if same_anchor(self, other):
+            # if the have the same anchor use the more performant Spark native `cov`
+            return self._internal.spark_frame.cov(self.name, other.name)


self._kdf._internal.resolved_copy.spark_frame.cov( self._internal.data_spark_column_names[0], other._internal.data_spark_column_names[0])

?

FYI: self.name won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.

ueshin · 2020-06-30T21:42:45Z

databricks/koalas/series.py

+            # if not on the same anchor calculate covariance manually
+            return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)


Maybe we should create a new DataFrame and use it, something like:

kdf = self._kdf.copy() tmp_column = verify_temp_column_name(kdf, '__tmp_column__') kdf[tmp_column] = other return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)

I haven't checked the code, so please modify as it works.

Btw, we should do this at the beginning of this method to avoid extra checks for length or something.

databricks/koalas/series.py

itholic · 2021-01-11T09:43:20Z

Any updates here ? Just confirming :)

xinrong-meng · 2021-08-03T23:06:32Z

Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week.

set up cov with tests

cd9c1be

itholic reviewed Jun 30, 2020

View reviewed changes

databricks/koalas/series.py Show resolved Hide resolved

ueshin reviewed Jun 30, 2020

View reviewed changes

lopez- added 2 commits July 19, 2020 21:57

adapt docstring to pandas

efdee72

define pandas object first to avoid generating extra spark jobs

a8138c0

		if len(self.index) != len(other.index):
		raise ValueError("series are not aligned")

		# if not on the same anchor calculate covariance manually
		return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)

Conversation

lopez- commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itholic commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itholic commented Jun 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

itholic commented Jan 11, 2021

Uh oh!

xinrong-meng commented Aug 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lopez- commented Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading