Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Subclassed DataFrame doesn't persist _metadata properties across binary operations #34177

Open
3 tasks done
clausmith opened this issue May 14, 2020 · 7 comments · May be fixed by #61101
Open
3 tasks done

BUG: Subclassed DataFrame doesn't persist _metadata properties across binary operations #34177

clausmith opened this issue May 14, 2020 · 7 comments · May be fixed by #61101
Labels
Bug metadata _metadata, .attrs Numeric Operations Arithmetic, Comparison, and Logical operations Subclassing Subclassing pandas objects

Comments

@clausmith
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


When subclassing a DataFrame, fields added to the _metadata property are only persisted across some operations (such as slicing) and not others (such as any arithmetic operation).

I would expect any properties defined on the subclass to persist whenever the result of an operation is an instance of the subclass.

The following is the example taken from the "Extending Pandas" docs: https://pandas.pydata.org/pandas-docs/stable/development/extending.html

import pandas as pd
class SubclassedDataFrame2(pd.DataFrame): 
 
    # temporary properties 
    _internal_names = pd.DataFrame._internal_names + ['internal_cache'] 
    _internal_names_set = set(_internal_names) 
 
    # normal properties 
    _metadata = ['added_property'] 
 
    @property 
    def _constructor(self): 
        return SubclassedDataFrame2 

df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.internal_cache = "cached"
df.added_property = "property"

With the above setup, here's how to reproduce the problem:

>>> df.added_property
'property'

>>> df[["A", "B"]].added_property # this works as expected
'property'

>>> (df * 2).added_property # I would expect this to work
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/clausmith/Developer/pandas-bug/pandas/pandas/core/generic.py", line 5220, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'SubclassedDataFrame2' object has no attribute 'added_property'

Problem description

The current behavior means that you can almost never rely on custom properties to persist on a subclassed DataFrame. This substantially reduces the utility of these custom properties.

Expected Output

I would expect the added_property property in the example above to persist after performing the arithmetic operation on the DataFrame. Especially because the result of (df * 2) is still an instance of SubclassedDataFrame2.

Output of pd.show_versions()

commit           : 507cb1548d36bbf48c3084a78d59af2fed78a9d1
python           : 3.7.3.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 18.7.0
Version          : Darwin Kernel Version 18.7.0: Mon Feb 10 21:08:45 PST 2020; root:xnu-4903.278.28~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+1576.g507cb1548
numpy            : 1.18.4
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 19.0.3
setuptools       : 40.8.0
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@clausmith clausmith added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 14, 2020
@TomAugspurger
Copy link
Contributor

DataFrame binary ops (like mul) don't currently call NDFrame.__finalize__. We have an xfailing test for this at

pytest.param(
(
pd.DataFrame,
frame_data,
operator.methodcaller("add", pd.DataFrame(*frame_data)),
),
marks=not_implemented_mark,
),
# TODO: div, mul, etc.
.

I don't think there would be any objection to calling finalize here. The primary API question is what to do with metadata / attrs when other is a Series or DataFrame. But for scalars there's no issue.

@clausmith are you interested in working on this?

@TomAugspurger TomAugspurger added metadata _metadata, .attrs Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 14, 2020
@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone May 14, 2020
@TomAugspurger
Copy link
Contributor

xref #28283 for the general issue. This can be specific to binops.

@TomAugspurger TomAugspurger changed the title BUG: Subclassed DataFrame doesn't persist _metadata properties across certain operations BUG: Subclassed DataFrame doesn't persist _metadata properties across binary operations May 14, 2020
@clausmith
Copy link
Author

@TomAugspurger ah that's what I thought. I saw #28283 and figured this might be related. Unfortunately I'm a little out of my depth here (I'm a PM more so than an engineer 😬).

@mroeschke mroeschke added the Bug label May 14, 2020
@jbesomi
Copy link

jbesomi commented May 22, 2020

Hi @TomAugspurger, I'm working on a side project with Pandas and need this problem to be fixed. I would love to help you. Can you please explain me how I should proceed to first understand the issue and then fix it? Thank you.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 22, 2020 via email

@jbesomi
Copy link

jbesomi commented May 22, 2020

Great, thank you. Will start from Series._binop then.

"As I mentioned earlier, it's not 100% clear how we'll propagate metadata / .attrs when the values differ": what do you mean with values differ?

A related yet-different problem is to assign to a new or existing column of a pandas Dataframe a pandas Series containing metadata. Also in this case the metadata information is lost. Do you think it will be possible to solve this problem too or there might be some other reasons why we should keep things as they are?

@TomAugspurger
Copy link
Contributor

A related yet-different problem is to assign to a new or existing column of a pandas Dataframe a pandas Series containing metadata

In general I'd recommend moving away from _metadata and instead using .attrs.

By differeing attrs I mean different keys, or perhaps the same keys but different values. I think eventually we'll want a system to decide how to propagate the attrs in that case, but I'm not sure yet what that should look like.

@jorisvandenbossche jorisvandenbossche added the Subclassing Subclassing pandas objects label Dec 7, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug metadata _metadata, .attrs Numeric Operations Arithmetic, Comparison, and Logical operations Subclassing Subclassing pandas objects
Projects
None yet
5 participants