Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pivot_table drops rows and columns despite values being non-NaN #61113

Open
3 tasks done
it176131 opened this issue Mar 12, 2025 · 9 comments · May be fixed by #61184
Open
3 tasks done

BUG: pivot_table drops rows and columns despite values being non-NaN #61113

it176131 opened this issue Mar 12, 2025 · 9 comments · May be fixed by #61184
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@it176131
Copy link

it176131 commented Mar 12, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd
import pandas._testing as tm


def test_pivot_table_index_and_column_with_nan() -> None:
    """Index and columns should exist if any non-null values.

    Input data
    ----------
        row  col  val
    0  NaN  0.0    0
    1  0.0  1.0    1
    2  1.0  2.0    2
    3  2.0  3.0    3
    4  3.0  NaN    4

    Expected output
    ---------------
    col  0.0  1.0  2.0  3.0  NaN
    row                         
    NaN  0.0  NaN  NaN  NaN  NaN
    0.0  NaN  1.0  NaN  NaN  NaN
    1.0  NaN  NaN  2.0  NaN  NaN
    2.0  NaN  NaN  NaN  3.0  NaN
    3.0  NaN  NaN  NaN  NaN  4.0
    """
    data = {
        "row": [None, *range(4)],
        "col": [*range(4), None],
        "val": range(5)
    }
    df = pd.DataFrame(data)
    actual = df.pivot_table(values="val", index="row", columns="col")
    e_index = [None, *range(4)]
    e_columns = [*range(4), None]
    e_data = np.zeros(shape=(5, 5))
    e_data.fill(np.NaN)
    np.fill_diagonal(a=e_data, val=range(5))
    expected = pd.DataFrame(
        data=e_data,
        index=pd.Index(data=e_index, name="row"),
        columns=pd.Index(data=e_columns, name="col")
    )
    tm.assert_frame_equal(left=actual, right=expected)  # fails

Issue Description

Rows and columns are unexpectedly dropped while creating a pivot table with a single NaN index label and a single NaN column label and no NaN values in the input data.

Input data

   row  col  val
0  NaN  0.0    0
1  0.0  1.0    1
2  1.0  2.0    2
3  2.0  3.0    3
4  3.0  NaN    4

Actual output

col  1.0  2.0  3.0
row               
0.0  1.0  NaN  NaN
1.0  NaN  2.0  NaN
2.0  NaN  NaN  3.0

Expected Behavior

The docs for dropna state:

Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.

Given that dropna=True by default, I'd expect all columns and rows to be present as each has at least one non-NaN value.

Expected output

col  0.0  1.0  2.0  3.0  NaN
row                         
NaN  0.0  NaN  NaN  NaN  NaN
0.0  NaN  1.0  NaN  NaN  NaN
1.0  NaN  NaN  2.0  NaN  NaN
2.0  NaN  NaN  NaN  3.0  NaN
3.0  NaN  NaN  NaN  NaN  4.0

Installed Versions

INSTALLED VERSIONS

commit : 3c93d06
python : 3.13.2
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : AMD64 Family 25 Model 116 Stepping 1, AuthenticAMD
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 3.0.0.dev0+2007.g3c93d06d64
numpy : 1.26.4
dateutil : 2.9.0.post0
pip : 25.0.1
Cython : 3.0.12
sphinx : 8.1.3
IPython : 9.0.2
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.3
blosc : None
bottleneck : 1.4.2
fastparquet : 2024.11.0
fsspec : 2025.3.0
html5lib : 1.1
hypothesis : 6.129.0
gcsfs : 2025.3.0
jinja2 : 3.1.6
lxml.etree : 5.3.1
matplotlib : 3.10.1
numba : 0.61.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 19.0.1
pyreadstat : 1.2.8
pytest : 8.3.5
python-calamine : None
pytz : 2025.1
pyxlsb : 1.0.10
s3fs : 2025.3.0
scipy : 1.15.2
sqlalchemy : 2.0.39
tables : 3.10.2
tabulate : 0.9.0
xarray : 2024.9.0
xlrd : 2.0.1
xlsxwriter : 3.2.2
zstandard : 0.23.0
tzdata : 2025.1
qtpy : None
pyqt5 : None

@it176131 it176131 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 12, 2025
@it176131
Copy link
Author

it176131 commented Mar 12, 2025

Strangely, pivot almost produces the expected output (the columns are sorted with NaN first):

data = {
    "row": [None, *range(4)],
    "col": [*range(4), None],
    "val": range(5)
}
df = pd.DataFrame(data)
out = df.pivot(index="col", columns="row", values="val")
print(out)
col  NaN  0.0  1.0  2.0  3.0
row                         
NaN  NaN  0.0  NaN  NaN  NaN
0.0  NaN  NaN  1.0  NaN  NaN
1.0  NaN  NaN  NaN  2.0  NaN
2.0  NaN  NaN  NaN  NaN  3.0
3.0  4.0  NaN  NaN  NaN  NaN

@snitish
Copy link
Member

snitish commented Mar 13, 2025

It looks like dropna is used in multiple places inside __internal_pivot_table:

  1. In the groupby() which means that NaN keys (either in index (row) or columns (col)) will be dropped.
grouped = data.groupby(keys, observed=observed, sort=sort, dropna=dropna)
  1. For dropping all-NaN rows and columns
if dropna and ...
    agged = agged.dropna(how="all")

Due to the usage in 1 above, your result df doesn't contain cases where row or col is NaN. It looks like this behavior is not documented well.
Seems this is a known issue: #53521

@it176131
Copy link
Author

@snitish

Thanks for the reply! What options should I explore to fix and/or improve this? I will look at #53521 to see if suggestions have already been made there as well.

@snitish
Copy link
Member

snitish commented Mar 13, 2025

I'll let @rhshadrach chime in

@rhshadrach
Copy link
Member

Indeed, this seems duplicative of #53521. For now I recommend using dropna=False followed by dropping all NA rows / columns in the output. Does that satisfy your use case?

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 22, 2025
@it176131
Copy link
Author

@rhshadrach I was actually looking to keep all the NA rows and columns in the output as in the expected output above. dropna=False does this, but wondering if I should update the docs in pivot_table so users know that dropna=True does more than what it says in the current docs. Maybe something like the following:

    dropna : bool, default True
         Do not include columns whose entries are all NaN. If True,
-        rows with a NaN value in any column will be omitted before
-        computing margins.
+        - rows with a NaN value in any column will be omitted before computing margins
+        - index/column keys containing NA values will be dropped (see ``dropna`` parameter in ``pandas.DataFrame.groupby``)

Shared groupby docs for dropna

@rhshadrach
Copy link
Member

I'm +1 on adding to the docstring as suggested.

@rhshadrach rhshadrach added Docs and removed Bug Needs Info Clarification about behavior needed to assess issue labels Mar 25, 2025
@it176131
Copy link
Author

I'll link a PR

@it176131 it176131 linked a pull request Mar 26, 2025 that will close this issue
5 tasks
@it176131
Copy link
Author

@rhshadrach PR has been linked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants