Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_feather fails because partitions evaluate to Dask graph key tuples #325

Open
fbunt opened this issue Feb 8, 2025 · 0 comments
Open

Comments

@fbunt
Copy link
Contributor

fbunt commented Feb 8, 2025

to_feather fails when trying to write a dataframe to disk with an odd error.

>>> import dask.dataframe as dd
>>> import dask_geopandas as dgpd
>>> import geopandas as gpd
>>> import numpy as np
>>> 
>>> dfs = []
>>> N = 5
>>> for i in range(3):
>>>     gs = gpd.points_from_xy(np.arange(N), np.arange(N), crs=5070)
>>>     df = gpd.GeoDataFrame({"data": np.full(N, i), "geometry": gs})
>>>     dfs.append(dgpd.from_geopandas(df, npartitions=1))
>>> ddf = dd.concat(dfs)
>>> print(ddf.compute())
   data     geometry
0     0  POINT (0 0)
1     0  POINT (1 1)
2     0  POINT (2 2)
3     0  POINT (3 3)
4     0  POINT (4 4)
0     1  POINT (0 0)
1     1  POINT (1 1)
2     1  POINT (2 2)
3     1  POINT (3 3)
4     1  POINT (4 4)
0     2  POINT (0 0)
1     2  POINT (1 1)
2     2  POINT (2 2)
3     2  POINT (3 3)
4     2  POINT (4 4)
>>> ddf.to_feather("test.feather")
Traceback (most recent call last):
  File "/var/mnt/fastdata02/mtbs/src/feather_error.py", line 13, in <module>
    ddf.to_feather("test.feather")
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/expr.py", line 682, in to_feather
    return to_feather(self, path, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 433, in to_feather
    return compute_as_if_collection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/base.py", line 399, in compute_as_if_collection
    return schedule(dsk2, keys, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/threaded.py", line 91, in get
    results = get_async(
              ^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 516, in get_async
    raise_exception(exc, tb)
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 324, in reraise
    raise exc
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/local.py", line 229, in execute_task
    result = task(data)
             ^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/_task_spec.py", line 741, in __call__
    return self.func(*new_argspec)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask/utils.py", line 79, in apply
    return func(*args)
           ^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 149, in write_partition
    table = cls._pandas_to_arrow_table(df, preserve_index=None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/dask_geopandas/io/arrow.py", line 202, in _pandas_to_arrow_table
    table = _geopandas_to_arrow(df, index=preserve_index)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/geopandas/io/arrow.py", line 340, in _geopandas_to_arrow
    _validate_dataframe(df)
  File "/home/fred/homes/rts/anaconda3/envs/dgpd_err/lib/python3.12/site-packages/geopandas/io/arrow.py", line 242, in _validate_dataframe
    raise ValueError("Writing to Parquet/Feather only supports IO with DataFrames")
ValueError: Writing to Parquet/Feather only supports IO with DataFrames

If I add print(f"{type(df) = }\n{df}") at the start of geopandas.io.arrow._validate_dataframe, where the error occurs, I get the following:

type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 2)
type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 0)
type(df) = <class 'tuple'>
('concat-213dda7de847a70669800b78bbdabba7', 1)

It seems that the partitions are being evaluated to dask graph key tuples instead of the actual dataframes. I have run into similar issues with other dask_geopandas workflows, but they were random and I could only get it to happen infrequently (1/1000). This is the first case that reliably reproduces it. It began when I updated to include dask-expr.

Environment:

python                    3.12.8          h9e4cc4f_1_cpython    conda-forge
dask                      2025.1.0           pyhd8ed1ab_0    conda-forge
dask-core                 2025.1.0           pyhd8ed1ab_0    conda-forge
dask-expr                 2.0.0              pyhd8ed1ab_0    conda-forge
dask-geopandas            0.4.3              pyhd8ed1ab_0    conda-forge
geopandas-base            1.0.1              pyha770c72_3    conda-forge
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant