Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Open
snakingfire opened this issue Nov 5, 2024 · 4 comments

Comments

@snakingfire
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Related to #44640

When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:

/.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder

This is immediately followed by SIGABRT and the process crashing.

When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.

import pandas as pd
import pyarrow

# Example DataFrame creation
import numpy as np
import random
import string

dataframe_size = 10_000_000

map_keys = [
    "a1B2c3D4e5",
    "f6G7h8I9j0",
    "k1L2m3N4o5",
    "p6Q7r8S9t0",
    "u1V2w3X4y5",
    "z6A7b8C9d0",
    "e1F2g3H4i5",
    "j6K7l8M9n0",
    "o1P2q3R4s5",
    "t6U7v8W9x0",
    "y1Z2a3B4c5",
    "d6E7f8G9h0",
    "i1J2k3L4m5",
    "n6O7p8Q9r0",
    "s1T2u3V4w5",
]

# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=20))
    for _ in range(int(dataframe_size / 100))
]

# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
    "".join(
        random.choices(
            string.ascii_letters + string.digits, k=random.randint(20, 200)
        )
    )
    for _ in range(int(dataframe_size / 100))
]

print("Generating random maps")
random_maps = [
    {
        key: random.choice(random_map_values)
        for key in random.sample(map_keys, random.randint(5, 10))
    }
    for _ in range(int(dataframe_size / 100))
]

print("Generating random dataframe")
data_with_map_col = {
    "partition": np.full(dataframe_size, "1"),
    "column1": np.random.choice(random_strings, dataframe_size),
    "map_col": np.random.choice(random_maps, dataframe_size),
}

# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)

column_types = {
    "partition": pyarrow.string(),
    "column1": pyarrow.string(),
    "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)

# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
    df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)

Environment Details:

  • Python Version: Python 3.11.8
  • Pyarrow version: 18.0.0

Component(s)

Python

@snakingfire snakingfire changed the title Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash Nov 5, 2024
@pitrou
Copy link
Member

pitrou commented Nov 9, 2024

cc @jorisvandenbossche @raulcd

@raulcd raulcd changed the title [Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash [Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash Nov 14, 2024
@raulcd
Copy link
Member

raulcd commented Nov 14, 2024

really weird:

item_builder_->length()   19456153
key_builder_->length():   19456154

I'll have to debug a bunch to understand where this mismatch is coming from :)

@pitrou
Copy link
Member

pitrou commented Nov 14, 2024

Intuitively, I think what happens is that the item_builder_ overflows because it's a StringBuilder and we try to append more than 2 GiB to it. The converter logic then tries to finish the chunk and start another one, but the key and item builders are out of sync.

@pitrou
Copy link
Member

pitrou commented Nov 14, 2024

It looks like the rewind-on-overflow in arrow/util/converter.h is too naive. In particular, if appending to one of a StructBuilder's child builders raises CapacityError, then all child builders should be rewind to the same length to ensure consistency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants