Skip to content

[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash #44643

Open
@snakingfire

Description

@snakingfire

Describe the bug, including details regarding any error messages, version, and platform.

Related to #44640

When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:

/.../arrow/cpp/src/arrow/array/builder_nested.cc:103:  Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder

This is immediately followed by SIGABRT and the process crashing.

When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.

import pandas as pd
import pyarrow

# Example DataFrame creation
import numpy as np
import random
import string

dataframe_size = 10_000_000

map_keys = [
    "a1B2c3D4e5",
    "f6G7h8I9j0",
    "k1L2m3N4o5",
    "p6Q7r8S9t0",
    "u1V2w3X4y5",
    "z6A7b8C9d0",
    "e1F2g3H4i5",
    "j6K7l8M9n0",
    "o1P2q3R4s5",
    "t6U7v8W9x0",
    "y1Z2a3B4c5",
    "d6E7f8G9h0",
    "i1J2k3L4m5",
    "n6O7p8Q9r0",
    "s1T2u3V4w5",
]

# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
    "".join(random.choices(string.ascii_letters + string.digits, k=20))
    for _ in range(int(dataframe_size / 100))
]

# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
    "".join(
        random.choices(
            string.ascii_letters + string.digits, k=random.randint(20, 200)
        )
    )
    for _ in range(int(dataframe_size / 100))
]

print("Generating random maps")
random_maps = [
    {
        key: random.choice(random_map_values)
        for key in random.sample(map_keys, random.randint(5, 10))
    }
    for _ in range(int(dataframe_size / 100))
]

print("Generating random dataframe")
data_with_map_col = {
    "partition": np.full(dataframe_size, "1"),
    "column1": np.random.choice(random_strings, dataframe_size),
    "map_col": np.random.choice(random_maps, dataframe_size),
}

# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)

column_types = {
    "partition": pyarrow.string(),
    "column1": pyarrow.string(),
    "map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)

# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
    df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)

Environment Details:

  • Python Version: Python 3.11.8
  • Pyarrow version: 18.0.0

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions