Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Related to #44640
When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:
/.../arrow/cpp/src/arrow/array/builder_nested.cc:103: Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
This is immediately followed by SIGABRT and the process crashing.
When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size
is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.
import pandas as pd
import pyarrow
# Example DataFrame creation
import numpy as np
import random
import string
dataframe_size = 10_000_000
map_keys = [
"a1B2c3D4e5",
"f6G7h8I9j0",
"k1L2m3N4o5",
"p6Q7r8S9t0",
"u1V2w3X4y5",
"z6A7b8C9d0",
"e1F2g3H4i5",
"j6K7l8M9n0",
"o1P2q3R4s5",
"t6U7v8W9x0",
"y1Z2a3B4c5",
"d6E7f8G9h0",
"i1J2k3L4m5",
"n6O7p8Q9r0",
"s1T2u3V4w5",
]
# Pre-generate random strings for columns to avoid repeated computation
print("Generating random column strings")
random_strings = [
"".join(random.choices(string.ascii_letters + string.digits, k=20))
for _ in range(int(dataframe_size / 100))
]
# Pre-generate random map values
print("Generating random map value strings")
random_map_values = [
"".join(
random.choices(
string.ascii_letters + string.digits, k=random.randint(20, 200)
)
)
for _ in range(int(dataframe_size / 100))
]
print("Generating random maps")
random_maps = [
{
key: random.choice(random_map_values)
for key in random.sample(map_keys, random.randint(5, 10))
}
for _ in range(int(dataframe_size / 100))
]
print("Generating random dataframe")
data_with_map_col = {
"partition": np.full(dataframe_size, "1"),
"column1": np.random.choice(random_strings, dataframe_size),
"map_col": np.random.choice(random_maps, dataframe_size),
}
# Create DataFrame
df_with_map_col = pd.DataFrame(data_with_map_col)
column_types = {
"partition": pyarrow.string(),
"column1": pyarrow.string(),
"map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema = pyarrow.schema(fields=column_types)
# Process crashes when dataframe is large enough
table = pyarrow.Table.from_pandas(
df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)
Environment Details:
- Python Version: Python 3.11.8
- Pyarrow version: 18.0.0
Component(s)
Python