You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:
/.../arrow/cpp/src/arrow/array/builder_nested.cc:103: Check failed: (item_builder_->length()) == (key_builder_->length()) keys and items builders don't have the same size in MapBuilder
This is immediately followed by SIGABRT and the process crashing.
When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when dataframe_size is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.
importpandasaspdimportpyarrow# Example DataFrame creationimportnumpyasnpimportrandomimportstringdataframe_size=10_000_000map_keys= [
"a1B2c3D4e5",
"f6G7h8I9j0",
"k1L2m3N4o5",
"p6Q7r8S9t0",
"u1V2w3X4y5",
"z6A7b8C9d0",
"e1F2g3H4i5",
"j6K7l8M9n0",
"o1P2q3R4s5",
"t6U7v8W9x0",
"y1Z2a3B4c5",
"d6E7f8G9h0",
"i1J2k3L4m5",
"n6O7p8Q9r0",
"s1T2u3V4w5",
]
# Pre-generate random strings for columns to avoid repeated computationprint("Generating random column strings")
random_strings= [
"".join(random.choices(string.ascii_letters+string.digits, k=20))
for_inrange(int(dataframe_size/100))
]
# Pre-generate random map valuesprint("Generating random map value strings")
random_map_values= [
"".join(
random.choices(
string.ascii_letters+string.digits, k=random.randint(20, 200)
)
)
for_inrange(int(dataframe_size/100))
]
print("Generating random maps")
random_maps= [
{
key: random.choice(random_map_values)
forkeyinrandom.sample(map_keys, random.randint(5, 10))
}
for_inrange(int(dataframe_size/100))
]
print("Generating random dataframe")
data_with_map_col= {
"partition": np.full(dataframe_size, "1"),
"column1": np.random.choice(random_strings, dataframe_size),
"map_col": np.random.choice(random_maps, dataframe_size),
}
# Create DataFramedf_with_map_col=pd.DataFrame(data_with_map_col)
column_types= {
"partition": pyarrow.string(),
"column1": pyarrow.string(),
"map_col": pyarrow.map_(pyarrow.string(), pyarrow.string()),
}
schema=pyarrow.schema(fields=column_types)
# Process crashes when dataframe is large enoughtable=pyarrow.Table.from_pandas(
df=df_with_map_col, schema=schema, preserve_index=False, safe=True
)
Environment Details:
Python Version: Python 3.11.8
Pyarrow version: 18.0.0
Component(s)
Python
The text was updated successfully, but these errors were encountered:
snakingfire
changed the title
Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash
[Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash
Nov 5, 2024
raulcd
changed the title
[Python] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash
[Python][C++] Calling Table.from_pandas with a dataframe that contains a map column of sufficient size causes SIGABRT and process crash
Nov 14, 2024
Intuitively, I think what happens is that the item_builder_ overflows because it's a StringBuilder and we try to append more than 2 GiB to it. The converter logic then tries to finish the chunk and start another one, but the key and item builders are out of sync.
It looks like the rewind-on-overflow in arrow/util/converter.h is too naive. In particular, if appending to one of a StructBuilder's child builders raises CapacityError, then all child builders should be rewind to the same length to ensure consistency.
Describe the bug, including details regarding any error messages, version, and platform.
Related to #44640
When attempting to convert a pandas dataframe that has a dict type column to a pyarrow table with a map column, if the dataframe and column are of sufficient size, the conversion fails with:
This is immediately followed by SIGABRT and the process crashing.
When the dataframe is of a smaller size, the conversion succeeds without error. See below for reproduction code, when
dataframe_size
is set to a small value (eg 1M rows) there is no error, but at a certain size (eg, 10M rows) the error condition occurs.Environment Details:
Component(s)
Python
The text was updated successfully, but these errors were encountered: