Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Impossible creation of array with dtype=string #61155

Open
2 of 3 tasks
Maxence1402 opened this issue Mar 20, 2025 · 6 comments
Open
2 of 3 tasks

BUG: Impossible creation of array with dtype=string #61155

Maxence1402 opened this issue Mar 20, 2025 · 6 comments
Assignees
Labels
Bug Strings String extension data type and string data

Comments

@Maxence1402
Copy link

Maxence1402 commented Mar 20, 2025

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.array([list('test')], dtype='string')
# ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pd.array([list('test'), list('word')], dtype='string')
# ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

pd.array([list('test'), list('words')], dtype='string')
# <StringArray>
# ["['t', 'e', 's', 't']", "['w', 'o', 'r', 'd', 's']"]
# Length: 2, dtype: string

pd.array([list('test')])
# <NumpyExtensionArray>
# [['t', 'e', 's', 't']]
# Length: 1, dtype: object

Issue Description

I'm trying to transform a list of list of strings into a StringArray, but the pd.array method with dtype='string' doesn't work when the second last level list contains lists of same length, raising an exception (see example). If the lists have different lengths, then the output doesn't raise an error and is correct.

In an older version of pandas (1.5.3), it produced a list of same length, but containing repeated casts of string arrays

<StringArray>
[
["['t' 'e' 's' 't']", "['t' 'e' 's' 't']", "['t' 'e' 's' 't']",
 "['t' 'e' 's' 't']"]
]
Shape: (1, 4), dtype: string

Expected Behavior

Same as pd.array([list('test')]) but with StringArray type.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.12.3.final.0
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 154 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : English_Europe.1252

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 75.1.0
pip : 24.2
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.27.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.3.7
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.1
gcsfs : None
matplotlib : 3.9.2
numba : 0.60.0
numexpr : 2.8.7
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.6.1
scipy : 1.13.1
sqlalchemy : 2.0.34
tables : 3.10.1
tabulate : 0.9.0
xarray : 2023.6.0
xlrd : None
zstandard : 0.23.0
tzdata : 2023.3
qtpy : 2.4.1
pyqt5 : None

@Maxence1402 Maxence1402 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2025
@rhshadrach
Copy link
Member

Thanks for the report! For expected behavior you wrote:

Same as pd.array([list('test')]) but with StringArray type.

The result of pd.array([list('test')]) is an array whose elements are lists. For a StringArray, the elements must be strings. Therefore you cannot accomplish this.

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Strings String extension data type and string data and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 20, 2025
@Maxence1402
Copy link
Author

Maxence1402 commented Mar 21, 2025

Indeed I did not realise that in the current version of pandas, arrays must be 1-dimensional (I'm used to working with an older version of pandas, where you can manipulate multi-dimensional arrays). Nonetheless, the conversion from StringArray to string seems strange, as it works for
pd.array([list('test'), list('words')], dtype='string')
but not for
pd.array([list('test'), list('word')], dtype='string')
where the length of sub-lists is homogeneous.
Is this conversion the expected behaviour of pandas?

@Manju080
Copy link
Contributor

In the same way the result for pd.array([list('test'), list('word')]) is also an array list without the dtype='string' because a StringArray must be 1-dimensional with scalar string elements. In other case, it converts each inner list to its string representation, resulting in a one-dimensional StringArray with elements like ['t', 'e', 's', 't'] and ['w', 'o', 'r', 'd', 's'].

please let me know if I can help with anything

@Maxence1402
Copy link
Author

Maxence1402 commented Mar 26, 2025

I mean, what I find strange is the difference in behaviour for the two examples I mentioned, not the cast itself.

@rhshadrach
Copy link
Member

Agreed @Maxence1402 - this difference is due to our use of NumPy:

print(np.array([list('test'), list('words')], dtype="object"))
# [list(['t', 'e', 's', 't']) list(['w', 'o', 'r', 'd', 's'])]
print(np.array([list('test'), list('word')], dtype="object"))
# [['t' 'e' 's' 't'] ['w' 'o' 'r' 'd']]

We shouldn't raise in the case of [list('test'), list('word')]. PRs to fix are welcome!

@rhshadrach rhshadrach removed the Needs Info Clarification about behavior needed to assess issue label Mar 26, 2025
@Manju080
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants