Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue with csv #622

Open
breck7 opened this issue Feb 17, 2025 · 3 comments
Open

issue with csv #622

breck7 opened this issue Feb 17, 2025 · 3 comments

Comments

@breck7
Copy link
Owner

breck7 commented Feb 17, 2025

I recently tried loading your CSV for a data analysis project and noticed a small inconsistency in the formatting—it seems that some rows may not have the same number of fields as expected. I was wondering if you’ve used this CSV for data analysis? Is there any trick to it or any insight you could share would be really appreciated. When loading the CSV using pandas I run into the following error: pandas.errors.ParserError: Error tokenizing data. C error: Expected 396 fields in line 5, saw 852 I see similar errors when loading the TSV file and the JSON file seems to be missing some columns/fields.

@swaptr
Copy link

swaptr commented Feb 17, 2025

Thanks @breck7 for this issue, here is what I had found.
The csv and tsv were downloaded from the website:

  1. Pandas - Normal loading
df = pd.read_csv("pldb.csv")

throws the following error:

Traceback (most recent call last):
  File "/home/swapnil/repository/test/x.py", line 7, in <module>
    df_csv = pd.read_csv("pldb.csv")
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 838, in pandas._libs.parsers.TextReader.read_low_memory
  File "parsers.pyx", line 905, in pandas._libs.parsers.TextReader._read_rows
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2061, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 396 fields in line 5, saw 852
  1. Pandas - Change engine to 'python'
df = pd.read_csv("pldb.csv", engine='python')

fails with the following error:

Traceback (most recent call last):
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 805, in _next_iter_line
    line = next(self.data)
           ^^^^^^^^^^^^^^^
_csv.Error: ',' expected after '"'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/swapnil/repository/test/x.py", line 7, in <module>
    df_csv = pd.read_csv("pldb.csv", engine='python')
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 626, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/readers.py", line 1923, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 252, in read
    content = self._get_lines(rows)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 1140, in _get_lines
    next_row = self._next_iter_line(row_num=self.pos + rows + 1)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 834, in _next_iter_line
    self._alert_malformed(msg, row_num)
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/pandas/io/parsers/python_parser.py", line 781, in _alert_malformed
    raise ParserError(msg)
pandas.errors.ParserError: ',' expected after '"'
  1. Pandas - Skip/Warn on malformed lines
df = pd.read_csv("pldb.csv", on_bad_lines="warn", low_memory=False)

gives the following output (missing data issues in this case):

ParserWarning: Skipping line 5: expected 396 fields, saw 852
Skipping line 21: expected 396 fields, saw 460
Skipping line 52: expected 396 fields, saw 628
Skipping line 65: expected 396 fields, saw 503
Skipping line 71: expected 396 fields, saw 563
Skipping line 79: expected 396 fields, saw 477
Skipping line 85: expected 396 fields, saw 569
Skipping line 102: expected 396 fields, saw 562
Skipping line 104: expected 396 fields, saw 711
Skipping line 110: expected 396 fields, saw 415
Skipping line 118: expected 396 fields, saw 561
Skipping line 121: expected 396 fields, saw 421
Skipping line 127: expected 396 fields, saw 421
Skipping line 135: expected 396 fields, saw 437
Skipping line 137: expected 396 fields, saw 558
Skipping line 143: expected 396 fields, saw 559
Skipping line 157: expected 396 fields, saw 429
Skipping line 161: expected 396 fields, saw 562
Skipping line 168: expected 396 fields, saw 559
Skipping line 175: expected 396 fields, saw 415
Skipping line 193: expected 396 fields, saw 418
Skipping line 206: expected 396 fields, saw 544
Skipping line 236: expected 396 fields, saw 506
Skipping line 246: expected 396 fields, saw 503
Skipping line 251: expected 396 fields, saw 447
Skipping line 254: expected 396 fields, saw 409
Skipping line 261: expected 396 fields, saw 558
Skipping line 266: expected 396 fields, saw 505
Skipping line 286: expected 396 fields, saw 419
Skipping line 299: expected 396 fields, saw 503
Skipping line 310: expected 396 fields, saw 415
Skipping line 312: expected 396 fields, saw 436
Skipping line 314: expected 396 fields, saw 415
Skipping line 315: expected 396 fields, saw 420
Skipping line 322: expected 396 fields, saw 569
Skipping line 332: expected 396 fields, saw 568
Skipping line 341: expected 396 fields, saw 579
Skipping line 349: expected 396 fields, saw 423
Skipping line 370: expected 396 fields, saw 436
Skipping line 372: expected 396 fields, saw 567
Skipping line 382: expected 396 fields, saw 561
Skipping line 389: expected 396 fields, saw 504
Skipping line 395: expected 396 fields, saw 503
Skipping line 398: expected 396 fields, saw 503
Skipping line 406: expected 396 fields, saw 503
Skipping line 408: expected 396 fields, saw 416
Skipping line 419: expected 396 fields, saw 560
Skipping line 428: expected 396 fields, saw 563
Skipping line 432: expected 396 fields, saw 569
Skipping line 435: expected 396 fields, saw 419
Skipping line 443: expected 396 fields, saw 461
Skipping line 475: expected 396 fields, saw 580
Skipping line 494: expected 396 fields, saw 504
Skipping line 508: expected 396 fields, saw 557
Skipping line 514: expected 396 fields, saw 561
Skipping line 532: expected 396 fields, saw 425
Skipping line 533: expected 396 fields, saw 436
Skipping line 562: expected 396 fields, saw 556
Skipping line 581: expected 396 fields, saw 426
Skipping line 583: expected 396 fields, saw 422
Skipping line 638: expected 396 fields, saw 561
Skipping line 644: expected 396 fields, saw 721
Skipping line 660: expected 396 fields, saw 416
Skipping line 672: expected 396 fields, saw 722
Skipping line 682: expected 396 fields, saw 556
Skipping line 696: expected 396 fields, saw 558
Skipping line 703: expected 396 fields, saw 557
Skipping line 708: expected 396 fields, saw 434
Skipping line 732: expected 396 fields, saw 504
Skipping line 736: expected 396 fields, saw 558
Skipping line 776: expected 396 fields, saw 417
Skipping line 785: expected 396 fields, saw 564
Skipping line 820: expected 396 fields, saw 516
Skipping line 849: expected 396 fields, saw 423
Skipping line 859: expected 396 fields, saw 415
Skipping line 868: expected 396 fields, saw 576
Skipping line 880: expected 396 fields, saw 558
Skipping line 893: expected 396 fields, saw 424
Skipping line 907: expected 396 fields, saw 424
Skipping line 924: expected 396 fields, saw 432
Skipping line 955: expected 396 fields, saw 567
Skipping line 970: expected 396 fields, saw 563
Skipping line 974: expected 396 fields, saw 567
Skipping line 984: expected 396 fields, saw 423
Skipping line 987: expected 396 fields, saw 898
Skipping line 993: expected 396 fields, saw 503
Skipping line 1002: expected 396 fields, saw 557
Skipping line 1008: expected 396 fields, saw 507
Skipping line 1013: expected 396 fields, saw 564
Skipping line 1026: expected 396 fields, saw 503
Skipping line 1045: expected 396 fields, saw 417
Skipping line 1053: expected 396 fields, saw 541
Skipping line 1061: expected 396 fields, saw 557
Skipping line 1067: expected 396 fields, saw 505
Skipping line 1068: expected 396 fields, saw 417
Skipping line 1082: expected 396 fields, saw 562
Skipping line 1086: expected 396 fields, saw 416
Skipping line 1124: expected 396 fields, saw 416
Skipping line 1127: expected 396 fields, saw 524
Skipping line 1137: expected 396 fields, saw 422
Skipping line 1145: expected 396 fields, saw 503
Skipping line 1151: expected 396 fields, saw 564
Skipping line 1166: expected 396 fields, saw 570
Skipping line 1186: expected 396 fields, saw 505
Skipping line 1187: expected 396 fields, saw 504
Skipping line 1193: expected 396 fields, saw 673
Skipping line 1201: expected 396 fields, saw 891
Skipping line 1227: expected 396 fields, saw 577
Skipping line 1262: expected 396 fields, saw 558
Skipping line 1284: expected 396 fields, saw 557
Skipping line 1296: expected 396 fields, saw 570
Skipping line 1336: expected 396 fields, saw 562
Skipping line 1354: expected 396 fields, saw 579
Skipping line 1357: expected 396 fields, saw 566
Skipping line 1359: expected 396 fields, saw 504
Skipping line 1382: expected 396 fields, saw 422
Skipping line 1397: expected 396 fields, saw 503
Skipping line 1422: expected 396 fields, saw 563
Skipping line 1436: expected 396 fields, saw 507
Skipping line 1442: expected 396 fields, saw 559
Skipping line 1491: expected 396 fields, saw 556
Skipping line 1505: expected 396 fields, saw 746
Skipping line 1578: expected 396 fields, saw 563
Skipping line 1595: expected 396 fields, saw 449
Skipping line 1612: expected 396 fields, saw 724
Skipping line 1646: expected 396 fields, saw 504
Skipping line 1647: expected 396 fields, saw 415
Skipping line 1655: expected 396 fields, saw 507
Skipping line 1694: expected 396 fields, saw 562
Skipping line 1711: expected 396 fields, saw 415
Skipping line 1723: expected 396 fields, saw 557
Skipping line 1797: expected 396 fields, saw 419
Skipping line 1798: expected 396 fields, saw 530
Skipping line 1999: expected 396 fields, saw 437
Skipping line 2097: expected 396 fields, saw 505
Skipping line 2120: expected 396 fields, saw 443
Skipping line 2178: expected 396 fields, saw 559
Skipping line 2267: expected 396 fields, saw 504
Skipping line 2274: expected 396 fields, saw 415
Skipping line 2371: expected 396 fields, saw 424
Skipping line 2377: expected 396 fields, saw 504
Skipping line 2655: expected 396 fields, saw 508
Skipping line 2663: expected 396 fields, saw 503
Skipping line 2743: expected 396 fields, saw 560
Skipping line 2898: expected 396 fields, saw 503
Skipping line 3410: expected 396 fields, saw 417
  1. Using polars
df = pl.read_csv("./pldb.csv")

gives the following error

Traceback (most recent call last):
  File "/home/swapnil/repository/test/x.py", line 13, in <module>
    df = pl.read_csv("./pldb.csv")
         ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/polars/_utils/deprecation.py", line 92, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/polars/io/csv/functions.py", line 534, in read_csv
    df = _read_csv_impl(
         ^^^^^^^^^^^^^^^
  File "/home/swapnil/repository/test/.venv/lib64/python3.12/site-packages/polars/io/csv/functions.py", line 682, in _read_csv_impl
    pydf = PyDataFrame.read_csv(
           ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `",,true false,,,,,,,true,,,,true,,,true,,,,true,,,,false,,true,,true,true,true,true,,,,,,,,,,,,,,,,true,,,,,,true,,,,,,,true,,,true,,,,true,,,,,,,true,true,,true,true,true,true,,,true,,true,,false,,,,,,,,,,,true,,true,,,,,,true,false,,,,,,false,,,,true,,,,,,,,,false,true,,false,,true,,,,,,,,,true,,true,,,,,true,,,,,,,,,,,,,,,,false,,,,true,true,,true,,,,https://github.com/SpencerPark/IJava,https://en.wikipedia.org/wiki/Java_(programming_language),401,37,https://hopl.info/showlanguage.prx?exp=2131,Java,Java,,Java,https://github.com/textmate/java.tmbundle,,"year|publisher|title|authors|isbn13\n2017|Pearson|Java Software Solutions|Lewis` as dtype `str` at column 'stringToken' (column number 205)

The current offset in the file is 18276 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `schema_overrides` argument
- setting `ignore_errors` to `True`,
- adding `",,true false,,,,,,,true,,,,true,,,true,,,,true,,,,false,,true,,true,true,true,true,,,,,,,,,,,,,,,,true,,,,,,true,,,,,,,true,,,true,,,,true,,,,,,,true,true,,true,true,true,true,,,true,,true,,false,,,,,,,,,,,true,,true,,,,,,true,false,,,,,,false,,,,true,,,,,,,,,false,true,,false,,true,,,,,,,,,true,,true,,,,,true,,,,,,,,,,,,,,,,false,,,,true,true,,true,,,,https://github.com/SpencerPark/IJava,https://en.wikipedia.org/wiki/Java_(programming_language),401,37,https://hopl.info/showlanguage.prx?exp=2131,Java,Java,,Java,https://github.com/textmate/java.tmbundle,,"year|publisher|title|authors|isbn13\n2017|Pearson|Java Software Solutions|Lewis` to the `null_values` list.

Original error: ```invalid csv file

Field `",,true false,,,,,,,true,,,,true,,,true,,,,true,,,,false,,true,,true,true,true,true,,,,,,,,,,,,,,,,true,,,,,,true,,,,,,,true,,,true,,,,true,,,,,,,true,true,,true,true,true,true,,,true,,true,,false,,,,,,,,,,,true,,true,,,,,,true,false,,,,,,false,,,,true,,,,,,,,,false,true,,false,,true,,,,,,,,,true,,true,,,,,true,,,,,,,,,,,,,,,,false,,,,true,true,,true,,,,https://github.com/SpencerPark/IJava,https://en.wikipedia.org/wiki/Java_(programming_language),401,37,https://hopl.info/showlanguage.prx?exp=2131,Java,Java,,Java,https://github.com/textmate/java.tmbundle,,"year|publisher|title|authors|isbn13\n2017|Pearson|Java Software Solutions|Lewis` is not properly escaped.```
  1. Line number 6 malforms this dataframe by adding an extra column (397 instead of 396):
import csv

with open('pldb.csv', mode='r', newline='', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    data = list(reader)

df = pd.DataFrame(data)

So far I have tried these, let me know if you need help with something. Thanks for looking into this.

@breck7
Copy link
Owner Author

breck7 commented Feb 17, 2025

Okay the regression was at some point I added an optimized path for faster csv generation ("buildConcepts"), and only quote escaped values that had commas, forgetting to also escape values that had no commas but did have quotes.

Interestingly all the spreadsheet programs seemed to still open the poorly quote CSVs fine, but programmatically opening the CSVs didn't work.

Thank you so much @swaptr for finding and reporting this!

@breck7
Copy link
Owner Author

breck7 commented Feb 17, 2025

This fix: breck7/scroll@28964ac#r152645225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants