bug: single value is invalidly split into multiple fields on certain unicode characters #137

neko-kai · 2018-10-04T21:30:33Z

Example:

import java.io.ByteArrayInputStream
import java.io.InputStreamReader

import com.github.tototoshi.csv.CSVReader

object App extends App {
  val csv = ",퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯\uE5D8\uE564栚ℑ钺剸蕁耥믠鐛挀쐜麂\uE6BF슊䧩奌쒒\u0085䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂\uE0F4窏\uEF66㋡\uEAC8\uEDEE\uF172秊ӥ붝ヴ恢둊\uEE65\uED46\uF4AC쫎,,,,2018-10-04T20:23:15.639Z,,,,,,,,,,,233299423,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Ⓑ쁖僂\uEF95䳀呇捧動瀼䂲殆䶐훥鼟쿠덠Ớ땄礪\uEEF4ἳ홤篏碽⪎ʞ昉\uF7E2\u0B29걫雘᪆脟\uEE43ᠪ뒤栗\uE487ɦ瀻\uE4AF뮲\uEF5B橾\uF358ᝬﭧ薪쉶䗹훴殊Ӯ\u0FF2ᢦ\uD7A9묬鼃\uEFBF䀌럚ᆾ掽呈콒ᶿ蟡䵫䃽ꅡᠹ檸ⰹ\uA4CA뢳ᑤ\uE57D웪ⷹ\uF436槵巸貉ﻥگ쁸㎿顲鱿뽿쒏﹪\uEB34浱ퟲ驊,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-7.147540834511315E-49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2018-10-04T09:54:28.639Z,,,,,,,,,"

  val input = new InputStreamReader(new ByteArrayInputStream(csv.getBytes("UTF-8")))

  val res = CSVReader.open(input).all()

  println(res)
}

The output is:

List(List(, 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒), List(䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂窏㋡秊ӥ붝ヴ恢둊쫎, , , , 2018-10-04T20:23:15.639Z, , , , , , , , , , , 233299423, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Ⓑ쁖僂䳀呇捧動瀼䂲殆䶐훥鼟쿠덠Ớ땄礪ἳ홤篏碽⪎ʞ昉଩걫雘᪆脟ᠪ뒤栗ɦ瀻뮲橾ᝬﭧ薪쉶䗹훴殊Ӯ࿲ᢦ힩묬鼃䀌럚ᆾ掽呈콒ᶿ蟡䵫䃽ꅡᠹ檸ⰹ꓊뢳ᑤ웪ⷹ槵巸貉ﻥگ쁸㎿顲鱿뽿쒏﹪浱ퟲ驊, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , -7.147540834511315E-49, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 2018-10-04T09:54:28.639Z, , , , , , , , , ))

But, the second value was supposed to be 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒�䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂窏㋡秊ӥ붝ヴ恢둊쫎! instead, it was truncated to 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒

The text was updated successfully, but these errors were encountered:

neko-kai · 2018-12-17T19:21:07Z

The problem is caused by this and other lines in the parser treating unicode line endings, such as the character '\u0085' above, specially - the same line endings are NOT being escaped in the Writer resulting in asymmetric reading/writing as scala-csv will not read back the CSV it wrote when Quoting != QUOTE_ALL and the input contains unicode line endings

nicolagi · 2023-12-27T18:33:51Z

I'm also affected by this bug.
It was introduced in #25 to fix #22 and then it was kept with following refactorings.
It doesn't seem the right fix though: RFC-4180 and the W3C recommendation only specify CRLF as line terminators for CSV files.

I noticed you can pass an implicit CSVFormat to the CSVReader open methods. The CSVFormat includes a customizable line terminator, but AFAICS, it is only used in the CSVWriter. Perhaps the CSVReader could be made to respect the line terminator in the CSVFormat?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: single value is invalidly split into multiple fields on certain unicode characters #137

bug: single value is invalidly split into multiple fields on certain unicode characters #137

neko-kai commented Oct 4, 2018

neko-kai commented Dec 17, 2018 •

edited

Loading

nicolagi commented Dec 27, 2023 •

edited

Loading

bug: single value is invalidly split into multiple fields on certain unicode characters #137

bug: single value is invalidly split into multiple fields on certain unicode characters #137

Comments

neko-kai commented Oct 4, 2018

neko-kai commented Dec 17, 2018 • edited Loading

nicolagi commented Dec 27, 2023 • edited Loading

neko-kai commented Dec 17, 2018 •

edited

Loading

nicolagi commented Dec 27, 2023 •

edited

Loading