Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: single value is invalidly split into multiple fields on certain unicode characters #137

Open
neko-kai opened this issue Oct 4, 2018 · 2 comments

Comments

@neko-kai
Copy link

neko-kai commented Oct 4, 2018

Example:

import java.io.ByteArrayInputStream
import java.io.InputStreamReader

import com.github.tototoshi.csv.CSVReader

object App extends App {
  val csv = ",퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯\uE5D8\uE564栚ℑ钺剸蕁耥믠鐛挀쐜麂\uE6BF슊䧩奌쒒\u0085䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂\uE0F4\uEF66\uEAC8\uEDEE\uF172秊ӥ붝ヴ恢둊\uEE65\uED46\uF4AC쫎,,,,2018-10-04T20:23:15.639Z,,,,,,,,,,,233299423,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Ⓑ쁖僂\uEF95䳀呇捧動瀼䂲殆䶐훥鼟쿠덠Ớ땄礪\uEEF4ἳ홤篏碽⪎ʞ昉\uF7E2\u0B29걫雘᪆脟\uEE43ᠪ뒤栗\uE487ɦ瀻\uE4AF\uEF5B\uF358ᝬﭧ薪쉶䗹훴殊Ӯ\u0FF2\uD7A9묬鼃\uEFBF䀌럚ᆾ掽呈콒ᶿ蟡䵫䃽ꅡᠹ檸ⰹ\uA4CA뢳ᑤ\uE57D웪ⷹ\uF436槵巸貉ﻥگ쁸㎿顲鱿뽿쒏﹪\uEB34浱ퟲ驊,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,-7.147540834511315E-49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2018-10-04T09:54:28.639Z,,,,,,,,,"

  val input = new InputStreamReader(new ByteArrayInputStream(csv.getBytes("UTF-8")))

  val res = CSVReader.open(input).all()

  println(res)
}

The output is:

List(List(, 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒), List(䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂窏㋡秊ӥ붝ヴ恢둊쫎, , , , 2018-10-04T20:23:15.639Z, , , , , , , , , , , 233299423, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Ⓑ쁖僂䳀呇捧動瀼䂲殆䶐훥鼟쿠덠Ớ땄礪ἳ홤篏碽⪎ʞ昉଩걫雘᪆脟ᠪ뒤栗ɦ瀻뮲橾ᝬﭧ薪쉶䗹훴殊Ӯ࿲ᢦ힩묬鼃䀌럚ᆾ掽呈콒ᶿ蟡䵫䃽ꅡᠹ檸ⰹ꓊뢳ᑤ웪ⷹ槵巸貉ﻥگ쁸㎿顲鱿뽿쒏﹪浱ퟲ驊, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , -7.147540834511315E-49, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , 2018-10-04T09:54:28.639Z, , , , , , , , , ))

But, the second value was supposed to be 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒�䃡썙츚祉≔轾╠扒㱉鞎뽖븢暩䜄蚂窏㋡秊ӥ붝ヴ恢둊쫎! instead, it was truncated to 퀙䘘縤ઞ◒䘬掤⢶坪⁓匕ମҀꑤꇮ腋觯栚ℑ钺剸蕁耥믠鐛挀쐜麂슊䧩奌쒒

@neko-kai
Copy link
Author

neko-kai commented Dec 17, 2018

The problem is caused by this and other lines in the parser treating unicode line endings, such as the character '\u0085' above, specially - the same line endings are NOT being escaped in the Writer resulting in asymmetric reading/writing as scala-csv will not read back the CSV it wrote when Quoting != QUOTE_ALL and the input contains unicode line endings

@nicolagi
Copy link

nicolagi commented Dec 27, 2023

I'm also affected by this bug.
It was introduced in #25 to fix #22 and then it was kept with following refactorings.
It doesn't seem the right fix though: RFC-4180 and the W3C recommendation only specify CRLF as line terminators for CSV files.

I noticed you can pass an implicit CSVFormat to the CSVReader open methods. The CSVFormat includes a customizable line terminator, but AFAICS, it is only used in the CSVWriter. Perhaps the CSVReader could be made to respect the line terminator in the CSVFormat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants