|
| 1 | +# Why the CSV standard library is broken, broken, broken (and how to fix it), Part III or Returning a CSV Record as an Array? Hash? Struct? Row? |
| 2 | + |
| 3 | + |
| 4 | +What's broken (and wrong, wrong, wrong) with the CSV standard library? Let's count the ways. |
| 5 | + |
| 6 | +Start with the (complete) series: |
| 7 | +- **[Part I or A (Simplistic) String#split Kludge vs A Purpose Built CSV Parser »](why-the-csv-stdlib-is-broken.md)** |
| 8 | +- **[Part II or The Wonders of CSV Formats / Dialects »](csv-formats.md)** |
| 9 | + |
| 10 | + |
| 11 | + |
| 12 | +## What's a Comma-Separated Values (CSV) Record? |
| 13 | + |
| 14 | +Let's read `data.csv`: |
| 15 | + |
| 16 | +``` |
| 17 | +a,b,c |
| 18 | +1,2,3 |
| 19 | +``` |
| 20 | + |
| 21 | +What do you expect? |
| 22 | + |
| 23 | +``` ruby |
| 24 | +pp CSV.read( 'data.csv' ) |
| 25 | +``` |
| 26 | + |
| 27 | +returns |
| 28 | + |
| 29 | +``` ruby |
| 30 | +[["a", "b", "c"], |
| 31 | + ["1", "2", "3"]] |
| 32 | +``` |
| 33 | + |
| 34 | +That's great. At it's most basic |
| 35 | +a comma-separated values record once read in / parsed |
| 36 | +is a list of values. |
| 37 | + |
| 38 | + |
| 39 | +Now let's use a header. What do you expect now? |
| 40 | + |
| 41 | +``` ruby |
| 42 | +pp CSV.read( 'data.csv', headers: true ) |
| 43 | + |
| 44 | +# -or- |
| 45 | + |
| 46 | +CSV.read( 'data.csv', headers: true ) do |record| |
| 47 | + pp record |
| 48 | +end |
| 49 | +``` |
| 50 | + |
| 51 | +Oh, no! Yes, that's where the trouble starts. |
| 52 | + |
| 53 | +``` |
| 54 | +#<CSV::Table mode:col_or_row row_count:2> |
| 55 | +
|
| 56 | +-or- |
| 57 | +
|
| 58 | +#<CSV::Row "a":"1" "b":"2" "c":"3"> |
| 59 | +``` |
| 60 | + |
| 61 | +Adding `headers` to `CSV.read` turns the returned type into a lottery. |
| 62 | +You may get a plain old array or a custom `CSV::Table` or a custom `CSV::Row`. |
| 63 | +Let's fix it. How? |
| 64 | + |
| 65 | +The new `Csv.read` always, always, always returns an array of arrays |
| 66 | +for the comma-separated values records. Period. |
| 67 | + |
| 68 | +What about headers? |
| 69 | + |
| 70 | +The new `CsvHash.read` (note, the `Hash` in the name) |
| 71 | +always, always, always returns an array of hashes |
| 72 | +for the comma-separated values records. Period. |
| 73 | +No custom `CSV::Table` or custom `CSV::Row`. Thank you! |
| 74 | +Example: |
| 75 | + |
| 76 | +``` ruby |
| 77 | +pp CsvHash.read( 'data.csv' ) |
| 78 | + |
| 79 | +# => [{"a"=>"1", "b"=>"2", "c"=>"3"}] |
| 80 | +``` |
| 81 | + |
| 82 | +Bonus: Why not return a (typed) struct instead of a (schema-less) hash? |
| 83 | + |
| 84 | +Great you asked :-). |
| 85 | +Let's welcome `CsvRecord`. |
| 86 | +that let's you define (typed) structs for |
| 87 | +your comma-separated values (csv) records. |
| 88 | +Example: |
| 89 | + |
| 90 | +``` ruby |
| 91 | +Abc = CsvRecord.define do |
| 92 | + field :a ## note: default type is :string |
| 93 | + field :b |
| 94 | + field :c |
| 95 | +end |
| 96 | + |
| 97 | +# or in "classic" style: |
| 98 | + |
| 99 | +class Abc < CsvRecord::Base |
| 100 | + field :a |
| 101 | + field :b |
| 102 | + field :c |
| 103 | +end |
| 104 | +``` |
| 105 | + |
| 106 | +and use it like: |
| 107 | + |
| 108 | +``` ruby |
| 109 | +pp Abc.read( 'data.csv') |
| 110 | + |
| 111 | +# or |
| 112 | + |
| 113 | +Abc.read( 'data.csv' ).each do |rec| |
| 114 | + puts "#{rec.a} #{rec.b} #{rec.c}" |
| 115 | +end |
| 116 | +``` |
| 117 | + |
| 118 | + |
| 119 | +resulting in: |
| 120 | + |
| 121 | +``` |
| 122 | +[#<Abc:0x302c760 @values=["a","b","c"]>] |
| 123 | +
|
| 124 | +-or- |
| 125 | +
|
| 126 | +a b c |
| 127 | +``` |
| 128 | + |
| 129 | +Note: If you use `CsvRecord` you "auto(magically)-build" the (typed) struct |
| 130 | +(on-the-fly) from the headers in the datafile. Example: |
| 131 | + |
| 132 | +``` ruby |
| 133 | +pp CsvRecord.read( 'data.csv' ) |
| 134 | +[#<Class:0x405c770 @values=["a","b","c"]>] |
| 135 | +``` |
| 136 | + |
| 137 | + |
| 138 | + |
| 139 | + |
| 140 | +## Reader vs Parser - Front-End vs Back-End |
| 141 | + |
| 142 | +What's broken (and wrong, wrong, wrong) with the CSV standard library? |
| 143 | +The CSV library is an all-in-one hairball with a |
| 144 | +a (simplistic) `String#split` kludge instead of a purpose built parser. |
| 145 | +Nothing new, see Part I (or Part II) in the series :-). |
| 146 | + |
| 147 | + |
| 148 | +Why not use different "low-level" parsers |
| 149 | +for supporting different CSV formats / dialects? |
| 150 | + |
| 151 | +Let's fix it. Yes, we can. |
| 152 | +The new CSV library alternative uses a reader for its "front-end" |
| 153 | +e.g. `Csv.parse`, `CsvHash.parse`, `CsvRecord.parse`, etc. |
| 154 | +and many parsers for its "back-end" |
| 155 | +e.g. `Csv::ParserStd.parse`, |
| 156 | +`Csv::ParserStrict.parse`, |
| 157 | +`Csv::ParserTab.parse`, etc. |
| 158 | +The idea is that the new "core" CSV library |
| 159 | +welcomes and |
| 160 | +is built on purpose for supporting new parsers. |
| 161 | +For example, why not add a faster parser with c-extensions (in "native" code)? |
| 162 | +Anyone ? :-). |
| 163 | + |
| 164 | + |
| 165 | + |
| 166 | +How to use a different parser? |
| 167 | +Change `Csv.read`, that is, a convenience shortcut for |
| 168 | +`Csv.default.read` to: |
| 169 | + |
| 170 | + |
| 171 | +``` ruby |
| 172 | +Csv.strict.read( 'data.csv' ) # will use the ParserStrict |
| 173 | +# -or- |
| 174 | +Csv.tab.read( 'data.tab') # will use the ParserTab ("strict" tab-format) |
| 175 | +# and so on |
| 176 | +``` |
| 177 | + |
| 178 | + |
| 179 | +You can also use different pre-configured / pre-defined |
| 180 | +dialects / formats. Example: |
| 181 | + |
| 182 | +``` ruby |
| 183 | +Csv.mysql.read( 'data.csv') |
| 184 | +Csv.postgresql_text.read( 'data.csv' ) |
| 185 | +Csv.excel.read( 'data.csv' ) |
| 186 | +# and so on |
| 187 | +``` |
| 188 | + |
| 189 | +Note: `Csv.mysql`, for example, is a convenience shortcut for: |
| 190 | + |
| 191 | +``` ruby |
| 192 | +parser = CsvReader::ParserStrict.new( sep: "\t", |
| 193 | + quote: false, |
| 194 | + escape: true, |
| 195 | + null: "\\N" ) |
| 196 | +mysql = CsvReader.new( parser ) |
| 197 | +mysql.read( 'data.csv' ) |
| 198 | +``` |
| 199 | + |
| 200 | + |
| 201 | + |
| 202 | +## CSV is the Future - The World's #1 Data Format |
| 203 | + |
| 204 | +Anyways, what's the point? |
| 205 | +The point is data is the new gold. |
| 206 | +And CSV is the #1 and the world's most popular data format :-). |
| 207 | + |
| 208 | +The csv standard library might have been once |
| 209 | +"state-of-the-art" ten years ago - now in 2020 it's unfortunately a |
| 210 | +dead horse with many many flaws |
| 211 | +that cannot handle the (rich) diversity / dialects of csv formats. |
| 212 | + |
| 213 | + |
| 214 | +The joke (and heart of the matter) is that if you |
| 215 | +want to parse comma-separated values (csv) lines it is more |
| 216 | +complicated than using `line.split(",")` and you need a purpose-built |
| 217 | +parser for the (edge) cases and (special) escape rules, and, thus, |
| 218 | +you're advised to use a csv library. |
| 219 | +That's how it all started. |
| 220 | +What's broken (and wrong, wrong, wrong) with the CSV standard library? Let's count the ways. |
| 221 | + |
| 222 | + |
| 223 | +## Request for Comments (RFC) |
| 224 | + |
| 225 | +Please post your comments to the [ruby-talk mailing list](https://rubytalk.org) thread. Thanks! |
0 commit comments