Skip to content

Commit ca23e4c

Browse files
committed
add docs
1 parent 62905f4 commit ca23e4c

13 files changed

+2595
-0
lines changed

docs/README.md

+71
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Comma-separated values (csv) scripts & tools docs
2+
3+
4+
5+
## Article Series - Why the CSV standard library is broken (and how to fix it)
6+
7+
<!-- comment out introduction
8+
9+
### Introduction
10+
11+
<details>
12+
<summary>Show/Hide Text</summary>
13+
14+
15+
Reminder: Dear [James Edward Gray II](https://twitter.com/JEG2), We love you. We thank you for your code.
16+
You're a genius. You're beautiful. [We stand on your shoulders. You're a giant.¹](https://en.wikipedia.org/wiki/Standing_on_the_shoulders_of_giants)
17+
Please, please, please - these articles are NOT about you.
18+
It's about the code and how to fix it.
19+
20+
> I'm seeing from you is that we should not consider people's feelings when criticizing their work. [...]
21+
> Please take time to sit down [..] and offer an apology to the author of the CSV library.
22+
23+
[I Apologize - Sorry, Sorry, Sorry - Why the standard CSV library author deserves our hugs and thank yous and why new giants are wanted »](sorry-sorry-sorry.md)
24+
25+
26+
---
27+
¹: stand on someone's shoulders - to make discoveries, insights, or progress due to the discoveries or previous work of those who have come before.
28+
29+
</details>
30+
31+
-->
32+
33+
34+
<!--
35+
### Content
36+
-->
37+
38+
39+
> "Criticism is something we can avoid easily by saying nothing, doing nothing, and being nothing."
40+
>
41+
> -- Aristotle
42+
43+
44+
_What's broken (and wrong, wrong, wrong) in the CSV standard library? Let's count the ways:_
45+
46+
- [**Part I or A (Simplistic) String#split Kludge vs A Purpose Built CSV Parser**](why-the-csv-stdlib-is-broken.md)
47+
- [**Part II or The Wonders of CSV Formats / Dialects**](csv-formats.md)
48+
- [**Part III or Returning a CSV Record as an Array? Hash? Struct? Row?**](csv-array-hash-struct.md)
49+
- [**Part IV or Numerics a.k.a. Auto-Magic Type Inference for Strings and Numbers**](csv-numerics.md)
50+
- [**Part V or Escaping the Stray Quote Error Hell - Do You Want Single, Double, or French Quotes With That Comma?**](csv-quotes.md)
51+
- [**Part VI or Fixes in Alternative CSV Libraries or Evolve or Die or Fast, Faster, Fasterer, Fastest**](csv-libraries.md)
52+
- [**Part VII or What's Your Type? Guess. Again. And Again. And Again. Guess What's a Schema For?**](csv-types.md)
53+
54+
55+
56+
<!--
57+
58+
> "He has a right to criticize, who has a heart to help."
59+
>
60+
> -- Abraham Lincoln
61+
62+
63+
-->
64+
65+
66+
67+
68+
## Migrate / Upgrade from ___ - Side-by-Side Examples
69+
70+
- [**Migrate / Upgrade from Smarter CSV to CSV Reader - Side-by-Side Examples**](smarter-csv.md)
71+

docs/csv-array-hash-struct.md

+225
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# Why the CSV standard library is broken, broken, broken (and how to fix it), Part III or Returning a CSV Record as an Array? Hash? Struct? Row?
2+
3+
4+
What's broken (and wrong, wrong, wrong) with the CSV standard library? Let's count the ways.
5+
6+
Start with the (complete) series:
7+
- **[Part I or A (Simplistic) String#split Kludge vs A Purpose Built CSV Parser »](why-the-csv-stdlib-is-broken.md)**
8+
- **[Part II or The Wonders of CSV Formats / Dialects »](csv-formats.md)**
9+
10+
11+
12+
## What's a Comma-Separated Values (CSV) Record?
13+
14+
Let's read `data.csv`:
15+
16+
```
17+
a,b,c
18+
1,2,3
19+
```
20+
21+
What do you expect?
22+
23+
``` ruby
24+
pp CSV.read( 'data.csv' )
25+
```
26+
27+
returns
28+
29+
``` ruby
30+
[["a", "b", "c"],
31+
["1", "2", "3"]]
32+
```
33+
34+
That's great. At it's most basic
35+
a comma-separated values record once read in / parsed
36+
is a list of values.
37+
38+
39+
Now let's use a header. What do you expect now?
40+
41+
``` ruby
42+
pp CSV.read( 'data.csv', headers: true )
43+
44+
# -or-
45+
46+
CSV.read( 'data.csv', headers: true ) do |record|
47+
pp record
48+
end
49+
```
50+
51+
Oh, no! Yes, that's where the trouble starts.
52+
53+
```
54+
#<CSV::Table mode:col_or_row row_count:2>
55+
56+
-or-
57+
58+
#<CSV::Row "a":"1" "b":"2" "c":"3">
59+
```
60+
61+
Adding `headers` to `CSV.read` turns the returned type into a lottery.
62+
You may get a plain old array or a custom `CSV::Table` or a custom `CSV::Row`.
63+
Let's fix it. How?
64+
65+
The new `Csv.read` always, always, always returns an array of arrays
66+
for the comma-separated values records. Period.
67+
68+
What about headers?
69+
70+
The new `CsvHash.read` (note, the `Hash` in the name)
71+
always, always, always returns an array of hashes
72+
for the comma-separated values records. Period.
73+
No custom `CSV::Table` or custom `CSV::Row`. Thank you!
74+
Example:
75+
76+
``` ruby
77+
pp CsvHash.read( 'data.csv' )
78+
79+
# => [{"a"=>"1", "b"=>"2", "c"=>"3"}]
80+
```
81+
82+
Bonus: Why not return a (typed) struct instead of a (schema-less) hash?
83+
84+
Great you asked :-).
85+
Let's welcome `CsvRecord`.
86+
that let's you define (typed) structs for
87+
your comma-separated values (csv) records.
88+
Example:
89+
90+
``` ruby
91+
Abc = CsvRecord.define do
92+
field :a ## note: default type is :string
93+
field :b
94+
field :c
95+
end
96+
97+
# or in "classic" style:
98+
99+
class Abc < CsvRecord::Base
100+
field :a
101+
field :b
102+
field :c
103+
end
104+
```
105+
106+
and use it like:
107+
108+
``` ruby
109+
pp Abc.read( 'data.csv')
110+
111+
# or
112+
113+
Abc.read( 'data.csv' ).each do |rec|
114+
puts "#{rec.a} #{rec.b} #{rec.c}"
115+
end
116+
```
117+
118+
119+
resulting in:
120+
121+
```
122+
[#<Abc:0x302c760 @values=["a","b","c"]>]
123+
124+
-or-
125+
126+
a b c
127+
```
128+
129+
Note: If you use `CsvRecord` you "auto(magically)-build" the (typed) struct
130+
(on-the-fly) from the headers in the datafile. Example:
131+
132+
``` ruby
133+
pp CsvRecord.read( 'data.csv' )
134+
[#<Class:0x405c770 @values=["a","b","c"]>]
135+
```
136+
137+
138+
139+
140+
## Reader vs Parser - Front-End vs Back-End
141+
142+
What's broken (and wrong, wrong, wrong) with the CSV standard library?
143+
The CSV library is an all-in-one hairball with a
144+
a (simplistic) `String#split` kludge instead of a purpose built parser.
145+
Nothing new, see Part I (or Part II) in the series :-).
146+
147+
148+
Why not use different "low-level" parsers
149+
for supporting different CSV formats / dialects?
150+
151+
Let's fix it. Yes, we can.
152+
The new CSV library alternative uses a reader for its "front-end"
153+
e.g. `Csv.parse`, `CsvHash.parse`, `CsvRecord.parse`, etc.
154+
and many parsers for its "back-end"
155+
e.g. `Csv::ParserStd.parse`,
156+
`Csv::ParserStrict.parse`,
157+
`Csv::ParserTab.parse`, etc.
158+
The idea is that the new "core" CSV library
159+
welcomes and
160+
is built on purpose for supporting new parsers.
161+
For example, why not add a faster parser with c-extensions (in "native" code)?
162+
Anyone ? :-).
163+
164+
165+
166+
How to use a different parser?
167+
Change `Csv.read`, that is, a convenience shortcut for
168+
`Csv.default.read` to:
169+
170+
171+
``` ruby
172+
Csv.strict.read( 'data.csv' ) # will use the ParserStrict
173+
# -or-
174+
Csv.tab.read( 'data.tab') # will use the ParserTab ("strict" tab-format)
175+
# and so on
176+
```
177+
178+
179+
You can also use different pre-configured / pre-defined
180+
dialects / formats. Example:
181+
182+
``` ruby
183+
Csv.mysql.read( 'data.csv')
184+
Csv.postgresql_text.read( 'data.csv' )
185+
Csv.excel.read( 'data.csv' )
186+
# and so on
187+
```
188+
189+
Note: `Csv.mysql`, for example, is a convenience shortcut for:
190+
191+
``` ruby
192+
parser = CsvReader::ParserStrict.new( sep: "\t",
193+
quote: false,
194+
escape: true,
195+
null: "\\N" )
196+
mysql = CsvReader.new( parser )
197+
mysql.read( 'data.csv' )
198+
```
199+
200+
201+
202+
## CSV is the Future - The World's #1 Data Format
203+
204+
Anyways, what's the point?
205+
The point is data is the new gold.
206+
And CSV is the #1 and the world's most popular data format :-).
207+
208+
The csv standard library might have been once
209+
"state-of-the-art" ten years ago - now in 2020 it's unfortunately a
210+
dead horse with many many flaws
211+
that cannot handle the (rich) diversity / dialects of csv formats.
212+
213+
214+
The joke (and heart of the matter) is that if you
215+
want to parse comma-separated values (csv) lines it is more
216+
complicated than using `line.split(",")` and you need a purpose-built
217+
parser for the (edge) cases and (special) escape rules, and, thus,
218+
you're advised to use a csv library.
219+
That's how it all started.
220+
What's broken (and wrong, wrong, wrong) with the CSV standard library? Let's count the ways.
221+
222+
223+
## Request for Comments (RFC)
224+
225+
Please post your comments to the [ruby-talk mailing list](https://rubytalk.org) thread. Thanks!

0 commit comments

Comments
 (0)