Skip to content

Commit 7427f38

Browse files
committed
up
1 parent 004eeb1 commit 7427f38

16 files changed

+1062
-1
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,5 @@ Gem Family
99

1010
[csvyaml](csvyaml) - read tabular data in the CSV <3 YAML format, that is, comma-separated values (CSV) line-by-line records with yaml ain't markup language (YAML) encoding rules
1111

12-
12+
[csvrecord](csvrecord) - read in comma-separated values (csv) records with typed structs / schemas
1313

csvrecord/.gitignore

+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#######################
2+
# ignore ruby rake generated folders
3+
4+
/pkg/
5+
/doc/
6+
7+
8+
################
9+
# ignore (top-level) datapackage folders
10+
11+
/pack/
12+
/.pack/

csvrecord/HISTORY.md

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
### 0.0.1 / 2018-08-11
2+
3+
* Everything is new. First release.

csvrecord/Manifest.txt

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
HISTORY.md
2+
LICENSE.md
3+
Manifest.txt
4+
README.md
5+
Rakefile
6+
lib/csvrecord.rb
7+
lib/csvrecord/base.rb
8+
lib/csvrecord/version.rb
9+
test/data/beer.csv
10+
test/data/beer11.csv
11+
test/helper.rb
12+
test/test_record.rb
13+
test/test_record_auto.rb
14+
test/test_version.rb

csvrecord/NOTES.md

+258
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
# Notes
2+
3+
4+
5+
## More CSV Libraries
6+
7+
<https://github.com/pcreux/csv-importer>
8+
9+
<https://github.com/eval/csv-omg>
10+
11+
12+
13+
14+
15+
# CSV Notes
16+
17+
todo: move to awesome csv page - why? why not?
18+
19+
- <https://www.ietf.org/rfc/rfc4180.txt> - RFC 4180, Common Format and MIME Type for CSV Files, October 2005
20+
21+
(Augmented) Backus-Naur Form (BNF) Grammar:
22+
23+
```
24+
file = [header CRLF] record *(CRLF record) [CRLF]
25+
header = name *(COMMA name)
26+
record = field *(COMMA field)
27+
name = field
28+
field = (escaped / non-escaped)
29+
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
30+
non-escaped = *TEXTDATA
31+
COMMA = %x2C
32+
CR = %x0D
33+
DQUOTE = %x22
34+
LF = %x0A
35+
CRLF = CR LF
36+
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
37+
```
38+
39+
40+
- <https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/> - Model for Tabular Data and Metadata on the Web, W3C Recommendation 17 December 2015
41+
42+
(Extended) Backus-Naur Form (BNF) Grammar:
43+
44+
```
45+
[1] csv ::= header record+
46+
[2] header ::= record
47+
[3] record ::= fields #x0D? #x0A
48+
[4] fields ::= field ("," fields)*
49+
[5] field ::= WS* rawfield WS*
50+
[6] rawfield ::= '"' QCHAR* '"' |SCHAR*
51+
[7] QCHAR ::= [^"] |'""'
52+
[8] SCHAR ::= [^",#x0A#x0D]
53+
[9] WS ::= [#x20#x09]
54+
```
55+
56+
57+
Notes on Microsoft Excel Spreadsheets:
58+
59+
Save
60+
61+
Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (e.g. because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes.
62+
Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent.
63+
64+
65+
Open / Read
66+
67+
When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files.
68+
69+
70+
Wikipedia
71+
- <https://en.wikipedia.org/wiki/Comma-separated_values>
72+
73+
More
74+
75+
<https://www.csvreader.com/csv_format.php>
76+
77+
Use unix-style escape rules
78+
- no quotes required!!!
79+
- \, => e.g. apples\, carrots\, and oranges
80+
- \\ for "escaped" literal backslash
81+
- \n for newline (LF) and
82+
- \r for carriage return (CR)
83+
- \s - use for space - why? why not?
84+
- DO NOT USE - \N - use for Null - why? why not? -- conflicts easy with \n, thus, NOT recommended - AVOID - DO NOT USE!!!!
85+
- allow \### and \o### Octal, \x## Hex, \d### Decimal, and \u#### Unicode - why? why not?
86+
87+
88+
89+
90+
"Rules" about spaces:
91+
92+
- recommendation: (always) trim leading and trailing spaces!!!!
93+
- note: frictionless data "csv dialects" only allow trim leading space (e.g. `skipInitialSpace`)
94+
95+
"Rules" about unknown unknowns (nil) and known unknowns (missing)
96+
97+
- recommendation: use a special letter e.g. `?` for marking **known** unknowns and
98+
- use `""` or empty or `NA` or `n/a` etc. for "classic" null values (e.g. unknown unknowns)
99+
100+
101+
102+
- Allow comments starting with `#`
103+
- Skip blank lines (white space trimmed)
104+
- is ,,, a blank line? or four nulls?
105+
106+
107+
108+
# Go Csv Library
109+
110+
<https://golang.org/pkg/encoding/csv/>
111+
112+
dialect options include:
113+
114+
```
115+
// Comma is the field delimiter.
116+
// It is set to comma (',') by NewReader.
117+
// Comma must be a valid rune and must not be \r, \n,
118+
// or the Unicode replacement character (0xFFFD).
119+
Comma rune
120+
121+
// Comment, if not 0, is the comment character. Lines beginning with the
122+
// Comment character without preceding whitespace are ignored.
123+
// With leading whitespace the Comment character becomes part of the
124+
// field, even if TrimLeadingSpace is true.
125+
// Comment must be a valid rune and must not be \r, \n,
126+
// or the Unicode replacement character (0xFFFD).
127+
// It must also not be equal to Comma.
128+
Comment rune
129+
130+
// FieldsPerRecord is the number of expected fields per record.
131+
// If FieldsPerRecord is positive, Read requires each record to
132+
// have the given number of fields. If FieldsPerRecord is 0, Read sets it to
133+
// the number of fields in the first record, so that future records must
134+
// have the same field count. If FieldsPerRecord is negative, no check is
135+
// made and records may have a variable number of fields.
136+
FieldsPerRecord int
137+
138+
// If LazyQuotes is true, a quote may appear in an unquoted field and a
139+
// non-doubled quote may appear in a quoted field.
140+
LazyQuotes bool
141+
142+
// If TrimLeadingSpace is true, leading white space in a field is ignored.
143+
// This is done even if the field delimiter, Comma, is white space.
144+
TrimLeadingSpace bool
145+
```
146+
147+
148+
``` go
149+
package main
150+
151+
import (
152+
"encoding/csv"
153+
"fmt"
154+
"log"
155+
"strings"
156+
)
157+
158+
func main() {
159+
in := `first_name;last_name;username
160+
"Rob";"Pike";rob
161+
# lines beginning with a # character are ignored
162+
Ken;Thompson;ken
163+
"Robert";"Griesemer";"gri"
164+
`
165+
r := csv.NewReader(strings.NewReader(in))
166+
r.Comma = ';'
167+
r.Comment = '#'
168+
169+
records, err := r.ReadAll()
170+
if err != nil {
171+
log.Fatal(err)
172+
}
173+
174+
fmt.Print(records)
175+
}
176+
```
177+
178+
179+
More about CSV
180+
181+
- <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm>
182+
- <http://www.creativyst.com/Doc/Std/ctx/ctx.htm> altenative "extentend" csv format (uses always pipes `|` for separator)
183+
184+
e.g.
185+
186+
```
187+
\TPersons|People Table|Pet owners in our example db|Pet owners|||
188+
\LNumber|LastName|FirstName
189+
\NPerson Number|Last Name|First Name
190+
\QNUMBER(7)|VARCHAR(65)|CHAR(35)
191+
1|Smythe|Jane
192+
2|Doe|John
193+
3|Mellonhead|Creg
194+
```
195+
196+
197+
## CSV Format
198+
199+
The CSV Format:
200+
- **Each record is one line** - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line separator may also be embedded in the data (making a record more than one line but still acceptable).
201+
- **Fields are separated with commas.** - Duh.
202+
- **Leading and trailing whitespace is ignored** - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
203+
- **Embedded commas** - Field MUST be delimited with double-quotes.
204+
- **Embedded double-quotes** - Embedded double-quote characters MUST be doubled, and the field must be delimited with double-quotes.
205+
- **Embedded line-breaks** - Fields MUST be surrounded by double-quotes.
206+
207+
Source <http://edoceo.com/utilitas/csv-file-format>
208+
209+
210+
## CSV Formats
211+
212+
This utility supports two flavors of CSV:
213+
214+
#### UNIX Style
215+
216+
- backslash escape character for quotes (\"), new lines (\n), and backslashes (\\)
217+
- Each record must be on its own line. If a field contains a new line, the new line must be escaped.
218+
- Leading and trailing white space on an unquoted field is ignored.
219+
- Compatible with standard unix text processing tools such as grep and sed that work on a line by line basis.
220+
221+
222+
#### Microsoft Excel Style
223+
224+
- Two quotes escape character ("" escapes "), no other characters are escaped.
225+
- Compatible with Microsoft Excel and many other programs that have adopted the format for data import and export.
226+
- Leading and trailing white space on an unquoted field is significant.
227+
- Specified by RFC4180.
228+
229+
Note that for simple field data that does not contain quotes or new lines, the two formats are fairly equivalent.
230+
231+
232+
233+
## CSV Format
234+
235+
```
236+
apple,"wild cherry",peach
237+
pear,plum,"apricot"
238+
mango,payaya,guava
239+
"orange, Valencia", lemon, lime
240+
"""extra virgin"" olive", palm, date
241+
```
242+
243+
Usually fields containing embedded spaces or commas
244+
are contained in " marks, but there are other conventions.
245+
Quotes (") inside quoted fields are doubled.
246+
247+
248+
## CSV Variations
249+
250+
Behaviors of this program that often vary between CSV implementations:
251+
252+
- Newlines are supported in quoted fields.
253+
- Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules.
254+
- Each record can have a different numbers of fields.
255+
- The three common forms of newlines are supported: CR, CRLF, LF.
256+
- A newline will be added if the file does not end with one.
257+
- No whitespace trimming is done.
258+

0 commit comments

Comments
 (0)