|
| 1 | +# Notes |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +## More CSV Libraries |
| 6 | + |
| 7 | +<https://github.com/pcreux/csv-importer> |
| 8 | + |
| 9 | +<https://github.com/eval/csv-omg> |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +# CSV Notes |
| 16 | + |
| 17 | +todo: move to awesome csv page - why? why not? |
| 18 | + |
| 19 | +- <https://www.ietf.org/rfc/rfc4180.txt> - RFC 4180, Common Format and MIME Type for CSV Files, October 2005 |
| 20 | + |
| 21 | +(Augmented) Backus-Naur Form (BNF) Grammar: |
| 22 | + |
| 23 | +``` |
| 24 | +file = [header CRLF] record *(CRLF record) [CRLF] |
| 25 | +header = name *(COMMA name) |
| 26 | +record = field *(COMMA field) |
| 27 | +name = field |
| 28 | +field = (escaped / non-escaped) |
| 29 | +escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE |
| 30 | +non-escaped = *TEXTDATA |
| 31 | +COMMA = %x2C |
| 32 | +CR = %x0D |
| 33 | +DQUOTE = %x22 |
| 34 | +LF = %x0A |
| 35 | +CRLF = CR LF |
| 36 | +TEXTDATA = %x20-21 / %x23-2B / %x2D-7E |
| 37 | +``` |
| 38 | + |
| 39 | + |
| 40 | +- <https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/> - Model for Tabular Data and Metadata on the Web, W3C Recommendation 17 December 2015 |
| 41 | + |
| 42 | +(Extended) Backus-Naur Form (BNF) Grammar: |
| 43 | + |
| 44 | +``` |
| 45 | +[1] csv ::= header record+ |
| 46 | +[2] header ::= record |
| 47 | +[3] record ::= fields #x0D? #x0A |
| 48 | +[4] fields ::= field ("," fields)* |
| 49 | +[5] field ::= WS* rawfield WS* |
| 50 | +[6] rawfield ::= '"' QCHAR* '"' |SCHAR* |
| 51 | +[7] QCHAR ::= [^"] |'""' |
| 52 | +[8] SCHAR ::= [^",#x0A#x0D] |
| 53 | +[9] WS ::= [#x20#x09] |
| 54 | +``` |
| 55 | + |
| 56 | + |
| 57 | +Notes on Microsoft Excel Spreadsheets: |
| 58 | + |
| 59 | +Save |
| 60 | + |
| 61 | +Excel generates CSV files encoded using Windows-1252 with LF line endings. Characters that cannot be represented within Windows-1252 are replaced by underscores. Only those cells that need escaping (e.g. because they contain commas or double quotes) are escaped, and double quotes are escaped with two double quotes. |
| 62 | +Dates and numbers are formatted as displayed, which means that formatting can lead to information being lost or becoming inconsistent. |
| 63 | + |
| 64 | + |
| 65 | +Open / Read |
| 66 | + |
| 67 | +When opening CSV files, Excel interprets CSV files saved in UTF-8 as being encoded as Windows-1252 (whether or not a BOM is present). It correctly deals with double quoted cells, except that it converts line breaks within cells into spaces. It understands CRLF as a line break. It detects dates (formatted as YYYY-MM-DD) and formats them in the default date formatting for files. |
| 68 | + |
| 69 | + |
| 70 | +Wikipedia |
| 71 | +- <https://en.wikipedia.org/wiki/Comma-separated_values> |
| 72 | + |
| 73 | +More |
| 74 | + |
| 75 | +<https://www.csvreader.com/csv_format.php> |
| 76 | + |
| 77 | +Use unix-style escape rules |
| 78 | +- no quotes required!!! |
| 79 | +- \, => e.g. apples\, carrots\, and oranges |
| 80 | +- \\ for "escaped" literal backslash |
| 81 | +- \n for newline (LF) and |
| 82 | +- \r for carriage return (CR) |
| 83 | +- \s - use for space - why? why not? |
| 84 | +- DO NOT USE - \N - use for Null - why? why not? -- conflicts easy with \n, thus, NOT recommended - AVOID - DO NOT USE!!!! |
| 85 | +- allow \### and \o### Octal, \x## Hex, \d### Decimal, and \u#### Unicode - why? why not? |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | + |
| 90 | +"Rules" about spaces: |
| 91 | + |
| 92 | +- recommendation: (always) trim leading and trailing spaces!!!! |
| 93 | + - note: frictionless data "csv dialects" only allow trim leading space (e.g. `skipInitialSpace`) |
| 94 | + |
| 95 | +"Rules" about unknown unknowns (nil) and known unknowns (missing) |
| 96 | + |
| 97 | +- recommendation: use a special letter e.g. `?` for marking **known** unknowns and |
| 98 | +- use `""` or empty or `NA` or `n/a` etc. for "classic" null values (e.g. unknown unknowns) |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | +- Allow comments starting with `#` |
| 103 | +- Skip blank lines (white space trimmed) |
| 104 | + - is ,,, a blank line? or four nulls? |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | +# Go Csv Library |
| 109 | + |
| 110 | +<https://golang.org/pkg/encoding/csv/> |
| 111 | + |
| 112 | +dialect options include: |
| 113 | + |
| 114 | +``` |
| 115 | +// Comma is the field delimiter. |
| 116 | + // It is set to comma (',') by NewReader. |
| 117 | + // Comma must be a valid rune and must not be \r, \n, |
| 118 | + // or the Unicode replacement character (0xFFFD). |
| 119 | + Comma rune |
| 120 | +
|
| 121 | + // Comment, if not 0, is the comment character. Lines beginning with the |
| 122 | + // Comment character without preceding whitespace are ignored. |
| 123 | + // With leading whitespace the Comment character becomes part of the |
| 124 | + // field, even if TrimLeadingSpace is true. |
| 125 | + // Comment must be a valid rune and must not be \r, \n, |
| 126 | + // or the Unicode replacement character (0xFFFD). |
| 127 | + // It must also not be equal to Comma. |
| 128 | + Comment rune |
| 129 | +
|
| 130 | + // FieldsPerRecord is the number of expected fields per record. |
| 131 | + // If FieldsPerRecord is positive, Read requires each record to |
| 132 | + // have the given number of fields. If FieldsPerRecord is 0, Read sets it to |
| 133 | + // the number of fields in the first record, so that future records must |
| 134 | + // have the same field count. If FieldsPerRecord is negative, no check is |
| 135 | + // made and records may have a variable number of fields. |
| 136 | + FieldsPerRecord int |
| 137 | +
|
| 138 | + // If LazyQuotes is true, a quote may appear in an unquoted field and a |
| 139 | + // non-doubled quote may appear in a quoted field. |
| 140 | + LazyQuotes bool |
| 141 | +
|
| 142 | + // If TrimLeadingSpace is true, leading white space in a field is ignored. |
| 143 | + // This is done even if the field delimiter, Comma, is white space. |
| 144 | + TrimLeadingSpace bool |
| 145 | +``` |
| 146 | + |
| 147 | + |
| 148 | +``` go |
| 149 | +package main |
| 150 | + |
| 151 | +import ( |
| 152 | + "encoding/csv" |
| 153 | + "fmt" |
| 154 | + "log" |
| 155 | + "strings" |
| 156 | +) |
| 157 | + |
| 158 | +func main() { |
| 159 | + in := `first_name;last_name;username |
| 160 | +"Rob";"Pike";rob |
| 161 | +# lines beginning with a # character are ignored |
| 162 | +Ken;Thompson;ken |
| 163 | +"Robert";"Griesemer";"gri" |
| 164 | +` |
| 165 | + r := csv.NewReader(strings.NewReader(in)) |
| 166 | + r.Comma = ';' |
| 167 | + r.Comment = '#' |
| 168 | + |
| 169 | + records, err := r.ReadAll() |
| 170 | + if err != nil { |
| 171 | + log.Fatal(err) |
| 172 | + } |
| 173 | + |
| 174 | + fmt.Print(records) |
| 175 | +} |
| 176 | +``` |
| 177 | + |
| 178 | + |
| 179 | +More about CSV |
| 180 | + |
| 181 | +- <http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm> |
| 182 | +- <http://www.creativyst.com/Doc/Std/ctx/ctx.htm> altenative "extentend" csv format (uses always pipes `|` for separator) |
| 183 | + |
| 184 | +e.g. |
| 185 | + |
| 186 | +``` |
| 187 | +\TPersons|People Table|Pet owners in our example db|Pet owners||| |
| 188 | +\LNumber|LastName|FirstName |
| 189 | +\NPerson Number|Last Name|First Name |
| 190 | +\QNUMBER(7)|VARCHAR(65)|CHAR(35) |
| 191 | +1|Smythe|Jane |
| 192 | +2|Doe|John |
| 193 | +3|Mellonhead|Creg |
| 194 | +``` |
| 195 | + |
| 196 | + |
| 197 | +## CSV Format |
| 198 | + |
| 199 | +The CSV Format: |
| 200 | +- **Each record is one line** - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line separator may also be embedded in the data (making a record more than one line but still acceptable). |
| 201 | +- **Fields are separated with commas.** - Duh. |
| 202 | +- **Leading and trailing whitespace is ignored** - Unless the field is delimited with double-quotes in that case the whitespace is preserved. |
| 203 | +- **Embedded commas** - Field MUST be delimited with double-quotes. |
| 204 | +- **Embedded double-quotes** - Embedded double-quote characters MUST be doubled, and the field must be delimited with double-quotes. |
| 205 | +- **Embedded line-breaks** - Fields MUST be surrounded by double-quotes. |
| 206 | + |
| 207 | +Source <http://edoceo.com/utilitas/csv-file-format> |
| 208 | + |
| 209 | + |
| 210 | +## CSV Formats |
| 211 | + |
| 212 | +This utility supports two flavors of CSV: |
| 213 | + |
| 214 | +#### UNIX Style |
| 215 | + |
| 216 | +- backslash escape character for quotes (\"), new lines (\n), and backslashes (\\) |
| 217 | +- Each record must be on its own line. If a field contains a new line, the new line must be escaped. |
| 218 | +- Leading and trailing white space on an unquoted field is ignored. |
| 219 | +- Compatible with standard unix text processing tools such as grep and sed that work on a line by line basis. |
| 220 | + |
| 221 | + |
| 222 | +#### Microsoft Excel Style |
| 223 | + |
| 224 | +- Two quotes escape character ("" escapes "), no other characters are escaped. |
| 225 | +- Compatible with Microsoft Excel and many other programs that have adopted the format for data import and export. |
| 226 | +- Leading and trailing white space on an unquoted field is significant. |
| 227 | +- Specified by RFC4180. |
| 228 | + |
| 229 | +Note that for simple field data that does not contain quotes or new lines, the two formats are fairly equivalent. |
| 230 | + |
| 231 | + |
| 232 | + |
| 233 | +## CSV Format |
| 234 | + |
| 235 | +``` |
| 236 | +apple,"wild cherry",peach |
| 237 | +pear,plum,"apricot" |
| 238 | +mango,payaya,guava |
| 239 | +"orange, Valencia", lemon, lime |
| 240 | +"""extra virgin"" olive", palm, date |
| 241 | +``` |
| 242 | + |
| 243 | +Usually fields containing embedded spaces or commas |
| 244 | +are contained in " marks, but there are other conventions. |
| 245 | +Quotes (") inside quoted fields are doubled. |
| 246 | + |
| 247 | + |
| 248 | +## CSV Variations |
| 249 | + |
| 250 | +Behaviors of this program that often vary between CSV implementations: |
| 251 | + |
| 252 | +- Newlines are supported in quoted fields. |
| 253 | +- Double quotes are permitted in a non-quoted field. However, a field starting with a quote must follow quoting rules. |
| 254 | +- Each record can have a different numbers of fields. |
| 255 | +- The three common forms of newlines are supported: CR, CRLF, LF. |
| 256 | +- A newline will be added if the file does not end with one. |
| 257 | +- No whitespace trimming is done. |
| 258 | + |
0 commit comments