You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _episodes/03-dates-as-data.md
+16-14Lines changed: 16 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -38,11 +38,11 @@ In particular, please remember that functions that are valid for a given
38
38
spreadsheet program (be it LibreOffice, Microsoft Excel, OpenOffice.org,
39
39
Gnumeric, etc.) are usually guaranteed to be compatible only within the same
40
40
family of products. If you will later need to export the data and need to
41
-
conserve the timestamps you are better off handling them using one of the solutions discussed below.
41
+
conserve the timestamps you are better off handling them using one of the solutions discussed below.
42
42
43
43
44
44
> ## Exercise
45
-
>
45
+
>
46
46
> Challenge: pulling month, day and year out of dates
47
47
>
48
48
> - In the `Dates` tab of your Excel file we summarized training data from 2015. There's a `date` column.
@@ -56,12 +56,10 @@ conserve the timestamps you are better off handling them using one of the soluti
56
56
>
57
57
> (Make sure the new column is formatted as a number and not as a date. Change the function to correspond to each row - i.e., =MONTH(A3), =DAY(A3), =YEAR(A3) for the next row.
58
58
>
59
-
60
59
>
61
60
> > ## Solution
62
61
> > You can see that even though you wanted the year to be 2015 for all entries, your spreadsheet program interpreted two entries as 2017, the year the data was entered, not the year of the workshop.
@@ -76,7 +74,7 @@ If you’re working with historic data, be extremely careful with your dates!
76
74
77
75
Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a
78
76
different serial number than the [1900 date system](https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). Because of this,
79
-
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
77
+
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
80
78
81
79
82
80
## Data formats in spreadsheets
@@ -98,11 +96,15 @@ the above functions we can easily add days, months or years to a given date.
98
96
Say you had a sampling plan where you needed to sample every thirty seven days.
99
97
In another cell, you could type:
100
98
101
-
=B2+37
99
+
~~~
100
+
=B2+37
101
+
~~~
102
102
103
103
And it would return
104
104
105
-
8-Aug
105
+
~~~
106
+
8-Aug
107
+
~~~
106
108
107
109
because it understands the date as a number `41822`, and `41822 + 37 = 41859`
108
110
which Excel interprets as August 8, 2014. It retains the format (for the most
@@ -124,15 +126,15 @@ the quantities to the correct entities.
124
126
125
127
Which brings us to the many different ways Excel provides in how it displays dates. If you refer to the figure above, you’ll see that there are many, MANY ways that ambiguity creeps into your data depending on the format you chose when you enter your data, and if you’re not fully cognizant of which format you’re using, you can end up actually entering your data in a way that Excel will badly misinterpret.
126
128
127
-
> ## Exercise
129
+
> ## Exercise
128
130
> What happens to the dates in the `dates` tab of our workbook if we save this sheet in Excel (in `csv` format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the `csv` file in Excel?
129
131
> > ## Solution
130
-
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
131
-
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
132
-
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
133
-
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
134
-
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
135
-
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
132
+
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
133
+
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
134
+
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
135
+
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
136
+
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
137
+
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
Copy file name to clipboardExpand all lines: _episodes/05-exporting-data.md
+13-11Lines changed: 13 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,11 +65,13 @@ An important note for backwards compatibility: you can open CSVs in Excel!
65
65
66
66
## A Note on Cross-platform Operability
67
67
68
-
By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..
68
+
By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..
69
69
70
70
As such, when exporting to CSV using Excel, your data in text format will look like this:
71
71
72
-
>data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
72
+
~~~
73
+
data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
74
+
~~~
73
75
74
76
When opening your CSV file in Excel again, it will parse it as follows:
75
77
@@ -79,11 +81,11 @@ However, if you open your CSV file on a different system that does not parse the
79
81
80
82
Your data in text format then look like this:
81
83
82
-
>data1<br>
83
-
>data2<CR><br>
84
-
>1<br>
85
-
>2<CR><br>
86
-
>…
84
+
~~~
85
+
data1,data2<CR>
86
+
1,2<CR>
87
+
…
88
+
~~~
87
89
88
90
You will then see a weird character or possibly the string `CR` or `\r`:
89
91
@@ -100,15 +102,15 @@ There are a handful of solutions for enforcing uniform UNIX-style line endings o
## Dealing with commas as part of data values in `*.csv` files ##
17
17
18
-
In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
18
+
In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
19
19
20
20
However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter.
In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`).
32
34
If we try to read the above into Excel (or other spreadsheet programme), we will get something like this:
33
35
34
36

35
37
36
-
The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).
37
-
38
-
If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:
38
+
The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).
39
+
40
+
If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:
Now opening this file as a `csv` in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters. However, if you are working with an already existing dataset in which the data values are not included in "" but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning.
49
52
50
53
If the dataset you're dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - "") is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.
51
54
52
-
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.
55
+
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.
0 commit comments