Skip to content

Commit 7332391

Browse files
committed
small formatting fixes
1 parent 42a7cee commit 7332391

File tree

4 files changed

+62
-49
lines changed

4 files changed

+62
-49
lines changed

_episodes/01-format-data.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ questions:
77
objectives:
88
- "Describe best practices for data entry and formatting in spreadsheets."
99
- "Apply best practices to arrange variables and observations in a spreadsheet."
10-
10+
1111
keypoints:
1212
- Use one column for one variable
1313
- Use one row for one observation
@@ -51,10 +51,13 @@ Unorganized data can make it harder to work with your data,
5151
so you should be mindful of your data organization when doing your data entry.
5252
You'll want to organize your data in a way that allows other programs and people to easily understand and use the data.
5353

54+
> ## Callout
55+
>
5456
> **Note:** the best layouts/formats (as well as software and
5557
> interfaces) for **data entry** and **data analysis** might be
5658
> different. It is important to take this into account, and ideally
5759
> automate the conversion from one to another.
60+
{: .callout}
5861

5962
### Keeping track of your analyses
6063

@@ -140,7 +143,7 @@ with this data and how you fixed it.
140143
{: .challenge}
141144

142145

143-
> ## Important ##
146+
> ## Important
144147
>
145148
> Do not forget of our first piece of advice:
146149
> **create a new file** for the cleaned data, and **never
@@ -150,7 +153,10 @@ with this data and how you fixed it.
150153

151154
An excellent reference, in particular with regard to R scripting is
152155

156+
> ## Resource
157+
>
153158
> Hadley Wickham, *Tidy Data*, Vol. 59, Issue 10, Sep 2014, Journal of
154159
> Statistical Software. [http://www.jstatsoft.org/v59/i10](http://www.jstatsoft.org/v59/i10).
160+
{: .callout}
155161

156162
<!-- *Instructors see notes in 'instructors_notes.md' on this exercise.* -->

_episodes/03-dates-as-data.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,11 @@ In particular, please remember that functions that are valid for a given
3838
spreadsheet program (be it LibreOffice, Microsoft Excel, OpenOffice.org,
3939
Gnumeric, etc.) are usually guaranteed to be compatible only within the same
4040
family of products. If you will later need to export the data and need to
41-
conserve the timestamps you are better off handling them using one of the solutions discussed below.
41+
conserve the timestamps you are better off handling them using one of the solutions discussed below.
4242

4343

4444
> ## Exercise
45-
>
45+
>
4646
> Challenge: pulling month, day and year out of dates
4747
>
4848
> - In the `Dates` tab of your Excel file we summarized training data from 2015. There's a `date` column.
@@ -56,12 +56,10 @@ conserve the timestamps you are better off handling them using one of the soluti
5656
>
5757
> (Make sure the new column is formatted as a number and not as a date. Change the function to correspond to each row - i.e., =MONTH(A3), =DAY(A3), =YEAR(A3) for the next row.
5858
>
59-
6059
>
6160
> > ## Solution
6261
> > You can see that even though you wanted the year to be 2015 for all entries, your spreadsheet program interpreted two entries as 2017, the year the data was entered, not the year of the workshop.
6362
> > ![dates, exersize 1](../fig/3_Dates_as_Columns.png)
64-
> > {: .output}
6563
> {: .solution}
6664
{: .challenge}
6765
@@ -76,7 +74,7 @@ If you’re working with historic data, be extremely careful with your dates!
7674
7775
Excel also entertains a second date system, the 1904 date system, as the default in Excel for Macintosh. This system will assign a
7876
different serial number than the [1900 date system](https://support.microsoft.com/en-us/help/214330/differences-between-the-1900-and-the-1904-date-system-in-excel). Because of this,
79-
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
77+
[dates must be checked for accuracy when exporting data from Excel](http://uc3.cdlib.org/2014/04/09/abandon-all-hope-ye-who-enter-dates-in-excel/) (look for dates that are ~4 years off).
8078
8179
8280
## Data formats in spreadsheets
@@ -98,11 +96,15 @@ the above functions we can easily add days, months or years to a given date.
9896
Say you had a sampling plan where you needed to sample every thirty seven days.
9997
In another cell, you could type:
10098
101-
=B2+37
99+
~~~
100+
=B2+37
101+
~~~
102102
103103
And it would return
104104
105-
8-Aug
105+
~~~
106+
8-Aug
107+
~~~
106108
107109
because it understands the date as a number `41822`, and `41822 + 37 = 41859`
108110
which Excel interprets as August 8, 2014. It retains the format (for the most
@@ -124,15 +126,15 @@ the quantities to the correct entities.
124126
125127
Which brings us to the many different ways Excel provides in how it displays dates. If you refer to the figure above, you’ll see that there are many, MANY ways that ambiguity creeps into your data depending on the format you chose when you enter your data, and if you’re not fully cognizant of which format you’re using, you can end up actually entering your data in a way that Excel will badly misinterpret.
126128
127-
> ## Exercise
129+
> ## Exercise
128130
> What happens to the dates in the `dates` tab of our workbook if we save this sheet in Excel (in `csv` format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the `csv` file in Excel?
129131
> > ## Solution
130-
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
131-
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
132-
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
133-
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
134-
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
135-
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
132+
> > - Click to the `dates` tab of the workbook and double-click on any of the values in the `Date collected` column. Notice that most of the dates display with the year 2015 and two are 2017.
133+
> > - Select `File -> Save As` in Excel and in the drop down menu for file format select `CSV UTF-8 (Comma delimited) (.csv)`. Click `Save`.
134+
> > - You will see a pop-up that says "This workbook cannot be saved in the selected file format because it contains multiple sheets." Choose `Save Active Sheet`.
135+
> > - Navigate to the file in your finder application. Right click and select `Open With`. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
136+
> > - Now right click on the file again and open with Excel. Notice that the dates display with the current year, not 2015.
137+
> > As you can see, exporting data from Excel and then importing it back into Excel fundamentally changed the data once again!
136138
> {: .solution}
137139
{: .challenge}
138140

_episodes/05-exporting-data.md

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -65,11 +65,13 @@ An important note for backwards compatibility: you can open CSVs in Excel!
6565

6666
## A Note on Cross-platform Operability
6767

68-
By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..
68+
By default, most coding and statistical environments expect UNIX-style line endings (ASCII `LF` character) as representing line breaks. However, Windows uses an alternate line ending signifier (ASCII `CR LF` characters) by default for legacy compatibility with Teletype-based systems..
6969

7070
As such, when exporting to CSV using Excel, your data in text format will look like this:
7171

72-
>data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
72+
~~~
73+
data1,data2<CR><LF>1,2<CR><LF>4,5<CR><LF>
74+
~~~
7375

7476
When opening your CSV file in Excel again, it will parse it as follows:
7577

@@ -79,11 +81,11 @@ However, if you open your CSV file on a different system that does not parse the
7981

8082
Your data in text format then look like this:
8183

82-
>data1<br>
83-
>data2<CR><br>
84-
>1<br>
85-
>2<CR><br>
86-
>
84+
~~~
85+
data1,data2<CR>
86+
1,2<CR>
87+
88+
~~~
8789

8890
You will then see a weird character or possibly the string `CR` or `\r`:
8991

@@ -100,15 +102,15 @@ There are a handful of solutions for enforcing uniform UNIX-style line endings o
100102
```
101103
[filter "cr"]
102104
clean = LC_CTYPE=C awk '{printf(\"%s\\n\", $0)}' | LC_CTYPE=C tr '\\r' '\\n'
103-
smudge = tr '\\n' '\\r'`
105+
smudge = tr '\\n' '\\r'`
104106
```
105-
107+
106108
and then create a file `.gitattributes` that contains the line:
107-
109+
108110
```
109111
*.csv filter=cr
110112
```
111-
113+
112114
3. Use [dos2unix](http://dos2unix.sourceforge.net/) (available on OSX, *nix, and Cygwin) on local files to standardize line endings.
113115

114116
#### A note on Python and `xls`
Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: Caveats of popular data and file formats
2+
title: Caveats of popular data and file formats
33
teaching: 5
44
exercises: 0
55
questions:
@@ -15,38 +15,41 @@ keypoints:
1515

1616
## Dealing with commas as part of data values in `*.csv` files ##
1717

18-
In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
18+
In the [previous lesson](../05-exporting-data) we discussed how to export Excel file formats into `*.csv`. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
1919

2020
However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter.
2121

2222
Data could look like this:
23-
24-
date,type,len_hours,num_registered,num_attended,trainer,cancelled
25-
29 Apr,OA,1.5,1.5,15,JM,N
26-
3 Mar,OA,60,19,25,PG,N
27-
3 Jul,OA,1,25,20,PG, JM ,N
28-
4 Jan,OA,1,26,17,JM,N
29-
29 Mar,RDM,1,27,24,JM,N
30-
31-
In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`).
23+
24+
~~~
25+
date,type,len_hours,num_registered,num_attended,trainer,cancelled
26+
29 Apr,OA,1.5,1.5,15,JM,N
27+
3 Mar,OA,60,19,25,PG,N
28+
3 Jul,OA,1,25,20,PG, JM ,N
29+
4 Jan,OA,1,26,17,JM,N
30+
29 Mar,RDM,1,27,24,JM,N
31+
~~~
32+
33+
In record `3 Jul,OA,1,25,20,PG, JM ,N` the value for *trainer* includes a comma for multiple trainers (`PG, JM`).
3234
If we try to read the above into Excel (or other spreadsheet programme), we will get something like this:
3335

3436
![Issue with importing csv format](../fig/csv-mistake.png)
3537

36-
The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).
37-
38-
If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:
38+
The value for 'trainer' was split into two columns (instead of being put in one column `F`). This can propagate to a number of further errors. For example, the "extra" column will be interpreted as a column with many missing values (and without a proper header!).
39+
40+
If you want to store your data in `csv` format and expect that your data values may contain commas, you can avoid the problem discussed above by putting the values to be included in the same column in quotes (""). Applying this rule, the data might look like this:
3941

40-
date,type,len_hours,num_registered,num_attended,trainer,cancelled
41-
29 Apr,OA,1.5,1.5,15,JM,N
42-
3 Mar,OA,60,19,25,PG,N
43-
3 Jul,OA,1,25,20,"PG, JM",N
44-
4 Jan,OA,1,26,17,JM,N
45-
29 Mar,RDM,1,27,24,JM,N
46-
42+
~~~
43+
date,type,len_hours,num_registered,num_attended,trainer,cancelled
44+
29 Apr,OA,1.5,1.5,15,JM,N
45+
3 Mar,OA,60,19,25,PG,N
46+
3 Jul,OA,1,25,20,"PG, JM",N
47+
4 Jan,OA,1,26,17,JM,N
48+
29 Mar,RDM,1,27,24,JM,N
49+
~~~
4750

4851
Now opening this file as a `csv` in Excel will not lead to an extra column, because Excel will only use commas that fall outside of quotation marks as delimiting characters. However, if you are working with an already existing dataset in which the data values are not included in "" but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning.
4952

5053
If the dataset you're dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - "") is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.
5154

52-
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.
55+
Cleaning up datasets is one of the major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. The Python and R lessons will give you the basis for developing skills to build relevant scripts.

0 commit comments

Comments
 (0)