Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added CSV file handling in WorkingWithFiles #94

Open
wants to merge 14 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions Python/Module5_OddsAndEnds/WorkingWithFiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,6 +249,119 @@ with open("a_poem.txt", mode="r") as my_open_file:
```
<!-- #endregion -->

<!-- #region -->
## Working with Comma Seperated Value Files

Comma Seperated Value (CSV) files are commonly used to store data that you might typically find in a table.
These files can be formatted in many ways, but the typical format is to have each of the column values in the table be separated by commas while having a newline separate each row.
Suppose we have the following table of test scores:

| | Exam 1 (%) | Exam 2 (%) |
| ------------- |:-------------:| -----:|
| Ashley | $93$ | $95$ |
| Brad | $84$ | $100$ |
| Cassie | $99$ | $87$ |

This table depicts the test scores of three students across 2 exams.
Here is what the corresponding CSV file might look like:

```python
name,exam one score,exam two score
Ashley,93,95
Brad,84,100
Cassie,99,87
```
In addition to the fact that the first line typically contains a header, you are also allowed to have spaces within each of columns as well.

<div class="alert alert-warning">

**Note**:

It is not guaranteed that all CSV files are actually comma separated.
Non-standard CSV files will typically come with instructions on how the data is organized.
In general, it is a good practice to open up the CSV file and look at the first few lines to get a sense of how it is organized (unless the file is too large).
</div>

### How to parse CSVs with NumPy

We will first look into parsing and storing CSV data using our favorite package: `numpy`!

To demonstrate how importing a CSV works, we will try to import [a costal waves dataset](https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba/data) from Kaggle.
After you extract the *.csv* from the *.zip*, rename it to *costal_dataset.csv*.
```python
from numpy import genfromtxt # genfromtxt() allows for easy parsing of CSVs
my_data = genfromtxt(r"./Downloads/costal_dataset.csv", delimiter=',')
```
`genfromtxt()` takes in CSV file path and delimiter (the character used to split the data, typically comma for CSV).
Let's check out some properties of the CSV:

```python
>>> type(my_data)
numpy.ndarray

>>> my_data.shape
(43729, 7)

#Let's look at the actual data
>>> my_data
array([[ nan, nan, nan, ..., nan, nan, nan],
[ nan, -99.9 , -99.9 , ..., -99.9 , -99.9 , -99.9 ],
[ nan, 0.875, 1.39 , ..., 4.506, -99.9 , -99.9 ],
...,
[ nan, 2.157, 3.43 , ..., 12.89 , 97. , 21.95 ],
[ nan, 2.087, 2.84 , ..., 10.963, 92. , 21.95 ],
[ nan, 1.926, 2.98 , ..., 12.228, 84. , 21.95 ]])
```
You may notice that there are some `nan` values present when we look at this perticular set of data.
Typically, if there are non-numerical values in the file, such as headers and dates, importing it into a NumPy array will turn those values into `nan`.

### How to parse CSVs with Pandas

A really popular library for parsing CSVs is the [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html "Pandas Documentation") library. Here is a quick way to parse a CSV using Pandas:
```Python
import pandas as pd
my_data =pd.read_csv(r"./Downloads/costal_dataset.csv", sep=',',header=None)
```
That's it!
The method `read_csv()` imports the CSV into the variable `my_data`.
This method has similar input parameters to `genfromtxt()` and many extra optional parameters as well.
Look at the docstring for more information.

Let's parse the same [ocean waves csv](https://www.kaggle.com/jolasa/waves-measuring-buoys-data-mooloolaba/data) from before but with Pandas instead of NumPy:

```Python
>>> type(my_data)
pandas.core.frame.DataFrame #Notice that this is a custom type

>>> my_data.shape
(43729, 7)

>>> my_data.values #This is how we access the values as an array
array([['Date/Time', 'Hs', 'Hmax', ..., 'Tp', 'Peak Direction', 'SST'],
['01/01/2017 00:00', '-99.9', '-99.9', ..., '-99.9', '-99.9',
'-99.9'],
['01/01/2017 00:30', '0.875', '1.39', ..., '4.506', '-99.9',
'-99.9'],
...,
['30/06/2019 22:30', '2.157', '3.43', ..., '12.89', '97', '21.95'],
['30/06/2019 23:00', '2.087', '2.84', ..., '10.963', '92',
'21.95'],
['30/06/2019 23:30', '1.926', '2.98', ..., '12.228', '84',
'21.95']], dtype=object)
```
One of the coolest features of Pandas is how it nicely organizes the parsed CSV data for visualization.
Here is how `my_data` is displayed in a Jupyter Notebook:

```Python
my_data[0:21] #Prints out first 20 values in nice format
```
![Pandas Parsed Figure](pics/Pandas_CSV.jpg)

One of the main advantages of Pandas is that it **treats all the data as strings**, while NumPy only deals with numerical values.
This allows Pandas to store information such as headers and date, while NumPy cannot.
Read the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html "Documentation Link") for more information.
<!-- #endregion -->

<!-- #region -->
## Globbing for Files

Expand Down
Binary file added Python/Module5_OddsAndEnds/pics/Pandas_CSV.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.