You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: 2-Working-With-Data/07-python/README.md
+45-2
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,9 @@ import matplotlib.pyplot as plt
43
43
from scipy import...# you need to specify exact sub-packages that you need
44
44
```
45
45
46
-
Pandas is centered around the following basic concepts:
46
+
Pandas is centered around a few basic concepts.
47
+
48
+
### Series
47
49
48
50
**Series** is a sequence of values, similar to a list or numpy array. The main difference is that series also has and **index**, and when we operate on series (eg., add them), the index is taken into account. Index can be as simple as integer row number (it is the index used by default when creating a series from list or array), or it can have a complex structure, such as date interval.
> **Note** that we are not using simple syntax `total_items+additional_items`. If we did, we would have received a lot of `NaN` (*Not a Number*) values in the resulting series. This is because there are missing values for some of the index point in the `additional_items` series, and adding `Nan` to anything results in `NaN`. Thus we need to specify `fill_value` parameter during addition.
78
+
79
+
With time series, we can also **resample** the series with different time intervals. For example, suppose we want to compute mean sales volume monthly. We can use the following code:
80
+
```python
81
+
monthly = total_items.resample("1M").mean()
82
+
ax = monthly.plot(kind='bar')
83
+
```
84
+

85
+
86
+
### DataFrame
87
+
88
+
A DataFrame is essentially a collection of series with the same index. We can combine several series together into a DataFrame:
89
+
```python
90
+
a = pd.Series(range(1,10))
91
+
b = pd.Series(["I","like","to","play","games","and","will","not","change"],index=range(0,9))
92
+
df = pd.DataFrame([a,b])
93
+
```
94
+
This will create a horizontal table like this:
95
+
|| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
96
+
|---|---|---|---|---|---|---|---|---|---|
97
+
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
98
+
| 1 | I | like | to | use | Python | and | Pandas | very | much |
99
+
100
+
We can also use Series as columns, and specify column names using dictionary:
101
+
```python
102
+
df = pd.DataFrame({ 'A' : a, 'B' : b })
103
+
```
104
+
This will give us a table like this:
105
+
106
+
|| A | B |
107
+
|---|---|---|
108
+
| 0 | 1 | I |
109
+
| 1 | 2 | like |
110
+
| 2 | 3 | to |
111
+
| 3 | 4 | use |
112
+
| 4 | 5 | Python |
113
+
| 5 | 6 | and |
114
+
| 6 | 7 | Pandas |
115
+
| 7 | 8 | very |
116
+
| 8 | 9 | much |
74
117
## 🚀 Challenge
75
118
76
119
First problem we will focus on is modelling of epidemic spread of COVID-19. In order to do that, we will use the data on the number of infected individuals in different countries, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). Dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19).
77
120
78
-
Since we want to demonstrate how to deal with data, we invite you to open [`notebook-pandas.ipynb`](notebook-pandas.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have leaf for you along the way.
121
+
Since we want to demonstrate how to deal with data, we invite you to open [`notebook-covidspread.ipynb`](notebook-covidspread.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have leaf for you along the way.
0 commit comments