You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: English version/ch01_MathematicalBasis/Chapter 1_MathematicalBasis.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -389,7 +389,7 @@ P(A/B) = P(A\cap B) / P(B)
389
389
$$
390
390
391
391
Description: The event or subset $A$ and $B$ in the same sample space $\Omega$, if an element randomly selected from $\Omega$ belongs to $B$, then the next randomly selected element The probability of belonging to $A$ is defined as the conditional probability of $A$ on the premise of $B$.
According to the Venn diagram, it can be clearly seen that in the event of event B, the probability of event A occurring is $P(A\bigcap B)$ divided by $P(B)$.
395
395
Example: A couple has two children. What is the probability that one of them is a girl and the other is a girl? (I have encountered interviews and written tests)
Copy file name to clipboardexpand all lines: English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md
+39-39
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ The feature neuron model in the multi-layer perceptron is called the perceptron
14
14
15
15
The simple perceptron is shown below:
16
16
17
-

17
+

18
18
19
19
Where $x_1$, $x_2$, $x_3$ is the input to the perceptron, and its output is:
20
20
@@ -46,30 +46,30 @@ $$
46
46
47
47
Set the appropriate $x$ and $b$ , a simple perceptual unit's NAND gate is expressed as follows:
48
48
49
-

49
+

50
50
51
51
When the input is $0$, $1$, the perceptron output is $ 0 * (-2) + 1 * (-2) + 3 = 1$.
52
52
53
53
More complex perceptrons are composed of simple perceptron units:
54
54
55
-

55
+

56
56
57
57
**Multilayer Perceptron**
58
58
59
59
The multi-layer perceptron is promoted by the perceptron. The most important feature is that there are multiple neuron layers, so it is also called deep neural network. Each neuron in the $i$ layer of the multilayer perceptron is connected to each neuron in the $i-1$ layer compared to a separate perceptron.
60
60
61
-

61
+

62
62
63
63
The output layer can have more than $1$ neurons. The hidden layer can have only $1 $ layers, or it can have multiple layers. The output layer is a neural network of multiple neurons such as the following:
64
64
65
-

65
+

66
66
67
67
68
68
### 3.1.2 What are the common model structures of neural networks?
69
69
70
70
The figure below contains most of the commonly used models:
71
71
72
-

72
+

73
73
74
74
### 3.1.3 How to choose a deep learning development platform?
75
75
@@ -108,14 +108,14 @@ Some platforms are specifically developed for deep learning research and applica
108
108
109
109
The reason for the disappearance of the gradient is affected by many factors, such as the size of the learning rate, the initialization of the network parameters, and the edge effect of the activation function. In the deep neural network, the gradient calculated by each neuron is passed to the previous layer, and the gradient received by the shallower neurons is affected by all previous layer gradients. If the calculated gradient value is very small, as the number of layers increases, the obtained gradient update information will decay exponentially, and the gradient disappears. The figure below shows the learning rate of different hidden layers:
110
110
111
-

111
+

112
112
113
113
2. Exploding Gradient
114
114
In a network structure such as a deep network or a Recurrent Neural Network (RNN), gradients can accumulate in the process of network update, becoming a very large gradient, resulting in a large update of the network weight value, making the network unstable; In extreme cases, the weight value will even overflow and become a $NaN$ value, which cannot be updated anymore.
115
115
116
116
3. Degeneration of the weight matrix results in a reduction in the effective degrees of freedom of the model. The degradation rate of learning in the parameter space is slowed down, which leads to the reduction of the effective dimension of the model. The available degrees of freedom of the network contribute to the gradient norm in learning. As the number of multiplication matrices (ie, network depth) increases, The product of the matrix becomes more and more degraded. In nonlinear networks with hard saturated boundaries (such as ReLU networks), as the depth increases, the degradation process becomes faster and faster. The visualization of this degradation process is shown in a 2014 paper by Duvenaud et al:
117
117
118
-

118
+

119
119
120
120
As the depth increases, the input space (shown in the upper left corner) is twisted into thinner and thinner filaments at each point in the input space, and only one direction orthogonal to the filament affects the response of the network. In this direction, the network is actually very sensitive to change.
121
121
@@ -129,9 +129,9 @@ Traditional machine learning needs to define some manual features to purposefull
129
129
130
130
131
131
132
-

132
+

133
133
134
-

134
+

135
135
136
136
## 3.2 Network Operations and Calculations
137
137
@@ -141,23 +141,23 @@ There are two main types of neural network calculations: foward propagation (FP)
141
141
142
142
** Forward Propagation**
143
143
144
-

144
+

145
145
146
146
Suppose the upper node $ i, j, k, ... $ and so on are connected to the node $ w $ of this layer, so what is the value of the node $ w $? That is, the weighting operation is performed by the nodes of $i, j, k, ... $ above and the corresponding connection weights, and the final result is added with an offset term (for simplicity in the figure) Finally, through a non-linear function (ie activation function), such as $ReLu $, $ sigmoid $ and other functions, the final result is the output of this layer node $ w $.
147
147
148
148
Finally, through this method of layer by layer operation, the output layer results are obtained.
149
149
150
150
**Backpropagation**
151
151
152
-

152
+

153
153
154
154
Because of the final result of our forward propagation, taking the classification as an example, there is always an error in the end. How to reduce the error? One algorithm that is widely used at present is the gradient descent algorithm, but the gradient requires the partial derivative. The Chinese alphabet is used as an example to explain:
155
155
156
156
Let the final error be $ E $ and the activation function of the output layer be a linear activation function, for the output then $ E $ for the output node $ y_l $ the partial derivative is $ y_l - t_l $, where $ t_l $ is the real value, $ \ Frac{\partial y_l}{\partial z_l} $ refers to the activation function mentioned above, $ z_l $ is the weighted sum mentioned above, then the $ E $ for this layer has a partial derivative of $ z_l $ Frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $. In the same way, the next level is calculated as well, except that the $\frac{\partial E}{\partial y_k} $ calculation method has been changed back to the input layer, and finally $ \frac{\partial E}{ \partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $, and $ \frac{\partial z_j}{\partial x_i} = w_i j $ . Then adjust the weights in these processes, and then continue the process of forward propagation and back propagation, and finally get a better result.
157
157
158
158
### 3.2.2 How to calculate the output of the neural network?
159
159
160
-

160
+

161
161
162
162
As shown in the figure above, the input layer has three nodes, which we numbered as 1, 2, and 3; the four nodes of the hidden layer are numbered 4, 5, 6, and 7; the last two nodes of the output layer are numbered 8. 9. For example, node 4 of the hidden layer is connected to the three nodes 1, 2, and 3 of the input layer, and the weights on the connection are $ w_{41}, w_{42}, w_{43} $.
163
163
@@ -185,7 +185,7 @@ For the same reason, we can also calculate $ y_2 $. So that the output values
185
185
186
186
Suppose there is a 5\*5 image, convolved with a 3\*3 filter, and I want a 3\*3 Feature Map, as shown below:
187
187
188
-

188
+

189
189
190
190
$ x_{i,j} $ represents the $ j $ column element of the $ i $ line of the image. $ w_{m,n} $ means filter $ m $ line $ n $ column weight. $ w_b $ represents the offset of $filter$. Table $a_i, _j$ shows the feature map $ i$ line $ j $ column element. $f$ represents the activation function, here the $ReLU$ function is used as an example.
191
191
@@ -211,15 +211,15 @@ $$
211
211
212
212
The calculation process is illustrated as follows:
213
213
214
-

214
+

215
215
216
216
By analogy, all Feature Maps are calculated.
217
217
218
-

218
+

219
219
220
220
When the stride is 2, the Feature Map is calculated as follows
221
221
222
-

222
+

223
223
224
224
Note: Image size, stride, and the size of the Feature Map after convolution are related. They satisfy the following relationship:
225
225
@@ -252,15 +252,15 @@ Where $D$ is the depth; $F$ is the size of the filter; $w_{d,m,n}$ represents th
252
252
253
253
There can be multiple filters per convolutional layer. After each filter is convolved with the original image, you get a Feature Map. The depth (number) of the Feature Map after convolution is the same as the number of filters in the convolutional layer. The following illustration shows the calculation of a convolutional layer with two filters. $7*7*3$ Input, after two convolutions of $3*3*3$ filter (step size is $2$), get the output of $3*3*2$. The Zero padding in the figure is $1$, which is a $0$ around the input element.
254
254
255
-

255
+

256
256
257
257
The above is the calculation method of the convolutional layer. This is a partial connection and weight sharing: each layer of neurons is only connected to the upper layer of neurons (convolution calculation rules), and the weight of the filter is the same for all neurons in the previous layer. For a convolutional layer containing two $3 * 3 * 3 $ fitlers, the number of parameters is only $ (3 * 3 * 3+1) * 2 = 56 $, and the number of parameters is the same as the previous one. The number of layers of neurons is irrelevant. Compared to a fully connected neural network, the number of parameters is greatly reduced.
258
258
259
259
### 3.2.4 How to calculate the output value of the Pooling layer output value?
260
260
261
261
The main role of the Pooling layer is to downsample, further reducing the number of parameters by removing unimportant samples from the Feature Map. There are many ways to pooling, the most common one is Max Pooling. Max Pooling actually takes the maximum value in the sample of n\*n as the sampled value after sampling. The figure below is 2\*2 max pooling:
262
262
263
-

263
+

264
264
265
265
In addition to Max Pooing, Average Pooling is also commonly used - taking the average of each sample.
266
266
For a Feature Map with a depth of $ D $ , each layer does Pooling independently, so the depth after Pooling is still $ D $.
@@ -269,7 +269,7 @@ For a Feature Map with a depth of $ D $ , each layer does Pooling independently,
269
269
270
270
A typical three-layer neural network is as follows:
271
271
272
-

272
+

273
273
274
274
Where Layer $ L_1 $ is the input layer, Layer $ L_2 $ is the hidden layer, and Layer $ L_3 $ is the output layer.
275
275
@@ -279,11 +279,11 @@ If the input and output are the same, it is a self-encoding model. If the raw da
279
279
280
280
Suppose you have the following network layer:
281
281
282
-

282
+

283
283
284
284
The input layer contains neurons $ i_1, i_2 $, offset $ b_1 $; the hidden layer contains neurons $ h_1, h_2 $, offset $ b_2 $, and the output layer is $ o_1, o_2 $, $ W_i $ is the weight of the connection between the layers, and the activation function is the $sigmoid $ function. Take the initial value of the above parameters, as shown below:
285
285
286
-

286
+

287
287
288
288
among them:
289
289
@@ -365,7 +365,7 @@ $$
365
365
366
366
The following diagram can be more intuitive to see how the error propagates back:
367
367
368
-

368
+

369
369
370
370
### 3.2.6 What is the meaning of the neural network more "deep"?
371
371
@@ -429,23 +429,23 @@ Among them, the search process requires a search algorithm, generally: grid sear
429
429
430
430
The function image is as follows:
431
431
432
-

432
+

433
433
434
434
2. tanh activation function
435
435
436
436
The function is defined as: $ f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $, the value range is $ (- 1,1) $.
437
437
438
438
The function image is as follows:
439
439
440
-

440
+

441
441
442
442
3. Relu activation function
443
443
444
444
The function is defined as: $ f(x) = max(0, x) $ , and the value field is $ [0,+∞) $;
445
445
446
446
The function image is as follows:
447
447
448
-

448
+

449
449
450
450
4. Leak Relu activation function
451
451
@@ -458,15 +458,15 @@ Among them, the search process requires a search algorithm, generally: grid sear
458
458
459
459
The image is as follows ($ a = 0.5 $):
460
460
461
-

461
+

462
462
463
463
5. SoftPlus activation function
464
464
465
465
The function is defined as: $ f(x) = ln( 1 + e^x) $, and the value range is $ (0, +∞) $.
466
466
467
467
The function image is as follows:
468
468
469
-

469
+

470
470
471
471
6. softmax function
472
472
@@ -478,7 +478,7 @@ Among them, the search process requires a search algorithm, generally: grid sear
478
478
479
479
For common activation functions, the derivative is calculated as follows:
480
480
481
-

481
+

482
482
483
483
### 3.4.4 What are the properties of the activation function?
484
484
@@ -517,7 +517,7 @@ The following are common choices:
517
517
518
518
The Relu activation function image is as follows:
519
519
520
-

520
+

521
521
522
522
According to the image, it can be seen that it has the following characteristics:
523
523
@@ -543,15 +543,15 @@ $$
543
543
544
544
From the following figure, the neural network contains the input layer, and then processed by two feature layers. Finally, the softmax analyzer can get the probability under different conditions. Here, it needs to be divided into three categories, and finally get $ y=0. , y=1, y=2 probability value of $.
545
545
546
-

546
+

547
547
548
548
Continuing with the picture below, the three inputs pass through softmax to get an array of $[0.05, 0.10, 0.85] $, which is the function of soft.
549
549
550
-

550
+

551
551
552
552
The more visual mapping process is shown below:
553
553
554
-

554
+

555
555
556
556
In the case of softmax, the original output is $3,1,-3$, which is mapped to the value of $(0,1)$ by the softmax function, and the sum of these values is $1 $( Satisfy the nature of the probability), then we can understand it as a probability, when we finally select the output node, we can select the node with the highest probability (that is, the value corresponds to the largest) as our prediction target!
557
557
@@ -668,7 +668,7 @@ At this time, a batch-grading learning method (Mini-batches Learning) can be emp
668
668
669
669
### 3.6.3 Why can normalization improve the solution speed?
670
670
671
-

671
+

672
672
673
673
The above figure is the optimal solution finding process that represents whether the data is uniform (the circle can be understood as a contour). The left image shows the search process without normalization, and the right image shows the normalized search process.
674
674
@@ -684,7 +684,7 @@ Suppose $w1$ ranges in $[-10, 10]$, while $w2$ ranges in $[-100, 100]$, the grad
684
684
685
685
This will result in a more bias toward the direction of $ w1 $ during the search. Go out of the "L" shape, or become the "Zigzag" shape.
686
686
687
-

687
+

688
688
689
689
### 3.6.5 What types of normalization?
690
690
@@ -742,7 +742,7 @@ among them,
742
742
743
743
A simple diagram is as follows:
744
744
745
-

745
+

746
746
747
747
### 3.6.8 What is Batch Normalization?
748
748
@@ -892,7 +892,7 @@ Deviation Initialization Trap: Both are initialized to the same value.
892
892
Take a three-layer network as an example:
893
893
First look at the structure
894
894
895
-

895
+

896
896
897
897
Its expression is:
898
898
@@ -1051,8 +1051,8 @@ tf.train.RMSPropOptimizer
1051
1051
1052
1052
### 3.12.2 Why is regularization helpful in preventing overfitting?
1053
1053
1054
-

1055
-

1054
+

1055
+

1056
1056
1057
1057
The left picture is high deviation, the right picture is high variance, and the middle is Just Right, which we saw in the previous lesson.
0 commit comments