Skip to content

Commit c8c06ed

Browse files
authored
Merge pull request scutan90#477 from naah69/master
github page
2 parents 7c8871a + cf879e2 commit c8c06ed

File tree

29 files changed

+5797
-625
lines changed

29 files changed

+5797
-625
lines changed

.nojekyll

Whitespace-only changes.

English version/ch01_MathematicalBasis/Chapter 1_MathematicalBasis.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -389,7 +389,7 @@ P(A/B) = P(A\cap B) / P(B)
389389
$$
390390

391391
Description: The event or subset $A$ and $B$ in the same sample space $\Omega$, if an element randomly selected from $\Omega$ belongs to $B$, then the next randomly selected element The probability of belonging to $A$ is defined as the conditional probability of $A$ on the premise of $B$.
392-
![conditional probability](./img/ch1/conditional_probability.jpg)
392+
![conditional probability](img/ch1/conditional_probability.jpg)
393393

394394
According to the Venn diagram, it can be clearly seen that in the event of event B, the probability of event A occurring is $P(A\bigcap B)$ divided by $P(B)$.
395395
Example: A couple has two children. What is the probability that one of them is a girl and the other is a girl? (I have encountered interviews and written tests)

English version/ch02_MachineLearningFoundation/Chapter 2_TheBasisOfMachineLearning.md

+43-43
Large diffs are not rendered by default.

English version/ch03_DeepLearningFoundation/ChapterIII_DeepLearningFoundation.md

+39-39
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The feature neuron model in the multi-layer perceptron is called the perceptron
1414

1515
The simple perceptron is shown below:
1616

17-
![](./img/ch3/3-1.png)
17+
![](img/ch3/3-1.png)
1818

1919
Where $x_1$, $x_2$, $x_3$ is the input to the perceptron, and its output is:
2020

@@ -46,30 +46,30 @@ $$
4646

4747
Set the appropriate $x$ and $b$ , a simple perceptual unit's NAND gate is expressed as follows:
4848

49-
![](./img/ch3/3-2.png)
49+
![](img/ch3/3-2.png)
5050

5151
When the input is $0$, $1$, the perceptron output is $ 0 * (-2) + 1 * (-2) + 3 = 1$.
5252

5353
More complex perceptrons are composed of simple perceptron units:
5454

55-
![](./img/ch3/3-3.png)
55+
![](img/ch3/3-3.png)
5656

5757
**Multilayer Perceptron**
5858

5959
The multi-layer perceptron is promoted by the perceptron. The most important feature is that there are multiple neuron layers, so it is also called deep neural network. Each neuron in the $i$ layer of the multilayer perceptron is connected to each neuron in the $i-1$ layer compared to a separate perceptron.
6060

61-
![](./img/ch3/3.1.1.5.png)
61+
![](img/ch3/3.1.1.5.png)
6262

6363
The output layer can have more than $1$ neurons. The hidden layer can have only $1 $ layers, or it can have multiple layers. The output layer is a neural network of multiple neurons such as the following:
6464

65-
![](./img/ch3/3.1.1.6.png)
65+
![](img/ch3/3.1.1.6.png)
6666

6767

6868
### 3.1.2 What are the common model structures of neural networks?
6969

7070
The figure below contains most of the commonly used models:
7171

72-
![](./img/ch3/3-7.jpg)
72+
![](img/ch3/3-7.jpg)
7373

7474
### 3.1.3 How to choose a deep learning development platform?
7575

@@ -108,14 +108,14 @@ Some platforms are specifically developed for deep learning research and applica
108108

109109
The reason for the disappearance of the gradient is affected by many factors, such as the size of the learning rate, the initialization of the network parameters, and the edge effect of the activation function. In the deep neural network, the gradient calculated by each neuron is passed to the previous layer, and the gradient received by the shallower neurons is affected by all previous layer gradients. If the calculated gradient value is very small, as the number of layers increases, the obtained gradient update information will decay exponentially, and the gradient disappears. The figure below shows the learning rate of different hidden layers:
110110

111-
![](./img/ch3/3-8.png)
111+
![](img/ch3/3-8.png)
112112

113113
2. Exploding Gradient
114114
In a network structure such as a deep network or a Recurrent Neural Network (RNN), gradients can accumulate in the process of network update, becoming a very large gradient, resulting in a large update of the network weight value, making the network unstable; In extreme cases, the weight value will even overflow and become a $NaN$ value, which cannot be updated anymore.
115115

116116
3. Degeneration of the weight matrix results in a reduction in the effective degrees of freedom of the model. The degradation rate of learning in the parameter space is slowed down, which leads to the reduction of the effective dimension of the model. The available degrees of freedom of the network contribute to the gradient norm in learning. As the number of multiplication matrices (ie, network depth) increases, The product of the matrix becomes more and more degraded. In nonlinear networks with hard saturated boundaries (such as ReLU networks), as the depth increases, the degradation process becomes faster and faster. The visualization of this degradation process is shown in a 2014 paper by Duvenaud et al:
117117

118-
![](./img/ch3/3-9.jpg)
118+
![](img/ch3/3-9.jpg)
119119

120120
As the depth increases, the input space (shown in the upper left corner) is twisted into thinner and thinner filaments at each point in the input space, and only one direction orthogonal to the filament affects the response of the network. In this direction, the network is actually very sensitive to change.
121121

@@ -129,9 +129,9 @@ Traditional machine learning needs to define some manual features to purposefull
129129

130130

131131

132-
![](./img/ch3/3.1.6.1.png)
132+
![](img/ch3/3.1.6.1.png)
133133

134-
![](./img/ch3/3-11.jpg)
134+
![](img/ch3/3-11.jpg)
135135

136136
## 3.2 Network Operations and Calculations
137137

@@ -141,23 +141,23 @@ There are two main types of neural network calculations: foward propagation (FP)
141141

142142
** Forward Propagation**
143143

144-
![](./img/ch3/3.2.1.1.png)
144+
![](img/ch3/3.2.1.1.png)
145145

146146
Suppose the upper node $ i, j, k, ... $ and so on are connected to the node $ w $ of this layer, so what is the value of the node $ w $? That is, the weighting operation is performed by the nodes of $i, j, k, ... $ above and the corresponding connection weights, and the final result is added with an offset term (for simplicity in the figure) Finally, through a non-linear function (ie activation function), such as $ReLu $, $ sigmoid $ and other functions, the final result is the output of this layer node $ w $.
147147

148148
Finally, through this method of layer by layer operation, the output layer results are obtained.
149149

150150
**Backpropagation**
151151

152-
![](./img/ch3/3.2.1.2.png)
152+
![](img/ch3/3.2.1.2.png)
153153

154154
Because of the final result of our forward propagation, taking the classification as an example, there is always an error in the end. How to reduce the error? One algorithm that is widely used at present is the gradient descent algorithm, but the gradient requires the partial derivative. The Chinese alphabet is used as an example to explain:
155155

156156
Let the final error be $ E $ and the activation function of the output layer be a linear activation function, for the output then $ E $ for the output node $ y_l $ the partial derivative is $ y_l - t_l $, where $ t_l $ is the real value, $ \ Frac{\partial y_l}{\partial z_l} $ refers to the activation function mentioned above, $ z_l $ is the weighted sum mentioned above, then the $ E $ for this layer has a partial derivative of $ z_l $ Frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{\partial z_l} $. In the same way, the next level is calculated as well, except that the $\frac{\partial E}{\partial y_k} $ calculation method has been changed back to the input layer, and finally $ \frac{\partial E}{ \partial x_i} = \frac{\partial E}{\partial y_j} \frac{\partial y_j}{\partial z_j} $, and $ \frac{\partial z_j}{\partial x_i} = w_i j $ . Then adjust the weights in these processes, and then continue the process of forward propagation and back propagation, and finally get a better result.
157157

158158
### 3.2.2 How to calculate the output of the neural network?
159159

160-
![](./img/ch3/3.2.2.1.png)
160+
![](img/ch3/3.2.2.1.png)
161161

162162
As shown in the figure above, the input layer has three nodes, which we numbered as 1, 2, and 3; the four nodes of the hidden layer are numbered 4, 5, 6, and 7; the last two nodes of the output layer are numbered 8. 9. For example, node 4 of the hidden layer is connected to the three nodes 1, 2, and 3 of the input layer, and the weights on the connection are $ w_{41}, w_{42}, w_{43} $.
163163

@@ -185,7 +185,7 @@ For the same reason, we can also calculate $ y_2 $. So that the output values
185185

186186
Suppose there is a 5\*5 image, convolved with a 3\*3 filter, and I want a 3\*3 Feature Map, as shown below:
187187

188-
![](./img/ch3/3.2.3.1.png)
188+
![](img/ch3/3.2.3.1.png)
189189

190190
$ x_{i,j} $ represents the $ j $ column element of the $ i $ line of the image. $ w_{m,n} $ means filter $ m $ line $ n $ column weight. $ w_b $ represents the offset of $filter$. Table $a_i, _j$ shows the feature map $ i$ line $ j $ column element. $f$ represents the activation function, here the $ReLU$ function is used as an example.
191191

@@ -211,15 +211,15 @@ $$
211211

212212
The calculation process is illustrated as follows:
213213

214-
![](./img/ch3/3.2.3.2.png)
214+
![](img/ch3/3.2.3.2.png)
215215

216216
By analogy, all Feature Maps are calculated.
217217

218-
![](./img/ch3/3.2.3.4.png)
218+
![](img/ch3/3.2.3.4.png)
219219

220220
When the stride is 2, the Feature Map is calculated as follows
221221

222-
![](./img/ch3/3.2.3.5.png)
222+
![](img/ch3/3.2.3.5.png)
223223

224224
Note: Image size, stride, and the size of the Feature Map after convolution are related. They satisfy the following relationship:
225225

@@ -252,15 +252,15 @@ Where $D$ is the depth; $F$ is the size of the filter; $w_{d,m,n}$ represents th
252252

253253
There can be multiple filters per convolutional layer. After each filter is convolved with the original image, you get a Feature Map. The depth (number) of the Feature Map after convolution is the same as the number of filters in the convolutional layer. The following illustration shows the calculation of a convolutional layer with two filters. $7*7*3$ Input, after two convolutions of $3*3*3$ filter (step size is $2$), get the output of $3*3*2$. The Zero padding in the figure is $1$, which is a $0$ around the input element.
254254

255-
![](./img/ch3/3.2.3.6.png)
255+
![](img/ch3/3.2.3.6.png)
256256

257257
The above is the calculation method of the convolutional layer. This is a partial connection and weight sharing: each layer of neurons is only connected to the upper layer of neurons (convolution calculation rules), and the weight of the filter is the same for all neurons in the previous layer. For a convolutional layer containing two $3 * 3 * 3 $ fitlers, the number of parameters is only $ (3 * 3 * 3+1) * 2 = 56 $, and the number of parameters is the same as the previous one. The number of layers of neurons is irrelevant. Compared to a fully connected neural network, the number of parameters is greatly reduced.
258258

259259
### 3.2.4 How to calculate the output value of the Pooling layer output value?
260260

261261
The main role of the Pooling layer is to downsample, further reducing the number of parameters by removing unimportant samples from the Feature Map. There are many ways to pooling, the most common one is Max Pooling. Max Pooling actually takes the maximum value in the sample of n\*n as the sampled value after sampling. The figure below is 2\*2 max pooling:
262262

263-
![](./img/ch3/3.2.4.1.png)
263+
![](img/ch3/3.2.4.1.png)
264264

265265
In addition to Max Pooing, Average Pooling is also commonly used - taking the average of each sample.
266266
For a Feature Map with a depth of $ D $ , each layer does Pooling independently, so the depth after Pooling is still $ D $.
@@ -269,7 +269,7 @@ For a Feature Map with a depth of $ D $ , each layer does Pooling independently,
269269

270270
A typical three-layer neural network is as follows:
271271

272-
![](./img/ch3/3.2.5.1.png)
272+
![](img/ch3/3.2.5.1.png)
273273

274274
Where Layer $ L_1 $ is the input layer, Layer $ L_2 $ is the hidden layer, and Layer $ L_3 $ is the output layer.
275275

@@ -279,11 +279,11 @@ If the input and output are the same, it is a self-encoding model. If the raw da
279279

280280
Suppose you have the following network layer:
281281

282-
![](./img/ch3/3.2.5.2.png)
282+
![](img/ch3/3.2.5.2.png)
283283

284284
The input layer contains neurons $ i_1, i_2 $, offset $ b_1 $; the hidden layer contains neurons $ h_1, h_2 $, offset $ b_2 $, and the output layer is $ o_1, o_2 $, $ W_i $ is the weight of the connection between the layers, and the activation function is the $sigmoid $ function. Take the initial value of the above parameters, as shown below:
285285

286-
![](./img/ch3/3.2.5.3.png)
286+
![](img/ch3/3.2.5.3.png)
287287

288288
among them:
289289

@@ -365,7 +365,7 @@ $$
365365

366366
The following diagram can be more intuitive to see how the error propagates back:
367367

368-
![](./img/ch3/3.2.5.4.png)
368+
![](img/ch3/3.2.5.4.png)
369369

370370
### 3.2.6 What is the meaning of the neural network more "deep"?
371371

@@ -429,23 +429,23 @@ Among them, the search process requires a search algorithm, generally: grid sear
429429

430430
The function image is as follows:
431431

432-
![](./img/ch3/3-26.png)
432+
![](img/ch3/3-26.png)
433433

434434
2. tanh activation function
435435

436436
The function is defined as: $ f(x) = tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $, the value range is $ (- 1,1) $.
437437

438438
The function image is as follows:
439439

440-
![](./img/ch3/3-27.png)
440+
![](img/ch3/3-27.png)
441441

442442
3. Relu activation function
443443

444444
The function is defined as: $ f(x) = max(0, x) $ , and the value field is $ [0,+∞) $;
445445

446446
The function image is as follows:
447447

448-
![](./img/ch3/3-28.png)
448+
![](img/ch3/3-28.png)
449449

450450
4. Leak Relu activation function
451451

@@ -458,15 +458,15 @@ Among them, the search process requires a search algorithm, generally: grid sear
458458

459459
The image is as follows ($ a = 0.5 $):
460460

461-
![](./img/ch3/3-29.png)
461+
![](img/ch3/3-29.png)
462462

463463
5. SoftPlus activation function
464464

465465
The function is defined as: $ f(x) = ln( 1 + e^x) $, and the value range is $ (0, +∞) $.
466466

467467
The function image is as follows:
468468

469-
![](./img/ch3/3-30.png)
469+
![](img/ch3/3-30.png)
470470

471471
6. softmax function
472472

@@ -478,7 +478,7 @@ Among them, the search process requires a search algorithm, generally: grid sear
478478

479479
For common activation functions, the derivative is calculated as follows:
480480

481-
![](./img/ch3/3-31.png)
481+
![](img/ch3/3-31.png)
482482

483483
### 3.4.4 What are the properties of the activation function?
484484

@@ -517,7 +517,7 @@ The following are common choices:
517517

518518
The Relu activation function image is as follows:
519519

520-
![](./img/ch3/3-32.png)
520+
![](img/ch3/3-32.png)
521521

522522
According to the image, it can be seen that it has the following characteristics:
523523

@@ -543,15 +543,15 @@ $$
543543

544544
From the following figure, the neural network contains the input layer, and then processed by two feature layers. Finally, the softmax analyzer can get the probability under different conditions. Here, it needs to be divided into three categories, and finally get $ y=0. , y=1, y=2 probability value of $.
545545

546-
![](./img/ch3/3.4.9.1.png)
546+
![](img/ch3/3.4.9.1.png)
547547

548548
Continuing with the picture below, the three inputs pass through softmax to get an array of $[0.05, 0.10, 0.85] $, which is the function of soft.
549549

550-
![](./img/ch3/3.4.9.2.png)
550+
![](img/ch3/3.4.9.2.png)
551551

552552
The more visual mapping process is shown below:
553553

554-
![****](./img/ch3/3.4.9.3.png)
554+
![****](img/ch3/3.4.9.3.png)
555555

556556
In the case of softmax, the original output is $3,1,-3$, which is mapped to the value of $(0,1)$ by the softmax function, and the sum of these values ​​is $1 $( Satisfy the nature of the probability), then we can understand it as a probability, when we finally select the output node, we can select the node with the highest probability (that is, the value corresponds to the largest) as our prediction target!
557557

@@ -668,7 +668,7 @@ At this time, a batch-grading learning method (Mini-batches Learning) can be emp
668668

669669
### 3.6.3 Why can normalization improve the solution speed?
670670

671-
![](./img/ch3/3.6.3.1.png)
671+
![](img/ch3/3.6.3.1.png)
672672

673673
The above figure is the optimal solution finding process that represents whether the data is uniform (the circle can be understood as a contour). The left image shows the search process without normalization, and the right image shows the normalized search process.
674674

@@ -684,7 +684,7 @@ Suppose $w1$ ranges in $[-10, 10]$, while $w2$ ranges in $[-100, 100]$, the grad
684684

685685
This will result in a more bias toward the direction of $ w1 $ during the search. Go out of the "L" shape, or become the "Zigzag" shape.
686686

687-
![](./img/ch3/3-37.png)
687+
![](img/ch3/3-37.png)
688688

689689
### 3.6.5 What types of normalization?
690690

@@ -742,7 +742,7 @@ among them,
742742

743743
A simple diagram is as follows:
744744

745-
![](./img/ch3/3.6.7.1.png)
745+
![](img/ch3/3.6.7.1.png)
746746

747747
### 3.6.8 What is Batch Normalization?
748748

@@ -892,7 +892,7 @@ Deviation Initialization Trap: Both are initialized to the same value.
892892
Take a three-layer network as an example:
893893
First look at the structure
894894

895-
![](./img/ch3/3.8.2.1.png)
895+
![](img/ch3/3.8.2.1.png)
896896

897897
Its expression is:
898898

@@ -1051,8 +1051,8 @@ tf.train.RMSPropOptimizer
10511051

10521052
### 3.12.2 Why is regularization helpful in preventing overfitting?
10531053

1054-
![](./img/ch3/3.12.2.1.png)
1055-
![](./img/ch3/3.12.2.2.png)
1054+
![](img/ch3/3.12.2.1.png)
1055+
![](img/ch3/3.12.2.2.png)
10561056

10571057
The left picture is high deviation, the right picture is high variance, and the middle is Just Right, which we saw in the previous lesson.
10581058

0 commit comments

Comments
 (0)