Skip to content

Commit 1e3e834

Browse files
committed
dim reduction exercises and update
1 parent 008c071 commit 1e3e834

7 files changed

Lines changed: 59 additions & 28 deletions

File tree

docs/day4/dim_reduction.rst

Lines changed: 59 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,15 @@ Guest Lecture by Professor **Anders Hast**
2222
- We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction
2323
- Share insights and experience from Anders's own research
2424

25-
The Essence of Machine Learning: Classification
26-
-------------------------------------------------
25+
visualisation <--> Science
26+
--------------------------
27+
28+
.. figure:: ../img/varanoi.png
29+
:width: 300px
30+
:align: right
31+
:alt: varanoi_regions
32+
33+
Clustering
2734

2835
.. figure:: ../img/ml_classification.png
2936
:width: 300px
@@ -34,24 +41,33 @@ The Essence of Machine Learning: Classification
3441

3542

3643

37-
How can a model separate the "blue" from the "red"?
44+
* We will look at tools for visualising what cannot easily be seen, i.e. high dimensionality reduction
45+
* We will also see that you can make discoveries in your visualisations!
3846

39-
Which model is the best, the green curve or the black?
47+
What is a typical machine learning task?
48+
--------------------------------------
4049

41-
Classification challenges:
4250

43-
* **Black curve**: The model will guess "wrong" sometimes for new data
44-
* **Green curve**: The model will make even more wrong guesses? Why?
45-
"Outliers" or special cases have too much impact on the classification boundaries.
51+
* Differ between different classes of features
52+
* Features usually have more than 3 dimensions, hundreds or even thousands!
53+
* The idea is to find a separating curve in high dimensional space
54+
* Usually we visualise this in 2D since it is easier to understand!
55+
* We will look at several techniques to do this!
56+
* If we can separate in 2D it can often be done in High dimensional space and vice versa!
4657

47-
**Dimensionality reduction:**
4858

49-
Project from several dimensions to fewer, often 2D or 3D.
59+
**Dimensionality reduction:**
5060

51-
*Remember*: we get a distorted picture of the high dimensional space!
61+
* Project from several dimensions to fewer, often 2D or 3D
62+
* Remember: we get a distorted picture of the high dimensional space!
63+
* Some techniques
64+
* SOM
65+
* PCA
66+
* t-SNE
67+
* UMAP
5268

5369
.. figure:: ../img/dim_reduction_bunny.png
54-
:width: 300px
70+
:width: 500px
5571
:align: center
5672
:alt: dim_reduction_bunny
5773

@@ -66,9 +82,16 @@ Some Dimensionality Reduction Techniques:
6682
PCA (on Iris Data)
6783
^^^^^^^^^^^^^^^^^^^
6884

69-
* Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens. There are 50 specimens from each of three species.
70-
* Pretty good separation of classes
71-
* However PCA often fails for high dimensional data as the clusters will overlap!
85+
* PCA = “find the directions where the data varies the most.”
86+
* PCA finds a new coordinate system that fits the data:
87+
* The first axis (1st principal component) points where the data spreads out the most.
88+
* The second axis (2nd principal component) is perpendicular to the first and captures the next largest spread.
89+
* The eigenvectors are the directions of those new axes — the principal components.
90+
* The eigenvalues tell you how much variance (spread) each component captures.
91+
* Fisher's iris data consists of measurements on the sepal length, sepal width, petal length, and petal width for 150 iris specimens.
92+
* There are 50 specimens from each of three species.
93+
* One axis per data element (which ones are discriminant?)
94+
* Follow each individual using the lines
7295

7396
.. admonition:: Iris and its PCA
7497
:class: dropdown
@@ -85,18 +108,25 @@ PCA (on Iris Data)
85108
:align: center
86109
:alt: iris_pca
87110

111+
.. raw:: html
112+
113+
<div style="height: 20px;"></div>
114+
115+
.. figure:: ../img/iris_lines.png
116+
:align: center
117+
:alt: iris_lines
88118

89119

90120
t-SNE
91121
^^^^^^
92122

123+
* A dimensionality-reduction method for visualising high- dimensional data in 2D or 3D
124+
* It keeps similar points close together and dissimilar ones far apart
125+
* Works by turning distances between points into probabilities of being neighbours, both in the original space and in the low-dimensional map
126+
* Then it moves points to make those probabilities match (minimizing KL divergence)
127+
* Uses a Student's t-distribution in 2D to keep clusters separated and avoid crowding
93128

94-
* t-distributed stochastic neighbor embedding (t-SNE) is a `statistical <https://en.wikipedia.org/wiki/Statistics>`_ method for visualising high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
95-
* The t-SNE algorithm comprises two main stages.
96-
* First, t-SNE constructs a `probability distribution <https://en.wikipedia.org/wiki/Probability_distribution>`_ over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability.
97-
* Second, t-SNE de nes a similar probability distribution over the points in the low-dimensional map, and it minimises the `Kullback–Leibler divergence <https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence>`_ between the two distributions with respect to the locations of the points in the map.
98-
99-
.. admonition:: PCA vs t-SNE vs HOG
129+
.. admonition:: PCA vs t-SNE vs UMAP
100130
:class: dropdown
101131

102132
.. figure:: ../img/pca_example.png
@@ -119,21 +149,20 @@ t-SNE
119149

120150
<div style="height: 20px;"></div>
121151

122-
.. figure:: ../img/hog_tnse_example.png
152+
.. figure:: ../img/umap.png
123153
:align: center
124-
:alt: pca_exhog_tnse_exampleample
154+
:alt: umap_example
125155

126-
HOG & t-SNE
156+
UMAP
127157

128158

129159

130160
UMAP
131161
^^^^^^
132162

133-
* Uniform manifold approximation and projection (UMAP) is a nonlinear dimensionality reduction technique.
134-
* Visually, it is similar to t-SNE, but it assumes that the data is uniformly distributed on a `locally connected Riemannian manifold <https://en.wikipedia.org/wiki/Riemannian_manifold>`_ and that the `Riemannian metric <https://en.wikipedia.org/wiki/Riemannian_manifold#Riemannian_metrics_and_Riemannian_manifolds>`_ is locally constant or approximately locally constant.
135-
* UMAP is newer and therefore preferred by many.
136-
* However tends to separate clusters better! But is that always better?
163+
* A nonlinear dimensionality-reduction method, like t-SNE, used to visualize high-dimensional data in 2D or 3D
164+
* Based on manifold theory — it assumes your data lies on a curved surface within a high-dimensional space
165+
* Builds a graph of local relationships (who's close to whom) in the original space, then finds a low-dimensional layout that preserves those relationships
137166

138167
Face Recognition (FR) Use case
139168
--------------------------------
@@ -175,3 +204,5 @@ Exercise
175204
Try running the notebook and give the correct dataset path wherever required.
176205

177206
The env required for this notebook is ``pip install numpy matplotlib scikit-learn scipy pillow plotly umap-learn``
207+
208+
Sample examples from documentations: https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py , https://plotly.com/python/t-sne-and-umap-projections/

docs/img/dim_reduction_bunny.png

10.2 KB
Loading

docs/img/iris_lines.png

178 KB
Loading

docs/img/pca_example.png

145 KB
Loading

docs/img/t-SNE_example.png

69.1 KB
Loading

docs/img/umap.png

139 KB
Loading

docs/img/varanoi.png

319 KB
Loading

0 commit comments

Comments
 (0)