This is the repository associated to our NeurIPS 2019 paper
"Intrinsic dimension of data representations in deep neural networks"
Paper Authors
Alessio Ansuini (1), Alessandro Laio (1), Jakob H. Macke (2), Davide Zoccolan (1)
(1) International School for Advanced Studies (SISSA) https://www.sissa.it/
(2) Technical University of Munich (TUM) https://www.tum.de/
Alessio Ansuini wrote the following contents: he is the only one to blame for any error in these still under construction github pages.
We provide
- an Introduction to our work, on this page
- detailed Instructions for reproducibility of our results
- Extra materials (poster, video, more in the future...)
We will provide in future weeks/months a Tutorials pointing to possible extensions and open problems.
-
Alessio Ansuini's poster at NeurIPS
-
Davide Zoccolan's interdisciplinary seminar - at the interface between Neuroscience and Deep Learning - given at the ICTP Workshop on Science of Data Science | (smr 3283).
3-5 min read
jump to results
Datasets can be very high-dimensional. In images, each pixel counts for one dimension (three if coloured), so high-resolution images typically have dimensionality larger than 1,000,000. Countless examples could be made from the fields of biology (genomics, epigenomics), particle physics, et cetera.
The embedding dimension (ED) is the number of features in the data (number of pixels, number of genes expressed in microarrays, etc.) and this is usually the number that counts when data are stored and transmitted (unless we compress it). A very different concept is the intrinsic dimension (ID) that is, informally,
the minimal number of parameters needed to describe the data.
The ID could be much lower than the ED, depending on how much structure and redundancy is present in the original representation. Let us make an example: a torus in 3 dimensions
When we use euclidean coordinates we can specify a point on the torus giving three numbers: ED = 3. But in such a description we are not taking into account the structure of this surface. We are not even taking into account that it's a surface. It turns out that, in fact, the ID = 2 in this case, because we need only two coordinates (one angle for each generating circle) to specify uniquely a point on its surface.
Now let us imagine a dataset composed of a number of points lying close to the torical surface, with some small fluctuations due to the presence of noise. What we require from an algorithm that estimates the intrinsic dimensionality of this dataset is a value close to two, also if the local dimensionality of the noise perturbation (almost by definition of noise) is three.
When we are faced with a new dataset, very often we do not know much about the process that generated it. (In the case of the torus, for example, we could ignore the fact that, in order to generate these data, it was enough to choose random pairs of angles with some probability distribution, transform angles in euclidean coordinates, and then add a small random perturbation to each of these points.)
So, in general, it would be helpful, when exploring data, to know what is the intrinsic dimensionality to start with. It would be helpful for many purposes: compression, density estimation etc.
It is well known that DNNs - in particular convolutional networks (CNNs) - transform their input from the original space (pixels, sounds, etc.) to a progressively abstract form, that support classification and downstream actions.
We follow the evolution of the ID of representations along the layers of CNNs, using the estimation method described in Facco et al..
Our main findings are:
-
the ID profile, across a relevant number of SOTA (pre-trained) CNNs follows a curved shape that we informally nicknamed the "hunchback"
(to compare different architectures we plotted the ID vs. a relative depth, which is the number of non-trivial transformations that the network performs on the input (convolutional and fully-connected layers' operations) divided by the total number of these transformations before the output)
-
the ID in the last hidden layer is predictive of its generalization performance
(this result holds also within architecture classes, see inset for ResNets)
-
representations, even in the last hidden layer are curved.
This indicates that a flattening of data manifolds may not be a general computational goal that deep networks strive to achieve: progressive reduction of the ID, rather than gradual flattening, seems to be the key to achieving linearly separable representations.
A linear approach based on PCA (PC-ID = number of principal components that capture the 90% of variance in the data) was unable to capture the actual dimensionality, since is not able to distinguish qualitatively between trained and untrained networks.
Our ID estimates shows that, on the contrary, in untrained networks the ID is flat, therefore the hunchback shapes we found in trained networks are a genuine effect of training.
Further results: dynamics
These line of research on dynamics is very important for the development of unsupervised approaches (see for example Ma et al. and Gong et al.). We performed these experiments on a VGG-16 network trained on CIFAR-10; the architecture and the optimization procedure used for these experiments is taken from https://github.com/kuangliu/pytorch-cifar.
What we found is:
-
during training different layers show different dynamics: the final layers compress representations, while the initial and intermediate layers expand it
-
focusing only on the last hidden layer, after a first compression phase (lasting approximately a half-epoch) the ID slowly expanded and stabilized at a higher value. This change of regime (from compression to expansion) is not accompanied in our experiments to the onset of overfitting, as it was observed in Ma et al. that used local measures of intrinsic dimension. It is important, for such comparisons, to remember also that our ID estimate is global.
Overall, we think that the dynamics of the ID is not yet completely understood, perhaps depending on the architectures, datasets and optimization procedures.
We hope that data-driven, empirical approaches to investigate deep neural networks, like the one implemented in our study, will provide intuitions and constraints, which will ultimately inspire and enable the development of theoretical explanations of their computational capabilities. We also hope that the methods and the ideas expressed in this paper will be helpful in problems outside deep learning, for the analysis of datasets in general.
If you found this useful please consider citing the following paper
@article{ansuini2019intrinsic,
title={Intrinsic dimension of data representations in deep neural networks},
author={Ansuini, Alessio and Laio, Alessandro and Macke, Jakob H and Zoccolan, Davide},
journal={arXiv preprint arXiv:1905.12784},
year={2019}
}