Paper explained: “UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION” — ICLR’17
Original Paper can be found here. It was one of the three papers which got Best Paper Award at ICLR 2017.

What to expect from this blog: Summary of the paper and my understanding of the paper mixed with my personal opinions.
1. Crux of the paper
Let us begin by trying to summarise the claims of the paper.
This is the main claim of the paper:
The paper shows how traditional approaches fail to explain why large neural networks generalize well in practice.
To elaborate further on the above statement, let us look at the following points:
- ‘traditional approaches’: Generally generalisation performance of the large neural networks is attributed to either the model family (and the inductive bias associated with them, e.g. CNN for images) or the explicit (L2 Norm of the weights, Dropout, BatchNorm) as well as the implicit (properties of the optimization algorithm) regularisation techniques.
- Generalization: Generalization is referred to the model’s ability to perform equally on the unseen data and hence is usually quantified as the generalisation error, i.e., the difference between test error and train error.
The paper brings to light that it is not trivial to answer why neural networks have such good generalisation performance. They do this by finding out that neural networks easily fit (memorise) random label and random data with no significant change in the training properties. This is highlighted in the paper neatly by the following two centred, italic notes:
Deep Neural Networks easily fit random labels.
Explicit regulrization may improve generalization performance, but is neither necessory nor by itself sufficient for controlling generalzation error.
Other claims of the paper:
- A simple 2-layered ReLU network with parameters p = 2n+d can express any labelling of size n in d dimensions, i.e., the hypothesis class represented by a 2-layered ReLU network with p parameters can shatter datasets of size n in d dimensions.
- The properties of the training process of the standard architecture neural network don’t change substantially when fitting on random labels — which leads to the claim that whatever justification there was for small generalization of these networks are not enough.
- The statistical learning idea around (explicit) regularisation, namely, confining the hypothesis class to a smaller subset with manageable complexity, is not enough to explain the generalization abilities of deep networks (since the same networks also fit random data)
- Implicit regularization — What properties of global minima explain their generalisation? Do all global minima generalise equally?
They call out the fact that understanding generalization is difficult even for a simple linear model. For linear models, they investigate two properties of the minima and check if these properties signal towards generalization performance of a model.
a) Curvature of the minima: In their construction of the linear case, all the minima had the same curvature
b) Norm of the minima: For the linear model in consideration, assuming 0 initial weight, they find that the solution that SGD would converge to is the minimum-l2-norm solution. Unfortunately, this also doesn’t guarantee better generalization performance.
The paper asks an important question — What makes Deep Neural Networks generalize well? It brings out the fact that all trivial answers to this question are not correct.
2. Significance
The question that is put forward for the readers is why do Neural Networks generalise well. We have all taken advantage of neural networks’ performance someway or the other, they have become immensely popular and are almost everywhere. But we still do not understand what makes them generalise well.
An answer to this question would enable better design of architectures, optimization algorithms and regularisers. Not knowing why something is working well makes it harder to improve and interpret. A satisfactory answer to this question thus would have profound implications in understanding Deep Learning, making it more reliable and robust.
3. Experimental Setup
The experimental results of the paper seem to not shock me — there are some similar reactions on the OpenReview forum. It is not shocking to note that the standard regularizers only contribute so much to the generalization performance. It is also not shocking that even with standard regularizations in place, Deep Networks can fit random labels.
The following set of experiments are performed:
1. True Labels: Original dataset
2. Partially Corrupted labels: label of each image is independently corrupted with a probability p
3. Random Labels: p=1
4. Shuffled Pixels: one random permutation of the pixels is applied to all images in train and test set.
5. Random Pixels: Random permutation is applied to each image independently
6. Gaussian: random pixels are generated from a Gaussian distribution with mean and variance matching original image.
Standard architectures are trained on CIFAR 10 and ImageNet benchmarks with the same set of hyperparameters and the training and test accuracies are noted.
The first 3 experiments can prove if the deep networks can ‘shatter’ datasets of practical sizes (Although to prove shattering, one should show all possible labels can be explained by a hypothesis class, it is safe to assume that if random labelling can be explained, any labelling would be). Close to 0 training error in the second and third experiments bring out the fact that even though there is seemingly no humanly explainable relation between the images and the labels, deep networks still learn some function which satisfies random labelling. This means that the hypothesis class is rich; then why does it learn ‘correct’ (correctness based on generalization error) function when given humanely labels.
The latter 3 experiments show that even if the images are not natural, CNNs still are able to learn functions which give close to 0 error. In other words, even after the inductive bias in CNN architectures, the hypothesis class is rich enough to learn functions on top of random pixels. CNN architectures seem to not help much when learning with natural images, as they perform similarly for non-natural images as well.
Two attributes differ for random data and true data: 1) Training error: The training error for random labels is not 0% as with the true labels on ImageNet, but is still very low (~5%) and is much better than random chance. 2) Learning characteristics. The paper also claims that the learning characteristics (train curves, epochs required) are similar for all the above variants. However, some people feel otherwise (OpenReview) and I would also not want to read too much into it. Although learning characteristics also can give a lot of information about the optimization process, the paper does not provide much insights or experiments to claim something confidently.
Experiment with regularizations
Three regularizers are considered which are very commonly used:
1. Data Augmentation: Domain-specific transformations are performed, like random cropping, hue perturbation etc.
2. Weight Decay: l2 regularization on weights
3. Dropout
Without changing the hyperparameters, the experiments are performed with various regularizations turned on and off. On CIFAR10 with or without regularizers, the generalization error is very low. On InceptionNet however, turning off regularizations resulted in 18% drop in test Top-1 accuracy.
The authors also observe that data augmentation techniques using the known symmetries and changing the model architecture seem to be more impactful than just using weight decays or preventing low training errors. Changing the model architecture results in changing the hypothesis class and is a way to model the inductive bias, i.e, our knowledge about the problem and domain. Thus it helps in reducing the complexity of the model and hence reduces the variance, and if the modelled inductive bias makes the hypothesis class closer to the underlying true function, it also reduces the bias error. When both the bias and variance errors reduce, the generalisation error reduces.
The conclusion from these experiments is that regularizations, when tuned properly, help to improve generalization performance, but they are not the reason why deep neural networks generalize well, because even when turned off, the networks continue to perform well.
This to me is not very surprising, nobody would have claimed that it is the regularizers which bring the generalization errors from 90% to 10%.
4. What can be learnt from the paper
- We do not know why deep networks generalize well, and the obvious answers are not the correct answers.
- Hypothesis classes are rich enough to contain functions which explain arbitrary labellings and optimization algorithms do settle on such functions.
5. Other insights, gaps and some follow up references
- Why rich hypothesis spaces are undesirable?
According to learning theory, any hypothesis class which has infinite VC dimension is not PAC learnable. Which means if a hypothesis class can explain all possible labels for a dataset of infinite size, then it is impossible to get a Probably Approximately Correct function. Philosophically, if a hypothesis class is able to explain any fact, then it is useless. [2] noted that a tight bound on the VC dimension of feedforward networks with ReLU activations is: VC-dim = O(k ∗ dim(w)), hence these neural networks can shatter any dataset of size n < VC-dim, which is usually the case as the size of a practical dataset is far lower than the number of parameters. - The assumption in thinking about generalization is that if the network performs similarly on unseen data, we believe it is generalizing well. But we forget that the unseen data is usually very similar or close to the training data. But as noted by [1], if the test images are slightly changed, the test error rises significantly, which hints towards a need for better quantification and evaluation of the generalisation performance itself.
- Existing work on neural network pruning (e.g., [4]) demonstrates that the function learned by a neural network can often be represented with fewer parameters. [5] shows the importance of weight initialisation and prove that there are subnetworks which when initialised properly (and hence directed towards the better minima) can reach the same performance as the larger network (with more complex hypothesis class), but when initialised randomly, the same subnetwork architecture is unable to reach the best performance.
- A comparison of the performance of different global minimas as done in [3] seems like a step in the right direction.
- Side note: It is interesting to me that if a Machine Learning algorithm learns a function which explains human comprehendible labels, we call that ‘learning’ otherwise we call it ‘memorization’. We do want our algorithms to learn what we learn, we want to give them the knowledge that we possess, and thus it makes sense to tag any other kind of learning as just plain memorization. But it is technically not memorization of labels if it is not humanely comprehendible, it is just learning some other function which we do not comprehend and that is not the way we perceive our world.
[1] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet? International Conference on Machine Learning, 2019
[2] N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930, 2017
[3] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. In NIPS, 2017.
[4] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.
[5] J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019.