Friday, June 28, 2019

The exploding gradient problem: Why your neural network gives NaNs

Have you ever been training your neural network and suddenly got a bunch of NaNs as loss values? This is usually because of something called the exploding gradient problem, which is when your neural network's gradients, used to train it, are extremely large, leading to overflow errors when performing backpropagation.

The exploding gradient problem is mostly caused by using large weight values in combination with using a large number of layers. It happens a lot with RNNs which would be neural networks with as many layers as there are items in the input sequence. Basically, given a particular activation function, with every additional layer the gradient of the parameters in the first layer can either vanish or explode as the gradient consists of a multiplication of the weights of the layers in front of it together. If the weights are fractions then their product will get closer and closer to zero (vanishes) with every additional layer, whilst if the weights are large then their product will get closer and closer to infinity (explodes) with every additional layer. Vanishing gradients lead to no learning in the early layers whilst exploding gradients leads to NaNs.

In every explanation of this that I see, I always just see general reasoning without any actual examples of what this actually looks like. In order to illustrate what exploding gradients actually look like, we're going to make some simplifications such as that we're building a neural network with just one neural unit per layer and the single weight and bias in each layer will be the same. This is not a necessary condition to reach exploding gradients, but it will help visualise what is going on. Let's see an example of what happens when we have just 3 layers and a weight equal to 8.

ReLU is an activation function that is known for mitigating the vanishing gradient problem, but it also makes it easy to create exploding gradients if the weights are large enough, which is why weights must be initialised to very small values. Here is what the graph produced by a neural network looks like when it consists of single ReLU units in each layer (single number input and single number output as well) as the number of layers varies between 1 (pink), 2 (red), and 3 (dark red).
$y = \text{relu}(w \times x)$
$y = \text{relu}(w \times \text{relu}(w \times x))$
$y = \text{relu}(w \times \text{relu}(w \times \text{relu}(w \times x)))$

See how the steepness grows exponentially with every additional layer? That's because the gradient is basically the product of all the weights in all the layers, which means that, since $w$ is equal to 8, the gradient is increasing 8-fold with each additional layer.

Now with ReLU, this sort of thing is kind of expected as there is not bound to the value returned by it, so there should also be no bound to the gradient. But this also happens with squashing functions like tanh. Even though its returned value must be between 1 and -1, the gradient at particular points of the function can also be exceptionally large. This means that if you're training the neural network at a point in parameter space which has a very steep gradient, even if it's just a short burst, you'll end up with overflow errors or at the very least with shooting off the learning trajectory. This is what the graph produced by a neural network looksl ike when it consists of single tanh units in each layer.
$y = \text{tanh}(w \times x)$
$y = \text{tanh}(w \times \text{tanh}(w \times x))$
$y = \text{tanh}(w \times \text{tanh}(w \times \text{tanh}(w \times x)))$

See how the slope that transitions the value from -1 to 1 gets steeper and steeper as the number of layers increases? This means that if you're training a simple RNN that uses tanh as an activation function, you can either make the state weights very small or make the initial state values very large in order to stay off the middle slope. Of course this would have other problems as then the initial state wouldn't be able to easily learn to set the initial state to any set of values. It might also be the case that there is no single slope but that there are several slopes throughout the graph since the only limitation that tanh imposes is that all values occur between -1 and 1. For example, if the previous layer learns to perform some kind of sinusoid-like pattern that alternates between -5 and 5 (remember that a neural network with large enough layers can approximate any function), then this is what passing that through tanh would look like (note that this equation is a valid neural network with a hidden layer size of 3 and a single input and output):
$y = \text{tanh}(5 \times \text{tanh}(40 \times x-5) - 5 \times \text{tanh}(40 \times x+0) + 5 \times \text{tanh}(40 \times x+5))$

In this case you can see how it is possible for there to be many steep slopes spread around the parameter space, depending on what the previous layers are doing. Your parameter space could be a minefield.

Now sigmoid is a little more tricky to see how it explodes its gradients. If you just do what we did above, you'll get the following result:
$y = \text{sig}(w \times x)$
$y = \text{sig}(w \times \text{sig}(w \times x))$
$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x)))$

The slopes actually get flatter with every additional layer and get closer to $y=1$. This is because, contrary to tanh, sigmoid bounds itself to be between 0 and 1, with sigmoid(0) = 0.5. This means that $\text{sig}(\text{sigmoid}(x))$ will be bounded between 0.5 and 1, since the lowest the innermost sigmoid can go is 0, which will be mapped to 0.5 by the outer most sigmoid. With each additional nested sigmoid you'll just be pushing that lower bound closer and closer toward 1 until the graph becomes a flat line at $y=1$. In fact, in order to see exploding gradients we need to make use of the biases (which up till now were set to 0). Setting the biases to $-\frac{w}{2}$ gives very nice curves:
$y = \text{sig}(w \times x)$
$y = \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2})$
$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2}) - \frac{w}{2})$

Note that with sigmoid the steepness of the slope increases very slowly compared to tanh, which means that you'll need to use either larger weights or more layers to get the same dramatic effect. Also note that all these graphs were for weights being equal to 8, which is really large, but if you have many layers as in the case of a simple RNN working on long sentences, even a weight of 0.7 would explode after enough inputs.