## Monday, April 30, 2018

### Why Bahdanau's neural attention requires two layers

The neural attention model was first described for the task of machine translation in Bahdanau's Neural machine translation by jointly learning to align and translate. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:

\begin{align} c_i &= \sum_j \alpha_{ij} s_j \\ \alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\ z_{ij} &= W \tanh(V (s_j ++ p_{i-1})) \end{align}
where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.

So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?

In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.

Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be
$$\begin{matrix} (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3}) \end{matrix}$$
but if we add the constant k to each of the three numbers then the result will still be the same
$$\begin{matrix} & (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\ =& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\ =& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\ =& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3}) \end{matrix}$$

This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.

Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.
$$\begin{matrix} (W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\ (W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\ (W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\ \dots & \dots & \dots & \dots \end{matrix}$$

Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.

Now take the first two rows. What is the result of subtracting the second from the first?

$$\begin{matrix} (W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\ (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots \end{matrix}$$

Notice how all the columns have the same difference, which means that the second column can be rewritten as:

$$\begin{matrix} (W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots \end{matrix}$$

We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.