Geeky is Awesome: maths

Showing posts with label maths. Show all posts

Tuesday, August 31, 2021

Triangles in a diagonally segmented trapezium/trapezoid

When a trapezium (quadriletaral with two parallel sides) is segmented diagonally, you end up with 4 triangles as shown below:

These 4 triangles have interesting properties which I shall be proving here, namely that triangles P and Q are similar (same angles, different sizes) and triangles R and S have the same area.

Let's look at just triangles P and Q.

The angles with the same colour are equal for the following reasons: the green angles are opposite angles whilst the red and blue angles are alternative angles.

Now let's look at just triangles Q, R, and S.

In order to show that R and S have the same area, we'll first look at combined triangles Q and R and combined triangles Q and S.

These two combined triangles have the same base and height, which means that they have the same area ($A = \frac{1}{2}bh$). Now triangle Q in both figures is the same triangle, which means that it has the same area. Therefore, removing the area of triangle Q from both shapes will still result in an equal area ($R+Q = S+Q \implies R = S$). So therefore, triangles R and S have the same area.

Thursday, July 29, 2021

The last digit of square numbers cycles/repeats (modulo of square numbers)

In a previous blog post, I showed that the last digit of a square number could not be 2, 3, 7, or 8. I will now show that the last digit cycles through the same sequence of 10 digits: 0, 1, 4, 9, 6, 5, 6, 9, 4, 1; as shown in these first 31 square numbers:

 0^2:   0
 1^2:   1
 2^2:   4
 3^2:   9
 4^2:  16
 5^2:  25
 6^2:  36
 7^2:  49
 8^2:  64
 9^2:  81
10^2: 100
11^2: 121
12^2: 144
13^2: 169
14^2: 196
15^2: 225
16^2: 256
17^2: 289
18^2: 324
19^2: 361
20^2: 400
21^2: 441
22^2: 484
23^2: 529
24^2: 576
25^2: 625
26^2: 676
27^2: 729
28^2: 784
29^2: 841
30^2: 900

Last time I showed that the reason for only a subset of digits could be the last digit of square numbers was because of the following module equivalence:

$x^2 \text{ mod } 10 = (x \times x) \text{ mod } 10 = ((x \text{ mod } 10) \times (x \text{ mod } 10)) \text{ mod } 10 = (x \text{ mod } 10)^2 \text{ mod } 10$

which says that the last digit of a square number will always be the last digit of the square of a single digit, which limits the possible last digits to the last digit of 0^2, 1^2, 2^2, 3^2, 4^2, 5^2, 6^2, 7^2, 8^2, and 9^2, that is, 0, 1, 4, 9, 6, or 5. From this we can easily see why there is always a cycle among the same 10 digit sequence.

Given that $x^2 \text{ mod } 10 = (x \text{ mod } 10)^2 \text{ mod } 10$, we see that the $(x \text{ mod } 10)$ part is cyclic. This is because it gives the last digit of the number $x$ which cycles through 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Therefore the last digit of the square of these digits must also cycle. Note that this holds for any $n$ in $x^2 \text{ mod } n$ because, given that $x^2 \text{ mod } n = (x \text{ mod } n)^2 \text{ mod } n$, the $(x \text{ mod } n)$ part is cyclic every $n$ numbers.

You'll also notice that the cycle of digits is symmetric (0, 1, 4, 9, 6, 5, 6, 9, 4, 1), that is, goes back through the sequence in reverse order after 5. In a future blog post we'll see why this is.

Sunday, February 28, 2021

Micro averaged precision, recall, and F1 are always the same

This is something that causes confusion when it happens because it isn't very well known. If you're evaluating a multi-class classifier using micro-averaging, the precision, recall, and F1 scores will be exactly the same, always. In fact this is explained in the sklearn documentation as "Note that if all labels are included, “micro”-averaging in a multiclass setting will produce precision, recall and that are all identical to accuracy.". Let's prove this, first using an example and then more generally.

Let's say that your classifier's classes (or labels) are A, B, and C and that you have the following classifier predictions together with the true classes:

pred	true
A	A
A	A
B	A
C	A
A	B
B	B
B	C
C	C
C	C

To measure the micro-averaged precision, recall, and F1 you need to first measure the number true positives (TP), false positives (FP), and false negatives (FN) of each class. For example, if we look at just class A, the TP is the number of rows where A was predicted and also the true class, that is, 2 (first two rows). The FP is the number of rows where A was predicted but was not the true class, that is, 1 (the fifth row). The FN is the number of rows where A was not predicted but was the true class, that is, 1 (the third row).

Class	TP	FP	FN
A	2	1	2
B	1	2	1
C	2	1	1

You calculate the micro-averaged precision ($P$), recall ($R$), and $F1$ for labels $L$ as follows:

$P = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{2+1+2}{(2+1)+(1+2)+(2+1)} = \frac{5}{9} = 55.6\%$
$R = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} = \frac{2+1+2}{(2+2)+(1+1)+(2+1)} = \frac{5}{9} = 55.6\%$
$F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times \frac{5}{9} \times \frac{5}{9}}{\frac{5}{9} + \frac{5}{9}} = 55.6\%$

So we got all the scores the same. If $P$ and $R$ are the same than $F1$ will be the same because the harmonic mean of the same number is that number (just like the arithmetic mean). We can easily prove this: $$ \frac{2 \times X \times X}{X + X} = \frac{2X^2}{2X} = X $$ This is assuming that $X$ is not zero, which would mean that the precision and the recall are both equal to zero, which is impossible.

So the question we should ask is why are the precision and recall always the same. If the precision and recall are always the same, then their definition should be identical. Let's see if we can simplify their identity. $$ P = R \\ \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} \\ \sum_{l\in L}{({TP}_l + {FP}_l)} = \sum_{l\in L}{({TP}_l + {FN}_l)} \\ \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FN}_l} \\ \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{FN}_l} $$

So it all boils down to whether or not the sum of each label's false positives is always equal to the sum of each label's false negatives. Let's say that you have the following row in the first table:

pred	true
A	B

This is an error, but is it a false positive or a false negative? From the perspective of label A it's a false positive, but from the perspective of label B it's a false negative. So for every error in the table, it will be both a false positive for the label in the prediction column and a false negative for the label in the true column. This means that the number of false positives and false negatives will always be balanced, as with every false positive for one label there will be a false negative for another label.

And with this we have proven that the number of false negatives will always equal the number of false positives, which further proves that the precision will always be equal to the recall, which further proves that the F1 score will always be equal to them as well.

Wednesday, January 27, 2021

The secretary problem

Say you're dating several partners and want to get into a serious relationship eventually. After how many partners can you have a good idea of what a good partner is like? This is called the secretary problem, and it can be formalised as follows:

Let $n$ be the number of partners in total you can date. Let $A_i$ be the fitness of the $i$th partner, such that the order of the partners is random and we are interested in the partner with the maximum fitness. Let $r$ be the number of partners you date before you start looking for a serious relationship. Once a partner has been evaluated, you cannot go back to them if you try the next partner. What you're doing is going through all partners $i$ starting from 1 up to $r$, at which point you will take $\max_{i=1}^{r}(A_i)$ as a reference of what a good partner looks like and continue going through the rest of the partners $r$ to $n$ and stop as soon as $A_i$ is greater than the reference ($\max_{i=1}^{r}(A_i)$). If none of the partners are better than the best of the first $r-1$ then you settle for the last partner. The question is, what should $r$ be such that the probability of picking the partner with maximum $A_i$ is maximised.

The choice of $r$ actually makes a significant difference, as we can show in the following simulation in Python 3 (in Python everything needs to be 0-based such that the first item is at position 0 rather than 1):

import random

def select(A, r):
    ref = max(A[:r]) # first r items
    for i in range(r, len(A)): # rest of the items
        if A[i] > ref:
            return i
    return len(A)-1 # return last item if search failed

n = 20
for r in range(1, n): # first item not included
    successes = 0
    for _ in range(10000):
        A = [random.random() for _ in range(n)]
        best = max(range(n), key=lambda i:A[i])
        choice = select(A, r)
        if choice == best:
            successes += 1
    print(r+1, successes/10000)

These are the fraction of times the best partner is selected out of 10000 simulations for each $r$ from 2 to 20. You will notice that the fraction steadily increases up to a peak at 8 and then steadily decreases until the end. It's interesting that the peak is not in the middle of the list but closer to the beginning. How can we find what the best $r$ would be for any $n$?

First, we need to find the probability of choosing the best partner for a given $r$ and $n$. This probability can be expressed as two events happening together: you stopping at element $i+1$ and the maximum element being element $i+1$ (we're using $i+1$ instead of just $i$ because it will be mathematically convenient later). These two events are independent and so we can find the probability of each one and then multiply them together.

The probability of the maximum element from all $n$ elements being the $i+1$th element is $\frac{1}{n}$.

The other probability is a little trickier. To stop at element $i+1$, all the elements prior to $i+1$ have to be less than the reference we got from the first $r$ elements. If a larger element is found sooner, we would have stopped there. Therefore, we are interested in the probability that the maximum element from the first $i$ elements is among the first $r$ elements, which is $\frac{r}{i}$. This is assuming that $i+1 \geq r+1$, since we cannot choose any partner before the $r+1$th partner.

Multiplying these two probabilities, we get $\frac{r}{n \times i}$. But this is the probability for a given $i+1$, whereas we want the probability for any $i+1$. Each $i+1$ is mutually exclusive so we can add up the probability of each $i+1$. Given that $i+1$ must be greater than or equal to $r+1$ by definition,

$$P(r, n) = \sum_{i=r}^{n}\frac{r}{n \times i} = \frac{r}{n}\sum_{i=r}^{n}\frac{1}{i}$$

Great, so now we have the probability of making a successful selection given $r$ and $n$. Now how do we find at what $r$ is this maximum? Normally we would find the derivative of this equation with respect to $r$ and see what $r$ has to be for the derivative to be 0 (the turning point), but we can't differentiate this expression because $r$ is discrete. We also cannot treat $r$ as continuous because it is used as the starting point in the summation. But if we focus on finding $\frac{r}{n}$ rather than $r$, then we can do so in a round about way.

Start by multiplying $\frac{1}{n}$ with the expression like this:

$$\frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

We can do this since $n$ cannot be zero. Now what will this expression look like if we let $n$ go to infinity?

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

Let $x = \lim_{n \to \infty} \frac{r}{n}$ and $t = \lim_{n \to \infty} \frac{i}{n}$. In this case, the summation looks like a Reinmann sum, that is, it is adding together the areas of equally wide rectangular strips in a fixed region under a graph. As the width of the strips goes to zero, the total area of all the strips equals the area under the graph (in the fixed region), making it equal to the definite integral of the graph. In this case, the graph is $f(t) = \frac{1}{t}$, the width of the strips is $\frac{1}{n}$, and the region is between $\frac{r}{n}$ and $\frac{n}{n}$. This means the following:

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n} = x\int_x^1 \frac{1}{t} dt$$

Great. So now we need to find the turning point of $x\int_x^1 \frac{1}{t} dt$ with respect to $x$ in order to know what $\frac{r}{n}$ gives the most probable success. This comes as follows:

$$ \begin{align} \frac{d}{dx} (x\int_x^1 \frac{1}{t} dt) &= 0 \\ \frac{d}{dx} x(\ln(1) - \ln(x)) &= 0 \\ \frac{d}{dx} -x\ln(x) &= 0 \\ -\ln(x) - 1 &= 0 \\ x &= \frac{1}{e} \\ \end{align} $$

Which says that the number of partners you should date before looking for a serious relationship is the total number of partners you will date divided by $\frac{1}{e}$, or about 37% of the number of people you'll go out with.

Tuesday, December 29, 2020

Completing the square

For all quadratic expressions, that is, of the form $ax^2 + bx + c$, which we'll call canonical forms, there exists an identical expression of the form $a(x + p)^2 + q$, which we'll call completed square forms. For example, $3x^2 + 12x + 27 \equiv 3(x + 2)^2 + 15$. Among other reasons, completed square forms are convient for finding the root of the quadratic's graph because it is easier to manipulate algebraically. Below is the method for converting a quadratic expression in canonical form into an identical quadratic in completed square form.

The trick to completing the square is to see what would happen if, given $x^2 + bx + c$, you assume that the completed square form is $(x + \frac{b}{2})^2$ and correct it. If we have an $a$ in front of $x^2$, we can factor it out of the expression so that we have $a(x^2 + \frac{b}{a}x + \frac{c}{a})$ and then complete the square of what's inside the brackets. Let's use $3x^2 + 12x + 27$ as an example:

$3x^2 + 12x + 27 \equiv 3(x^2 + 4x + 9)$

Focus on $x^2 + 4x + 9$

Assume that $x^2 + 4x + 9 \equiv (x + 2)^2$

Expanding the brackets:
$(x + 2)^2 \equiv x^2 + 2 \times 2 \times x + 2^2 \equiv x^2 + 4x + 4$

We can see that dividing $b$ (that is, 4) by 2 is useful because that term gets doubled after expanding the brackets. We can further see that instead of 9, the last term is 4. So we need to correct for this. Fortunately, we can introduce the term $q$ to our completed square form, which can be used to correct it. We correct it by adding 5 (that is, $9 - 4$) after the squared brackets, like this:

$(x + 2)^2 + 5$

Now, when we expand the brackets, we get:

$(x + 2)^2 + 5 \equiv (x^2 + 2 \times 2 \times x + 2^2) + 5 \equiv x^2 + 4x + 9$

which is exactly what we want. Therefore,

$3x^2 + 12x + 27 \equiv 3(x^2 + 4x + 9) \equiv 3((x + 2)^2 + 5) \equiv 3(x + 2)^2 + 15$

Let's try this with the general case:

$ax^2 + bx + c \equiv a(x^2 + \frac{b}{a}x + \frac{c}{a})$

Focus on $x^2 + \frac{b}{a}x + \frac{c}{a}$

Assume that $x^2 + \frac{b}{a}x + \frac{c}{a} \equiv (x + \frac{b}{2a})^2$

$(x + \frac{b}{2a})^2 \equiv x^2 + 2 \times \frac{b}{2a} \times x + (\frac{b}{2a})^2 \equiv x^2 + \frac{b}{a}x + (\frac{b}{2a})^2$

Add $\frac{c}{a} - (\frac{b}{2a})^2$ to the completed square expression.

$(x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2$

Therefore,

$$ \begin{align*} ax^2 + bx + c &\equiv a(x^2 + \frac{b}{a}x + \frac{c}{a}) \\ &\equiv a((x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2) \\ &\equiv a(x + \frac{b}{2a})^2 + a\frac{c}{a} - a(\frac{b}{2a})^2 \\ &\equiv a(x + \frac{b}{2a})^2 + (c - \frac{b^2}{4a}) \end{align*} $$

You can use the last line to convert the expression directly.

Tuesday, September 29, 2020

Proof of the volume of a pyramid and cone formulae

Let's take a square based pyramid, one which is arranged such that it fits in the corner of a cube like this:

The volume of this kind of pyramid is a third of that of the cube. This is because 3 such pyramids will perfectly fit in the cube, as shown below:

In fact you don't even need the pyramid to fit in a cube, it can fit in a cuboid instead. Three such rectangular based pyramids will always exactly fit the cuboid. Now the next natural question is about pyramids that are arranged differently, such as when the top point is in the center of the base, as shown below:

You will notice that no matter how the pyramid is arranged, it always consists of an infinite number of stacked squares, starting from the square at the base of the pyramid and getting progressively smaller until the top point is reached. Since the base and top point of both pyramids are exactly the same, the only thing that is changing is where these square slices are horizontally. So the same set of squares in the orange pyramid are used in the green pyramid with the only difference being where they are placed. This means that the same volume is being used by both pyramids so the same formula is used for both. Therefore, the formula for the volume of a rectangular based pyramid is: $$V = \frac{1}{3}hA$$ where $h$ is the height of the pyramid and $A$ is the area of the base. Now what about when the base is not a square? Will a circle work? A circular based pyramid is a cone:

The cone is exactly the same as the square based pyramid except that instead of consisting of a stack of squares consists of a stack of circles (discs, to be exact). Now if the circles fit exactly inside those of a square based pyramid, then the area of the circle is going to be a fraction of the area of the square:

In terms of the radius of the circle ($r$), that fraction is: $$\frac{\pi r^2}{(2r)^2} = \frac{\pi r^2}{4r^2} = \frac{\pi}{4}$$ Now keeping in mind that this fraction will always be the same for any square in the stack of squares forming the pyramid (from the base to the top point), this means that the volume of the cone must be the same fraction of the volume of the pyramid. Therefore: $$V = \frac{\pi}{4}\frac{1}{3}h(2r)^2 = \frac{1}{3}h\pi r^2$$ In fact, the equation did not change from the previous one because it's still using the base area except that the base area is now of a circle instead of a rectangle. What about a trangular based pyramid (tetrahedron)? Same concept. The stack is made of triangles instead of rectangles or circles:

The fraction of the area occupied by the triangle over the area occupied by the square is half: $$\frac{l^2/2}{l^2} = \frac{1}{2}$$ where $l$ is the side length of the square. So now the volume will be half that of the square based pyramid: $$V = \frac{1}{2}\frac{1}{3}h(l)^2 = \frac{1}{3}h\frac{l^2}{2} = \frac{1}{6}hl^2$$ Again we see that the base area equation still holds and must hold for any shaped base pyramid (where the base shape is linearly reduced to a point along the height of the pyramid).

Sunday, August 30, 2020

Why the area under a velocity-time graph gives the distance travelled

I remember back when I was in school when, during physics lessons on motion, we would have velocity-time graphs of some car, where the graph tells you how fast the car was going at any given time, like the one below:

So this is saying that the car started from rest (0m/s) then it accelerated to 20m/s, and so on. The interesting part about this is that you can apparently find the distance travelled by the car by finding the area under the graph. Now eventually I learned that this is because the area under the graph is equal to the anti-derivative of the graph which makes transforms the velocity-time graph back into a distance-time graph (since the derivative of the distance-time graph is the velocity-time graph). But back when I was doing this thing in school we didn't know about calculus so we were not given a reason for why this works, which is a shame because you actually don't need calculus if you're not dealing with curves.

So why does this work? Well, in order to find the area under the graph you need to first break it up into a set of basic shapes as below. Triangles, rectangles, trapezia, and so on. You then find the area of each individual basic shape and find the total.

Let's start with the rectangle. Why is the area under a rectangle equal to the distance travelled? A rectangle means that the velocity for a period of time was constant. The width is the duration and the height is the velocity.

Now if the car was travelling at a constant speed of $v$ for duration $t$, what is the distance travelled? If $v = s/t$ then $s = vt$, that is, the distance travelled is the velocity multiplied by the duration, which just so happens to be the area of the rectangle.

Now one important thing to note about all the sub-shapes under the graph is that they are all different forms of trapezia. A trapezium with equal sides is a rectangle and a trapezium with one side being zero is a triangle. So we can focus on just the trapezium.

So now we can't just use the equation $v = s/t$ because that assumes that the velocity is constant, whereas here it has increased (steadily) from $v_1$ to $v_2$. Now without referring to Newton's equations of motion, it is clear that there exists a single constant velocity which is between $v_1$ and $v_2$ such that the distance travelled in the same duration is equal to the distance travelled after this acceleration. In other words, there is some rectangle with a height that is between $v_1$ and $v_2$ that should give the same answer. We usually call this height the average velocity and it is usually found by dividing the total distance travelled by the total duration of travelling. But here we don't have the distance travelled, only the initial and final velocity. So what is the average velocity of a uniformly increasing velocity?

The answer is the average height of the trapezium, which is $(v_1 + v_2)/2$. We can convince ourselves that this is the average height of the trapezium by sampling a number of points along the top side of the trapezium and finding what the average of the sample heights is. Let's take trapezium C in the brokendown graph above which is between time 20 and 25:

Of course since this is just a single case you might not be convinced yet; so let's do this algebraically using a variable amount of samples $n$:

$$ \hat{v} = \frac{\sum_{i \in 0...n-1}{(v_1 + \frac{i\times(v_2 - v_1)}{n-1})}}{n} \\ \hat{v} = \frac{\sum_{i \in 0...n-1}{v_1} + \sum_{i \in 0...n-1}{\frac{i\times(v_2 - v_1)}{n-1}}}{n} \\ \hat{v} = \frac{n v_1 + \frac{(v_2 - v_1)\times\sum_{i \in 0...n-1}{i}}{n-1}}{n} \\ \hat{v} = v_1 + \frac{(v_2 - v_1)\times\sum_{i \in 0...n-1}{i}}{n\times(n-1)} \\ \hat{v} = v_1 + \frac{(v_2 - v_1)\times\frac{n\times(n-1)}{2}}{n\times(n-1)} \\ \hat{v} = v_1 + \frac{v_2 - v_1}{2} \\ \hat{v} = v_1 + \frac{v_2}{2} - \frac{v_1}{2} \\ \hat{v} = \frac{v_2}{2} + \frac{v_1}{2} \\ \hat{v} = \frac{v_2 + v_1}{2} $$

Great! So now we know that given a trapezium of velocity-time graph, this is equivalent to a rectangle of the same width but with a height that is the average of the two sides:

We already know how to find the distance travelled by a rectangle shaped velocity-time graph so let's use it on the above rectangle: $s = (v2+v1)/2\times t$. It just so happens that this is also the area of the trapezium. So we can just skip all the above and just take the area of the trapezium to find the distance travelled.

All that's left is what is the distance travelled when the velocity is zero since it's just a flat line, but I'm sure you can guess what that is.

Tuesday, March 31, 2020

A differentiable version of Conway's Game of Life

Conway's Game of Life consists of a grid with black and white squares which flip in colour over time. The rules for changing the colour of a square are as follows:

If the current colour of a square is white then if it is surrounded by 2 or 3 adjacent white squares it stays white, otherwise it becomes black.
If the current colour of a square is black then if it is surrounded by exactly 3 white squares then it becomes white, otherwise it stays black.

If we treat the grid as a numeric matrix where 0 is black and 1 is white, then a differentiable version of these rules can e made. The idea here is to allow not just black and white but also shades of grey so that the values of the squares can change only slightly and hence allow the changes to be differentiable. In order to force the values to remain between 1 and 0 we will make use of the sigmoid (logistic) function which is defined as $y = \frac{1}{1 + e^{-x}}$.

The key trick here is to come up with a differentiable function that maps the number of white squares around a square to 0 or 1. We need two such functions: one that when x is 2 or 3 y is 1 and otherwise y is 0 and another that when x is 3 y is 1 and otherwise y is 0. We can make this function using two sigmoid functions subtracted from each other as follows:

$y = \text{sig}(w \times (x - a)) - \text{sig}(w \times (x - b))$

The graph plot of this function would be one which starts as 0, then increases to 1 at $x = a$ (where $a$ is the halfway point as $y$ goes from 0 to 1), then stays 1 until $x = b$, at which point $y$ goes back to 0 (again, $b$ is the halfway point as $y$ goes from 1 to 0). $w$ is there to control how steep the change from 0 to 1 and back should be with large $w$ meaning very steep.

So now we apply the above equation to be used on a whole matrix of values and where $x$ is the number of adjacent white squares. Given that a square can also be grey, we instead just take the sum of the adjacent values to count the number of white squares. This results in a count which is not a whole number but which is usable in a differentiable function.

We also need to be able to choose between using the rule for white squares or the rule for black squares. To do that we can just take a weighted average as follows:

$y = x \times a + (1 - x) \times b$

where $x$ is a number between 0 and 1 and a is the value $y$ should be when $x$ is 1 whilst $b$ is the value $y$ should be when $x$ is 0.

Given a matrix $G$, you can get the next iteration $G'$ using the following function:

Let $k$ be a 3 by 3 matrix consisting of all 1s and a single 0 in the center.
$N = conv(G, k)$, that is, convolve the kernel $k$ over the matrix $G$ in order to get a new matrix $N$ giving the number of adjacent white squares around every element in the matrix.
$B = \text{sig}(w \times (G - 2.5)) - \text{sig}(w \times (G - 3.5))$ (resultant matrix if all elements where black)
$W = \text{sig}(w \times (G - 1.5)) - \text{sig}(w \times (G - 3.5))$ (resultant matrix if all elements where white)
$G' = G \times W + (1 - G) \times B$

Thursday, February 27, 2020

Proof that the sum of all natural numbers is equal to -1/12 (provided we make an assumption)

This is one of those weird proofs that turns out has practical value in physics. If you add together all the numbers from 1 to infinity, the answer will be $-\frac{1}{12}$. Here's how we get this.

We start from the simpler question of what is 1 - 1 + 1 - 1 + 1 - 1 + ... for an infinite number of terms. If you stop the series at a 1 then the answer is 1 but if you stop at a -1 then the answer is 0. Since the series never stops then we can make the assumption that the answer is 0.5, which is the average. This is like if you had a light switch that you were constantly switching on and off. If done at a fast enough frequency, the light will seem to be halfway light and dark, which is 0.5.

$$\sum_i{(-1)^i} = \frac{1}{2}$$

If you accept the above assumption then the rest is rigorous.

We next need to find what 1 - 2 + 3 - 4 + 5 - ... is. This can be reduced to the above identity by applying the following trick:

S   = 1 - 2 + 3 - 4 + 5 - ...

S+S = 1 - 2 + 3 - 4 + 5 - ...
        + 1 - 2 + 3 - 4 + ...

    = 1 - 1 + 1 - 1 + 1 - ...

Therefore $2S = \frac{1}{2}$ which means that $S = \frac{1}{4}$.

$$\sum_i{(-1)^i i} = \frac{1}{4}$$

Finally we get to our original question:

S'   = 1 + 2 + 3 + 4 + 5 + ...

S'-S = 1 + 2 + 3 + 4 + 5 + 6 + ...
     - 1 + 2 - 3 + 4 - 5 + 6 - ...

     = 0 + 4 + 0 + 8 + 0 + 12 + ...

Which are the even numbers multiplied by 2, that is, the multiples of 4.

S'-S = 4 + 8 + 12 + ...
     = 4(1 + 2 + 3 + ...)
     = 4S'

Now if S' - S = 4S' then

S' - S   = 4S'
S' - 4S' = S
-3S'     = S
S'       = -S/3
         = -(1/4)/3 = -1/12

$$\sum_i{i} = -\frac{1}{12}$$

Sunday, January 26, 2020

The Whittaker-Shannon interpolation formula (inferring missing values from a sampled signal using sinc or Lanczos function)

In an earlier blog post we had talked about Lagrange polynomial interpolation which is when you need to find a polynomial that passes through a set of points. This time we will be seeing a similar problem that is encountered in signal processing.

In discrete signal processing, a continuous signal is represented by a discrete signal consisting of values taken at regular intervals in time, a process called sampling. Below we have an example of a continuous signal that is then sampled at intervals of 0.9 (arbitrarily chosen here).

Now it may be the case that you would like to reconstruct the original signal from the sampled signal. You may think that there is an infinite number of signals that can fit the same samples but because the discrete values are sampled at regular intervals, there is actually only one signal that can fit it, provided that the sampling rate is greater than twice the highest frequency of the original signal. If this condition is met then we can construct the original signal using the Whittaker-Shannon interpolation formula.

The process is similar to that of the Lagrange polynomial interpolation. In polynomial interpolation, the trick is to find a set of component polynomials each of which passes through one of the given set of points. In order to be able to combine them all together into a single polynomial that passes through all of the given set of points, the component polynomials also need to equal zero wherever there is a different point. This makes sure that adding the polynomials together will not cause an interference between them at the given set of points.

In Whittaker-Shannon interpolation, the components are the sinc function instead of polynomials. Sinc functions are periodic functions that are as follows:

$$
\text{sinc}(x) = \begin{cases}
1 & \text{for } x = 0 \\
\frac{sin(x)}{x} & \text{otherwise} \\
\end{cases}
$$

See how sinc is equal to 1 at x = 0 and equal to zero at regular intervals around it? This is how we'll take advantage of it as a component. The first thing we need to do is to synchronise the intervals at which it equals zero to the intervals of the sampled signal. If the sample interval, or period, is 0.9, then the synchronised sinc function would be:

$$
y = sinc\left(\frac{\pi x}{0.9}\right)
$$

Next we need to shift the center peak of the synchronised sinc over one of the samples. Let's pick sample number 3 where sample number 0 is the first sample:

$$
y = sinc\left(\frac{\pi (x - 3 \times 0.9)}{0.9}\right)
$$

Finally we need to adjust the height of the peak to be equal to that of that sample value. Since the peak has a value of one and all the other important points on the sinc function are equal to zero, then if the all we have to do is multiply the sinc function by the sample value:

$$
y = -0.3303 \times sinc\left(\frac{\pi (x - 3 \times 0.9)}{0.9}\right)
$$

Great! Now we just need to do the same for all the other samples and add up all the sinc functions into a single function:

$$
y =
0 \times sinc\left(\frac{\pi (x - 0 \times 0.9)}{0.9}\right) +
0.7628 \times sinc\left(\frac{\pi (x - 1 \times 0.9)}{0.9}\right) +
-0.4309 \times sinc\left(\frac{\pi (x - 2 \times 0.9)}{0.9}\right) +
-0.3303 \times sinc\left(\frac{\pi (x - 3 \times 0.9)}{0.9}\right) +
\dots
$$

Note that the point made at the top about how there is only one signal that passes through the given samples only applied when there is an infinite supply of such samples.

Now, the sinc function is only useful when you want to make use of all points together in order to infer any missing point. In practice, this is very computationally expensive and it would be better to instead use the points thatre close to the missing point rather than all the points. To do this, instead of the sinc function we use a truncated sinc function called a Lanczos filter:

$$
\text{lanczos}(x, a) = \begin{cases}
sinc(\pi x) \times sinc(\frac{\pi x}{a}) & -a < x < a \\
0 & \text{otherwise} \\
\end{cases}
$$

where $a$ is some positive whole number such as 2 and is the amount of points to use to the left and right of the point to infer.

Now we can do the same thing as before but with Lanczos instead of sinc, taking care that the sinc within the lanczos function already multiplies $x$ by $\pi$:

$$
y =
0 \times \text{lanczos}\left(\frac{x - 0 \times 0.9}{0.9}, 2\right) +
0.7628 \times \text{lanczos}\left(\frac{x - 1 \times 0.9}{0.9}, 2\right) +
-0.4309 \times \text{lanczos}\left(\frac{x - 2 \times 0.9}{0.9}, 2\right) +
-0.3303 \times \text{lanczos}\left(\frac{x - 3 \times 0.9}{0.9}, 2\right) +
\dots
$$

The Lanczos filter can be used in things like image enlargement. By treating each pixel in an image as a sample of points from a continuous field, values in between two existing pixels can be deduced by interpolation by reproducing the continuous field using the Lanczos interpolation and then picking missing values from between two pixels.

In general, for samples $s$ and a sample period of $T$,

$$y = \sum_i s[i] \times sinc\left(\frac{\pi (x - i \times T)}{T}\right)$$
$$y = \sum_i s[i] \times lanczos\left(\frac{x - i \times T}{T}\right)$$

Wednesday, October 30, 2019

No square number can end in 2, 3, 7, or 8 (modulo of square numbers)

Look at the following list of square numbers:

0^2:   0
 1^2:   1
 2^2:   4
 3^2:   9
 4^2:  16
 5^2:  25
 6^2:  36
 7^2:  49
 8^2:  64
 9^2:  81
10^2: 100
11^2: 121
12^2: 144
13^2: 169
14^2: 196
15^2: 225
16^2: 256
17^2: 289
18^2: 324
19^2: 361
20^2: 400
21^2: 441
22^2: 484
23^2: 529
24^2: 576
25^2: 625
26^2: 676
27^2: 729
28^2: 784
29^2: 841
30^2: 900

Notice how the last digit is always 0, 1, 4, 9, 6, or 5? There also seems to be a cyclic pattern of digits repeating themselves but we'll talk about that in another post. In this post, we'll just talk about why certain digits do not occur at the end of square numbers, which is useful in order to check if a number is a square or not.

We can get the last digit of a number by dividing it by 10 and keeping the remainder, also known as the number modulo 10:

0^2 mod 10: 0
 1^2 mod 10: 1
 2^2 mod 10: 4
 3^2 mod 10: 9
 4^2 mod 10: 6
 5^2 mod 10: 5
 6^2 mod 10: 6
 7^2 mod 10: 9
 8^2 mod 10: 4
 9^2 mod 10: 1
10^2 mod 10: 0
11^2 mod 10: 1
12^2 mod 10: 4
13^2 mod 10: 9
14^2 mod 10: 6
15^2 mod 10: 5
16^2 mod 10: 6
17^2 mod 10: 9
18^2 mod 10: 4
19^2 mod 10: 1
20^2 mod 10: 0
21^2 mod 10: 1
22^2 mod 10: 4
23^2 mod 10: 9
24^2 mod 10: 6
25^2 mod 10: 5
26^2 mod 10: 6
27^2 mod 10: 9
28^2 mod 10: 4
29^2 mod 10: 1
30^2 mod 10: 0

It turns out that this sort of exclusion of certain modulo results from square numbers happens with many other numbers apart from 10. Take 12 for example:

0^2 mod 12: 0
 1^2 mod 12: 1
 2^2 mod 12: 4
 3^2 mod 12: 9
 4^2 mod 12: 4
 5^2 mod 12: 1
 6^2 mod 12: 0
 7^2 mod 12: 1
 8^2 mod 12: 4
 9^2 mod 12: 9
10^2 mod 12: 4
11^2 mod 12: 1
12^2 mod 12: 0
13^2 mod 12: 1
14^2 mod 12: 4
15^2 mod 12: 9
16^2 mod 12: 4
17^2 mod 12: 1
18^2 mod 12: 0
19^2 mod 12: 1
20^2 mod 12: 4
21^2 mod 12: 9
22^2 mod 12: 4
23^2 mod 12: 1
24^2 mod 12: 0
25^2 mod 12: 1
26^2 mod 12: 4
27^2 mod 12: 9
28^2 mod 12: 4
29^2 mod 12: 1
30^2 mod 12: 0

With 12 you only get 0, 1, 4, and 9 as digits which is less than with 10. Why does this happen? We can prove that this is the case using modulo algebra. Let's call the number being squared $x$ and the number after the modulo $n$. Then by the distributive law of the modulo operation:

$x^2 \text{ mod } n = (x \times x) \text{ mod } n = ((x \text{ mod } n) \times (x \text{ mod } n)) \text{ mod } n = (x \text{ mod } n)^2 \text{ mod } n$

What the above derivation says is that when dividing a square number, the remainder will be based on the remainder of the root of the square number. In the case of the last digit of a square number, which is equivalent to the square number modulo 10, the last digit will be one of the last digits of the single digits squared: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81. This is because $x \text{ mod } 10$ gives you the last digit of the number $x$, which is a single digit, and then this is squared and the last digit of the square is the answer.

In the general case, this is setting a limit on the number of possible remainders (usually $x \text{ mod } n$ has $n$ different possible answers for all $x$). The number of different remainders is first limited to $n$ with the first inner modulo but then only the remainders resulting from these numbers squared can be returned. This means that if one of the square numbers has the same modulo as another square number, then the total number of final remainders has decreased.

In a future post we will investigate why the remainders of square numbers are cyclic.

Friday, August 30, 2019

The Fourier transform for real numbers (approximating complicated periodic functions with simple sinusoids)

In a previous post we saw how to approximate complex functions using simple polynomials by using the Taylor series approximation method. This consisted in assuming that your complex function was an actual polynomial and then using mathematical tricks to tease out the coefficients of the polynomial one by one. Polynomials are useful functions to approximate curves but are terrible at approximating periodic functions such as $sin(x)$ (repeat themselves periodically) because they themselves are not periodic. In this post we'll see how to approximating periodic functions by using a sum of simple sinusoids (sines or cosines). With polynomials, we had a summation of powers of $x$, each of which having a different power. Now we will instead be having a summation of sinusoids, each of which having a different whole number frequency. With polynomials we had to find the coefficients, which are constants multiplied by the power of $x$, and any curve could be arbitrarily approximated by finding the right coefficients. Now we will be finding amplitudes, which are constants multiplied by each sinusoid, and any periodic function can be arbitrarily approximated by finding the right amplitudes.

Instead of the Taylor series approximation we will now be using the Fourier series approximation, also known as the Fourier transform. In order to keep things simple we will only be looking at real functions rather than complex ones (functions with imaginary components). The Fourier series assumes that the periodic function we're deconstructing, $f(x)$, begins its period at 0 and ends it at $T$, after which $f(x)$ will continue repeating itself such that $f(T + k) = f(k)$. If we'd like our period to start at $x_0$ instead then we can easily just replace our periodic function with $f'(x)$ such that $f'(x) = f(x + x_0)$ and then subtract $x_0$ from $x$ again after getting the approximate function.

Let's say that we have a complex periodic function where the first period is between 0 and 2 and is defined as $y = \cosh(x-1)$. This is what it looks like:

What we mean by this being a periodic function is that if we looked beyond the first period we would see the function looking like this:

We're assuming that our function consists of a summation of sinusoids with different frequencies and amplitudes such that each wavelength is equal to the period of the original function or a whole number division of that period (half, quarter, etc.). To make a sine or cosine have a period equal to $T$, we use $sin(x\frac{2\pi}{T})$ and $cos(x\frac{2\pi}{T})$. Multiplying $x$ by a whole number will change the frequency of the sinusoid such that the wavelength is equal to a whole number division of $T$. For example, $sin(2 x\frac{2\pi}{2})$ will make the sine function repeat itself twice when going from 0 to 2. This is what our sinusoid functions look like for both sine and cosine:

By adding up these sines and cosines together whilst having specific amplitudes (that we need to find) we can get closer and closer to the original periodic function.

The Fourier transform takes advantage of an interesting quirk of sinusoids and the area under their graph. For two positive whole numbers $n$ and $m$, if you multiply $\sin(n x\frac{2\pi}{T})$ by $\sin(m x\frac{2\pi}{T})$ or $\cos(n x\frac{2\pi}{T})$ by $\cos(m x\frac{2\pi}{T})$, the area under the graph between 0 and $T$ will always be zero, with the only exception being when $n = m$. If $n = m$ then the area will be equal to $\frac{T}{2}$. In other words,

$$
\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx =
\begin{cases}
0& \text{if } n \neq m\\
\frac{T}{2}& \text{otherwise}
\end{cases}
$$
and likewise for $\cos$ instead of $\sin$. Here's an example showing the area under the graph of two cosines with different frequencies being multiplied together:

Note how each hump above the x-axis has a corresponding anti-hump below the x-axis which cancel each other out, resulting in a total area of zero. Here's an example showing the area under the graph of two cosines with the same frequency being multiplied together:

Note how the area is all above the x-axis. If we had to measure this area, it would be equal to $\frac{T}{2}$ where $T = 2$. Let's prove this for all frequencies.

Proof that the area under the product of two sines with different frequencies is 0:
$$
\begin{align}
\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx
&= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) - \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\
&= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx - \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} - \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\
&= 0\\
\end{align}
$$

Proof that the area under the product of two sines with the same frequencies is half their period:
$$
\begin{align}
\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(n x\frac{2\pi}{T}\right) dx
&= \int_{0}^{T} \sin^2\left(n x\frac{2\pi}{T}\right) dx\\
&= \int_{0}^{T} \frac{1}{2}\left(1 - \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\
&= \frac{1}{2}\left(\int_{0}^{T} 1 dx - \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\
&= \frac{1}{2}\left(\left[x \right]_{0}^{T} - \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\
&= \frac{1}{2}\left(\left(T - 0\right) - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\
&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\
&= \frac{T}{2}\\
\end{align}
$$

Proof that the area under the product of two cosines with different frequencies is 0:
$$
\begin{align}
\int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx
&= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) + \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\
&= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\
&= 0\\
\end{align}
$$

Proof that the area under the product of two cosines with the same frequencies is half their period:
$$
\begin{align}
\int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(n x\frac{2\pi}{T}\right) dx
&= \int_{0}^{T} \cos^2\left(n x\frac{2\pi}{T}\right) dx\\
&= \int_{0}^{T} \frac{1}{2}\left(1 + \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\
&= \frac{1}{2}\left(\int_{0}^{T} 1 dx + \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\
&= \frac{1}{2}\left(\left[x \right]_{0}^{T} + \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\
&= \frac{1}{2}\left(\left(T - 0\right) + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\
&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\
&= \frac{T}{2}\\
\end{align}
$$

Also interestingly, proof that the area under the product of a sine and cosine, regardless of frequency, is 0:
$$
\begin{align}
\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx
&= \int_{0}^{T} \frac{1}{2}\left(\sin\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right) + \sin\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right)\right) dx\\
&= \frac{1}{2}\left(\int_{0}^{T} \sin\left((n + m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \sin\left((n - m) \frac{2\pi}{T}x\right) dx\right)\\
&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\
&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\
&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\
&= 0\\
\end{align}
$$

Great! Now we can get back to the Fourier transform. Suppose that we have the following sum of sines and cosines:
$$
\begin{align}
f(x) = &a_1 \cos\left(1 x\frac{2\pi}{T}\right) + b_1 \sin\left(1 x\frac{2\pi}{T}\right) + \\
&a_2 \cos\left(2 x\frac{2\pi}{T}\right) + b_2 \sin\left(2 x\frac{2\pi}{T}\right) + \\
&\dots
\end{align}
$$

How can we tease out the individual amplitudes $a_i$ and $b_i$? Thanks to the identities we proved above, we can now get any amplitude we want by multiplying the function by a sine or cosine, finding the area under the graph, and dividing the area by $\frac{T}{2}$. Here's how it works:
$$
\begin{align}
\frac{2}{T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right) f(x) dx
&= \frac{1}{2T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right)
\begin{pmatrix}
&a_1 cos\left(1 x\frac{2\pi}{T}\right) & + b_1 sin\left(1 x\frac{2\pi}{T}\right) + \\
&a_2 cos\left(2 x\frac{2\pi}{T}\right) & + b_2 sin\left(2 x\frac{2\pi}{T}\right) + \\
&\dots&
\end{pmatrix}
dx\\
&= \frac{2}{T}
\begin{pmatrix}
&a_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(1 x\frac{2\pi}{T}\right) dx & + b_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(1 x\frac{2\pi}{T}\right) dx + \\
&\dots&\\
&a_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(n x\frac{2\pi}{T}\right) dx & + b_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(n x\frac{2\pi}{T}\right) dx + \\
&\dots&
\end{pmatrix}\\
&= \frac{2}{T}
\begin{pmatrix}
&a_1 0 & + b_1 0 + \\
&\dots&\\
&a_n \frac{T}{2} & + b_n 0 + \\
&\dots&
\end{pmatrix}\\
&= \frac{2}{T} a_n \frac{T}{2}\\
&= a_n
\end{align}
$$

And obviously multiplying by $\sin\left(n x\frac{2\pi}{T}\right)$ would yield the amplitude of a sine. All that's left is by how much to vertically lift the graph, that is, the constant term to add to the sinusoids. The sinusoids should oscillate around the average value of the original function in one period, which is found as follows:

$y_0 = \frac{1}{T} \int_0^T f(x) dx$

Getting back to $f(x) = \cosh(x-1)$, here are the amplitudes:

$y_0 = \frac{1}{2} \int_0^2 f(x) dx = 1.1752$
$a_1 = \frac{2}{2} \int_0^2 \cos\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.2162$
$b_1 = \frac{2}{2} \int_0^2 \sin\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.0$
$a_2 = \frac{2}{2} \int_0^2 \cos\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0581$
$b_2 = \frac{2}{2} \int_0^2 \sin\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0$

And here are the graphs of the approximated function getting better with every new term (where the approximated function is denoted as $\hat{f}$):

$\hat{f}(x) = 1.1752$

$\hat{f}(x) = 1.1752 + 0.2162 \cos\left(1 x\frac{2\pi}{2}\right)$

$\hat{f}(x) = 1.1752 + 0.2162\cos\left(1 x\frac{2\pi}{2}\right) + 0.0581\cos\left(2 x\frac{2\pi}{2}\right)$

And of course here is the function beyond the first period:

Thursday, December 27, 2018

Language models and the unknown token: Why perplexity gets worse as the corpus size increases

Unless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.

The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.

It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.

Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.

We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.

Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.

Let's say that we want to find the probability of the following sentence:

the loud dog barked at the loud man

and let's say that the language model we're using to do that has the following vocabulary:

the at dog man

this means that the sentence is now changed to:

the UNK dog UNK at the UNK man

Now the naive way to get the probability of the sentence is as follows:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})
$$

Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:

$$
P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3
$$

Now the sentence probability goes up as the vocabulary size increases!

Wednesday, October 24, 2018

Mentally estimating square roots

What's $\sqrt{10}$? I'd reckon it's about $3 + \frac{10-9}{2 \times 3} = 3.166$. The actual answer is 3.162. What about $\sqrt{18}$? Should be about $4 + \frac{18-16}{2 \times 4} = 4.250$. The actual answer is 4.242. I found out about this calculation from this video but there was no explanation for why it works. Here's an explanation.

So the method here is as follows.

Let the number you want to find the square root of be $a^2$.
Let the largest square number which is less than $a^2$ be $b^2$ such that $b^2 <= a^2 < (b+1)^2$. For example if $a^2$ is 10 then $b^2$ is 9, if $a^2$ is 18 then $b^2$ is 16.
The square root of $a^2$ is approximately $b + \frac{a^2 - b^2}{2 b}$.

This method is easy to carry out mentally but why does it work? The trick here is that the graph of the square root function grows so slowly that we can approximate the curve between two adjacent square numbers as a line.

We can use the line to approximate the square root of any number between two square numbers. The first thing we need to know is the gradient of the line. The vertical distance between two adjacent square numbers on the square root curve is 1, since the two square numbers are the squares of two consecutive numbers. The horizontal distance changes and becomes larger as the adjacent square numbers become larger but we can calculate it as follows:

$$(b+1)^2 - b^2 = b^2 + 2b + 1 - b^2 = 2b + 1$$

So the horizontal distance is twice the square root of the smaller square number plus one. Therefore the gradient of the line is $\frac{1}{2b+1}$. Once we know by how much the line grows vertically for every horizontal unit, we can then determine how much higher than $b$ the point on the line will be at $a$ by multiplying the gradient by $a^2-b^2$, as shown below:

Since the difference in height is less than 1, it is going to be the part of the square root that comes after the decimal point, with the whole number part being $b$.

It might be hard to mentally divide by an odd number in $\frac{a^2-b^2}{2b+1}$ so we further approximate it as $\frac{a^2-b^2}{2b}$ instead. And that's why this method works.

Thursday, August 30, 2018

The McLauren series and Taylor series (approximating complicated functions with simple polynomials)

The McLauren series

Imagine you had a function $f(x)$ that you knew was a polynomial, but whose details were unknown and you could only apply operations to the function without being able to read it. How could you find the coefficients of this polynomial? We know that for coefficients $a_i$:

$f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$

If we find $f(0)$ then we can find $a_0$.

$f(0) = a_0 + a_1 0 + a_2 0^2 + a_3 0^3 + a_4 0^4 + ... = a_0 + 0 + 0 + ... = a_0$

That was easy, but how can we find $a_1$? We need an operation that gets rid of $a_0$ and also the $x$ in the term $a_1 x$. That operation turns out to be differentiation with respect to $x$:

$f'(x) = a_1 + 2 a_2 x + 3 a_3 x^2 + 4 a_4 x^3 + ...$

Great! Now we can find $a_1$ by replacing $x$ with 0:

$f'(0) = a_1 + 2 a_2 0 + 3 a_3 0^2 + 4 a_4 0^3 + ... = a_1$

We can find $a_2$ by repeating these two steps:

$f''(x) = 2 a_2 + 2 \cdot 3 a_3 x + 3 \cdot 4 a_4 x^2 + ...$
$f''(0) = 2 a_2 + 2 \cdot 3 a_3 0 + 3 \cdot 4 a_4 0^2 + ... = 2 a_2$

What we found is twice of $a_2$ which means that we need to divide by 2 to get $a_2$. The differentiation operation is multiplying constants by each coefficient and the constants get bigger and bigger the more times we apply differentiation. You might notice that what's happening is that $a_0$ was multiplied by 1, $a_1$ was also multiplied by 1, $a_2$ has been multiplied by 2, $a_3$ will be multiplied by 6, $a_4$ by 24, and so on. These are factorials, sequences of whole numbers multiplied together ($3! = 1 \times 2 \times 3 = 6$). Which means that we need to divide by the next factorial after each round of differentiation and substitution by zero.

In general we can find the $i$th coefficient in an unknown polynomial function by doing the following:

$a_i = \frac{f^i(0)}{i!}$

That's great. Now to test it. Let's see if a complex function is actually a polynomial in disguise. Take something simple such as $f(x) = e^x$. This doesn't look like a polynomial, but it may be represented by a polynomial with an infinite number of terms. Let's find what are the coefficients of the hidden polynomial in $e^x$.

$f(x) = e^x = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$
$\frac{f(0)}{0!} = \frac{e^0}{1} = a_0$
$\frac{f(0)}{0!} = 1$

OK, so $a_0$ is 1. Let's find the rest of the coefficients.

$a_1 = \frac{f'(0)}{1!} = \frac{e^0}{1} = 1$
$a_2 = \frac{f''(0)}{2!} = \frac{e^0}{2} = \frac{1}{2}$
$a_3 = \frac{f'''(0)}{3!} = \frac{e^0}{6} = \frac{1}{6}$
$...$

So the first few terms of the polynomial hidden within $e^x$ are:
$f(x) = 1 + x + \frac{1}{2} x^2 + \frac{1}{6} x^3 + ...$

Does this partial polynomial look anything like $e^x$ when plotted as a graph?

Pretty good within a boundary! Note how the curve has a perfect fit at $x = 0$ and that it gets worse as we move away from there. Adding more terms to the polynomial will enlarge the area around $x = 0$ that is close to the curve but it will always be radiating out from there.

Let's try for $f(x) = cos(x)$ now.

$a_0 = \frac{f(0)}{0!} = \frac{cos(0)}{1} = 1$
$a_1 = \frac{f'(0)}{1!} = \frac{-sin(0)}{1} = 0$
$a_2 = \frac{f''(0)}{2!} = \frac{-cos(0)}{2} = -\frac{1}{2}$
$a_3 = \frac{f'''(0)}{3!} = \frac{sin(0)}{6} = 0$
$...$

So the first few terms of the polynomial hidden within $cos(x)$ is
$f(x) = 1 - \frac{1}{2} x^2 + ...$

The Taylor series

As you can see, this "polynomialisation" of functions is a neat way to approximate a function we might not know how to implement exactly but know how to differentiate and how to find its value when $x$ is 0. But what if we don't know what $f(0)$ is such as in $ln(x)$ or $\frac{1}{x}$? A slight modification to our method allows us to use any value of $x$ and not just 0. Let's call this value $b$. By slightly modifying the polynomial we expect to be hiding inside the function, we can make the polynomial act the same way when $f(b)$ is used instead of $f(0)$:

$f(x) = a_0 + a_1 (x - b) + a_2 (x - b)^2 + a_3 (x - b)^3 + a_4 (x - b)^4 + ...$
$f(b) = a_0 + a_1 (b - b) + a_2 (b - b)^2 + a_3 (b - b)^3 + a_4 (b - b)^4 + ...$
$f(b) = a_0$

$f'(x) = a_1 + 2 a_2 (x - b) + 3 a_3 (x - b)^2 + 4 a_4 (x - b)^3 + ...$
$f'(b) = a_1$

$a_i = \frac{f^i(b)}{i!}$

The catch here is that we are now finding coefficients to the polynomial $a_0 + a_1 (x - b) + a_2 (x - b)^2 + ...$ and not of $a_0 + a_1 x + a_2 x^2 + ...$, but that's OK. Let's try this on $ln(x)$ with $b = 1$.

$a_0 = \frac{f(1)}{0!} = \frac{ln(1)}{1} = 0$
$a_1 = \frac{f'(1)}{1!} = \frac{\frac{1}{1}}{1} = 1$
$a_2 = \frac{f''(1)}{2!} = \frac{-\frac{1}{1^2}}{1} = -1$
$a_3 = \frac{f'''(1)}{3!} = \frac{\frac{2}{1^3}}{1} = 2$
$...$

So the first few terms of the polynomial hidden within $ln(x)$ is
$f(x) = (x - 1) - (x - 1)^2 + 2 (x - 1)^3 + ...$

Adding more terms will approximate the original function better and better but what if we didn't have to? Remember how I said in the previous section that the polynomial approximates the original function best close to $x = 0$. Well now we can approximate it best around any point $b$ and not just around 0. This means that if our function has multiple known values, such as $cos(x)$ which is known to be 1 at $x = 0$, 0 at $x = \frac{\pi}{2}$, -1 at $x = \pi$, etc., then we can use several short polynomials centered at different points in the function instead of one large polynomial that approximates it well over a large interval. Let's try approximating $cos(x)$ around $x = \pi$, which means that we'll set $b$ to $\pi$.

$a_0 = \frac{f(\pi)}{0!} = \frac{cos(\pi)}{1} = -1$
$a_1 = \frac{f'(\pi)}{1!} = \frac{-sin(\pi)}{1} = 0$
$a_2 = \frac{f''(\pi)}{2!} = \frac{-cos(\pi)}{2} = \frac{1}{2}$
$a_3 = \frac{f'''(\pi)}{3!} = \frac{sin(\pi)}{6} = 0$
$...$

So the first few terms of the polynomial hidden within $cos(x)$ which best approximates the area around $x = \pi$ is
$f(x) = -1 + \frac{1}{2} (x - \pi)^2 + ...$

This is useful when implementing mathematical functions on a computer. You keep several simple polynomials defined at different points in the domain of the function and then pick the closest one to the $x$ you need to evaluate. You can then compute an approximation that isn't too bad without requiring a lot of computational time.

Monday, April 30, 2018

Why Bahdanau's neural attention requires two layers

The neural attention model was first described for the task of machine translation in Bahdanau's Neural machine translation by jointly learning to align and translate. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:

$$
\begin{align}
c_i &= \sum_j \alpha_{ij} s_j \\
\alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\
z_{ij} &= W \tanh(V (s_j ++ p_{i-1}))
\end{align}
$$
where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.

So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?

In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.

Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be
$$
\begin{matrix}
(\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})
\end{matrix}
$$
but if we add the constant k to each of the three numbers then the result will still be the same
$$
\begin{matrix}
& (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\
=& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\
=& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\
=& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})
\end{matrix}
$$

This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.

Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.
$$
\begin{matrix}
(W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\
(W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\
(W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\
\dots & \dots & \dots & \dots
\end{matrix}
$$

Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.

Now take the first two rows. What is the result of subtracting the second from the first?

$$
\begin{matrix}
(W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\
(W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots
\end{matrix}
$$

Notice how all the columns have the same difference, which means that the second column can be rewritten as:

$$
\begin{matrix}
(W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots
\end{matrix}
$$

We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.