## Monday, December 30, 2019

### Implementing a Gaussian blur filter together with convolution operation from scratch

Gaussian blurring is a very common filter used in image processing which is useful for many things such as removing salt and pepper noise from images, resizing images to be smaller (downsampling), and simulating out-of-focus effects. Let's look at how to implement one ourselves.

Let's use the following grey-scale image as an example:

The image consists of many numbers that are arranged in a grid, each number called a pixel. The pixel values range from 0 to 255 where 0 is represented with a black colour and 255 with a white colour and all the other numbers are shades of grey in between. Here are the values of the 5x5 square of pixels in the top-left corner of the image:
[[156, 157, 160, 159, 158],
[156, 157, 159, 158, 158],
[158, 157, 156, 156, 157],
[160, 157, 154, 154, 156],
[158, 157, 156, 156, 157]]

In the case of a colour image, each pixel contains 3 values: the intensity of red, the intensity of blue, and the intensity of green. When these three primary colours are combined in different intensities they can produce any colour.

Let's use the following colour image as an example:

The values of the 5x5 square of pixels in the top-left corner of the image are:
[[[21, 13,  8], [21, 13,  9], [20, 11,  8], [21, 13, 11], [21, 14,  8]],
[[21, 13,  7], [21, 13,  9], [20, 14,  7], [22, 14,  8], [20, 13,  7]],
[[21, 14,  7], [23, 13, 10], [20, 14,  9], [21, 13,  9], [21, 15,  9]],
[[21, 13,  6], [21, 13,  9], [21, 13,  8], [21, 13, 10], [20, 12,  5]],
[[21, 13,  5], [21, 13,  7], [21, 13,  4], [21, 13,  7], [21, 13,  8]]]

The convolution operation

One of the basic operations in image processing is the convolution. This is when you replace every pixel value $p$ in an image with a weighted sum of the pixel values near $p$, that is, multiply each nearby pixel value with a different number before adding them all up and replacing $p$ with the result. The simplest example of a convolution is when 'nearby' means the pixel itself only. For example, if we convolve a 5x5 pixel greyscale image with a 1x1 weighting, this is what we get:

$$\left[\begin{array}{ccccc} 1 & 2 & 3 & 4 & 5\\ 6 & 7 & 8 & 9 & 10\\ 11 & 12 & 13 & 14 & 15\\ 16 & 17 & 18 & 19 & 20\\ 21 & 22 & 23 & 24 & 25\\ \end{array}\right] \circledast \left[\begin{array}{c} 100 \end{array}\right] = \left[\begin{array}{ccccc} 1 \times 100 & 2 \times 100 & 3 \times 100 & 4 \times 100 & 5 \times 100\\ 6 \times 100 & 7 \times 100 & 8 \times 100 & 9 \times 100 & 10 \times 100\\ 11 \times 100 & 12 \times 100 & 13 \times 100 & 14 \times 100 & 15 \times 100\\ 16 \times 100 & 17 \times 100 & 18 \times 100 & 19 \times 100 & 20 \times 100\\ 21 \times 100 & 22 \times 100 & 23 \times 100 & 24 \times 100 & 25 \times 100\\ \end{array}\right] = \left[\begin{array}{ccccc} 100 & 200 & 300 & 400 & 500\\ 600 & 700 & 800 & 900 & 1000\\ 1100 & 1200 & 1300 & 1400 & 1500\\ 1600 & 1700 & 1800 & 1900 & 2000\\ 2100 & 2200 & 2300 & 2400 & 2500\\ \end{array}\right]$$

Here we have replaced the pixel value 1 with 100, 2 with 200, etc. Note that in practice the numbers would be clipped to not exceed 255.

Only the pixel being replaced contributed to its new value. We can extend the range of the weighting by using a 3x3 weight instead of a single number which will combine all the pixels adjacent to the pixel being replaced.

$$\left[\begin{array}{ccccc} 1 & 2 & 3 & 4 & 5\\ 6 & 7 & 8 & 9 & 10\\ 11 & 12 & 13 & 14 & 15\\ 16 & 17 & 18 & 19 & 20\\ 21 & 22 & 23 & 24 & 25\\ \end{array}\right] \circledast \left[\begin{array}{ccc} 100 & 200 & 100\\ 300 & 400 & 300\\ 100 & 200 & 100\\ \end{array}\right] = \left[\begin{array}{ccccc} \left(\begin{array}{cccccc} & 1 \times 100 & + & 2 \times 200 & + & 3 \times 100\\ + & 6 \times 300 & + & 7 \times 400 & + & 8 \times 300\\ + & 11 \times 100 & + & 12 \times 200 & + & 13 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 2 \times 100 & + & 3 \times 200 & + & 4 \times 100\\ + & 7 \times 300 & + & 8 \times 400 & + & 9 \times 300\\ + & 12 \times 100 & + & 13 \times 200 & + & 14 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 3 \times 100 & + & 4 \times 200 & + & 5 \times 100\\ + & 8 \times 300 & + & 9 \times 400 & + & 10 \times 300\\ + & 13 \times 100 & + & 14 \times 200 & + & 15 \times 100\\ \end{array}\right) \\ \left(\begin{array}{cccccc} & 6 \times 100 & + & 7 \times 200 & + & 8 \times 100\\ + & 11 \times 300 & + & 12 \times 400 & + & 13 \times 300\\ + & 16 \times 100 & + & 17 \times 200 & + & 18 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 7 \times 100 & + & 8 \times 200 & + & 9 \times 100\\ + & 12 \times 300 & + & 13 \times 400 & + & 14 \times 300\\ + & 17 \times 100 & + & 18 \times 200 & + & 19 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 8 \times 100 & + & 9 \times 200 & + & 10 \times 100\\ + & 13 \times 300 & + & 14 \times 400 & + & 15 \times 300\\ + & 18 \times 100 & + & 19 \times 200 & + & 20 \times 100\\ \end{array}\right) \\ \left(\begin{array}{cccccc} & 11 \times 100 & + & 12 \times 200 & + & 13 \times 100\\ + & 16 \times 300 & + & 17 \times 400 & + & 18 \times 300\\ + & 21 \times 100 & + & 22 \times 200 & + & 23 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 12 \times 100 & + & 13 \times 200 & + & 14 \times 100\\ + & 17 \times 300 & + & 18 \times 400 & + & 19 \times 300\\ + & 22 \times 100 & + & 23 \times 200 & + & 24 \times 100\\ \end{array}\right) & \left(\begin{array}{cccccc} & 13 \times 100 & + & 14 \times 200 & + & 15 \times 100\\ + & 18 \times 300 & + & 19 \times 400 & + & 20 \times 300\\ + & 23 \times 100 & + & 24 \times 200 & + & 25 \times 100\\ \end{array}\right) \\ \end{array}\right] = \left[\begin{array}{ccc} 12600 & 14400 & 16200\\ 21600 & 23400 & 25200\\ 30600 & 32400 & 34200\\ \end{array}\right]$$

Note how the new image becomes smaller than the original image. This is in order to avoid using pixels values that lie outside of the image edges when computing new pixel values. One way to use pixel values outside of the image is to treat any pixel outside of the image as having a value of zero (zero padding). We can also reflect the image outside of its edges or wrap around to the other side of the image.

We can implement the above convolution operation in Python naively using for loops as follows:

def conv(img, wgt):
img_h = len(img)
img_w = len(img[0])
wgt_h = len(wgt)
wgt_w = len(wgt[0])

res_h = img_h - (wgt_h//2)*2
res_w = img_w - (wgt_w//2)*2
res = [ [ 0 for _ in range(res_w) ] for _ in range(res_h) ]

for img_row in range(wgt_h//2, img_h-wgt_h//2):
for img_col in range(wgt_w//2, img_w-wgt_w//2):
new_pix = 0
for wgt_row in range(wgt_h):
for wgt_col in range(wgt_w-1, -1, -1): #Technically, a convolution operation needs to flip the weights left to right.
new_pix += wgt[wgt_row][wgt_col]*img[img_row+wgt_row-wgt_h//2][img_col+wgt_col-wgt_w//2]
res[img_row - wgt_h//2][img_col - wgt_w//2] = new_pix

return res

print(conv([
[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9,10],
[11,12,13,14,15],
[16,17,18,19,20],
[21,22,23,24,25]
], [
[100,200,100],
[300,400,300],
[100,200,100]
]))
[[12600, 14400, 16200], [21600, 23400, 25200], [30600, 32400, 34200]]

I say naively because it is possible to perform a convolution using fewer loops and with parallel processing but that would be beyond the scope of this blog post. Note that in the code above we are flipping the weights before multiplying them because that is what a convolution involves, but since we'll be working with symmetric weights then it doesn't really matter.

If we want to use zero padding in our convolution then we would do it as follows:

img_h = len(img)
img_w = len(img[0])
wgt_h = len(wgt)
wgt_w = len(wgt[0])

res = [ [ 0 for _ in range(img_w) ] for _ in range(img_h) ]

for img_row in range(img_h):
for img_col in range(img_w):
new_pix = 0
for wgt_row in range(wgt_h):
for wgt_col in range(wgt_w-1, -1, -1):
y = img_row+wgt_row-wgt_h//2
x = img_col+wgt_col-wgt_w//2
if 0 >= y > img_h and 0 >= x > img_w:
pix = img[y][x]
else:
pix = 0
new_pix += wgt[wgt_row][wgt_col]*pix
res[img_row][img_col] = new_pix

return res

[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9,10],
[11,12,13,14,15],
[16,17,18,19,20],
[21,22,23,24,25]
], [
[100,200,100],
[300,400,300],
[100,200,100]
]))
[[2900, 4800, 6200, 7600, 6100], [8300, 12600, 14400, 16200, 12500], [14800, 21600, 23400, 25200, 19000], [21300, 30600, 32400, 34200, 25500], [19900, 28800, 30200, 31600, 23100]]

We can adapt this function to be able to operate on colour images by simply applying the filter on each of the three intensity channels separately as follows:

img_h = len(img)
img_w = len(img[0])
wgt_h = len(wgt)
wgt_w = len(wgt[0])

res = [ [ [ 0, 0, 0 ] for _ in range(img_w) ] for _ in range(img_h) ]

for img_row in range(img_h):
for img_col in range(img_w):
for channel in range(3):
new_pix = 0
for wgt_row in range(wgt_h):
for wgt_col in range(wgt_w-1, -1, -1):
y = img_row+wgt_row-wgt_h//2
x = img_col+wgt_col-wgt_w//2
if 0 <= y < img_h and 0 <= x < img_w:
pix = img[y][x][channel]
else:
pix = 0
new_pix += wgt[wgt_row][wgt_col]*pix
res[img_row][img_col][channel] = new_pix

return res

[[ 1,  2,  3], [ 4,  5,  6], [ 7,  8,  9], [10, 11, 12], [13, 14, 15]],
[[16, 17, 18], [19, 20, 21], [22, 23, 24], [25, 26, 27], [28, 29, 30]],
[[31, 32, 33], [34, 35, 36], [37, 38, 39], [40, 41, 42], [43, 44, 45]],
[[46, 47, 48], [49, 50, 51], [52, 53, 54], [55, 56, 57], [58, 59, 60]],
[[61, 62, 63], [64, 65, 66], [67, 68, 69], [70, 71, 72], [73, 74, 75]]
], [
[100,200,100],
[300,400,300],
[100,200,100]
]))
[[[6700, 7700, 8700], [11600, 13000, 14400], [15800, 17200, 18600], [20000, 21400, 22800], [16300, 17300, 18300]], [[22300, 23600, 24900], [34200, 36000, 37800], [39600, 41400, 43200], [45000, 46800, 48600], [34900, 36200, 37500]], [[41800, 43100, 44400], [61200, 63000, 64800], [66600, 68400, 70200], [72000, 73800, 75600], [54400, 55700, 57000]], [[61300, 62600, 63900], [88200, 90000, 91800], [93600, 95400, 97200], [99000, 100800, 102600], [73900, 75200, 76500]], [[57700, 58700, 59700], [83600, 85000, 86400], [87800, 89200, 90600], [92000, 93400, 94800], [67300, 68300, 69300]]]
In practice we can use scipy's convolve function in order to apply a convolution:
import scipy.ndimage
import numpy as np
img = np.array([
[[ 1,  2,  3], [ 4,  5,  6], [ 7,  8,  9], [10, 11, 12], [13, 14, 15]],
[[16, 17, 18], [19, 20, 21], [22, 23, 24], [25, 26, 27], [28, 29, 30]],
[[31, 32, 33], [34, 35, 36], [37, 38, 39], [40, 41, 42], [43, 44, 45]],
[[46, 47, 48], [49, 50, 51], [52, 53, 54], [55, 56, 57], [58, 59, 60]],
[[61, 62, 63], [64, 65, 66], [67, 68, 69], [70, 71, 72], [73, 74, 75]]
])
weights = np.array([
[100,200,100],
[300,400,300],
[100,200,100]
])
res = np.stack([ #Convolve each channel separately as a greyscale image and then combine them back into a single image.
scipy.ndimage.convolve(img[:,:,channel], weights, mode='constant')
for channel in range(3)
], axis=2)
print(res)
In signal processing, the weights are actually called a filter.

The gaussian blur
A blur is made by replacing each pixel value with the average value of its nearby pixels. This softens edges and more the image look out of focus. Averaging nearby pixels can be easily achieved by convolving with a filter of $\frac{1}{n}$ where $n$ is the amount of numbers in the filter. This results in making each pixel equal to $$p_1 \times \frac{1}{n} + p_2 \times \frac{1}{n} + \dots + p_n \times \frac{1}{n} = \frac{1}{n} \left(p_1 + p_2 + \dots + p_n\right)$$ where $p_i$ is a pixel value.
import scipy.ndimage
import skimage
import numpy as np

filter = np.ones([11, 11])
filter = filter/filter.size

res = np.stack([ #Convolve each channel separately as a greyscale image and then combine them back into a single image.
scipy.ndimage.convolve(img[:,:,channel], filter, mode='constant')
for channel in range(3)
], axis=2)

skimage.io.imshow(res)
skimage.io.show()

This is actually called a uniform blur, which is not considered an ideal blur as it gives equal importance to pixels that are far from the pixel being replaced as to pixels that are close. A better blur is a gaussian blur. A gaussian blur is when the filter values follow a gaussian bell curve with a maximum in the center and a steep decline just around the center. The gaussian function is defined as follows: $$G(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x^2}{2\sigma^2}}$$ where $\sigma$ is a parameter for controlling how wide the bell curve is and therefore how much influence the nearby pixels have on the value of the pixel being replaced (the greater the $\sigma$, the more blurry). In two dimensions all you do is take the gaussian of each variable and multiply them together: $$G(x,y) = G(x) \times G(y)$$ Now the way the gaussian blur works is to use an infinite 2D gaussian curve as a filter for a convolution. The center of the filter is $G(0,0)$ and each cell in the filter being a distance of one in the gaussian function's domain as follows: $$\newcommand\iddots{\mathinner{ \kern1mu\raise1pt{.} \kern2mu\raise4pt{.} \kern2mu\raise7pt{\Rule{0pt}{7pt}{0pt}.} \kern1mu }} \begin{array}{ccccc} \ddots & & \vdots & & \iddots\\ & G(-1, 1) & G( 0, 1) & G( 1, 1) & \\ \cdots & G(-1, 0) & G( 0, 0) & G( 1, 0) & \cdots\\ & G(-1,-1) & G( 0,-1) & G( 1,-1) & \\ \iddots & & \vdots & & \ddots\\ \end{array}$$ In order to be a perfect gaussian blur, the filter needs to be infinitely large, which is not possible. Fortunately, $G(3\sigma)$ is small enough to be rounded down to zero. This means that our kernel needs to only have a side of $\lceil 6\sigma + 1 \rceil$ at most as anything larger than that will not make a difference ($\lceil x \rceil$ means $x$ rounded up). The problem is that even though the missing values are negligible, the values in the filter will not equal 1 (especially for a small $\sigma$) which makes the resulting image will be either brighter or darker. To get around this we cheat a bit and divide each value in the filter by the filter's total, which results in a guassian blur with a neutral effect on the brightness of the image. $$\begin{array}{ccccccc} \frac{G(-\lceil 3\sigma \rceil, \lceil 3\sigma \rceil)}{T} & & & \frac{G(0, \lceil 3\sigma \rceil)}{T} & & & \frac{G(\lceil 3\sigma \rceil, \lceil 3\sigma \rceil)}{T}\\ & \ddots & & \vdots & & \iddots & \\ & & \frac{G(-1, 1)}{T} & \frac{G( 0, 1)}{T} & \frac{G( 1, 1)}{T} & & \\ \frac{G(-\lceil 3\sigma \rceil, 0)}{T} & \cdots & \frac{G(-1, 0)}{T} & \frac{G( 0, 0)}{T} & \frac{G( 1, 0)}{T} & \cdots & \frac{G(\lceil 3\sigma \rceil, 1)}{T}\\ & & \frac{G(-1,-1)}{T} & \frac{G( 0,-1)}{T} & \frac{G( 1,-1)}{T} & & \\ & \iddots & & \vdots & & \ddots & \\ \frac{G(-\lceil 3\sigma \rceil, -\lceil 3\sigma \rceil)}{T} & & & \frac{G(0, -\lceil 3\sigma \rceil)}{T} & & & \frac{G(\lceil 3\sigma \rceil, -\lceil 3\sigma \rceil)}{T}\\ \end{array}$$ where $T$ is the total of all the $G(x,y)$ in the filter.
import scipy.ndimage
import skimage
import numpy as np

sigma = 4
G = lambda x: 1/np.sqrt(2*np.pi*sigma**2)*np.exp(-x**2/(2*sigma**2))
G2 = lambda x,y: G(x)*G(y)
limit = np.ceil(3*sigma).astype(np.int32)
filter = G2(*np.meshgrid(np.arange(-limit, limit+1), np.arange(-limit, limit+1)))
filter = weights/np.sum(filter)

res = np.stack([
scipy.ndimage.convolve(img[:,:,channel], filter, mode='constant')
for channel in range(3)
], axis=2)
skimage.io.imshow(res)
skimage.io.show()

Of course skimage has a gaussian blur filter out of the box for us to use:
import skimage

sigma = 4
res = skimage.filters.gaussian(img, sigma, mode='constant', truncate=3) #Truncate to 3 sigmas.
skimage.io.imshow(res)
skimage.io.show()

## Saturday, November 30, 2019

### Texture Descriptors of Image Regions using Local Binary Patterns

A very simple way to measure the similarity between the texture of regions in images is to capture the region's texture information in a vector using local binary patterns. These vectors are invariant to intensity changes, fast to compute, and very easy to understand. Let's see how this works. Note that since we're focusing on texture then we don't care about colour but about intensities, so the image needs to be greyscale.

The first step is to replace all the pixels in the image region with LBP (local binary pattern) codes. For every pixel, draw an imaginary circle of some radius centred on the pixel and draw some number of imaginary dots on that circle that are equally spaced apart. The radius and number of dots are to be hand tuned in order to maximise performance but they are usually set to a radius of 1 pixel and a number of dots of 8. Now pick the pixels that fall under those dots together with the pixel at the centre.

In the surrounding pixels, any pixel that has a lower intensity than the central pixel gets a zero and all the rest get a one. These numbers are put together in order to form a binary number called the LBP code of the central pixel.

Do this for every pixel in the region and then count how many times each code appears in the region, putting all the frequencies in a vector (a histogram). This is your texture descriptor and you can stop here. The more similar the vector is to other vectors, the more similar the texture of the image regions are.

We can do better than this however by making the codes uniform. If you do an analysis of these LBP code frequencies, you'll find that most of them have a most two bit changes in them. For example, 00011100 has two bit changes, one from a 0 to a 1 and another after from a 1 to a 0. On the other hand 01010101 has 7 changes. The reason why the majority will have at most two bit changes is because, most of the time, the pixels around a central pixel do not have random intensities but change in a flow as shown in the figure above. If there is a gradient of intensity changes in a single direction then there will only be two bit changes. This means that we can replace all LBP codes with more than two bit changes with a single LBP code that just means 'other'. This is called a uniform LBP code and it reduces our vector size significantly without much of a performance hit, which is good.

We can do even better by making the LBP codes rotation invariant, that is, using the same LBP code regardless of where the gradient direction is pointing. In the above figure, if the line of dark grey pixels were rotated, the LBP code would be changed to something like 00011110 for example or 11110000. The only thing that would change when rotating the surrounding pixels would be the bit rotation of the LBP code. We can make our codes invariant by artificially rotating our LBP codes so that for a given number of ones and zeros in the code we always have a canonical code. Usually we treat the binary code as an integer and rotate the bits until we get the minimum integer. For example, here are all the bit rotations of four 1s and four 0s:

11110000 (240), 01111000 (120), 00111100 (60), 00011110 (30), 00001111 (15)

In this case we would pick 00001111 as our canonical representation as that is the smallest integer. This drastically reduces our vector sizes as the amount of different possible LBP codes to find the frequency of will be a small fraction of the original full set of LBP codes.

You can use LBP descriptors using Python's skimage and numpy as follows:
import numpy as np
import skimage
import skimage.feature

region = image[0:100,0:100]
lbp_codes = skimage.feature.local_binary_pattern(region, P=8, R=1, method='uniform') #8 dots, radius 1, rotation invariant uniform codes.
descriptor = np.histogram(lbp_codes, bins=10, range=(0, 10))[0] #Number of bins and range changes depending on the number of dots and LBP method used.

The number of bins and range needed can be obtained by artificially generating all the possible arrangements of darker and lighter pixels around a central pixel:
def get_histogram_params(dots, method):
all_possible_lbp_codes = set()
for i in range(2**8):
str_bin = '{:0>8b}'.format(i)
region = [ {'0':0,'1':2}[ch] for ch in str_bin ] #Surrounding pixels.
region.insert(4, 1) #Central pixel.
region = np.array(region).reshape((3,3))
lbp_codes = skimage.feature.local_binary_pattern(region, dots, 1, method) #Radius is minimum to keep region size to a minimum.
all_possible_lbp_codes.add(lbp_codes[1,1]) #Get code of central pixel and add it to the set (keeps on unique values).
num_bins = len(all_possible_lbp_codes)
rng = (min(all_possible_lbp_codes), max(all_possible_lbp_codes)+1)
return (num_bins, rng)

## Wednesday, October 30, 2019

### No square number can end in 2, 3, 7, or 8 (modulo of square numbers)

Look at the following list of square numbers:
0^2:   0
1^2:   1
2^2:   4
3^2:   9
4^2:  16
5^2:  25
6^2:  36
7^2:  49
8^2:  64
9^2:  81
10^2: 100
11^2: 121
12^2: 144
13^2: 169
14^2: 196
15^2: 225
16^2: 256
17^2: 289
18^2: 324
19^2: 361
20^2: 400
21^2: 441
22^2: 484
23^2: 529
24^2: 576
25^2: 625
26^2: 676
27^2: 729
28^2: 784
29^2: 841
30^2: 900

Notice how the last digit is always 0, 1, 4, 9, 6, or 5? There also seems to be a cyclic pattern of digits repeating themselves but we'll talk about that in another post. In this post, we'll just talk about why certain digits do not occur at the end of square numbers, which is useful in order to check if a number is a square or not.

We can get the last digit of a number by dividing it by 10 and keeping the remainder, also known as the number modulo 10:

0^2 mod 10: 0
1^2 mod 10: 1
2^2 mod 10: 4
3^2 mod 10: 9
4^2 mod 10: 6
5^2 mod 10: 5
6^2 mod 10: 6
7^2 mod 10: 9
8^2 mod 10: 4
9^2 mod 10: 1
10^2 mod 10: 0
11^2 mod 10: 1
12^2 mod 10: 4
13^2 mod 10: 9
14^2 mod 10: 6
15^2 mod 10: 5
16^2 mod 10: 6
17^2 mod 10: 9
18^2 mod 10: 4
19^2 mod 10: 1
20^2 mod 10: 0
21^2 mod 10: 1
22^2 mod 10: 4
23^2 mod 10: 9
24^2 mod 10: 6
25^2 mod 10: 5
26^2 mod 10: 6
27^2 mod 10: 9
28^2 mod 10: 4
29^2 mod 10: 1
30^2 mod 10: 0

It turns out that this sort of exclusion of certain modulo results from square numbers happens with many other numbers apart from 10. Take 12 for example:

0^2 mod 12: 0
1^2 mod 12: 1
2^2 mod 12: 4
3^2 mod 12: 9
4^2 mod 12: 4
5^2 mod 12: 1
6^2 mod 12: 0
7^2 mod 12: 1
8^2 mod 12: 4
9^2 mod 12: 9
10^2 mod 12: 4
11^2 mod 12: 1
12^2 mod 12: 0
13^2 mod 12: 1
14^2 mod 12: 4
15^2 mod 12: 9
16^2 mod 12: 4
17^2 mod 12: 1
18^2 mod 12: 0
19^2 mod 12: 1
20^2 mod 12: 4
21^2 mod 12: 9
22^2 mod 12: 4
23^2 mod 12: 1
24^2 mod 12: 0
25^2 mod 12: 1
26^2 mod 12: 4
27^2 mod 12: 9
28^2 mod 12: 4
29^2 mod 12: 1
30^2 mod 12: 0

With 12 you only get 0, 1, 4, and 9 as digits which is less than with 10. Why does this happen? We can prove that this is the case using modulo algebra. Let's call the number being squared $x$ and the number after the modulo $n$. Then by the distributive law of the modulo operation:

$x^2 \text{ mod } n = (x \times x) \text{ mod } n = ((x \text{ mod } n) \times (x \text{ mod } n)) \text{ mod } n = (x \text{ mod } n)^2 \text{ mod } n$

What the above derivation says is that when dividing a square number, the remainder will be based on the remainder of the root of the square number. In the case of the last digit of a square number, which is equivalent to the square number modulo 10, the last digit will be one of the last digits of the single digits squared: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81. This is because $x \text{ mod } 10$ gives you the last digit of the number $x$, which is a single digit, and then this is squared and the last digit of the square is the answer.

In the general case, this is setting a limit on the number of possible remainders (usually $x \text{ mod } n$ has $n$ different possible answers for all $x$). The number of different remainders is first limited to $n$ with the first inner modulo but then only the remainders resulting from these numbers squared can be returned. This means that if one of the square numbers has the same modulo as another square number, then the total number of final remainders has decreased.

In a future post we will investigate why the remainders of square numbers are cyclic.

## Thursday, September 26, 2019

### Python functions for generating all substrings/sublists, combinations, permutations, and sequences

def subs(xs, sub_len):
'''Get all sublists/substrings from xs of the given sub-length.'''
for i in range(len(xs) - sub_len + 1): #Get all indexes of the substrings that are at least sub_len long.
yield xs[i:i+sub_len]

for x in subs('abcd', 2):
print(x)
ab
bc
cd

def combinations(xs, comb_size):
'''Get all combinations of the given combination size.'''
if comb_size == 0:
yield xs[:0] #Empty list/string
else:
for i in range(len(xs) - comb_size + 1):
for ys in combinations(xs[i+1:], comb_size-1): #For every item, combine it with every item that comes after it.
yield xs[i] + ys

for x in combinations('abcd', 2):
print(x)
ab
ac
bc
bd
cd

def perms(xs):
'''Get all permutations.'''
if len(xs) == 1:
yield xs
else:
for i in range(len(xs)):
for ys in perms(xs[:i] + xs[i+1:]): #For every item, combine it with every item except itself.
yield xs[i] + ys

for x in perms('abc'):
print(x)
abc
acb
bac
bca
cab
cba

def permutations(xs, perm_size):
'''Get all permutations of the given permutation size.'''
for cs in combinations(xs, perm_size):
for p in perms(cs): #Get all the permutations of all the combinations.
yield p

for x in permutations('abcd', 2):
print(x)
ab
ba
ac
ca
da
bc
cb
bd
db
cd
dc

def seqs(xs, seq_len):
'''Get all sequences of a given length.'''
if seq_len == 0:
yield xs[:0] #Empty list/string
else:
for i in range(len(xs)):
for ys in seqs(xs, seq_len-1): #For every item, combine it with every item including itself.
yield xs[i] + ys

for x in seqs('abc', 2):
print(x)
aa
ab
ac
ba
bb
bc
ca
cb
cc

## Friday, August 30, 2019

### The Fourier transform for real numbers (approximating complicated periodic functions with simple sinusoids)

In a previous post we saw how to approximate complex functions using simple polynomials by using the Taylor series approximation method. This consisted in assuming that your complex function was an actual polynomial and then using mathematical tricks to tease out the coefficients of the polynomial one by one. Polynomials are useful functions to approximate curves but are terrible at approximating periodic functions such as $sin(x)$ (repeat themselves periodically) because they themselves are not periodic. In this post we'll see how to approximating periodic functions by using a sum of simple sinusoids (sines or cosines). With polynomials, we had a summation of powers of $x$, each of which having a different power. Now we will instead be having a summation of sinusoids, each of which having a different whole number frequency. With polynomials we had to find the coefficients, which are constants multiplied by the power of $x$, and any curve could be arbitrarily approximated by finding the right coefficients. Now we will be finding amplitudes, which are constants multiplied by each sinusoid, and any periodic function can be arbitrarily approximated by finding the right amplitudes.

Instead of the Taylor series approximation we will now be using the Fourier series approximation, also known as the Fourier transform. In order to keep things simple we will only be looking at real functions rather than complex ones (functions with imaginary components). The Fourier series assumes that the periodic function we're deconstructing, $f(x)$, begins its period at 0 and ends it at $T$, after which $f(x)$ will continue repeating itself such that $f(T + k) = f(k)$. If we'd like our period to start at $x_0$ instead then we can easily just replace our periodic function with $f'(x)$ such that $f'(x) = f(x + x_0)$ and then subtract $x_0$ from $x$ again after getting the approximate function.

Let's say that we have a complex periodic function where the first period is between 0 and 2 and is defined as $y = \cosh(x-1)$. This is what it looks like:

What we mean by this being a periodic function is that if we looked beyond the first period we would see the function looking like this:

We're assuming that our function consists of a summation of sinusoids with different frequencies and amplitudes such that each wavelength is equal to the period of the original function or a whole number division of that period (half, quarter, etc.). To make a sine or cosine have a period equal to $T$, we use $sin(x\frac{2\pi}{T})$ and $cos(x\frac{2\pi}{T})$. Multiplying $x$ by a whole number will change the frequency of the sinusoid such that the wavelength is equal to a whole number division of $T$. For example, $sin(2 x\frac{2\pi}{2})$ will make the sine function repeat itself twice when going from 0 to 2. This is what our sinusoid functions look like for both sine and cosine:

By adding up these sines and cosines together whilst having specific amplitudes (that we need to find) we can get closer and closer to the original periodic function.

The Fourier transform takes advantage of an interesting quirk of sinusoids and the area under their graph. For two positive whole numbers $n$ and $m$, if you multiply $\sin(n x\frac{2\pi}{T})$ by $\sin(m x\frac{2\pi}{T})$ or $\cos(n x\frac{2\pi}{T})$ by $\cos(m x\frac{2\pi}{T})$, the area under the graph between 0 and $T$ will always be zero, with the only exception being when $n = m$. If $n = m$ then the area will be equal to $\frac{T}{2}$. In other words,

$$\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx = \begin{cases} 0& \text{if } n \neq m\\ \frac{T}{2}& \text{otherwise} \end{cases}$$
and likewise for $\cos$ instead of $\sin$. Here's an example showing the area under the graph of two cosines with different frequencies being multiplied together:

Note how each hump above the x-axis has a corresponding anti-hump below the x-axis which cancel each other out, resulting in a total area of zero. Here's an example showing the area under the graph of two cosines with the same frequency being multiplied together:

Note how the area is all above the x-axis. If we had to measure this area, it would be equal to $\frac{T}{2}$ where $T = 2$. Let's prove this for all frequencies.

Proof that the area under the product of two sines with different frequencies is 0:
\begin{align} \int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx &= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) - \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\ &= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx - \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} - \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\ &= 0\\ \end{align}

Proof that the area under the product of two sines with the same frequencies is half their period:
\begin{align} \int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(n x\frac{2\pi}{T}\right) dx &= \int_{0}^{T} \sin^2\left(n x\frac{2\pi}{T}\right) dx\\ &= \int_{0}^{T} \frac{1}{2}\left(1 - \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\ &= \frac{1}{2}\left(\int_{0}^{T} 1 dx - \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\ &= \frac{1}{2}\left(\left[x \right]_{0}^{T} - \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\ &= \frac{1}{2}\left(\left(T - 0\right) - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\ &= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\ &= \frac{T}{2}\\ \end{align}

Proof that the area under the product of two cosines with different frequencies is 0:
\begin{align} \int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx &= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) + \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\ &= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\ &= 0\\ \end{align}

Proof that the area under the product of two cosines with the same frequencies is half their period:
\begin{align} \int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(n x\frac{2\pi}{T}\right) dx &= \int_{0}^{T} \cos^2\left(n x\frac{2\pi}{T}\right) dx\\ &= \int_{0}^{T} \frac{1}{2}\left(1 + \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\ &= \frac{1}{2}\left(\int_{0}^{T} 1 dx + \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\ &= \frac{1}{2}\left(\left[x \right]_{0}^{T} + \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\ &= \frac{1}{2}\left(\left(T - 0\right) + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\ &= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\ &= \frac{T}{2}\\ \end{align}

Also interestingly, proof that the area under the product of a sine and cosine, regardless of frequency, is 0:
\begin{align} \int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx &= \int_{0}^{T} \frac{1}{2}\left(\sin\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right) + \sin\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right)\right) dx\\ &= \frac{1}{2}\left(\int_{0}^{T} \sin\left((n + m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \sin\left((n - m) \frac{2\pi}{T}x\right) dx\right)\\ &= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\ &= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\ &= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\ &= 0\\ \end{align}

Great! Now we can get back to the Fourier transform. Suppose that we have the following sum of sines and cosines:
\begin{align} f(x) = &a_1 \cos\left(1 x\frac{2\pi}{T}\right) + b_1 \sin\left(1 x\frac{2\pi}{T}\right) + \\ &a_2 \cos\left(2 x\frac{2\pi}{T}\right) + b_2 \sin\left(2 x\frac{2\pi}{T}\right) + \\ &\dots \end{align}

How can we tease out the individual amplitudes $a_i$ and $b_i$? Thanks to the identities we proved above, we can now get any amplitude we want by multiplying the function by a sine or cosine, finding the area under the graph, and dividing the area by $\frac{T}{2}$. Here's how it works:
\begin{align} \frac{2}{T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right) f(x) dx &= \frac{1}{2T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right) \begin{pmatrix} &a_1 cos\left(1 x\frac{2\pi}{T}\right) & + b_1 sin\left(1 x\frac{2\pi}{T}\right) + \\ &a_2 cos\left(2 x\frac{2\pi}{T}\right) & + b_2 sin\left(2 x\frac{2\pi}{T}\right) + \\ &\dots& \end{pmatrix} dx\\ &= \frac{2}{T} \begin{pmatrix} &a_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(1 x\frac{2\pi}{T}\right) dx & + b_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(1 x\frac{2\pi}{T}\right) dx + \\ &\dots&\\ &a_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(n x\frac{2\pi}{T}\right) dx & + b_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(n x\frac{2\pi}{T}\right) dx + \\ &\dots& \end{pmatrix}\\ &= \frac{2}{T} \begin{pmatrix} &a_1 0 & + b_1 0 + \\ &\dots&\\ &a_n \frac{T}{2} & + b_n 0 + \\ &\dots& \end{pmatrix}\\ &= \frac{2}{T} a_n \frac{T}{2}\\ &= a_n \end{align}

And obviously multiplying by $\sin\left(n x\frac{2\pi}{T}\right)$ would yield the amplitude of a sine. All that's left is by how much to vertically lift the graph, that is, the constant term to add to the sinusoids. The sinusoids should oscillate around the average value of the original function in one period, which is found as follows:

$y_0 = \frac{1}{T} \int_0^T f(x) dx$

Getting back to $f(x) = \cosh(x-1)$, here are the amplitudes:

$y_0 = \frac{1}{2} \int_0^2 f(x) dx = 1.1752$
$a_1 = \frac{2}{2} \int_0^2 \cos\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.2162$
$b_1 = \frac{2}{2} \int_0^2 \sin\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.0$
$a_2 = \frac{2}{2} \int_0^2 \cos\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0581$
$b_2 = \frac{2}{2} \int_0^2 \sin\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0$

And here are the graphs of the approximated function getting better with every new term (where the approximated function is denoted as $\hat{f}$):

$\hat{f}(x) = 1.1752$

$\hat{f}(x) = 1.1752 + 0.2162 \cos\left(1 x\frac{2\pi}{2}\right)$

$\hat{f}(x) = 1.1752 + 0.2162\cos\left(1 x\frac{2\pi}{2}\right) + 0.0581\cos\left(2 x\frac{2\pi}{2}\right)$

And of course here is the function beyond the first period:

## Wednesday, July 31, 2019

### Recognising simple shapes with OpenCV in Python

Convolutional neural networks are usually used to recognise objects in images but that's a bit of an overkill if you just want to recognise simple flat shapes. Here's how to use OpenCV to do simple object recognition.

Basically the trick to recognise polygons is to convert your image into an approximate polygon representation using something like edge detection and then count the number of sides in the polygon. OpenCV handles a lot of this stuff for you so its quite easy.

Here's the image we will be working on:

The first step is to binarise the pixels of the image, that is they are made either black or white. First the image must be turned into greyscale so that there is just one number per pixel and then anything that is not white (greyscale value 255) is replaced with 255 (white) whilst pure white is replaced with 0 (black).

import cv2
import numpy as np

#Make image greyscale
grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
cv2.imshow('greyscale', grey)

#Binarise greyscale pixels
(_, thresh) = cv2.threshold(grey, 254, 255, 1)
cv2.imshow('thresholded', thresh)

Next we're going to find contours around these white shapes, that is, reduce the image into a set of points with each point being a corner in the shape. This is done using the findContours function which returns a list consisting of contour points for every shape and their hierarchy. The hierarchy is the way each shape is nested within other shapes. We won't need this here since we don't have any shapes within shapes but it will be useful in other situations.

#Get shape contours
(contours, hierarchy) = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
thresh_copy = cv2.cvtColor(thresh, cv2.COLOR_GRAY2RGB)
cv2.drawContours(thresh_copy, contours, contourIdx=-1, color=(0, 255, 0), thickness=2)
print('number of sides per shape:')
for contour in contours:
print('', contour.shape[0])
print()
cv2.imshow('contours', thresh_copy)

number of sides per shape:
119
122
4
124

Unfortunately, the number of sides extracted from each shape is nowhere near what it should be. This is because the corners in the shapes are rounded which results in multiple sides around the corner. We therefore need to simplify these contours so that insignificant sides can be removed. This is done using the Ramer–Douglas–Peucker algorithm which is implemented as the approxPolyDP function. The algorithm requires a parameter called epsilon which controls how roughly to chop up the polygon into a smaller number of sides. This number is dependent on the size of the polygon so we make it a fraction of the shape's perimeter.

#Simplify contours
thresh_copy = cv2.cvtColor(thresh, cv2.COLOR_GRAY2RGB)
print('number of sides per shape:')
for contour in contours:
perimeter = cv2.arcLength(contour, True)
e = 0.05*perimeter #The bigger the fraction, the more sides are chopped off the original polygon
contour = cv2.approxPolyDP(contour, epsilon=e, closed=True)
cv2.drawContours(thresh_copy, [ contour ], contourIdx=-1, color=(0, 255, 0), thickness=2)
print('', contour.shape[0])
cv2.imshow('simplified contours', thresh_copy)

number of sides per shape:
6
5
4
3

And with this we have the number of sides in each polygon.

If you want to check for circles as well as polygons then you will not be able to do so by counting sides. Instead you can get the minimum enclosing circle around a contour and check if its area is close to the area of the contour (before it is simplified):

print('circle')

The minimum enclosing circle is the smallest circle that completely contains the contour. Therefore it's area must necessarily be larger or equal to the shape of the contour, which can only be equal in the case that the contour is a circle.

This is also one way how you can get the position of each shape, by getting the centre point of the enclosing circle. The proper way to get a single coordinate representing the position of the contour is to get the centre of gravity using the contour's moments:

m = cv2.moments(contour)
x = m['m10']/m['m00']
y = m['m01']/m['m00']

Here's the full code:

import cv2
import numpy as np

#Binarise pixels
grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
(_, thresh) = cv2.threshold(grey, 254, 255, 1)

#Get shape contours
(contours, hierarchy) = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

#Recognise shapes
for contour in contours:
m = cv2.moments(contour)
x = round(m['m10']/m['m00'])
y = round(m['m01']/m['m00'])

shape_name = None

shape_name = 'circle'
else:
e = 0.05*cv2.arcLength(contour, True)
simple_contour = cv2.approxPolyDP(contour, epsilon=e, closed=True)
num_sides = simple_contour.shape[0]
shape_name = { 3: 'triangle', 4: 'quad', 5: 'pentagon', 6: 'hexagon' }.get(num_sides, 'unknown')

cv2.putText(img, shape_name, (x, y), fontFace=cv2.FONT_HERSHEY_SIMPLEX, fontScale=0.6, color=(0, 255, 0), thickness=2)

cv2.imshow('named shapes', img)

## Friday, June 28, 2019

Have you ever been training your neural network and suddenly got a bunch of NaNs as loss values? This is usually because of something called the exploding gradient problem, which is when your neural network's gradients, used to train it, are extremely large, leading to overflow errors when performing backpropagation.

The exploding gradient problem is mostly caused by using large weight values in combination with using a large number of layers. It happens a lot with RNNs which would be neural networks with as many layers as there are items in the input sequence. Basically, given a particular activation function, with every additional layer the gradient of the parameters in the first layer can either vanish or explode as the gradient consists of a multiplication of the weights of the layers in front of it together. If the weights are fractions then their product will get closer and closer to zero (vanishes) with every additional layer, whilst if the weights are large then their product will get closer and closer to infinity (explodes) with every additional layer. Vanishing gradients lead to no learning in the early layers whilst exploding gradients leads to NaNs.

In every explanation of this that I see, I always just see general reasoning without any actual examples of what this actually looks like. In order to illustrate what exploding gradients actually look like, we're going to make some simplifications such as that we're building a neural network with just one neural unit per layer and the single weight and bias in each layer will be the same. This is not a necessary condition to reach exploding gradients, but it will help visualise what is going on. Let's see an example of what happens when we have just 3 layers and a weight equal to 8.

ReLU is an activation function that is known for mitigating the vanishing gradient problem, but it also makes it easy to create exploding gradients if the weights are large enough, which is why weights must be initialised to very small values. Here is what the graph produced by a neural network looks like when it consists of single ReLU units in each layer (single number input and single number output as well) as the number of layers varies between 1 (pink), 2 (red), and 3 (dark red).
$y = \text{relu}(w \times x)$
$y = \text{relu}(w \times \text{relu}(w \times x))$
$y = \text{relu}(w \times \text{relu}(w \times \text{relu}(w \times x)))$

See how the steepness grows exponentially with every additional layer? That's because the gradient is basically the product of all the weights in all the layers, which means that, since $w$ is equal to 8, the gradient is increasing 8-fold with each additional layer.

Now with ReLU, this sort of thing is kind of expected as there is not bound to the value returned by it, so there should also be no bound to the gradient. But this also happens with squashing functions like tanh. Even though its returned value must be between 1 and -1, the gradient at particular points of the function can also be exceptionally large. This means that if you're training the neural network at a point in parameter space which has a very steep gradient, even if it's just a short burst, you'll end up with overflow errors or at the very least with shooting off the learning trajectory. This is what the graph produced by a neural network looksl ike when it consists of single tanh units in each layer.
$y = \text{tanh}(w \times x)$
$y = \text{tanh}(w \times \text{tanh}(w \times x))$
$y = \text{tanh}(w \times \text{tanh}(w \times \text{tanh}(w \times x)))$

See how the slope that transitions the value from -1 to 1 gets steeper and steeper as the number of layers increases? This means that if you're training a simple RNN that uses tanh as an activation function, you can either make the state weights very small or make the initial state values very large in order to stay off the middle slope. Of course this would have other problems as then the initial state wouldn't be able to easily learn to set the initial state to any set of values. It might also be the case that there is no single slope but that there are several slopes throughout the graph since the only limitation that tanh imposes is that all values occur between -1 and 1. For example, if the previous layer learns to perform some kind of sinusoid-like pattern that alternates between -5 and 5 (remember that a neural network with large enough layers can approximate any function), then this is what passing that through tanh would look like (note that this equation is a valid neural network with a hidden layer size of 3 and a single input and output):
$y = \text{tanh}(5 \times \text{tanh}(40 \times x-5) - 5 \times \text{tanh}(40 \times x+0) + 5 \times \text{tanh}(40 \times x+5))$

In this case you can see how it is possible for there to be many steep slopes spread around the parameter space, depending on what the previous layers are doing. Your parameter space could be a minefield.

Now sigmoid is a little more tricky to see how it explodes its gradients. If you just do what we did above, you'll get the following result:
$y = \text{sig}(w \times x)$
$y = \text{sig}(w \times \text{sig}(w \times x))$
$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x)))$

The slopes actually get flatter with every additional layer and get closer to $y=1$. This is because, contrary to tanh, sigmoid bounds itself to be between 0 and 1, with sigmoid(0) = 0.5. This means that $\text{sig}(\text{sigmoid}(x))$ will be bounded between 0.5 and 1, since the lowest the innermost sigmoid can go is 0, which will be mapped to 0.5 by the outer most sigmoid. With each additional nested sigmoid you'll just be pushing that lower bound closer and closer toward 1 until the graph becomes a flat line at $y=1$. In fact, in order to see exploding gradients we need to make use of the biases (which up till now were set to 0). Setting the biases to $-\frac{w}{2}$ gives very nice curves:
$y = \text{sig}(w \times x)$
$y = \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2})$
$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2}) - \frac{w}{2})$

Note that with sigmoid the steepness of the slope increases very slowly compared to tanh, which means that you'll need to use either larger weights or more layers to get the same dramatic effect. Also note that all these graphs were for weights being equal to 8, which is really large, but if you have many layers as in the case of a simple RNN working on long sentences, even a weight of 0.7 would explode after enough inputs.

## Thursday, May 30, 2019

### Word square maker

I made a word square searcher. A word square is a square grid with equal length words in each row and column such that the word in the first row is the word in the first column, the word in the second row is the word in the second column, and so on. Here's an example:
care
acid
ring
edge

It works by taking a file of words and indexing them by character and position such that I can find all the words that have a particular character at a particular position. The program will then build every possible word square that can be made from these words. It goes through all the words and tries them all as a first row/column in a word square. Each word is used as a basis for finding the rest of the words.

In the above example, the first row/column is "care". In order to find a suitable word for the second row/column, the index is used to find all the words that have a first letter equal to the first row's second letter. "acid"'s first letter is equal to "care"'s second letter, so it is a good candidate. Each one of these candidate words is used tentatively such that if it doesn't pan out then we can backtrack and try another candidate word. For the third row/column, the index is used to find all the words that have a first letter equal to the first row's third letter and a second letter equal to the second row's third letter. The intersection of these two groups of words would be all the suitable words that can be used as a third row. "ring"'s first letter is equal to "care"'s third letter and its second letter is equal to "acid"'s third letter as well, so it is a good candidate for the third row. This process keeps on going until none of the words make a word square given the first row/column. The program then moves on to another possible word in the first row/column.

You can find the program here: https://onlinegdb.com/By7XtdhpE.

## Wednesday, May 1, 2019

### Neutral Room Escape Game: Lights - Plus 7 Times 10 puzzle solver

I've been playing a point and click game called Lights which features a puzzle consisting of a 4 digit counter which starts from '0000' and which needs to be made equal to a certain number by a sequence of operations which are to either add 7 to the current value or multiply the current value by 10. I wrote a Python program to quickly solve the puzzle and I'm sharing the code for any one who needs to use it as well. You can find the program online at https://onlinegdb.com/rkg_l_ZDiV. Just run the program, enter the number you need to reach when requested and you'll see a sequence of steps showing which operations to perform in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached numbers until the goal number is reached. Enjoy!

## Tuesday, April 30, 2019

### Sliding tile puzzle solver

So I've been playing a point and click game called Loom Blend which features a sliding tile puzzle. I wrote a Python program to quickly solve the puzzle there and I'm sharing the code for any one who needs to use it as well. You can find the program online at https://onlinegdb.com/BkkrnXLsN. Just run the program and you'll see a sequence of steps showing which tiles to move in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached tile configurations until the goal configuration is reached. Enjoy!

## Sunday, March 17, 2019

### The Gambler's Fallacy: Why after tossing 100 heads in a row you can still get heads again

The Gambler's fallacy is a deceptively intuitive fallacy about how randomness works. Imagine you're tossing a fair coin repeatedly. You're astonished to see that every toss is resulting in heads. Heads, heads, heads, heads, over and over. As the number of consequtive heads keeps going up, does the probability of getting tails start increasing? It seems reasonable to conclude this as the number of heads and tails should balance each other out given that the coin is fair, as is stated by the Law of Large Numbers. But the idea that there is some kind of memory in the coin that is used to prevent the fair coin from acting as if it is unfair is wrong.

First, let's show that the Gambler's Fallacy is actually a fallacy by simulating a bunch of coin tosses in a program. We'll keep track of all the times a streak of 10 heads was obtained and the following toss after it and see if the number of heads following the streak is significantly less than the number of tails. We do not consider 11 heads in a row as two streaks of 10 heads but instead start searching for a completely new streak after every streak. Here is a Python 3 program to do that (we also count the number of heads and tails obtained in general to show that the simulated coin is not biased).

import random

seed = 0
num_trails = 1000000

random.seed(seed)

num_tails = 0
next_was_tail = 0
for trial in range(num_trails):

else:
num_tails += 1

else:
else:
else:
next_was_tail += 1

And here are the results:

Both the ratio of heads and tails in general and the ratio of heads and tails following a streak of 10 heads in a row are close to 1. You can try changing the streak length limit to see how making it shorter does not significantly change this ratio.

So why does this happen? It's a simple matter of probability theory. Both the probability of getting a head and of getting a tail is 0.5 and each toss is independent from the previous one. Therefore the probability of getting a head after 10 heads is $0.5^{10} \times 0.5 = 0.5^{11}$ and the probability of getting a tail after 10 heads is also $0.5^{10} \times 0.5 = 0.5^{11}$. It does not make a difference what the previous tosses were, the next toss being heads will always have a probability of 0.5, so both heads and tails are equally likely.

So then the more interesting question is why do we think otherwise? Why do we think that as the chain of consecutive heads gets longer, the probability of getting another head goes down? It could be because we expect the coin to be fair and a coin that never gives tails is a biased coin, but that doesn't really mean anything in terms of predicting what the next toss will be. It is entirely possible that the fairest possible coin will give a million heads in a row. There is nothing stopping this from happening. Not only that, but a million heads in a row is just as likely to happen as any other random looking sequence resulting from a million coin tosses. Let's write down a sequence of heads and tails as a string of 'H's and 'T's. The sequence 'HHHHHH' is just as likely to come up as the sequence 'HHHTTH', yet we think of the first sequence as being unlikely whilst of the second sequence as... what? Is the second sequence more likely to come up than the first? Would you bet more money on the second sequence than the first? This is what what I think is the root cause of the Gambler's Fallacy.

What I think happens in our mind is that we don't really predict whether the next toss will give a heads or a tails but whether the whole sequence will turn out to be an 'interesting' looking sequence or not. We love patterns, and when we see that a random sequence generator (coin tosses in this case) is giving out a sequence that looks like a simple pattern, we become suspicious of the randomness behind the sequence generator and assume that it isn't really random. I mean we're pretty terrible at generating random sequences manually because we assume that in a random process you will not get things like a streak of heads. But it's not really just a streak of heads that would make us suspicious isn't it? It's also a sequence of alternating heads and tails for example or a long sequence of heads followed by a long sequence of tails. I think that the source of the Gambler's Fallacy is that we categorise entire sequences into interesting and uninteresting sequences. As we watch a sequence being generated one toss at a time we expect that interesting sequences of that length are much less likely to happen that uninteresting ones, which is true. This makes us expect the next toss to break the interestingness of the current sequence, which, in the case of a streak of heads, is by the next toss turning out to be tails. Of course, although interesting sequences are less likely to happen as the length grows, given that the probability of the interesting pattern becoming uninteresting on the next toss is equal to the probability of it remaining interesting, it is still an incorrect assumption.

So what is the probability of an interesting sequence? This is a subjective question but we can still try to investigate it. Ideally a survey is conducted on a population of people to check what is considered as interesting by people on average. But that would be weird so I'll just conduct the survey on myself. I have generated all the possible coin tosses that can be obtained after 4 coin tosses and I will be annotating each sequence according to what I find interesting in it. I also did the same on 6 coin tosses to see how the ratio of interesting to uninteresting sequences changes with sequence length.

Here is what I find interesting in sequences of 4 coin tosses.
 TTTT same TTTH not interesting TTHT not interesting TTHH switched THTT not interesting THTH repeated THHT mirrored THHH not interesting HTTT not interesting HTTH mirrored HTHT repeated HTHH not interesting HHTT switched HHTH not interesting HHHT not interesting HHHH same

And here is what I find interesting in sequences of 6 coin tosses (organised into two columns to reduce vertical space).
 TTTTTT same TTTTTH not interesting TTTTHT not interesting TTTTHH not interesting TTTHTT not interesting TTTHTH not interesting TTTHHT not interesting TTTHHH switched TTHTTT not interesting TTHTTH doubled TTHTHT not interesting TTHTHH not interesting TTHHTT mirrored TTHHTH not interesting TTHHHT not interesting TTHHHH not interesting THTTTT not interesting THTTTH not interesting THTTHT mirrored THTTHH not interesting THTHTT not interesting THTHTH alternated THTHHT not interesting THTHHH not interesting THHTTT not interesting THHTTH not interesting THHTHT not interesting THHTHH doubled THHHTT not interesting THHHTH not interesting THHHHT mirrored THHHHH not interesting HTTTTT not interesting HTTTTH mirrored HTTTHT not interesting HTTTHH not interesting HTTHTT doubled HTTHTH not interesting HTTHHT don't know HTTHHH not interesting HTHTTT not interesting HTHTTH not interesting HTHTHT alternated HTHTHH not interesting HTHHTT not interesting HTHHTH mirrored HTHHHT not interesting HTHHHH not interesting HHTTTT not interesting HHTTTH not interesting HHTTHT not interesting HHTTHH mirrored HHTHTT not interesting HHTHTH not interesting HHTHHT doubled HHTHHH not interesting HHHTTT switched HHHTTH not interesting HHHTHT not interesting HHHTHH not interesting HHHHTT not interesting HHHHTH not interesting HHHHHT not interesting HHHHHH same

Among the 4 coin toss sequences, 8 out of the 16 sequences were considered interesting, meaning that you have a 50% chance of generating an interesting sequence. But among the 6 coin toss sequences, only 16 out of the 64 sequences were considered interesting, meaning that you have a 25% chance of generating an interesting sequence. Now I don't know if the patterns I identified in the tables above are the only interesting patterns for any sequence length, since maybe some patterns only emerge in very long sequences, but it is safe to say that the longer the sequence, the greater the proportion of uninteresting and random looking sequences will be. Therefore, the longer the sequence, the less likely that a sequence will be interesting. This might be what is going on when we think that the probability of getting a tails goes up as the streak of heads gets longer; we'd really be comparing how likely it is for a sequence of a given length to be uninteresting, which becomes less likely as the length grows.

Interesting patterns are interesting
I have to admit that whilst annotating interesting sequences I was more interested in being consistent in my reasoning than in saying what I genuinely found interesting. This is because I didn't actually feel that, for example, all "repeated" patterns were interesting, just some of them. But I wanted to avoid explaining myself too much based on what I felt. This means that there might be more going on in feeling that a sequence is aesthetically pleasing than simple mathematical patterns. It would be interesting if someone made an experiment where people are subjected to an EEG to see what is really considered interesting and what is not. How knows, maybe it might even turn out to be a kind of Rorschach test where different patterns of interest indicate different personalities. And maybe it might even be an interesting machine learning challenge to try to make a program predict which sequences would be considered interesting by humans in order to quantify how interesting a given sequence is.

## Tuesday, February 26, 2019

### MySQL extra inner join taking too long: optimizer_search_depth

MySQL has a query optimiser that takes in SQL queries and tries to find the best way to execute them. Then the query gets executed using the best execution plan found. When you use joins in your query, the time the optimiser takes to finish optimising is exponential to the number of joins. This means that for a few joins it only takes a negligible amount of time to finish but after a point, which is somewhere between 7 and 10 tables, the optimisation time will shoot up. In my case I went from 0.1 seconds to 17 seconds but just adding another table in my joins.

To check if this is happening to you, you can profile your query by executing the following:

SET profiling = 1;

SHOW PROFILE FOR QUERY 1;

(If you using phpMyAdmin you might need to add "SHOW PROFILE;" as well at the end)

This will show you a table of execution times for each phase in the execution of the query. The time taken by the query optimiser is in the "statistics" row. If this is too high, then you need to stop your optimiser from spending so much time.

Controlling how much time the optimiser should spend doing its job can be accomplished using the optimizer_search_depth variable. This is the maximum depth to search (presumably in some tree search algorithm) when optimising the query and is unfortunately set to the highest value by default. Being set to the highest value makes it the most likely to find the best execution plan but may also be unnecessarily time consuming. Thankfully there is a better default value to use: 0. When this variable is set to zero, the optimiser automatically picks a value based on the query and that seems to work well for me. To do this just execute the following query:

SET optimizer_search_depth = 0;

## Tuesday, January 29, 2019

### Python string format quick guide

This is a quick summary guide for when you just want to remind yourself how to format a value with Python's 'string'.format().

Referencing

Curly brackets are used to refer to parameter values in the 'format' method. On their own, the first pair of curly brackets will refer to the first parameter, second to the second, etc.

'{} {} {}'.format('a', 1, True)
a 1 True

Alternatively you can specify which parameter you want to use where using indexes.

'{0} {0} {2}'.format('a', 1, True)
a a True

You can even use names instead of indexes.

'{a} {b} {c}'.format(a='a', b=1, c=True)
a 1 True

And even refer to indexes in lists and to instance variables in objects.

'{a[0]} {b.numerator}'.format(a=[3,1,2], b=fractions.Fraction(1,2))
3 1

Formatting

Adding a colon inside the curly brackets allows you to format the values before they are added to the string. You can combine these with any of the above methods of referencing. The following formatting meta characters are to be used in the following order:
{:[alignment] [grouping] [precision] [data type]}

An alignment meta character is either >, <, or ^. The number after it is the field width in which to align.

'|{:>3}|{:<3}|{:^3}|'.format('a', 'b', 'c')
|  a|b  | c |

Any character before the alignment meta character is used instead of spaces.
'{:_>5} {:0>5}'.format('a', 12)
____a 00012

Grouping
You can group digits in numbers into triples with commas automatically by just putting a comma in the curly brackets.

'{:,}'.format(12345678)
12,345,678

Precision
When formatting floats you can specify how many digits to keep after the point by adding a dot and a number. By default the format of the number is in scientific notation which means that your floating point number will look like 12e-4 with the precision referring to the number of digits in the mantissa (the number before 'e'). See the next section to format your number in fixed point notation.

'{:.3}'.format(12345678.0)
1.23e+07

Data types
This is the most important part. The data type always comes at the end and is a single letter.

You can use it to change the numeric base of a whole number.
'dec-{:d} hex-{:X} oct-{:o} bin-{:b}'.format(16, 16, 16, 16)
dec-16 hex-10 oct-20 bin-10000

You can use it to convert unicode values to characters.
'{:c}'.format(97)
a

You can use it to convert numbers into scientific notations,
'{:e}'.format(12345678)
1.234568e+07

or fixed point notations,
'{:f}'.format(12345678)
12345678.000000

or percentages.
'{:%}'.format(0.1234)
12.340000%

Mixed examples

'{:0>2X}'.format(10)
0A

'{:,d}'.format(1234567)
1,234,567

'{:,.2f}'.format(12345.6)
12,345.60

'{:.2%}'.format(0.1234)
12.34%