tag:blogger.com,1999:blog-43189444598231434732019-12-10T03:22:31.383+01:00Geeky is AwesomeTo learn is to teach.Unknownnoreply@blogger.comBlogger121125tag:blogger.com,1999:blog-4318944459823143473.post-75783353664894230552019-11-30T11:25:00.002+01:002019-11-30T11:25:41.184+01:00Texture Descriptors of Image Regions using Local Binary PatternsA very simple way to measure the similarity between the texture of regions in images is to capture the region's texture information in a vector using local binary patterns. These vectors are invariant to intensity changes, fast to compute, and very easy to understand. Let's see how this works. Note that since we're focusing on texture then we don't care about colour but about intensities, so the image needs to be greyscale.<br /><br />The first step is to replace all the pixels in the image region with LBP (local binary pattern) codes. For every pixel, draw an imaginary circle of some radius centred on the pixel and draw some number of imaginary dots on that circle that are equally spaced apart. The radius and number of dots are to be hand tuned in order to maximise performance but they are usually set to a radius of 1 pixel and a number of dots of 8. Now pick the pixels that fall under those dots together with the pixel at the centre.<br /><br /><a href="https://2.bp.blogspot.com/-MjBycXxzkUs/XeJDOkmacGI/AAAAAAAABcY/4mGGaFmxpq8qO6SKgYQO1uR4-9stFoQLgCLcBGAsYHQ/s1600/lbp_pixel_selection.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-MjBycXxzkUs/XeJDOkmacGI/AAAAAAAABcY/4mGGaFmxpq8qO6SKgYQO1uR4-9stFoQLgCLcBGAsYHQ/s320/lbp_pixel_selection.png" width="320" height="117" data-original-width="604" data-original-height="220" /></a><br /><br />In the surrounding pixels, any pixel that has a lower intensity than the central pixel gets a zero and all the rest get a one. These numbers are put together in order to form a binary number called the LBP code of the central pixel.<br /><br /><a href="https://2.bp.blogspot.com/-9HWfiVbE1E8/XeJDTRTUPoI/AAAAAAAABcc/T_N8YaTxyPEkgU6EuV5cgqC0uCBJn-g0QCLcBGAsYHQ/s1600/lbp_code.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-9HWfiVbE1E8/XeJDTRTUPoI/AAAAAAAABcc/T_N8YaTxyPEkgU6EuV5cgqC0uCBJn-g0QCLcBGAsYHQ/s320/lbp_code.png" width="320" height="114" data-original-width="405" data-original-height="144" /></a><br /><br />Do this for every pixel in the region and then count how many times each code appears in the region, putting all the frequencies in a vector (a histogram). This is your texture descriptor and you can stop here. The more similar the vector is to other vectors, the more similar the texture of the image regions are.<br /><br />We can do better than this however by making the codes uniform. If you do an analysis of these LBP code frequencies, you'll find that most of them have a most two bit changes in them. For example, 00011100 has two bit changes, one from a 0 to a 1 and another after from a 1 to a 0. On the other hand 01010101 has 7 changes. The reason why the majority will have at most two bit changes is because, most of the time, the pixels around a central pixel do not have random intensities but change in a flow as shown in the figure above. If there is a gradient of intensity changes in a single direction then there will only be two bit changes. This means that we can replace all LBP codes with more than two bit changes with a single LBP code that just means 'other'. This is called a uniform LBP code and it reduces our vector size significantly without much of a performance hit, which is good.<br /><br />We can do even better by making the LBP codes rotation invariant, that is, using the same LBP code regardless of where the gradient direction is pointing. In the above figure, if the line of dark grey pixels were rotated, the LBP code would be changed to something like 00011110 for example or 11110000. The only thing that would change when rotating the surrounding pixels would be the bit rotation of the LBP code. We can make our codes invariant by artificially rotating our LBP codes so that for a given number of ones and zeros in the code we always have a canonical code. Usually we treat the binary code as an integer and rotate the bits until we get the minimum integer. For example, here are all the bit rotations of four 1s and four 0s:<br /><br />11110000 (240), 01111000 (120), 00111100 (60), 00011110 (30), 00001111 (15)<br /><br />In this case we would pick 00001111 as our canonical representation as that is the smallest integer. This drastically reduces our vector sizes as the amount of different possible LBP codes to find the frequency of will be a small fraction of the original full set of LBP codes.<br /><br />You can use LBP descriptors using Python's skimage and numpy as follows:<br /><pre class="brush:python">import numpy as np<br />import skimage<br />import skimage.feature<br /><br />image = skimage.io.imread('img.png', as_gray=True)<br />region = image[0:100,0:100]<br />lbp_codes = skimage.feature.local_binary_pattern(region, P=8, R=1, method='uniform') #8 dots, radius 1, rotation invariant uniform codes.<br />descriptor = np.histogram(lbp_codes, bins=10, range=(0, 10))[0] #Number of bins and range changes depending on the number of dots and LBP method used.<br /></pre><br />The number of bins and range needed can be obtained by artificially generating all the possible arrangements of darker and lighter pixels around a central pixel:<br /><pre class="brush:python">def get_histogram_params(dots, method):<br /> all_possible_lbp_codes = set()<br /> for i in range(2**8):<br /> str_bin = '{:0>8b}'.format(i)<br /> region = [ {'0':0,'1':2}[ch] for ch in str_bin ] #Surrounding pixels.<br /> region.insert(4, 1) #Central pixel.<br /> region = np.array(region).reshape((3,3))<br /> lbp_codes = skimage.feature.local_binary_pattern(region, dots, 1, method) #Radius is minimum to keep region size to a minimum.<br /> all_possible_lbp_codes.add(lbp_codes[1,1]) #Get code of central pixel and add it to the set (keeps on unique values).<br /> num_bins = len(all_possible_lbp_codes)<br /> rng = (min(all_possible_lbp_codes), max(all_possible_lbp_codes)+1)<br /> return (num_bins, rng)<br /></pre><br />You can get more information about the skimage function from <a href="https://scikit-image.org/docs/dev/api/skimage.feature.html#skimage.feature.local_binary_pattern">here</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-3197103515384156572019-10-30T23:01:00.002+01:002019-10-30T23:07:37.650+01:00No square number can end in 2, 3, 7, or 8 (modulo of square numbers)Look at the following list of square numbers:<br /><pre>0^2: 0<br /> 1^2: 1<br /> 2^2: 4<br /> 3^2: 9<br /> 4^2: 16<br /> 5^2: 25<br /> 6^2: 36<br /> 7^2: 49<br /> 8^2: 64<br /> 9^2: 81<br />10^2: 100<br />11^2: 121<br />12^2: 144<br />13^2: 169<br />14^2: 196<br />15^2: 225<br />16^2: 256<br />17^2: 289<br />18^2: 324<br />19^2: 361<br />20^2: 400<br />21^2: 441<br />22^2: 484<br />23^2: 529<br />24^2: 576<br />25^2: 625<br />26^2: 676<br />27^2: 729<br />28^2: 784<br />29^2: 841<br />30^2: 900<br /></pre><br />Notice how the last digit is always 0, 1, 4, 9, 6, or 5? There also seems to be a cyclic pattern of digits repeating themselves but we'll talk about that in another post. In this post, we'll just talk about why certain digits do not occur at the end of square numbers, which is useful in order to check if a number is a square or not.<br /><br />We can get the last digit of a number by dividing it by 10 and keeping the remainder, also known as the number modulo 10:<br /><br /><pre>0^2 mod 10: 0<br /> 1^2 mod 10: 1<br /> 2^2 mod 10: 4<br /> 3^2 mod 10: 9<br /> 4^2 mod 10: 6<br /> 5^2 mod 10: 5<br /> 6^2 mod 10: 6<br /> 7^2 mod 10: 9<br /> 8^2 mod 10: 4<br /> 9^2 mod 10: 1<br />10^2 mod 10: 0<br />11^2 mod 10: 1<br />12^2 mod 10: 4<br />13^2 mod 10: 9<br />14^2 mod 10: 6<br />15^2 mod 10: 5<br />16^2 mod 10: 6<br />17^2 mod 10: 9<br />18^2 mod 10: 4<br />19^2 mod 10: 1<br />20^2 mod 10: 0<br />21^2 mod 10: 1<br />22^2 mod 10: 4<br />23^2 mod 10: 9<br />24^2 mod 10: 6<br />25^2 mod 10: 5<br />26^2 mod 10: 6<br />27^2 mod 10: 9<br />28^2 mod 10: 4<br />29^2 mod 10: 1<br />30^2 mod 10: 0<br /></pre><br />It turns out that this sort of exclusion of certain modulo results from square numbers happens with many other numbers apart from 10. Take 12 for example:<br /><br /><pre>0^2 mod 12: 0<br /> 1^2 mod 12: 1<br /> 2^2 mod 12: 4<br /> 3^2 mod 12: 9<br /> 4^2 mod 12: 4<br /> 5^2 mod 12: 1<br /> 6^2 mod 12: 0<br /> 7^2 mod 12: 1<br /> 8^2 mod 12: 4<br /> 9^2 mod 12: 9<br />10^2 mod 12: 4<br />11^2 mod 12: 1<br />12^2 mod 12: 0<br />13^2 mod 12: 1<br />14^2 mod 12: 4<br />15^2 mod 12: 9<br />16^2 mod 12: 4<br />17^2 mod 12: 1<br />18^2 mod 12: 0<br />19^2 mod 12: 1<br />20^2 mod 12: 4<br />21^2 mod 12: 9<br />22^2 mod 12: 4<br />23^2 mod 12: 1<br />24^2 mod 12: 0<br />25^2 mod 12: 1<br />26^2 mod 12: 4<br />27^2 mod 12: 9<br />28^2 mod 12: 4<br />29^2 mod 12: 1<br />30^2 mod 12: 0<br /></pre><br />With 12 you only get 0, 1, 4, and 9 as digits which is less than with 10. Why does this happen? We can prove that this is the case using <a href="https://en.wikipedia.org/wiki/Modulo_operation#Properties_(identities)">modulo algebra</a>. Let's call the number being squared $x$ and the number after the modulo $n$. Then by the distributive law of the modulo operation:<br /><br />$x^2 \text{ mod } n = (x \times x) \text{ mod } n = ((x \text{ mod } n) \times (x \text{ mod } n)) \text{ mod } n = (x \text{ mod } n)^2 \text{ mod } n$<br /><br />What the above derivation says is that when dividing a square number, the remainder will be based on the remainder of the root of the square number. In the case of the last digit of a square number, which is equivalent to the square number modulo 10, the last digit will be one of the last digits of the single digits squared: <b>0</b>, <b>1</b>, <b>4</b>, <b>9</b>, 1<b>6</b>, 2<b>5</b>, 3<b>6</b>, 4<b>9</b>, 6<b>4</b>, 8<b>1</b>. This is because $x \text{ mod } 10$ gives you the last digit of the number $x$, which is a single digit, and then this is squared and the last digit of the square is the answer.<br /><br />In the general case, this is setting a limit on the number of possible remainders (usually $x \text{ mod } n$ has $n$ different possible answers for all $x$). The number of different remainders is first limited to $n$ with the first inner modulo but then only the remainders resulting from these numbers squared can be returned. This means that if one of the square numbers has the same modulo as another square number, then the total number of final remainders has decreased.<br /><br />In a future post we will investigate why the remainders of square numbers are cyclic.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-33750336105500644102019-09-26T22:14:00.000+02:002019-09-26T22:14:03.748+02:00Python functions for generating all substrings/sublists, combinations, permutations, and sequences<pre class="brush:python">def subs(xs, sub_len):<br /> '''Get all sublists/substrings from xs of the given sub-length.'''<br /> for i in range(len(xs) - sub_len + 1): #Get all indexes of the substrings that are at least sub_len long.<br /> yield xs[i:i+sub_len]<br /><br />for x in subs('abcd', 2):<br /> print(x)<br /></pre><pre>ab<br />bc<br />cd<br /></pre><br /><pre class="brush:python">def combinations(xs, comb_size):<br /> '''Get all combinations of the given combination size.'''<br /> if comb_size == 0:<br /> yield xs[:0] #Empty list/string<br /> else:<br /> for i in range(len(xs) - comb_size + 1):<br /> for ys in combinations(xs[i+1:], comb_size-1): #For every item, combine it with every item that comes after it.<br /> yield xs[i] + ys<br /><br />for x in combinations('abcd', 2):<br /> print(x)<br /></pre><pre>ab<br />ac<br />ad<br />bc<br />bd<br />cd<br /></pre><br /><pre class="brush:python">def perms(xs):<br /> '''Get all permutations.'''<br /> if len(xs) == 1:<br /> yield xs<br /> else:<br /> for i in range(len(xs)):<br /> for ys in perms(xs[:i] + xs[i+1:]): #For every item, combine it with every item except itself.<br /> yield xs[i] + ys<br /><br />for x in perms('abc'):<br /> print(x)<br /></pre><pre>abc<br />acb<br />bac<br />bca<br />cab<br />cba<br /></pre><br /><pre class="brush:python">def permutations(xs, perm_size):<br /> '''Get all permutations of the given permutation size.'''<br /> for cs in combinations(xs, perm_size):<br /> for p in perms(cs): #Get all the permutations of all the combinations.<br /> yield p<br /><br />for x in permutations('abcd', 2):<br /> print(x)<br /></pre><pre>ab<br />ba<br />ac<br />ca<br />ad<br />da<br />bc<br />cb<br />bd<br />db<br />cd<br />dc<br /></pre><br /><pre class="brush:python">def seqs(xs, seq_len):<br /> '''Get all sequences of a given length.'''<br /> if seq_len == 0:<br /> yield xs[:0] #Empty list/string<br /> else:<br /> for i in range(len(xs)):<br /> for ys in seqs(xs, seq_len-1): #For every item, combine it with every item including itself.<br /> yield xs[i] + ys<br /><br />for x in seqs('abc', 2):<br /> print(x)<br /></pre><pre>aa<br />ab<br />ac<br />ba<br />bb<br />bc<br />ca<br />cb<br />cc<br /></pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-10742889354980975022019-08-30T08:31:00.000+02:002019-09-01T22:37:16.623+02:00The Fourier transform for real numbers (approximating complicated periodic functions with simple sinusoids)In <a href="https://geekyisawesome.blogspot.com/2018/08/the-mclauren-series-and-taylor-series.html">a previous post</a> we saw how to approximate complex functions using simple polynomials by using the Taylor series approximation method. This consisted in assuming that your complex function was an actual polynomial and then using mathematical tricks to tease out the coefficients of the polynomial one by one. Polynomials are useful functions to approximate curves but are terrible at approximating periodic functions such as $sin(x)$ (repeat themselves periodically) because they themselves are not periodic. In this post we'll see how to approximating periodic functions by using a sum of simple sinusoids (sines or cosines). With polynomials, we had a summation of powers of $x$, each of which having a different power. Now we will instead be having a summation of sinusoids, each of which having a different whole number frequency. With polynomials we had to find the coefficients, which are constants multiplied by the power of $x$, and any curve could be arbitrarily approximated by finding the right coefficients. Now we will be finding amplitudes, which are constants multiplied by each sinusoid, and any periodic function can be arbitrarily approximated by finding the right amplitudes.<br /><br />Instead of the Taylor series approximation we will now be using the Fourier series approximation, also known as the <a href="https://en.wikipedia.org/wiki/Fourier_transform">Fourier transform</a>. In order to keep things simple we will only be looking at real functions rather than complex ones (functions with imaginary components). The Fourier series assumes that the periodic function we're deconstructing, $f(x)$, begins its period at 0 and ends it at $T$, after which $f(x)$ will continue repeating itself such that $f(T + k) = f(k)$. If we'd like our period to start at $x_0$ instead then we can easily just replace our periodic function with $f'(x)$ such that $f'(x) = f(x + x_0)$ and then subtract $x_0$ from $x$ again after getting the approximate function.<br /><br />Let's say that we have a complex periodic function where the first period is between 0 and 2 and is defined as $y = \cosh(x-1)$. This is what it looks like:<br /><br /><a href="https://3.bp.blogspot.com/-2ccIGeSF1F0/XWjCFgOf6lI/AAAAAAAABaI/TJyqkvk_L4MyhPDk5v-3JxoAo9-eaK06gCLcBGAs/s1600/fourier_fx.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-2ccIGeSF1F0/XWjCFgOf6lI/AAAAAAAABaI/TJyqkvk_L4MyhPDk5v-3JxoAo9-eaK06gCLcBGAs/s320/fourier_fx.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />What we mean by this being a periodic function is that if we looked beyond the first period we would see the function looking like this:<br /><br /><a href="https://4.bp.blogspot.com/-OFoQK7URpIs/XWjCK4NUfCI/AAAAAAAABaM/daWNHWSFBEQgcH4df_nFmIdyUEweWeNrQCLcBGAs/s1600/fourier_fx_periodic.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-OFoQK7URpIs/XWjCK4NUfCI/AAAAAAAABaM/daWNHWSFBEQgcH4df_nFmIdyUEweWeNrQCLcBGAs/s320/fourier_fx_periodic.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />We're assuming that our function consists of a summation of sinusoids with different frequencies and amplitudes such that each wavelength is equal to the period of the original function or a whole number division of that period (half, quarter, etc.). To make a sine or cosine have a period equal to $T$, we use $sin(x\frac{2\pi}{T})$ and $cos(x\frac{2\pi}{T})$. Multiplying $x$ by a whole number will change the frequency of the sinusoid such that the wavelength is equal to a whole number division of $T$. For example, $sin(2 x\frac{2\pi}{2})$ will make the sine function repeat itself twice when going from 0 to 2. This is what our sinusoid functions look like for both sine and cosine:<br /><br /><a href="https://2.bp.blogspot.com/-unRmIQdPXKM/XWjCPRreYtI/AAAAAAAABaQ/k30FwnMy7akPynHmpccJdIXd7uqDNVrOwCLcBGAs/s1600/fourier_fx_sinusoids.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-unRmIQdPXKM/XWjCPRreYtI/AAAAAAAABaQ/k30FwnMy7akPynHmpccJdIXd7uqDNVrOwCLcBGAs/s320/fourier_fx_sinusoids.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />By adding up these sines and cosines together whilst having specific amplitudes (that we need to find) we can get closer and closer to the original periodic function.<br /><br />The Fourier transform takes advantage of an interesting quirk of sinusoids and the area under their graph. For two positive whole numbers $n$ and $m$, if you multiply $\sin(n x\frac{2\pi}{T})$ by $\sin(m x\frac{2\pi}{T})$ or $\cos(n x\frac{2\pi}{T})$ by $\cos(m x\frac{2\pi}{T})$, the area under the graph between 0 and $T$ will always be zero, with the only exception being when $n = m$. If $n = m$ then the area will be equal to $\frac{T}{2}$. In other words,<br /><br />$$<br />\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx = <br />\begin{cases}<br />0& \text{if } n \neq m\\<br />\frac{T}{2}& \text{otherwise}<br />\end{cases}<br />$$<br />and likewise for $\cos$ instead of $\sin$. Here's an example showing the area under the graph of two cosines with different frequencies being multiplied together:<br /><br /><a href="https://2.bp.blogspot.com/-ZNL0ZTzbaYs/XWwqvc4jiTI/AAAAAAAABbA/f_2iCD25zDcxppAqiCDAmdVGl0GdIKpkACLcBGAs/s1600/fourier_cosn_cosm.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-ZNL0ZTzbaYs/XWwqvc4jiTI/AAAAAAAABbA/f_2iCD25zDcxppAqiCDAmdVGl0GdIKpkACLcBGAs/s320/fourier_cosn_cosm.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Note how each hump above the x-axis has a corresponding anti-hump below the x-axis which cancel each other out, resulting in a total area of zero. Here's an example showing the area under the graph of two cosines with the same frequency being multiplied together:<br /><br /><a href="https://4.bp.blogspot.com/-FXluztwcWiU/XWwq03yJGLI/AAAAAAAABbE/MIEr6fP4OS8szbG11-U7zJ3tPDSn1K-ZACLcBGAs/s1600/fourier_cosn_cosn.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-FXluztwcWiU/XWwq03yJGLI/AAAAAAAABbE/MIEr6fP4OS8szbG11-U7zJ3tPDSn1K-ZACLcBGAs/s320/fourier_cosn_cosn.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Note how the area is all above the x-axis. If we had to measure this area, it would be equal to $\frac{T}{2}$ where $T = 2$. Let's prove this for all frequencies.<br /><br />Proof that the area under the product of two sines with different frequencies is 0:<br />$$<br />\begin{align}<br />\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(m x\frac{2\pi}{T}\right) dx<br />&= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) - \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\<br />&= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx - \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} - \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) - \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\<br />&= 0\\<br />\end{align}<br />$$<br /><br />Proof that the area under the product of two sines with the same frequencies is half their period:<br />$$<br />\begin{align}<br />\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \sin\left(n x\frac{2\pi}{T}\right) dx<br />&= \int_{0}^{T} \sin^2\left(n x\frac{2\pi}{T}\right) dx\\<br />&= \int_{0}^{T} \frac{1}{2}\left(1 - \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\<br />&= \frac{1}{2}\left(\int_{0}^{T} 1 dx - \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\<br />&= \frac{1}{2}\left(\left[x \right]_{0}^{T} - \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\<br />&= \frac{1}{2}\left(\left(T - 0\right) - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T - \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\<br />&= \frac{T}{2}\\<br />\end{align}<br />$$<br /><br />Proof that the area under the product of two cosines with different frequencies is 0:<br />$$<br />\begin{align}<br />\int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx<br />&= \int_{0}^{T} \frac{1}{2}\left(\cos\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right) + \cos\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right)\right) dx\\<br />&= \frac{1}{2}\left(\int_{0}^{T} \cos\left((n - m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \cos\left((n + m) \frac{2\pi}{T}x\right) dx\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\<br />&= 0\\<br />\end{align}<br />$$<br /><br />Proof that the area under the product of two cosines with the same frequencies is half their period:<br />$$<br />\begin{align}<br />\int_{0}^{T} \cos\left(n x\frac{2\pi}{T}\right) \cos\left(n x\frac{2\pi}{T}\right) dx<br />&= \int_{0}^{T} \cos^2\left(n x\frac{2\pi}{T}\right) dx\\<br />&= \int_{0}^{T} \frac{1}{2}\left(1 + \cos\left(2n x\frac{2\pi}{T}\right)\right) dx\\<br />&= \frac{1}{2}\left(\int_{0}^{T} 1 dx + \int_{0}^{T}\cos\left(2n\frac{2\pi}{T} x\right) dx\right)\\<br />&= \frac{1}{2}\left(\left[x \right]_{0}^{T} + \frac{1}{2n\frac{2\pi}{T}}\left[\sin\left(2n\frac{2\pi}{T} x\right)\right]_{0}^{T}\right)\\<br />&= \frac{1}{2}\left(\left(T - 0\right) + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n\frac{2\pi}{T} T\right) - \sin\left(2n\frac{2\pi}{T} 0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2n2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(T + \frac{1}{2n\frac{2\pi}{T}}\left(0 - 0\right)\right)\\<br />&= \frac{T}{2}\\<br />\end{align}<br />$$<br /><br />Also interestingly, proof that the area under the product of a sine and cosine, regardless of frequency, is 0:<br />$$<br />\begin{align}<br />\int_{0}^{T} \sin\left(n x\frac{2\pi}{T}\right) \cos\left(m x\frac{2\pi}{T}\right) dx<br />&= \int_{0}^{T} \frac{1}{2}\left(\sin\left(n x\frac{2\pi}{T} + m x\frac{2\pi}{T}\right) + \sin\left(n x\frac{2\pi}{T} - m x\frac{2\pi}{T}\right)\right) dx\\<br />&= \frac{1}{2}\left(\int_{0}^{T} \sin\left((n + m) \frac{2\pi}{T}x\right) dx + \int_{0}^{T} \sin\left((n - m) \frac{2\pi}{T}x\right) dx\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left[\sin\left((n + m) \frac{2\pi}{T}x\right)\right]_{0}^{T} + \frac{1}{(n - m) \frac{2\pi}{T}}\left[\sin\left((n - m) \frac{2\pi}{T}x\right)\right]_{0}^{T}\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) \frac{2\pi}{T}T\right) - \sin\left((n + m) \frac{2\pi}{T}0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) \frac{2\pi}{T}T\right) - \sin\left((n - m) \frac{2\pi}{T}0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left((n + m) 2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left((n - m) 2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(\sin\left(2\pi\right) - \sin\left(0\right)\right)\right)\\<br />&= \frac{1}{2}\left(\frac{1}{(n + m) \frac{2\pi}{T}}\left(0 - 0\right) + \frac{1}{(n - m) \frac{2\pi}{T}}\left(0 - 0\right)\right)\\<br />&= 0\\<br />\end{align}<br />$$<br /><br />Great! Now we can get back to the Fourier transform. Suppose that we have the following sum of sines and cosines:<br />$$<br />\begin{align}<br />f(x) = &a_1 \cos\left(1 x\frac{2\pi}{T}\right) + b_1 \sin\left(1 x\frac{2\pi}{T}\right) + \\<br />&a_2 \cos\left(2 x\frac{2\pi}{T}\right) + b_2 \sin\left(2 x\frac{2\pi}{T}\right) + \\<br />&\dots<br />\end{align}<br />$$<br /><br />How can we tease out the individual amplitudes $a_i$ and $b_i$? Thanks to the identities we proved above, we can now get any amplitude we want by multiplying the function by a sine or cosine, finding the area under the graph, and dividing the area by $\frac{T}{2}$. Here's how it works:<br />$$<br />\begin{align}<br />\frac{2}{T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right) f(x) dx<br />&= \frac{1}{2T}\int_0^T \cos\left(n x\frac{2\pi}{T}\right)<br />\begin{pmatrix}<br />&a_1 cos\left(1 x\frac{2\pi}{T}\right) & + b_1 sin\left(1 x\frac{2\pi}{T}\right) + \\<br />&a_2 cos\left(2 x\frac{2\pi}{T}\right) & + b_2 sin\left(2 x\frac{2\pi}{T}\right) + \\<br />&\dots&<br />\end{pmatrix}<br />dx\\<br />&= \frac{2}{T}<br />\begin{pmatrix}<br />&a_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(1 x\frac{2\pi}{T}\right) dx & + b_1 \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(1 x\frac{2\pi}{T}\right) dx + \\<br />&\dots&\\<br />&a_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) cos\left(n x\frac{2\pi}{T}\right) dx & + b_n \int_0^T\cos\left(n x\frac{2\pi}{T}\right) sin\left(n x\frac{2\pi}{T}\right) dx + \\<br />&\dots&<br />\end{pmatrix}\\<br />&= \frac{2}{T}<br />\begin{pmatrix}<br />&a_1 0 & + b_1 0 + \\<br />&\dots&\\<br />&a_n \frac{T}{2} & + b_n 0 + \\<br />&\dots&<br />\end{pmatrix}\\<br />&= \frac{2}{T} a_n \frac{T}{2}\\<br />&= a_n<br />\end{align}<br />$$<br /><br />And obviously multiplying by $\sin\left(n x\frac{2\pi}{T}\right)$ would yield the amplitude of a sine. All that's left is by how much to vertically lift the graph, that is, the constant term to add to the sinusoids. The sinusoids should oscillate around the average value of the original function in one period, which is found as follows:<br /><br />$y_0 = \frac{1}{T} \int_0^T f(x) dx$<br /><br />Getting back to $f(x) = \cosh(x-1)$, here are the amplitudes:<br /><br />$y_0 = \frac{1}{2} \int_0^2 f(x) dx = 1.1752$<br />$a_1 = \frac{2}{2} \int_0^2 \cos\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.2162$<br />$b_1 = \frac{2}{2} \int_0^2 \sin\left(1 x\frac{2\pi}{2}\right) f(x) dx = 0.0$<br />$a_2 = \frac{2}{2} \int_0^2 \cos\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0581$<br />$b_2 = \frac{2}{2} \int_0^2 \sin\left(2 x\frac{2\pi}{2}\right) f(x) dx = 0.0$<br /><br />And here are the graphs of the approximated function getting better with every new term (where the approximated function is denoted as $\hat{f}$):<br /><br />$\hat{f}(x) = 1.1752$<br /><a href="https://2.bp.blogspot.com/-vgDL3UuWkz4/XWjCa4dFI8I/AAAAAAAABac/Wbik4vorStkVid8QVSqSgkRUIZT4l9wVACLcBGAs/s1600/fourier_approx0.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-vgDL3UuWkz4/XWjCa4dFI8I/AAAAAAAABac/Wbik4vorStkVid8QVSqSgkRUIZT4l9wVACLcBGAs/s320/fourier_approx0.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />$\hat{f}(x) = 1.1752 + 0.2162 \cos\left(1 x\frac{2\pi}{2}\right)$<br /><a href="https://1.bp.blogspot.com/-GQp7rgupQN8/XWjCeT9UfvI/AAAAAAAABag/54W9z8Ppmik2XB-GNaGKzC2u1J4Ku_XOACLcBGAs/s1600/fourier_approx1.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-GQp7rgupQN8/XWjCeT9UfvI/AAAAAAAABag/54W9z8Ppmik2XB-GNaGKzC2u1J4Ku_XOACLcBGAs/s320/fourier_approx1.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />$\hat{f}(x) = 1.1752 + 0.2162\cos\left(1 x\frac{2\pi}{2}\right) + 0.0581\cos\left(2 x\frac{2\pi}{2}\right)$<br /><a href="https://1.bp.blogspot.com/-7f4j8nVLWfM/XWjCjISpKYI/AAAAAAAABao/VYZF1Iy6FuAshaoc4EQmk2VtFQ38tQQGgCLcBGAs/s1600/fourier_approx2.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-7f4j8nVLWfM/XWjCjISpKYI/AAAAAAAABao/VYZF1Iy6FuAshaoc4EQmk2VtFQ38tQQGgCLcBGAs/s320/fourier_approx2.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />And of course here is the function beyond the first period:<br /><a href="https://2.bp.blogspot.com/-a5ZEA9-6Hbk/XWjCnL4vRnI/AAAAAAAABas/787bgLtObkUd6Kvfq-ABtyrlEREtNHwZgCLcBGAs/s1600/fourier_approx_periodic.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-a5ZEA9-6Hbk/XWjCnL4vRnI/AAAAAAAABas/787bgLtObkUd6Kvfq-ABtyrlEREtNHwZgCLcBGAs/s320/fourier_approx_periodic.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-45104141571280924102019-07-31T11:45:00.000+02:002019-07-31T12:35:54.674+02:00Recognising simple shapes with OpenCV in PythonConvolutional neural networks are usually used to recognise objects in images but that's a bit of an overkill if you just want to recognise simple flat shapes. Here's how to use OpenCV to do simple object recognition.<br /><br />Basically the trick to recognise polygons is to convert your image into an approximate polygon representation using something like edge detection and then count the number of sides in the polygon. OpenCV handles a lot of this stuff for you so its quite easy.<br /><br />Here's the image we will be working on:<br /><a href="https://3.bp.blogspot.com/-Evt3--O224c/XUFbwNCrHDI/AAAAAAAABYg/24ChYPwvluIzMNIHaHvu_fpI2j2yR8VUgCLcBGAs/s1600/opencv_polygons_1.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-Evt3--O224c/XUFbwNCrHDI/AAAAAAAABYg/24ChYPwvluIzMNIHaHvu_fpI2j2yR8VUgCLcBGAs/s320/opencv_polygons_1.png" width="320" height="320" data-original-width="300" data-original-height="300" /></a><br /><br />The first step is to binarise the pixels of the image, that is they are made either black or white. First the image must be turned into greyscale so that there is just one number per pixel and then anything that is not white (greyscale value 255) is replaced with 255 (white) whilst pure white is replaced with 0 (black).<br /><br /><pre class="brush:python">import cv2<br />import numpy as np<br /><br />img = cv2.imread('polygons.png', cv2.IMREAD_COLOR)<br /><br />#Make image greyscale<br />grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)<br />cv2.imshow('greyscale', grey)<br /><br />#Binarise greyscale pixels<br />(_, thresh) = cv2.threshold(grey, 254, 255, 1)<br />cv2.imshow('thresholded', thresh)<br /></pre><br /><a href="https://4.bp.blogspot.com/-CBY4T_i7jCQ/XUFcTcz41bI/AAAAAAAABYo/b07bY6BperAuDkA7WZ_i1nNmtQkohlOVgCLcBGAs/s1600/opencv_polygons_2.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-CBY4T_i7jCQ/XUFcTcz41bI/AAAAAAAABYo/b07bY6BperAuDkA7WZ_i1nNmtQkohlOVgCLcBGAs/s320/opencv_polygons_2.png" width="291" height="320" data-original-width="455" data-original-height="501" /></a><br /><br /><a href="https://1.bp.blogspot.com/-xLv8LA2EmTE/XUFcY0Mb7jI/AAAAAAAABYs/p8O6XX09IegUSSbODKG8OfOS6a5kKh0MgCLcBGAs/s1600/opencv_polygons_3.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-xLv8LA2EmTE/XUFcY0Mb7jI/AAAAAAAABYs/p8O6XX09IegUSSbODKG8OfOS6a5kKh0MgCLcBGAs/s320/opencv_polygons_3.png" width="291" height="320" data-original-width="455" data-original-height="501" /></a><br /><br />Next we're going to find contours around these white shapes, that is, reduce the image into a set of points with each point being a corner in the shape. This is done using the findContours function which returns a list consisting of contour points for every shape and their hierarchy. The <a href="https://docs.opencv.org/trunk/d9/d8b/tutorial_py_contours_hierarchy.html">hierarchy</a> is the way each shape is nested within other shapes. We won't need this here since we don't have any shapes within shapes but it will be useful in other situations.<br /><br /><pre class="brush:python">#Get shape contours<br />(contours, hierarchy) = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)<br />thresh_copy = cv2.cvtColor(thresh, cv2.COLOR_GRAY2RGB)<br />cv2.drawContours(thresh_copy, contours, contourIdx=-1, color=(0, 255, 0), thickness=2)<br />print('number of sides per shape:')<br />for contour in contours:<br /> print('', contour.shape[0])<br />print()<br />cv2.imshow('contours', thresh_copy)<br /></pre><br /><a href="https://4.bp.blogspot.com/-2YXQAftTfnU/XUFcyHq6oSI/AAAAAAAABY4/c_Iy60O7maMg7M0icKVevTpbNl1DyqyegCLcBGAs/s1600/opencv_polygons_4.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-2YXQAftTfnU/XUFcyHq6oSI/AAAAAAAABY4/c_Iy60O7maMg7M0icKVevTpbNl1DyqyegCLcBGAs/s320/opencv_polygons_4.png" width="291" height="320" data-original-width="455" data-original-height="501" /></a><br /><br /><pre>number of sides per shape:<br /> 119<br /> 122<br /> 4<br /> 124<br /></pre><br />Unfortunately, the number of sides extracted from each shape is nowhere near what it should be. This is because the corners in the shapes are rounded which results in multiple sides around the corner. We therefore need to simplify these contours so that insignificant sides can be removed. This is done using the <a href="https://en.wikipedia.org/wiki/Ramer%E2%80%93Douglas%E2%80%93Peucker_algorithm">Ramer–Douglas–Peucker algorithm</a> which is implemented as the approxPolyDP function. The algorithm requires a parameter called epsilon which controls how roughly to chop up the polygon into a smaller number of sides. This number is dependent on the size of the polygon so we make it a fraction of the shape's perimeter.<br /><br /><pre class="brush:python">#Simplify contours<br />thresh_copy = cv2.cvtColor(thresh, cv2.COLOR_GRAY2RGB)<br />print('number of sides per shape:')<br />for contour in contours:<br /> perimeter = cv2.arcLength(contour, True)<br /> e = 0.05*perimeter #The bigger the fraction, the more sides are chopped off the original polygon<br /> contour = cv2.approxPolyDP(contour, epsilon=e, closed=True)<br /> cv2.drawContours(thresh_copy, [ contour ], contourIdx=-1, color=(0, 255, 0), thickness=2)<br /> print('', contour.shape[0])<br />cv2.imshow('simplified contours', thresh_copy)<br /></pre><br /><a href="https://2.bp.blogspot.com/-GbwvsFF_dUQ/XUFdFz0sgoI/AAAAAAAABZE/i5nLORJl52MYBPyj7j-KkaN_HtIhgPCjwCLcBGAs/s1600/opencv_polygons_5.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-GbwvsFF_dUQ/XUFdFz0sgoI/AAAAAAAABZE/i5nLORJl52MYBPyj7j-KkaN_HtIhgPCjwCLcBGAs/s320/opencv_polygons_5.png" width="291" height="320" data-original-width="455" data-original-height="501" /></a><br /><br /><pre>number of sides per shape:<br /> 6<br /> 5<br /> 4<br /> 3<br /></pre><br />And with this we have the number of sides in each polygon.<br /><br />If you want to check for circles as well as polygons then you will not be able to do so by counting sides. Instead you can get the minimum enclosing circle around a contour and check if its area is close to the area of the contour (before it is simplified):<br /><br /><pre class="brush:python">((centre_x, centre_y), radius) = cv2.minEnclosingCircle(contour)<br />if cv2.contourArea(contour)/(np.pi*radius**2) > 0.95:<br /> print('circle')<br /></pre><br />The minimum enclosing circle is the smallest circle that completely contains the contour. Therefore it's area must necessarily be larger or equal to the shape of the contour, which can only be equal in the case that the contour is a circle.<br /><br />This is also one way how you can get the position of each shape, by getting the centre point of the enclosing circle. The proper way to get a single coordinate representing the position of the contour is to get the centre of gravity using the contour's moments:<br /><br /><pre class="brush:python">m = cv2.moments(contour)<br />x = m['m10']/m['m00']<br />y = m['m01']/m['m00']<br /></pre><br /><hr /><br />Here's the full code:<br /><br /><pre class="brush:python">import cv2<br />import numpy as np<br /><br />img = cv2.imread('polygons.png', cv2.IMREAD_COLOR)<br /><br />#Binarise pixels<br />grey = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)<br />(_, thresh) = cv2.threshold(grey, 254, 255, 1)<br /><br />#Get shape contours<br />(contours, hierarchy) = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)<br /><br />#Recognise shapes<br />for contour in contours:<br /> m = cv2.moments(contour)<br /> x = round(m['m10']/m['m00'])<br /> y = round(m['m01']/m['m00'])<br /><br /> shape_name = None<br /><br /> (_, radius) = cv2.minEnclosingCircle(contour)<br /> if cv2.contourArea(contour)/(np.pi*radius**2) > 0.95:<br /> shape_name = 'circle'<br /> else:<br /> e = 0.05*cv2.arcLength(contour, True)<br /> simple_contour = cv2.approxPolyDP(contour, epsilon=e, closed=True)<br /> num_sides = simple_contour.shape[0]<br /> shape_name = { 3: 'triangle', 4: 'quad', 5: 'pentagon', 6: 'hexagon' }.get(num_sides, 'unknown')<br /><br /> cv2.putText(img, shape_name, (x, y), fontFace=cv2.FONT_HERSHEY_SIMPLEX, fontScale=0.6, color=(0, 255, 0), thickness=2)<br /><br />cv2.imshow('named shapes', img)<br /></pre><br /><a href="https://2.bp.blogspot.com/-rKBGSrLd-d8/XUFuqMgZ67I/AAAAAAAABZQ/MZ6u2OdS2ckoGaJ8OG8UvGdtZbP3dFNkgCLcBGAs/s1600/opencv_polygons_final.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-rKBGSrLd-d8/XUFuqMgZ67I/AAAAAAAABZQ/MZ6u2OdS2ckoGaJ8OG8UvGdtZbP3dFNkgCLcBGAs/s320/opencv_polygons_final.png" width="291" height="320" data-original-width="455" data-original-height="501" /></a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-50741165580673232922019-06-28T12:28:00.000+02:002019-06-28T12:32:32.629+02:00The exploding gradient problem: Why your neural network gives NaNsHave you ever been training your neural network and suddenly got a bunch of NaNs as loss values? This is usually because of something called the exploding gradient problem, which is when your neural network's gradients, used to train it, are extremely large, leading to overflow errors when performing backpropagation.<br /><br />The exploding gradient problem is mostly caused by using large weight values in combination with using a large number of layers. It happens a lot with RNNs which would be neural networks with as many layers as there are items in the input sequence. Basically, given a particular activation function, with every additional layer the gradient of the parameters in the first layer can either vanish or explode as the gradient consists of a multiplication of the weights of the layers in front of it together. If the weights are fractions then their product will get closer and closer to zero (vanishes) with every additional layer, whilst if the weights are large then their product will get closer and closer to infinity (explodes) with every additional layer. Vanishing gradients lead to no learning in the early layers whilst exploding gradients leads to NaNs.<br /><br />In every explanation of this that I see, I always just see general reasoning without any actual examples of what this actually looks like. In order to illustrate what exploding gradients actually look like, we're going to make some simplifications such as that we're building a neural network with just one neural unit per layer and the single weight and bias in each layer will be the same. This is not a necessary condition to reach exploding gradients, but it will help visualise what is going on. Let's see an example of what happens when we have just 3 layers and a weight equal to 8.<br /><br />ReLU is an activation function that is known for mitigating the vanishing gradient problem, but it also makes it easy to create exploding gradients if the weights are large enough, which is why weights must be initialised to very small values. Here is what the graph produced by a neural network looks like when it consists of single ReLU units in each layer (single number input and single number output as well) as the number of layers varies between 1 (pink), 2 (red), and 3 (dark red).<br />$y = \text{relu}(w \times x)$<br />$y = \text{relu}(w \times \text{relu}(w \times x))$<br />$y = \text{relu}(w \times \text{relu}(w \times \text{relu}(w \times x)))$<br /><a href="https://3.bp.blogspot.com/-zT9J7O8W6vs/XRXeax56uOI/AAAAAAAABWU/mL8CjWKWTxgFY6GbpJf3U2FuIpmVjV_9gCLcBGAs/s1600/explodinggrads_relu.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-zT9J7O8W6vs/XRXeax56uOI/AAAAAAAABWU/mL8CjWKWTxgFY6GbpJf3U2FuIpmVjV_9gCLcBGAs/s320/explodinggrads_relu.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />See how the steepness grows exponentially with every additional layer? That's because the gradient is basically the product of all the weights in all the layers, which means that, since $w$ is equal to 8, the gradient is increasing 8-fold with each additional layer.<br /><br />Now with ReLU, this sort of thing is kind of expected as there is not bound to the value returned by it, so there should also be no bound to the gradient. But this also happens with squashing functions like tanh. Even though its returned value must be between 1 and -1, the gradient at particular points of the function can also be exceptionally large. This means that if you're training the neural network at a point in parameter space which has a very steep gradient, even if it's just a short burst, you'll end up with overflow errors or at the very least with shooting off the learning trajectory. This is what the graph produced by a neural network looksl ike when it consists of single tanh units in each layer.<br />$y = \text{tanh}(w \times x)$<br />$y = \text{tanh}(w \times \text{tanh}(w \times x))$<br />$y = \text{tanh}(w \times \text{tanh}(w \times \text{tanh}(w \times x)))$<br /><a href="https://3.bp.blogspot.com/-faTwhD0yrD8/XRXhOmNSCzI/AAAAAAAABWg/lkqoxXPBujEkwvntcSpVDTxr0s86SKn1QCLcBGAs/s1600/explodinggrads_tanh.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-faTwhD0yrD8/XRXhOmNSCzI/AAAAAAAABWg/lkqoxXPBujEkwvntcSpVDTxr0s86SKn1QCLcBGAs/s320/explodinggrads_tanh.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />See how the slope that transitions the value from -1 to 1 gets steeper and steeper as the number of layers increases? This means that if you're training a simple RNN that uses tanh as an activation function, you can either make the state weights very small or make the initial state values very large in order to stay off the middle slope. Of course this would have other problems as then the initial state wouldn't be able to easily learn to set the initial state to any set of values. It might also be the case that there is no single slope but that there are several slopes throughout the graph since the only limitation that tanh imposes is that all values occur between -1 and 1. For example, if the previous layer learns to perform some kind of sinusoid-like pattern that alternates between -5 and 5 (remember that a neural network with large enough layers can approximate any function), then this is what passing that through tanh would look like (note that this equation is a valid neural network with a hidden layer size of 3 and a single input and output):<br />$y = \text{tanh}(5 \times \text{tanh}(40 \times x-5) - 5 \times \text{tanh}(40 \times x+0) + 5 \times \text{tanh}(40 \times x+5))$<br /><a href="https://4.bp.blogspot.com/-obs-rODrXdg/XRXntm5HVWI/AAAAAAAABW8/7ap8ptJW1JA2F70venP340iUP4P1_xNewCLcBGAs/s1600/explodinggrads_tanh_sin.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-obs-rODrXdg/XRXntm5HVWI/AAAAAAAABW8/7ap8ptJW1JA2F70venP340iUP4P1_xNewCLcBGAs/s320/explodinggrads_tanh_sin.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />In this case you can see how it is possible for there to be many steep slopes spread around the parameter space, depending on what the previous layers are doing. Your parameter space could be a minefield.<br /><br />Now sigmoid is a little more tricky to see how it explodes its gradients. If you just do what we did above, you'll get the following result:<br />$y = \text{sig}(w \times x)$<br />$y = \text{sig}(w \times \text{sig}(w \times x))$<br />$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x)))$<br /><a href="https://4.bp.blogspot.com/-Gttx8ytPtok/XRXpSn4QygI/AAAAAAAABXI/_c_PgvXJPzM5bsCiPvRAgX1E-hCJUAY_gCLcBGAs/s1600/explodinggrads_sig_nobias.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Gttx8ytPtok/XRXpSn4QygI/AAAAAAAABXI/_c_PgvXJPzM5bsCiPvRAgX1E-hCJUAY_gCLcBGAs/s320/explodinggrads_sig_nobias.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />The slopes actually get flatter with every additional layer and get closer to $y=1$. This is because, contrary to tanh, sigmoid bounds itself to be between 0 and 1, with sigmoid(0) = 0.5. This means that $\text{sig}(\text{sigmoid}(x))$ will be bounded between 0.5 and 1, since the lowest the innermost sigmoid can go is 0, which will be mapped to 0.5 by the outer most sigmoid. With each additional nested sigmoid you'll just be pushing that lower bound closer and closer toward 1 until the graph becomes a flat line at $y=1$. In fact, in order to see exploding gradients we need to make use of the biases (which up till now were set to 0). Setting the biases to $-\frac{w}{2}$ gives very nice curves:<br />$y = \text{sig}(w \times x)$<br />$y = \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2})$<br />$y = \text{sig}(w \times \text{sig}(w \times \text{sig}(w \times x) - \frac{w}{2}) - \frac{w}{2})$<br /><a href="https://4.bp.blogspot.com/-gOy7aITnXwI/XRXqoFfKM7I/AAAAAAAABXU/3GA60aXT-mg1lOnMQT6sevCcf3CIaJEowCLcBGAs/s1600/explodinggrads_sig.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-gOy7aITnXwI/XRXqoFfKM7I/AAAAAAAABXU/3GA60aXT-mg1lOnMQT6sevCcf3CIaJEowCLcBGAs/s320/explodinggrads_sig.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Note that with sigmoid the steepness of the slope increases very slowly compared to tanh, which means that you'll need to use either larger weights or more layers to get the same dramatic effect. Also note that all these graphs were for weights being equal to 8, which is really large, but if you have many layers as in the case of a simple RNN working on long sentences, even a weight of 0.7 would explode after enough inputs.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-74522759088179742722019-05-30T10:37:00.000+02:002019-05-30T10:37:15.985+02:00Word square makerI made a word square searcher. A <a href="https://en.wikipedia.org/wiki/Word_square">word square</a> is a square grid with equal length words in each row and column such that the word in the first row is the word in the first column, the word in the second row is the word in the second column, and so on. Here's an example:<br /><pre>care<br />acid<br />ring<br />edge<br /></pre><br />It works by taking a file of words and indexing them by character and position such that I can find all the words that have a particular character at a particular position. The program will then build every possible word square that can be made from these words. It goes through all the words and tries them all as a first row/column in a word square. Each word is used as a basis for finding the rest of the words.<br /><br />In the above example, the first row/column is "care". In order to find a suitable word for the second row/column, the index is used to find all the words that have a first letter equal to the first row's second letter. "acid"'s first letter is equal to "care"'s second letter, so it is a good candidate. Each one of these candidate words is used tentatively such that if it doesn't pan out then we can backtrack and try another candidate word. For the third row/column, the index is used to find all the words that have a first letter equal to the first row's third letter and a second letter equal to the second row's third letter. The intersection of these two groups of words would be all the suitable words that can be used as a third row. "ring"'s first letter is equal to "care"'s third letter and its second letter is equal to "acid"'s third letter as well, so it is a good candidate for the third row. This process keeps on going until none of the words make a word square given the first row/column. The program then moves on to another possible word in the first row/column.<br /><br />You can find the program here: <a href="https://onlinegdb.com/By7XtdhpE">https://onlinegdb.com/By7XtdhpE</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-33614535457461529402019-05-01T13:50:00.001+02:002019-05-01T13:50:35.693+02:00Neutral Room Escape Game: Lights - Plus 7 Times 10 puzzle solverI've been playing a point and click game called <a href="http://neutralxe.net/esc/r2.html">Lights</a> which features a puzzle consisting of a 4 digit counter which starts from '0000' and which needs to be made equal to a certain number by a sequence of operations which are to either add 7 to the current value or multiply the current value by 10. I wrote a Python program to quickly solve the puzzle and I'm sharing the code for any one who needs to use it as well. You can find the program online at <a href="https://onlinegdb.com/rkg_l_ZDiV">https://onlinegdb.com/rkg_l_ZDiV</a>. Just run the program, enter the number you need to reach when requested and you'll see a sequence of steps showing which operations to perform in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached numbers until the goal number is reached. Enjoy!<br /><br /><br /><br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-10442361374087890842019-04-30T22:20:00.000+02:002019-04-30T22:20:10.418+02:00Sliding tile puzzle solverSo I've been playing a point and click game called <a href="http://akarika.net/games/escape/loom_blend.htm">Loom Blend</a> which features a <a href="https://en.wikipedia.org/wiki/Sliding_puzzle">sliding tile puzzle</a>. I wrote a Python program to quickly solve the puzzle there and I'm sharing the code for any one who needs to use it as well. You can find the program online at <a href="https://onlinegdb.com/BkkrnXLsN">https://onlinegdb.com/BkkrnXLsN</a>. Just run the program and you'll see a sequence of steps showing which tiles to move in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached tile configurations until the goal configuration is reached. Enjoy!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-56345914555229401872019-03-17T15:42:00.002+01:002019-03-17T15:46:16.872+01:00The Gambler's Fallacy: Why after tossing 100 heads in a row you can still get heads againThe <a href="https://en.wikipedia.org/wiki/Gambler's_fallacy">Gambler's fallacy</a> is a deceptively intuitive fallacy about how randomness works. Imagine you're tossing a fair coin repeatedly. You're astonished to see that every toss is resulting in heads. Heads, heads, heads, heads, over and over. As the number of consequtive heads keeps going up, does the probability of getting tails start increasing? It seems reasonable to conclude this as the number of heads and tails should balance each other out given that the coin is fair, as is stated by the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers">Law of Large Numbers</a>. But the idea that there is some kind of memory in the coin that is used to prevent the fair coin from acting as if it is unfair is wrong.<br /><br />First, let's show that the Gambler's Fallacy is actually a fallacy by simulating a bunch of coin tosses in a program. We'll keep track of all the times a streak of 10 heads was obtained and the following toss after it and see if the number of heads following the streak is significantly less than the number of tails. We do not consider 11 heads in a row as two streaks of 10 heads but instead start searching for a completely new streak after every streak. Here is a Python 3 program to do that (we also count the number of heads and tails obtained in general to show that the simulated coin is not biased).<br /><br /><pre class="brush:python">import random<br /><br />seed = 0<br />num_trails = 1000000<br />head_streak_length_limit = 10<br /><br />random.seed(seed)<br /><br />num_heads = 0<br />num_tails = 0<br />heads_streak_length = 0<br />next_was_head = 0<br />next_was_tail = 0<br />for trial in range(num_trails):<br /> result = random.choice(['head', 'tail'])<br /><br /> if result == 'head':<br /> num_heads += 1<br /> else:<br /> num_tails += 1<br /><br /> if heads_streak_length < head_streak_length_limit:<br /> if result == 'head':<br /> heads_streak_length += 1<br /> else:<br /> heads_streak_length = 0<br /> else:<br /> if result == 'head':<br /> next_was_head += 1<br /> else:<br /> next_was_tail += 1<br /><br /> heads_streak_length = 0<br /><br />print('heads / tails:', round(num_heads/num_tails, 4))<br />print('after heads streak was head / after heads streak was tail:', round(next_was_head/next_was_tail, 4))<br /></pre><br />And here are the results:<br /><pre>heads / tails: 1.002<br />after heads streak was head / after heads streak was tail: 1.0579<br /></pre><br />Both the ratio of heads and tails in general and the ratio of heads and tails following a streak of 10 heads in a row are close to 1. You can try changing the streak length limit to see how making it shorter does not significantly change this ratio.<br /><br />So why does this happen? It's a simple matter of probability theory. Both the probability of getting a head and of getting a tail is 0.5 and each toss is independent from the previous one. Therefore the probability of getting a head after 10 heads is $0.5^{10} \times 0.5 = 0.5^{11}$ and the probability of getting a tail after 10 heads is also $0.5^{10} \times 0.5 = 0.5^{11}$. It does not make a difference what the previous tosses were, the next toss being heads will always have a probability of 0.5, so both heads and tails are equally likely.<br /><br />So then the more interesting question is why do we think otherwise? Why do we think that as the chain of consecutive heads gets longer, the probability of getting another head goes down? It could be because we expect the coin to be fair and a coin that never gives tails is a biased coin, but that doesn't really mean anything in terms of predicting what the next toss will be. It is entirely possible that the fairest possible coin will give a million heads in a row. There is nothing stopping this from happening. Not only that, but a million heads in a row is just as likely to happen as any other random looking sequence resulting from a million coin tosses. Let's write down a sequence of heads and tails as a string of 'H's and 'T's. The sequence 'HHHHHH' is just as likely to come up as the sequence 'HHHTTH', yet we think of the first sequence as being unlikely whilst of the second sequence as... what? Is the second sequence more likely to come up than the first? Would you bet more money on the second sequence than the first? This is what what I think is the root cause of the Gambler's Fallacy.<br /><br />What I think happens in our mind is that we don't really predict whether the next toss will give a heads or a tails but whether the whole sequence will turn out to be an 'interesting' looking sequence or not. We love patterns, and when we see that a random sequence generator (coin tosses in this case) is giving out a sequence that looks like a simple pattern, we become suspicious of the randomness behind the sequence generator and assume that it isn't really random. I mean <a href="https://www.youtube.com/watch?v=tP-Ipsat90c">we're pretty terrible at generating random sequences manually</a> because we assume that in a random process you will not get things like a streak of heads. But it's not really just a streak of heads that would make us suspicious isn't it? It's also a sequence of alternating heads and tails for example or a long sequence of heads followed by a long sequence of tails. I think that the source of the Gambler's Fallacy is that we categorise entire sequences into interesting and uninteresting sequences. As we watch a sequence being generated one toss at a time we expect that interesting sequences of that length are much less likely to happen that uninteresting ones, which is true. This makes us expect the next toss to break the interestingness of the current sequence, which, in the case of a streak of heads, is by the next toss turning out to be tails. Of course, although interesting sequences are less likely to happen as the length grows, given that the probability of the interesting pattern becoming uninteresting on the next toss is equal to the probability of it remaining interesting, it is still an incorrect assumption.<br /><br />So what is the probability of an interesting sequence? This is a subjective question but we can still try to investigate it. Ideally a survey is conducted on a population of people to check what is considered as interesting by people on average. But that would be weird so I'll just conduct the survey on myself. I have generated all the possible coin tosses that can be obtained after 4 coin tosses and I will be annotating each sequence according to what I find interesting in it. I also did the same on 6 coin tosses to see how the ratio of interesting to uninteresting sequences changes with sequence length.<br /><br />Here is what I find interesting in sequences of 4 coin tosses.<br /><table border="1"><tr><td>TTTT</td><td style="font-weight:bold;">same</td></tr><tr><td>TTTH</td><td>not interesting</td></tr><tr><td>TTHT</td><td>not interesting</td></tr><tr><td>TTHH</td><td style="font-weight:bold;">switched</td></tr><tr><td>THTT</td><td>not interesting</td></tr><tr><td>THTH</td><td style="font-weight:bold;">repeated</td></tr><tr><td>THHT</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>THHH</td><td>not interesting</td></tr><tr><td>HTTT</td><td>not interesting</td></tr><tr><td>HTTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTHT</td><td style="font-weight:bold;">repeated</td></tr><tr><td>HTHH</td><td>not interesting</td></tr><tr><td>HHTT</td><td style="font-weight:bold;">switched</td></tr><tr><td>HHTH</td><td>not interesting</td></tr><tr><td>HHHT</td><td>not interesting</td></tr><tr><td>HHHH</td><td style="font-weight:bold;">same</td></tr></table><br />And here is what I find interesting in sequences of 6 coin tosses (organised into two columns to reduce vertical space).<br /><table border="1"><tr><td>TTTTTT</td><td style="font-weight:bold;">same</td><td>TTTTTH</td><td>not interesting</td></tr><tr><td>TTTTHT</td><td>not interesting</td><td>TTTTHH</td><td>not interesting</td></tr><tr><td>TTTHTT</td><td>not interesting</td><td>TTTHTH</td><td>not interesting</td></tr><tr><td>TTTHHT</td><td>not interesting</td><td>TTTHHH</td><td style="font-weight:bold;">switched</td></tr><tr><td>TTHTTT</td><td>not interesting</td><td>TTHTTH</td><td style="font-weight:bold;">doubled</td></tr><tr><td>TTHTHT</td><td>not interesting</td><td>TTHTHH</td><td>not interesting</td></tr><tr><td>TTHHTT</td><td style="font-weight:bold;">mirrored</td><td>TTHHTH</td><td>not interesting</td></tr><tr><td>TTHHHT</td><td>not interesting</td><td>TTHHHH</td><td>not interesting</td></tr><tr><td>THTTTT</td><td>not interesting</td><td>THTTTH</td><td>not interesting</td></tr><tr><td>THTTHT</td><td style="font-weight:bold;">mirrored</td><td>THTTHH</td><td>not interesting</td></tr><tr><td>THTHTT</td><td>not interesting</td><td>THTHTH</td><td style="font-weight:bold;">alternated</td></tr><tr><td>THTHHT</td><td>not interesting</td><td>THTHHH</td><td>not interesting</td></tr><tr><td>THHTTT</td><td>not interesting</td><td>THHTTH</td><td>not interesting</td></tr><tr><td>THHTHT</td><td>not interesting</td><td>THHTHH</td><td style="font-weight:bold;">doubled</td></tr><tr><td>THHHTT</td><td>not interesting</td><td>THHHTH</td><td>not interesting</td></tr><tr><td>THHHHT</td><td style="font-weight:bold;">mirrored</td><td>THHHHH</td><td>not interesting</td></tr><tr><td>HTTTTT</td><td>not interesting</td><td>HTTTTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTTTHT</td><td>not interesting</td><td>HTTTHH</td><td>not interesting</td></tr><tr><td>HTTHTT</td><td style="font-weight:bold;">doubled</td><td>HTTHTH</td><td>not interesting</td></tr><tr><td>HTTHHT</td><td style="font-weight:bold;">don't know</td><td>HTTHHH</td><td>not interesting</td></tr><tr><td>HTHTTT</td><td>not interesting</td><td>HTHTTH</td><td>not interesting</td></tr><tr><td>HTHTHT</td><td style="font-weight:bold;">alternated</td><td>HTHTHH</td><td>not interesting</td></tr><tr><td>HTHHTT</td><td>not interesting</td><td>HTHHTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTHHHT</td><td>not interesting</td><td>HTHHHH</td><td>not interesting</td></tr><tr><td>HHTTTT</td><td>not interesting</td><td>HHTTTH</td><td>not interesting</td></tr><tr><td>HHTTHT</td><td>not interesting</td><td>HHTTHH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HHTHTT</td><td>not interesting</td><td>HHTHTH</td><td>not interesting</td></tr><tr><td>HHTHHT</td><td style="font-weight:bold;">doubled</td><td>HHTHHH</td><td>not interesting</td></tr><tr><td>HHHTTT</td><td style="font-weight:bold;">switched</td><td>HHHTTH</td><td>not interesting</td></tr><tr><td>HHHTHT</td><td>not interesting</td><td>HHHTHH</td><td>not interesting</td></tr><tr><td>HHHHTT</td><td>not interesting</td><td>HHHHTH</td><td>not interesting</td></tr><tr><td>HHHHHT</td><td>not interesting</td><td>HHHHHH</td><td style="font-weight:bold;">same</td></tr></table><br />Among the 4 coin toss sequences, 8 out of the 16 sequences were considered interesting, meaning that you have a 50% chance of generating an interesting sequence. But among the 6 coin toss sequences, only 16 out of the 64 sequences were considered interesting, meaning that you have a 25% chance of generating an interesting sequence. Now I don't know if the patterns I identified in the tables above are the only interesting patterns for any sequence length, since maybe some patterns only emerge in very long sequences, but it is safe to say that the longer the sequence, the greater the proportion of uninteresting and random looking sequences will be. Therefore, the longer the sequence, the less likely that a sequence will be interesting. This might be what is going on when we think that the probability of getting a tails goes up as the streak of heads gets longer; we'd really be comparing how likely it is for a sequence of a given length to be uninteresting, which becomes less likely as the length grows.<br /><br /><div class="sectitle">Interesting patterns are interesting</div>I have to admit that whilst annotating interesting sequences I was more interested in being consistent in my reasoning than in saying what I genuinely found interesting. This is because I didn't actually feel that, for example, all "repeated" patterns were interesting, just some of them. But I wanted to avoid explaining myself too much based on what I felt. This means that there might be more going on in feeling that a sequence is aesthetically pleasing than simple mathematical patterns. It would be interesting if someone made an experiment where people are subjected to an EEG to see what is really considered interesting and what is not. How knows, maybe it might even turn out to be a kind of Rorschach test where different patterns of interest indicate different personalities. And maybe it might even be an interesting machine learning challenge to try to make a program predict which sequences would be considered interesting by humans in order to quantify how interesting a given sequence is.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-83109685167918923362019-02-26T17:28:00.001+01:002019-02-26T17:28:59.170+01:00MySQL extra inner join taking too long: optimizer_search_depthMySQL has a <a href="https://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html">query optimiser</a> that takes in SQL queries and tries to find the best way to execute them. Then the query gets executed using the best execution plan found. When you use joins in your query, the time the optimiser takes to finish optimising is <a href="https://dev.mysql.com/doc/refman/5.5/en/controlling-query-plan-evaluation.html">exponential to the number of joins</a>. This means that for a few joins it only takes a negligible amount of time to finish but after a point, which is somewhere between 7 and 10 tables, the optimisation time will shoot up. In my case I went from 0.1 seconds to 17 seconds but just adding another table in my joins.<br /><br />To check if this is happening to you, you can <a href="https://dev.mysql.com/doc/refman/5.5/en/show-profile.html">profile your query</a> by executing the following:<br /><br /><pre class="brush:python">SET profiling = 1;<br /><br />*your query*<br /><br />SHOW PROFILE FOR QUERY 1;<br /></pre><br />(If you using phpMyAdmin you might need to add "SHOW PROFILE;" as well at the end)<br /><br />This will show you a table of execution times for each phase in the execution of the query. The time taken by the query optimiser is in the "statistics" row. If this is too high, then you need to stop your optimiser from spending so much time.<br /><br />Controlling how much time the optimiser should spend doing its job can be accomplished using the <a href="https://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_optimizer_search_depth">optimizer_search_depth</a> variable. This is the maximum depth to search (presumably in some tree search algorithm) when optimising the query and is unfortunately set to the highest value by default. Being set to the highest value makes it the most likely to find the best execution plan but may also be unnecessarily time consuming. Thankfully there is a better default value to use: 0. When this variable is set to zero, the optimiser automatically picks a value based on the query and that seems to work well for me. To do this just execute the following query:<br /><br /><pre class="brush:python">SET optimizer_search_depth = 0;<br /></pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-25216890623852623542019-01-29T17:45:00.003+01:002019-01-29T18:13:33.838+01:00Python string format quick guideThis is a quick summary guide for when you just want to remind yourself how to format a value with Python's <a href="https://docs.python.org/3.6/library/string.html#format-string-syntax">'string'.format()</a>.<br /><br /><div class="sectitle">Referencing</div><br />Curly brackets are used to refer to parameter values in the 'format' method. On their own, the first pair of curly brackets will refer to the first parameter, second to the second, etc.<br /><br /><pre class="brush:python">'{} {} {}'.format('a', 1, True)<br /></pre><pre>a 1 True</pre><br />Alternatively you can specify which parameter you want to use where using indexes.<br /><br /><pre class="brush:python">'{0} {0} {2}'.format('a', 1, True)<br /></pre><pre>a a True</pre><br />You can even use names instead of indexes.<br /><br /><pre class="brush:python">'{a} {b} {c}'.format(a='a', b=1, c=True)<br /></pre><pre>a 1 True</pre><br />And even refer to indexes in lists and to instance variables in objects.<br /><br /><pre class="brush:python">'{a[0]} {b.numerator}'.format(a=[3,1,2], b=fractions.Fraction(1,2))<br /></pre><pre>3 1</pre><br /><div class="sectitle">Formatting</div><br />Adding a colon inside the curly brackets allows you to format the values before they are added to the string. You can combine these with any of the above methods of referencing. The following formatting meta characters are to be used in the following order:<br /><pre>{:[alignment] [grouping] [precision] [data type]}</pre><br /><div class="subsectitle">Aligning/padding/leading zeros</div>An alignment meta character is either >, <, or ^. The number after it is the field width in which to align.<br /><br /><pre class="brush:python">'|{:>3}|{:<3}|{:^3}|'.format('a', 'b', 'c')<br /></pre><pre>| a|b | c |</pre><br />Any character before the alignment meta character is used instead of spaces.<br /><pre class="brush:python">'{:_>5} {:0>5}'.format('a', 12)<br /></pre><pre>____a 00012</pre><br /><div class="subsectitle">Grouping</div>You can group digits in numbers into triples with commas automatically by just putting a comma in the curly brackets.<br /><br /><pre class="brush:python">'{:,}'.format(12345678)</pre><pre>12,345,678</pre><br /><div class="subsectitle">Precision</div>When formatting floats you can specify how many digits to keep after the point by adding a dot and a number. By default the format of the number is in scientific notation which means that your floating point number will look like 12e-4 with the precision referring to the number of digits in the mantissa (the number before 'e'). See the next section to format your number in fixed point notation.<br /><br /><pre class="brush:python">'{:.3}'.format(12345678.0)</pre><pre>1.23e+07</pre><br /><div class="subsectitle">Data types</div>This is the most important part. The data type always comes at the end and is a single letter.<br /><br />You can use it to change the numeric base of a whole number.<br /><pre class="brush:python">'dec-{:d} hex-{:X} oct-{:o} bin-{:b}'.format(16, 16, 16, 16)</pre><pre>dec-16 hex-10 oct-20 bin-10000</pre><br />You can use it to convert unicode values to characters.<br /><pre class="brush:python">'{:c}'.format(97)</pre><pre>a</pre><br />You can use it to convert numbers into scientific notations,<br /><pre class="brush:python">'{:e}'.format(12345678)</pre><pre>1.234568e+07</pre><br />or fixed point notations,<br /><pre class="brush:python">'{:f}'.format(12345678)</pre><pre>12345678.000000</pre><br />or percentages.<br /><pre class="brush:python">'{:%}'.format(0.1234)</pre><pre>12.340000%</pre><br /><div class="sectitle">Mixed examples</div><br /><pre class="brush:python">'{:0>2X}'.format(10)</pre><pre>0A</pre><br /><pre class="brush:python">'{:,d}'.format(1234567)</pre><pre>1,234,567</pre><br /><pre class="brush:python">'{:,.2f}'.format(12345.6)</pre><pre>12,345.60</pre><br /><pre class="brush:python">'{:.2%}'.format(0.1234)</pre><pre>12.34%</pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91605694465384295222018-12-27T19:37:00.000+01:002018-12-27T21:31:20.191+01:00Language models and the unknown token: Why perplexity gets worse as the corpus size increasesUnless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.<br /><br />The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.<br /><br />It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.<br /><br />Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.<br /><br />We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.<br /><br />Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.<br /><br />Let's say that we want to find the probability of the following sentence:<br /><pre>the loud dog barked at the loud man</pre><br />and let's say that the language model we're using to do that has the following vocabulary:<br /><pre>the at dog man</pre><br />this means that the sentence is now changed to:<br /><pre>the UNK dog UNK at the UNK man</pre><br />Now the naive way to get the probability of the sentence is as follows:<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})<br />$$<br /><br />But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})<br />$$<br /><br />Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3<br />$$<br /><br />Now the sentence probability goes up as the vocabulary size increases!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-20510933309610367662018-11-29T22:14:00.002+01:002018-11-29T22:21:46.349+01:00Explainable AI (XAI): Sensitivity AnalysisImagine you have a neural network that can classify images well. At test time, would it be enough to just know what the class of the image is? Or would you also want to know why the image was classified the way it was? This is one of the goals of explainable AI and in this post we'll see what sensitivity analysis is.<br /><br />Sensitivity analysis is a way to measure the importance of different parts of an input to one part of an output. In other words, you want to know which pixels were most important to give a high probability for a particular class. The way sensitivity analysis measures this is by measuring how sensitive the output is to each pixel, that is, which pixels, when changed, will change the output the most. Most pixels should leave the output unchanged when they themselves are changed, but some pixels would be critical to the particular output that the neural network has given. To measure this, we simply find the following:<br /><br />$\left|\frac{df(x)_i}{dx_j}\right|$<br /><br />that is, the sensitivity of output $i$ to input $j$ is the absolute value of the gradient of the output with respect to the input. In Tensorflow we can compute this as follows:<br /><br /><pre class="brush:python">sensitivity = tf.abs(tf.gradients([outputs[0, output_index]], [images])[0][0])</pre><br />where 'output_index' is a placeholder that is a scalar of type tf.int64, 'outputs' is a softmax with probabilities for each class and where the first index is the batch index and the second index is the class probability, and 'images' is the pixels of images where the first index is batch index and the second index is the image in vector form. This code also assumes that only the first image in the batch is to be analysed. This is because Tensorflow can only find the gradient of a single scalar so we can only find the sensitivity of a single output of a single image.<br /><br />Here are some examples I got when I tried it on a simple fully connected two layer neural network trained to classify MNIST handwritten digits.<br /><br /><a href="https://1.bp.blogspot.com/-kLdO_Qo6vVA/XABXKiTZ30I/AAAAAAAABRw/LE_WF4Blg7oWcndSPQuQduTbUNaJkyQdACLcBGAs/s1600/sensitivityanalysis_0.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-kLdO_Qo6vVA/XABXKiTZ30I/AAAAAAAABRw/LE_WF4Blg7oWcndSPQuQduTbUNaJkyQdACLcBGAs/s1600/sensitivityanalysis_0.png" data-original-width="108" data-original-height="210" /></a><a href="https://4.bp.blogspot.com/-EZPrCkoHGe8/XABXK2pdv-I/AAAAAAAABR0/l3fGgIdEw9YdKccqfKnP-iHX81oIHeS5ACLcBGAs/s1600/sensitivityanalysis_4.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-EZPrCkoHGe8/XABXK2pdv-I/AAAAAAAABR0/l3fGgIdEw9YdKccqfKnP-iHX81oIHeS5ACLcBGAs/s1600/sensitivityanalysis_4.png" data-original-width="108" data-original-height="210" /></a><a href="https://2.bp.blogspot.com/-SrkbuGE47ac/XABXK1jIC6I/AAAAAAAABRs/1p_c1eZrCk0GA4Fgm3swRNY7nT4VPztCgCLcBGAs/s1600/sensitivityanalysis_5.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-SrkbuGE47ac/XABXK1jIC6I/AAAAAAAABRs/1p_c1eZrCk0GA4Fgm3swRNY7nT4VPztCgCLcBGAs/s1600/sensitivityanalysis_5.png" data-original-width="108" data-original-height="209" /></a><a href="https://4.bp.blogspot.com/-Zvss5Mw6XGQ/XABXLI2-xxI/AAAAAAAABR4/TTq4wrHC2vYOmHQ5_Sv_Ui053s--P-amwCLcBGAs/s1600/sensitivityanalysis_7.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Zvss5Mw6XGQ/XABXLI2-xxI/AAAAAAAABR4/TTq4wrHC2vYOmHQ5_Sv_Ui053s--P-amwCLcBGAs/s1600/sensitivityanalysis_7.png" data-original-width="108" data-original-height="209" /></a><a href="https://1.bp.blogspot.com/-SqxUH5Ozhew/XABXLuZg_TI/AAAAAAAABR8/UWUY35aNAx0oirW-jiHeJ8eLnU8ZNr27gCLcBGAs/s1600/sensitivityanalysis_9.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-SqxUH5Ozhew/XABXLuZg_TI/AAAAAAAABR8/UWUY35aNAx0oirW-jiHeJ8eLnU8ZNr27gCLcBGAs/s1600/sensitivityanalysis_9.png" data-original-width="108" data-original-height="209" /></a><br /><br />These heat maps show which pixels have the highest sensitivity score for the digit they were classified as. We can see how the empty space in the middle is important for classifying the zero. The four and the nine can be easy confused for each other were it not for the little bit in the top left corner. The seven can be confused for a nine if we complete the top left loop and the five can be confused with a six or eight if we draw a little diagonal line.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-18073264991404833442018-10-24T22:10:00.000+02:002018-10-24T22:28:24.800+02:00Mentally estimating square rootsWhat's $\sqrt{10}$? I'd reckon it's about $3 + \frac{10-9}{2 \times 3} = 3.166$. The actual answer is 3.162. What about $\sqrt{18}$? Should be about $4 + \frac{18-16}{2 \times 4} = 4.250$. The actual answer is 4.242. I found out about this calculation from <a href="https://www.youtube.com/watch?v=PJHtqMjrStk">this video</a> but there was no explanation for why it works. Here's an explanation.<br /><br />So the method here is as follows.<br /><ol><li>Let the number you want to find the square root of be $a^2$.</li><li>Let the largest square number which is less than $a^2$ be $b^2$ such that $b^2 <= a^2 < (b+1)^2$. For example if $a^2$ is 10 then $b^2$ is 9, if $a^2$ is 18 then $b^2$ is 16.</li><li>The square root of $a^2$ is approximately $b + \frac{a^2 - b^2}{2 b}$.</li></ol><br />This method is easy to carry out mentally but why does it work? The trick here is that the graph of the square root function grows so slowly that we can approximate the curve between two adjacent square numbers as a line.<br /><br /><a href="https://3.bp.blogspot.com/-lgpg71gLShs/W9DVfIo9KHI/AAAAAAAABQE/XwYjBepZ-9g8H-P4mbeFruf1LACrY3bSACLcBGAs/s1600/sqrtestimation_graph.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-lgpg71gLShs/W9DVfIo9KHI/AAAAAAAABQE/XwYjBepZ-9g8H-P4mbeFruf1LACrY3bSACLcBGAs/s640/sqrtestimation_graph.png" width="640" height="338" data-original-width="1104" data-original-height="583" /></a><br /><br />We can use the line to approximate the square root of any number between two square numbers. The first thing we need to know is the gradient of the line. The vertical distance between two adjacent square numbers on the square root curve is 1, since the two square numbers are the squares of two consecutive numbers. The horizontal distance changes and becomes larger as the adjacent square numbers become larger but we can calculate it as follows:<br /><br />$$(b+1)^2 - b^2 = b^2 + 2b + 1 - b^2 = 2b + 1$$<br /><br />So the horizontal distance is twice the square root of the smaller square number plus one. Therefore the gradient of the line is $\frac{1}{2b+1}$. Once we know by how much the line grows vertically for every horizontal unit, we can then determine how much higher than $b$ the point on the line will be at $a$ by multiplying the gradient by $a^2-b^2$, as shown below:<br /><br /><a href="https://3.bp.blogspot.com/-BZyfdJce04Y/W9DVnBb8rFI/AAAAAAAABQI/gCMKoTeR2UwDY02k5__TU929wxxBPue2gCLcBGAs/s1600/sqrtestimation_line.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-BZyfdJce04Y/W9DVnBb8rFI/AAAAAAAABQI/gCMKoTeR2UwDY02k5__TU929wxxBPue2gCLcBGAs/s640/sqrtestimation_line.png" width="640" height="360" data-original-width="1280" data-original-height="720" /></a><br /><br />Since the difference in height is less than 1, it is going to be the part of the square root that comes after the decimal point, with the whole number part being $b$.<br /><br />It might be hard to mentally divide by an odd number in $\frac{a^2-b^2}{2b+1}$ so we further approximate it as $\frac{a^2-b^2}{2b}$ instead. And that's why this method works.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-87712869505393152292018-09-29T17:07:00.004+02:002018-09-29T17:10:25.392+02:00Comparing numpy scalars directly is time consuming, use .tolist() before a comparisonThis is something that I found out about recently when going through the elements of a numpy array in order to do some checks on each numbers. Turns out you shouldn't just do this<br /><br /><pre class="brush:python">for x in nparr:<br /> if x == 0:<br /> something something<br /></pre><br />as that uses a lot more time than doing this<br /><br /><pre class="brush:python">for x in nparr.tolist():<br /> if x == 0:<br /> something something<br /></pre><br />This is because a for loop iterating over a numpy array does not result in a sequence of Python constants but in a sequence of numpy scalars which would result in comparing a numpy array to a constant. Converting the array into a list first before the for loop will then result in a sequence of constants.<br /><br />Here is some profiling I've done using cProfile to check different ways to do an 'if' on a numpy array element:<br /><br /><pre class="brush:python">import cProfile<br />import numpy as np<br /><br />runs = 1000000<br /><br />print('Comparing numpy to numpy')<br />x = np.array(1.0, np.float32)<br />y = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == y:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy to constant')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing constant to constant')<br />x = 1.0<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy.tolist() to constant')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x.tolist() == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy to numpy.array(constant)')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == np.array(1.0, np.float32):<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy.tolist() to numpy.tolist()')<br />x = np.array(1.0, np.float32)<br />y = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x.tolist() == y.tolist():<br /> pass<br />''')<br />print()<br /></pre><br />Here are the results in order of speed:<br /><br /><table border="1"><tr><th>Comparing constant to constant:</th><td>0.088 seconds</td></tr><tr><th>Comparing numpy.tolist() to constant:</th><td>0.288 seconds</td></tr><tr><th>Comparing numpy.tolist() to numpy.tolist():</th><td>0.508 seconds</td></tr><tr><th>Comparing numpy to numpy:</th><td>0.684 seconds</td></tr><tr><th>Comparing numpy to constant:</th><td>1.192 seconds</td></tr><tr><th>Comparing numpy to numpy.array(constant):</th><td>1.203 seconds</td></tr></table><br />It turns out that it is always faster to first convert your numpy scalars into constants via .tolist() than to do anything with them as numpy scalars.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-23594156293254387072018-08-30T23:41:00.000+02:002018-09-29T08:26:01.092+02:00The McLauren series and Taylor series (approximating complicated functions with simple polynomials)<div class="sectitle">The McLauren series</div><br />Imagine you had a function $f(x)$ that you knew was a polynomial, but whose details were unknown and you could only apply operations to the function without being able to read it. How could you find the coefficients of this polynomial? We know that for coefficients $a_i$:<br /><br />$f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br /><br />If we find $f(0)$ then we can find $a_0$.<br /><br />$f(0) = a_0 + a_1 0 + a_2 0^2 + a_3 0^3 + a_4 0^4 + ... = a_0 + 0 + 0 + ... = a_0$<br /><br />That was easy, but how can we find $a_1$? We need an operation that gets rid of $a_0$ and also the $x$ in the term $a_1 x$. That operation turns out to be differentiation with respect to $x$:<br /><br />$f'(x) = a_1 + 2 a_2 x + 3 a_3 x^2 + 4 a_4 x^3 + ...$<br /><br />Great! Now we can find $a_1$ by replacing $x$ with 0:<br /><br />$f'(0) = a_1 + 2 a_2 0 + 3 a_3 0^2 + 4 a_4 0^3 + ... = a_1$<br /><br />We can find $a_2$ by repeating these two steps:<br /><br />$f''(x) = 2 a_2 + 2 \cdot 3 a_3 x + 3 \cdot 4 a_4 x^2 + ...$<br />$f''(0) = 2 a_2 + 2 \cdot 3 a_3 0 + 3 \cdot 4 a_4 0^2 + ... = 2 a_2$<br /><br />What we found is twice of $a_2$ which means that we need to divide by 2 to get $a_2$. The differentiation operation is multiplying constants by each coefficient and the constants get bigger and bigger the more times we apply differentiation. You might notice that what's happening is that $a_0$ was multiplied by 1, $a_1$ was also multiplied by 1, $a_2$ has been multiplied by 2, $a_3$ will be multiplied by 6, $a_4$ by 24, and so on. These are factorials, sequences of whole numbers multiplied together ($3! = 1 \times 2 \times 3 = 6$). Which means that we need to divide by the next factorial after each round of differentiation and substitution by zero.<br /><br />In general we can find the $i$th coefficient in an unknown polynomial function by doing the following:<br /><br />$a_i = \frac{f^i(0)}{i!}$<br /><br />That's great. Now to test it. Let's see if a complex function is actually a polynomial in disguise. Take something simple such as $f(x) = e^x$. This doesn't look like a polynomial, but it may be represented by a polynomial with an infinite number of terms. Let's find what are the coefficients of the hidden polynomial in $e^x$.<br /><br />$f(x) = e^x = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br />$\frac{f(0)}{0!} = \frac{e^0}{1} = a_0$<br />$\frac{f(0)}{0!} = 1$<br /><br />OK, so $a_0$ is 1. Let's find the rest of the coefficients.<br /><br />$a_1 = \frac{f'(0)}{1!} = \frac{e^0}{1} = 1$<br />$a_2 = \frac{f''(0)}{2!} = \frac{e^0}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{e^0}{6} = \frac{1}{6}$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $e^x$ are:<br />$f(x) = 1 + x + \frac{1}{2} x^2 + \frac{1}{6} x^3 + ...$<br /><br />Does this partial polynomial look anything like $e^x$ when plotted as a graph?<br /><br /><a href="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s1600/taylor_exp.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s320/taylor_exp.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Pretty good within a boundary! Note how the curve has a perfect fit at $x = 0$ and that it gets worse as we move away from there. Adding more terms to the polynomial will enlarge the area around $x = 0$ that is close to the curve but it will always be radiating out from there.<br /><br />Let's try for $f(x) = cos(x)$ now.<br /><br />$a_0 = \frac{f(0)}{0!} = \frac{cos(0)}{1} = 1$<br />$a_1 = \frac{f'(0)}{1!} = \frac{-sin(0)}{1} = 0$<br />$a_2 = \frac{f''(0)}{2!} = \frac{-cos(0)}{2} = -\frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{sin(0)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ is<br />$f(x) = 1 - \frac{1}{2} x^2 + ...$<br /><br /><a href="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s1600/taylor_cos.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s320/taylor_cos.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br /><div class="sectitle">The Taylor series</div><br />As you can see, this "polynomialisation" of functions is a neat way to approximate a function we might not know how to implement exactly but know how to differentiate and how to find its value when $x$ is 0. But what if we don't know what $f(0)$ is such as in $ln(x)$ or $\frac{1}{x}$? A slight modification to our method allows us to use any value of $x$ and not just 0. Let's call this value $b$. By slightly modifying the polynomial we expect to be hiding inside the function, we can make the polynomial act the same way when $f(b)$ is used instead of $f(0)$:<br /><br />$f(x) = a_0 + a_1 (x - b) + a_2 (x - b)^2 + a_3 (x - b)^3 + a_4 (x - b)^4 + ...$<br />$f(b) = a_0 + a_1 (b - b) + a_2 (b - b)^2 + a_3 (b - b)^3 + a_4 (b - b)^4 + ...$<br />$f(b) = a_0$<br /><br />$f'(x) = a_1 + 2 a_2 (x - b) + 3 a_3 (x - b)^2 + 4 a_4 (x - b)^3 + ...$<br />$f'(b) = a_1$<br /><br />$a_i = \frac{f^i(b)}{i!}$<br /><br />The catch here is that we are now finding coefficients to the polynomial $a_0 + a_1 (x - b) + a_2 (x - b)^2 + ...$ and not of $a_0 + a_1 x + a_2 x^2 + ...$, but that's OK. Let's try this on $ln(x)$ with $b = 1$.<br /><br />$a_0 = \frac{f(1)}{0!} = \frac{ln(1)}{1} = 0$<br />$a_1 = \frac{f'(1)}{1!} = \frac{\frac{1}{1}}{1} = 1$<br />$a_2 = \frac{f''(1)}{2!} = \frac{-\frac{1}{1^2}}{1} = -1$<br />$a_3 = \frac{f'''(1)}{3!} = \frac{\frac{2}{1^3}}{1} = 2$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $ln(x)$ is<br />$f(x) = (x - 1) - (x - 1)^2 + 2 (x - 1)^3 + ...$<br /><br /><a href="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s1600/taylor_ln.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s320/taylor_ln.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Adding more terms will approximate the original function better and better but what if we didn't have to? Remember how I said in the previous section that the polynomial approximates the original function best close to $x = 0$. Well now we can approximate it best around any point $b$ and not just around 0. This means that if our function has multiple known values, such as $cos(x)$ which is known to be 1 at $x = 0$, 0 at $x = \frac{\pi}{2}$, -1 at $x = \pi$, etc., then we can use several short polynomials centered at different points in the function instead of one large polynomial that approximates it well over a large interval. Let's try approximating $cos(x)$ around $x = \pi$, which means that we'll set $b$ to $\pi$.<br /><br />$a_0 = \frac{f(\pi)}{0!} = \frac{cos(\pi)}{1} = -1$<br />$a_1 = \frac{f'(\pi)}{1!} = \frac{-sin(\pi)}{1} = 0$<br />$a_2 = \frac{f''(\pi)}{2!} = \frac{-cos(\pi)}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(\pi)}{3!} = \frac{sin(\pi)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ which best approximates the area around $x = \pi$ is<br />$f(x) = -1 + \frac{1}{2} (x - \pi)^2 + ...$<br /><br /><a href="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s1600/taylor_cos_pi.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s320/taylor_cos_pi.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />This is useful when implementing mathematical functions on a computer. You keep several simple polynomials defined at different points in the domain of the function and then pick the closest one to the $x$ you need to evaluate. You can then compute an approximation that isn't too bad without requiring a lot of computational time.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-86050409192796957252018-07-29T23:25:00.001+02:002018-07-29T23:45:37.534+02:00Hyperparameter tuning using Scikit-OptimizeOne of my favourite academic discoveries this year was <a href="https://scikit-optimize.github.io/">Scikit-Optimize</a>, a nifty little Python automatic hyperparameter tuning library that comes with a lot of features I found missing in other similar libraries.<br /><br />So as explained in an <a href="https://geekyisawesome.blogspot.com/2017/10/hyperparameter-tuning-using-hyperopt.html">earlier blog post</a>, automatic hyperparameter tuning is about finding the right hyperparameters for a machine learning algorithm automatically. Usually this is done manually using human experience but even simple MonteCarlo search random guesses can result in better performance than human tweaking (<a href="http://www.jmlr.org/papers/v13/bergstra12a.html">see here</a>). So automatic methods were developed that try to explore the space of hyperparameters and their resulting performance after training the machine learning model and then try to home in on the best performing hyperparameters. Of course each time you want to evaluate a new hyperparameter combination you'd need to retrain and evaluate your machine learning model, which might take a very long time to finish, so it's important that a good hyperparameter combination is found with as little evaluations as possible. To do this we'll use <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a>, a process where a separate simpler model is trained to predict the resulting performance of the whole hyperparameter space. We check this trained model to predict which hyperparameters will give the best resulting performance and actually evaluate them by training our machine learning model with them. The actual resulting performance is then used to update the hyperparameter space model so that it makes better predictions and then we'll get a new promising hyperparameter combination from it. This is repeated for a set number of times. Now the most common hyperparameter space model to use is a Gaussian Process which maps continuous numerical hyperparameters to a single number which is the predicted performance. This is not very good when your hyperparameters contain categorical data such as a choice of activation function. There is a <a href="https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9993">paper</a> that suggests that random forests are much better at mapping general hyperparameter combinations to predicted performance.<br /><br />Now that we got the theory out of the way, let's see how to use the library. We'll apply it on a gradient descent algorithm that needs to minimize the squared function. For this we'll need 3 hyperparameters: the range of the initial value to be selected randomly, the learning rate, and the number of epochs to run. So we have two continuous values and one discrete integer value.<br /><br /><pre class="brush:python">import random<br /><br />def cost(x):<br /> return x**2<br /><br />def cost_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x, num_epochs):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(num_epochs):<br /> x = x - learning_rate*cost_grad(x)<br /> return x<br /></pre><br />Now we need to define the skopt optimizer:<br /><br /><pre class="brush:python">import skopt<br /><br />opt = skopt.Optimizer(<br /> [<br /> skopt.space.Real(0.0, 10.0, name='max_init_x'),<br /> skopt.space.Real(1.0, 1e-10, 'log-uniform', name='learning_rate'),<br /> skopt.space.Integer(1, 20, name='num_epochs'),<br /> ],<br /> n_initial_points=5,<br /> base_estimator='RF',<br /> acq_func='EI',<br /> acq_optimizer='auto',<br /> random_state=0,<br /> )<br /></pre><br />The above code is specifying 3 hyperparameters:<br /><ul><li>the maximum initial value that is a real number (continuous) and that can be between 10 and 0</li><li>the learning rate that is also a real number but that is also on a logarithmic scale (so that you are equally likely to try very large values and very small values) and can be between 1 and 1e-10</li><li>the number of epochs that is an integer (whole number) and that can be between 1 and 20</li></ul>It is also saying that the hyperparameter space model should be initialized based on 5 random hyperparameter combinations (you train the hyperparameter space model on 5 randomly chosen hyperparameters in order to be able to get the first suggested hyperparameter), that this model should be a random forest (RF), that the acquisition function (the function to decide which hyperparameter combination is the most promising to try next according to the hyperparameter space model) is the expected improvement (EI) of the hyperparameter combination, that the acquisition optimizer (the optimizer to find the next promising hyperparameter combination) is automatically chosen, and that the random state is set to a fixed number (zero) so that it always gives the same random values each time you run it.<br /><br />Next we will use the optimizer to find good hyperparameter combinations.<br /><br /><pre class="brush:python">best_cost = 1e100<br />best_hyperparams = None<br />for trial in range(5 + 20):<br /> [max_init_x, learning_rate, num_epochs] = opt.ask()<br /> [max_init_x, learning_rate, num_epochs] = [max_init_x.tolist(), learning_rate.tolist(), num_epochs.tolist()]<br /> next_hyperparams = [max_init_x, learning_rate, num_epochs]<br /> next_cost = cost(graddesc(max_init_x, learning_rate, num_epochs))<br /> if next_cost < best_cost:<br /> best_cost = next_cost<br /> best_hyperparams = next_hyperparams<br /> print(trial+1, next_cost, next_hyperparams)<br /> opt.tell(next_hyperparams, next_cost)<br />print(best_hyperparams)<br /></pre>The nice thing about this library is that you can use an 'ask/tell' system where you ask the library to give you the next hyperparameters to try and then you do something with them in order to get the actual performance value and finally you tell the library what this performance value is. This lets you do some nifty things such as ask for another value if the hyperparameters resulted in an invalid state in the machine learning model or even to save your progress and continue later.<br /><br />In the above code we're running a for loop to run the number of times we want to evaluate different hyperparameters. We need to run it for the 5 random values we specified before to initialize the hyperparameter space model plus another 20 evaluations to actually optimize the hyperameters. Now skopt does something funny which is that it returns not plain Python values for hyperparameters but rather each number is represented as a numpy scalar. Because of this we convert each numpy scalar back into a plain Python float or int using ".tolist()". We ask for the next hyperparamters to try, convert them to plain Python values, get their resulting cost after running gradient descent, store the best hyperparameters found up to now, and tell the library what the given hyperparameters' resulting performance was. At the end we print the best hyperparamter combination found.<br /><br />Some extra stuff:<br /><ul><li>You can ask for categorical hyperparameters using "skopt.space.Categorical(['option1', 'option2'], name='options')" which will return one of the values in the list when calling "ask".</li><li>You can ask for a different hyperparameter combination in case of an invalid one by changing "ask" to give you several hyperparameter suggestions rather than just one and then trying each one of them until one works by using "opt.ask(num_hyperpars)" (you can also incrementally ask for more values and always take the last one).</li><li>You can save and continue by saving all the hyperparameters evaluated and their corresponding performance value in a text file. You then later resupply the saved hyperparameters and their performance using "tell" for each one. This is much faster than actually evaluating them on the machine learning model so straight supplying known values will be ready quickly. Just be careful that you also call "ask" before each "tell" in order to get the same exact behaviour from the optimizer or else the next "tell" will give different values from what it would have given had you not loaded the previous ones manually.</li></ul>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-53404097895830233462018-06-28T23:19:00.002+02:002018-06-28T23:41:14.889+02:00Saving/Loading a Tensorflow model using HDF5 (h5py)The normal way to save the parameters of a neural network in Tensorflow is to create a tf.train.Saver() object and then calling the object's "save" and "restore" methods. But this can be a bit cumbersome and you might prefer to have more control on what and how things get saved and loaded. The standard file format for saving large tensors (such as the parameters of a neural network) is to use <a href="http://docs.h5py.org/en/stable/">HDF5</a>.<br /><br />Saving is easy. Here is the code you'll need:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'w') as f:<br /> for var in tf.trainable_variables():<br /> key = var.name.replace('/', ' ')<br /> value = session.run(var)<br /> f.create_dataset(key, data=value)<br /></pre><br />Notice that we need to use the session in order to get a parameter value.<br /><br />If you're using variable scopes then your variable names will have slashes in them and here we're replacing slashes with spaces. The reason is because the HDF5 format treats key values as directories where folder names are separated by slashes. This means that you need to traverse the keys recursively in order to arrive at the data (one folder name at a time) if you do not know the full name at the start. This replacement of slashes simplifies the code for loading a model later.<br /><br />Notice also that you can filter the variables to save as you like as well as save extra stuff. I like to save the Tensorflow version in the file in order to be able to check for incompatible variable names in contrib modules (RNNs had some variable names changed in version 1.2).<br /><br />Now comes the loading part. Loading is a tiny bit more involved because it requires that you make you neural network code include stuff for assigning values to the variables. All you need to do whilst constructing your Tensorflow graph is to include the following code:<br /><pre class="brush:python">param_setters = dict()<br />for var in tf.trainable_variables():<br /> placeholder = tf.placeholder(var.dtype, var.shape, var.name.split(':')[0]+'_setter')<br /> param_setters[var.name] = (tf.assign(var, placeholder), placeholder)<br /></pre><br />What this code does is it creates separate placeholder and assign nodes for each variable in your graph. In order to modify a variable you need to run the corresponding assign node in a session and pass the value through the corresponding placeholder. All the corresponding assign nodes and placeholders are kept in a dictionary called param_setters. We're also naming the placeholder the same as the variable but with '_setter' at the end.<br /><br />Notice that param_setters is a dictionary mapping variable names to a tuple consisting of the assign node and the placeholder.<br /><br />Now we can load the HDF5 file as follows:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'r') as f:<br /> for (name, val) in f.items()<br /> name = name.replace(' ', '/')<br /> val = np.array(val)<br /> session.run(param_setters[name][0], { param_setters[name][1]: val })<br /></pre><br />What's happening here is that we're loading each parameter from the file and replacing the spaces in names back into slashes. We then run the corresponding assign node for the given parameter name in param_setters and set it to the loaded value.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-36560109695240696702018-05-23T22:29:00.002+02:002018-05-23T22:34:35.379+02:00Fancy indexing in Tensorflow: Getting a different element from every row in a matrixLet's say you have the following 4 by 3 matrix:<br /><pre class="brush:python">M = np.array([[ 1, 2, 3 ],<br /> [ 4, 5, 6 ],<br /> [ 7, 8, 9 ],<br /> [ 10, 11, 12 ]])<br /></pre>When in Tensorflow we'd also have this line:<br /><pre class="brush:python">M = tf.constant(M)<br /></pre><br />Let's say that you want to get the first element of every row. In both Numpy and Tensorflow you can just do this:<br /><pre class="brush:python">x = M[:, 0]<br /></pre>which means 'get every row and from every row get the element at index 0'. "x" is now equal to:<br /><pre class="brush:python">np.array([1, 4, 7, 10])<br /></pre><br />Now let's say that instead of the first element of every row, you wanted the third element of the first row, the second element of the second row, the first element of the third row, and the second element of the fourth row. In Numpy, this is how you do that:<br /><pre class="brush:python">idx = [2,1,0,1]<br />x = M[[0,1,2,3], idx]<br /></pre>or equivalently:<br /><pre class="brush:python">x = M[np.arange(M.shape[0]), idx]<br /></pre>This is just breaking up the 'coordinates' of the desired elements into separate lists for each axis. "x" is now equal to:<br /><pre class="brush:python">np.array([3, 5, 7, 11])<br /></pre><br />Unfortunately this sort of fancy indexing isn't available in Tensorflow. Instead we have the function "tf.gather_nd" which lets us provide a list of 'coordinates'. Unlike Numpy, tf.gather_nd does not take separate lists per axis but expects a single list with one coordinate per item like this:<br /><pre class="brush:python">idx = tf.constant([[0,2],[1,1],[2,0],[3,1]], tf.int32)<br />x = tf.gather_nd(M, idx)<br /></pre><br />This is typically inconvenient as we usually have a single vector of indexes rather then a list of coordinates. It would be better to be able to just put a "range" like we did with Numpy. We can use the range and then join it to the vector of indexes sideways using "tf.stack", like this:<br /><pre class="brush:python">idx = tf.constant([2,1,0,1], tf.int32)<br />x = tf.gather_nd(M, tf.stack([tf.range(M.shape[0]), idx], axis=1))<br /></pre><br />Kind of bulky but at least it's possible. I miss Theano's Numpy-like interface.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-7222975979872868722018-04-30T22:26:00.000+02:002018-04-30T22:49:16.196+02:00Why Bahdanau's neural attention requires two layersThe neural attention model was first described for the task of machine translation in Bahdanau's <a href="https://arxiv.org/abs/1409.0473">Neural machine translation by jointly learning to align and translate</a>. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:<br /><br />$$<br />\begin{align}<br />c_i &= \sum_j \alpha_{ij} s_j \\<br />\alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\<br />z_{ij} &= W \tanh(V (s_j ++ p_{i-1}))<br />\end{align}<br />$$<br />where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.<br /><br />So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?<br /><br />In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.<br /><br />Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be<br />$$<br />\begin{matrix}<br />(\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br />but if we add the constant k to each of the three numbers then the result will still be the same<br />$$<br />\begin{matrix}<br />& (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\<br />=& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\<br />=& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\<br />=& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br /><br />This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.<br /><br />Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.<br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\<br />(W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\<br />(W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\<br />\dots & \dots & \dots & \dots<br />\end{matrix}<br />$$<br /><br />Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.<br /><br />Now take the first two rows. What is the result of subtracting the second from the first?<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\<br />(W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />Notice how all the columns have the same difference, which means that the second column can be rewritten as:<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-65373482219869214102018-03-29T20:47:00.000+02:002018-03-29T20:54:11.218+02:00Word Mover's Distance (a text semantic similarity measure)<a href="http://proceedings.mlr.press/v37/kusnerb15.pdf">Word Mover's Distance</a> is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for <a href="http://aclweb.org/anthology/E17-1019">evaluating automatic caption generation</a>. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD <a href="https://github.com/mkusner/wmd">here</a>, but reading the code is a challenge so here's how WMD works.<br /><br />This measure makes use of an existing distance measure called <a href="http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm">Earth Mover's Distance</a>. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.<br /><br />Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as <a href="https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing">word2vec</a>, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.<br /><br />The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it <a href="https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html">here</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-48848289051116548772018-02-26T19:47:00.000+01:002018-02-26T19:55:49.144+01:00Finding the Greatest Common Divisor (GCD) / Highest Common Factor (HCF) of two numbers<p>The greatest common divisor or highest common factor of two numbers can be found using a very simple algorithm described since the time of Euclid, known as the Euclidean algorithm. The nice thing about it is that it can be computed visually by drawing squares in a rectangle as follows.</p><br /><p>Let's say that you have a line segment of length 6 units and a smaller line segment of length 2 units.</p><canvas id="c1" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c1').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(210, 10); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); </script><br /><br /><p>We can check if the smaller line's length is a divisor of the larger line's length if we can place several of the smaller lines next to each other and reach the exact length of the larger line.</p><canvas id="c2" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c2').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(610, 10); ctx.stroke(); </script><br /><br /><p>Likewise we can check if a number is a common divisor of two other numbers by checking if a square can fill exactly a rectangle.</p><canvas id="c3" width="620" height="420"></canvas><br /><script>var ctx = document.getElementById('c3').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 600, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400; y+=200) { for (var x = 10; x < 600; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>In the above diagram, the big blue rectangle has sides of 6 units by 4 units and the small red squares have sides 2 units by 2 units. Since the red square can exactly fill the blue rectangle by placing copies of it next to each other, 2 is a common divisor of 4 and 6.</p><p>The greatest common divisor is the largest square that can exactly fill a rectangle. The largest square that can do this has its side equal to the smallest side of the rectangle (otherwise not even one square will fit in the rectangle) whilst the smallest square has its side equal to 1 unit (assuming that sides are always whole number lengths). Let's take the above rectangle as an example and use the Euclidean algorithm to find the largest square that fits a 6 by 4 rectangle.</p><p>The Euclidean algorithm takes advantage of the fact that the greatest common divisor of two numbers is also a divisor of the difference of the two numbers. If two numbers "a" and "b" have "c" as a common divisor, then they can be rewritten as "c×p" and "c×q". The difference "a-b" is then equal to "c×p - c×q" which is equal to "c×(p - q)" which is therefore a divisor of "c" since one of its factors is "c". Notice that "c" can be any divisor of "a" and "b", which means that "a-b" should also have the greatest common divisor of "a" and "b" as a divisor.</p><p>This means that the largest square to exactly fill the original 4 by 6 square will also be the largest square to exactly fill a 2 by 4 rectangle, since 2 is the difference between 4 and 6.</p><canvas id="c4" width="620" height="420"></canvas><script>var ctx = document.getElementById('c4').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 200, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400+10; y+=200) { for (var x = 10; x < 200+10; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>This piece of information allows us to substitute the difficult problem of finding the largest square that exactly fills a rectangle with an easier problem of finding the same square but on a smaller rectangle. The smaller the rectangle, the easier it is to find the largest square to exactly fill it and the Euclidean algorithm allows us to progressively chop off parts of the rectangle until all that remains is a small square which would be the square that exactly fills the original rectangle.</p><p>So the algorithm goes as follows:</p><ol><li>If the rectangle has both sides equal then it itself is the square that is the largest square that exactly fills the rectangle and we are done.</li><li>Otherwise put a square inside the rectangle of length equal to the shortest side of the rectangle.</li><li>Chop off the part of the rectangle that is covered by the square and repeat.</li></ol><p>The 6 by 4 rectangle is trivial so let's try this algorithm on a slightly harder problem of 217 by 124. This time we'll be using exact pixels to represent the rectangle lengths.</p><p>This is a 217 by 124 rectangle.</p><canvas id="c5" width="237" height="144"></canvas><script>var ctx = document.getElementById('c5').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); </script><br /><br /><p>This is the overlap it has with a square of length equal to its shortest side.</p><canvas id="c6" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c6').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10, 10, 124, 124); ctx.stroke(); </script><br /><br /><p>We now chop off that part of the rectangle and find the largest square which fits the remaining 93 by 124 rectangle.</p><canvas id="c7" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c7').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c8" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c8').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10, 93, 92); ctx.stroke(); </script><br /><br /><p>This is the remaining 93 by 31 rectangle.</p><canvas id="c9" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c9').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c10" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c10').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 62 by 31 rectangle.</p><canvas id="c11" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c11').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c12" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c12').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+155, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 31 by 31 rectangle.</p><canvas id="c13" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c13').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+186, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>We have finally reached a square of side 31, which means that 31 is the greatest common divisor of 217 and 124.</p><canvas id="c14" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c14').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 124+10; y+=31) { for (var x = 10; x < 217+10; x+=31) { ctx.rect(x, y, 31, 31); } } ctx.stroke(); </script> <p>Of course we don't really need to visualise anything. We can do this all numerically by just repeatedly subtracting the smallest number from the biggest and replacing the biggest with the difference. Here's a Python function that does that:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> if a < b:<br /> b = b - a<br /> else:<br /> a = a - b<br /> return a<br /></pre><p>We can make this shorter by using pattern matching:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> (a, b) = (min(a,b), max(a,b)-min(a,b))<br /> return a<br /></pre><p>If one of the numbers is much bigger than the other, this program will waste a lot of time repeatedly subtracting the same number until the bigger number becomes smaller than the other. Fortunately this process can be done in a single operation by dividing and getting the remainder. The remainder is what's left after you can't subtract the second number from the first any more. This speeds the algorithm up considerably. The operation is called the modulo operation. With modulo the subtractions will not stop when both numbers are equal as modulo assumes that you can do one more subtraction and make the first number zero. Instead we will now keep on looping until the remainder is zero, in which case the last number you used to get the remainder was the greatest common divisor:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while b > 0:<br /> (a, b) = (min(a,b), max(a,b)%min(a,b))<br /> return a<br /></pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-54863868494296382542018-01-30T11:32:00.001+01:002018-01-30T11:47:52.208+01:00A quick guide to running deep learning applications on Amazon AWS virtual server supercomputerDeep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.<br /><br />First of all, when I got started I was initially following this video on how to use AWS:<br /><a href="https://youtu.be/vTAMSm06baA?t=1476">https://youtu.be/vTAMSm06baA?t=1476</a><br /><br /><div class="sectitle">AWS Educate</div><br />If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:<br /><a href="https://aws.amazon.com/education/awseducate/">https://aws.amazon.com/education/awseducate/</a><br /><br /><div class="sectitle">Get the applications you need</div><br />If you're on Windows, then before creating your server you should install <a href="https://winscp.net/eng/index.php">WinSCP</a> and <a href="https://www.putty.org/">PuTTY</a> in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.<br /><br /><div class="sectitle">Create an account</div><br />Start by visiting this link and making an account:<br /><a href="https://aws.amazon.com/">https://aws.amazon.com/</a><br /><br />You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.<br /><br /><div class="sectitle">Enter the AWS console</div><br />Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:<br /><a href="https://console.aws.amazon.com/">https://console.aws.amazon.com/</a><br /><br />Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:<br /><a href="https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/">https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/</a><br /><br />Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.<br /><br />After having chosen a region, next go to Services and click on EC2:<br /><a href="https://console.aws.amazon.com/ec2/">https://console.aws.amazon.com/ec2/</a><br /><br /><div class="sectitle">Create a virtual server</div><br />Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".<br /><br />This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has <a href="https://aws.amazon.com/ec2/instance-types/">12GB of GPU memory</a>. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).<br /><br />If this is your first time creating a powerful instance then you will first need to <a href="https://console.aws.amazon.com/support">ask Amazon to let you use it</a> (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.<br /><br />After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".<br /><br />Leave this next step with default settings. Click on "Next: Add Storage".<br /><br />This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about <a href="https://aws.amazon.com/ebs/pricing/">10 cents per GB per month</a>. Otherwise you can use a magnetic drive which costs about <a href="https://aws.amazon.com/ebs/previous-generation/">5 cents per GB per month</a>.<br /><br />Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".<br /><br />This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".<br /><br />As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.<br /><br />If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br /><br /><div class="sectitle">Connecting to your virtual server</div><br />Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:<br /><a href="https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId">https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId</a><br /><br />Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.<br /><br />Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.<br /><br />If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br />After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.<br /><br />If you're not using Windows then you should use the terminal and follow this guide instead:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html</a><br />You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.<br /><br />Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "<a href="https://www.lifewire.com/examples-linux-unzip-command-2201157">unzip</a>" Linux command.<br /><br />DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.<br /><br /><div class="sectitle">Using your virtual server</div><br />Finally we are close to start running our scripts. But we still need to do two more things first.<br /><br />First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.<br /><br />Now you can install anything you want using pip.<br /><br />Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:<br /><a href="https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/">https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/</a><br /><br />Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.<br /><br /><div class="sectitle">Hoorray!</div><br />You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.<br /><br />You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.<br /><a href="https://console.aws.amazon.com/billing/home">https://console.aws.amazon.com/billing/home</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-38909971087674260032017-12-29T12:03:00.002+01:002018-04-04T15:33:30.503+02:00An introduction to the Long Short Term Memory (LSTM) RNN and Gated Recurrent Unit (GRU)In my <a href="https://geekyisawesome.blogspot.com.mt/2017/11/an-introduction-to-recurrent-neural.html">previous post</a> I talked about the recurrent neural network and said that RNNs are used to encode a sequence of vectors into a single vector. The way it works is that the RNN has a memory, which we call a "state", and this memory gets updated each time a new input is introduced to the RNN. Even if the same input is shown twice, the memory can still be changed significantly because the new state is determined by both the new input and the previous state. After all the inputs have been shown to the RNN, the final state will be a vector that represents everything that the RNN has seen. Of course the RNN cannot remember everything in perfect detail since the state vector has a fixed size whilst the sequence of inputs can be arbitrarily long. This means that the final state will actually be a summary of the whole sequence which only "remembers" important or salient features in the sequence. This process has been likened to the short term memory in animals.<br /><br />The simple RNN described in the previous post, when using sigmoid or hyperbolic tangent as an activation function, tends to suffer from "vanishing gradients" when trained to work with long sequences. What does it mean to have vanishing gradients?<br /><br />In order to train your RNN you need to find the gradient of the error with respect to each weight in order to know whether to increase or decrease the weight. Now let's assume that we're using $tanh$ as an activation function for computing the next state. The function that turns an input and a state into a new state is something like the following:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />We're leaving out the bias term in order to keep things short. If you want to find the gradient of $s^1$ with respect to $w_i$ (for a sequence length one), that gradient will be<br /><br />$\frac{ds^1}{dw_i} = (1 - tanh^2(i^1 w_i + s^0 w_s))(i^1)$<br /><br />Notice how the gradient involves a multiplication with $tanh^2$ of the input. $tanh$ gives a number between 1 and -1 which is a fraction and squaring a fraction makes it smaller. So the gradient might be a pretty small number.<br /><br />What happens when we have a sequence of two inputs?<br /><br />$s^2 = tanh(i^2 w_i + tanh(i^1 w_i + s^0 w_s) w_s)$<br /><br />The gradient will be:<br /><br />$\frac{ds^2}{dw_i} = (1 - tanh^2(\dots))(i^2 + (1 - tanh^2(\dots))(i^1))$<br /><br />Notice how now we've multiplied the already small number produced by $tanh^2$ by another number produced by $tanh^2$ before getting to our first input $i^1$. Multiplying a fraction by another fraction results in a fraction that's smaller than both fractions. This means that $i^1$ will contribute significantly less to the gradient than $i^2$. If we have a sequence of 3 inputs then the first input would require 3 fractional multiplications which makes it even smaller. This will eventually make the gradient negligible (it vanishes) and hence result in the RNN only learning to reduce the error with respect to recent inputs only, completely forgetting inputs towards the beginning of the sequence.<br /><br />One solution for this problem is to use activation functions which do not result in fractional multiplications for each time step, such as the rectified linear unit function (ReLU). $ReLU$ is a function that returns numbers unchanged if they're positive whilst replacing negative numbers with zero:<br /><br />$ReLU(x) = max(x, 0)$<br /><br />The gradient of this function is either 1 (if $x$ is positive) or 0 (if $x$ is negative). In mathematical terms this is equivalent to the indicator function $1_{x>0}$ which gives 1 when the subscript condition is met and 0 otherwise. So if we replace our $tanh$ with $ReLU$ for our previous equation we'll get the following:<br /><br />$s^2 = ReLU(i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s)$<br />$\frac{ds^2}{dw_i} = 1_{i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s > 0}(i^2 + 1_{i^1 w_i + s^0 w_s > 0}(i^1))$<br /><br />You might be worried about how a single failed condition in the indicator functions will result in the making all previous time steps being multiplied by 0 which will vanish them completely. But keep in mind that $i$ and $s$ are not single numbers but vectors of numbers, and that the $ReLU$ function will work on each number in the vectors independently. So whilst some numbers in the vectors will be negative, others will not. With large enough vectors it is unlikely that an entire vector will consist of just negative numbers so you're likely to have a number of cases with only multiplications by 1 (never 0) which will preserve at least parts of the vectors of early inputs.<br /><br />Although simple RNNs with ReLUs have been reported to give good results in some papers such as <a href="https://arxiv.org/abs/1504.00941">this</a>, a more complex form of RNN called a long short term memory (LSTM) RNN, which was designed to reduce the problem of vanishing gradients, is much more popular. According to one of the authors of the LSTM (Jurgen Schmidhuber), its name means that it is <a href="https://soundcloud.com/twiml/twiml-talk-44-jurgen-schmidhuber-lstms-plus-deep-learning-history-lesson">an RNN with a short term memory that is long</a>.<br /><br />The idea is to make the state pass through to the next state without going through an activation function. You pass the input through an activation function, but the previous state is just added to it linearly. So instead of having this state update:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />the LSTM has this state update:<br /><br />$s^1 = tanh(i^1 w_i) + s^0$<br /><br />This changes how the gradient changes as the sequence gets longer because now we don't have nested activation functions any more. Instead this is what you end up with when you add another time step:<br /><br />$s^2 = tanh(i^2 w_i) + tanh(i^1 w_i) + s^0$<br /><br />This equation perfectly preserves the gradients as it gets longer since all you're doing is adding another term but it also loses a lot of its expressive power since the order in which you present the items in a sequence doesn't matter any more. You'll get the same state vector regardless of how you shuffle the sequence.<br /><br />What we need is a mixture of the expressiveness of the simple RNN mentioned in the beginning and gradient preservation of this new function. The solution is to have two state vectors: one for expressiveness called the "hidden state" $h$ and one for gradient preservation called the "cell state" $c$. Here is the new equation:<br /><br />$c^1 = tanh(i^1 w_i + h^0 w_h) + c^0$<br />$h^1 = tanh(c^1)$<br /><br />Notice the following things:<br /><ul><li>The order of the inputs matters now because each input is combined with a different hidden state vector.</li><li>The hidden state vector is derived from the cell state vector by just passing the cell state through a $tanh$.</li><li>Even though the hidden state is derived from the cell state, $h^0$ and $c^0$ are separate constants and $h^0$ is not equal to $tanh(c^0)$.</li></ul><br />Let's derive the equation for the final state of a length 2 sequence step by step:<br /><br />$c^2 = tanh(i^2 w_i + h^1 w_h) + c^1$<br />$h^2 = tanh(c^2)$<br /><br />$c^2 = tanh(i^2 w_i + tanh(c^1) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />$c^2 = tanh(i^2 w_i + tanh(tanh(i^1 w_i + h^0 w_h)) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />You can see that what is happening is that you end up with a separate term for each input where the input of that term is just inside a $tanh$ and no deeper. On the other hand each term also consists of all the previous inputs but at much lower levels of influence since each previous input is within an additional two nested $tanh$ functions the further back in the sequence it is found. This allows for a distinction in the order of the inputs whilst keeping each term focused on one particular input.<br /><br />The creators of the LSTM didn't stop there however. They also introduced gating functions in the LSTM. A gating function is basically a single layer sub neural net that outputs a fraction between 0 and 1 using a sigmoid. This number is then multiplied by another number in order to either leave the second number unchanged or turn it to zero. In other words, if the gate is 1 then it allows the second number to pass whilst if it is 0 then it blocks the second number (and gates in between will make the number smaller). <a href="http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf">Originally there were two gates</a>: one for the input terms $tanh(i^1 w_i + h^0 w_h)$ and one for the new hidden state $tanh(c^1)$. The gates control whether to accept the new input (the input gate) and whether to allow previous information to be represented by the hidden state (the output gate). Later, <a href="http://www.jmlr.org/papers/volume3/gers02a/gers02a.pdf">another paper</a> introduced a forget gate that regulates the cell state as well. All of these gates are controlled by sub neural nets that take a decision based on the current input and the previous hidden state.<br /><br />Here is what the final LSTM equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = sig(i^{t} w_{g_{i}i} + h^{t-1} w_{g_{i}h} + b_{g_i})$<br />$g_o = sig(i^{t} w_{g_{o}i} + h^{t-1} w_{g_{o}h} + b_{g_o})$<br />$c^{t} = g_i \odot tanh(i^{t} w_{ci} + h^{t-1} w_{ch} + b_{c}) + g_f \odot c^{t-1}$<br />$h^{t} = g_o \odot tanh(c^{t})$<br /><br />where $g_f$, $g_i$, and $g_o$ are the forget gate, input gate, and output gate respectively. $w_{g_{f}i}$ means weight for the input being used in the calculation of $g_f$. $b$ are the bias terms of the sub neural nets (we didn't use biases in the previous equations to keep things short). $\odot$ is the element-wise product of two vectors and $sig$ stands for sigmoid.<br /><br />Here is a diagram illustrating the different components of the LSTM with gating function application (element-wise multiplication) being represented by triangles:<br /><br /><a href="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" data-original-width="756" data-original-height="378" /></a><br /><br />A little while before, you might have been a little sceptical about using two different state vectors in order to combine the expressivity of the simple RNN with the gradient preservation of the addition based RNN described above. Can't we just use the cell state in the $tanh$ as well? In fact there was later <a href="https://arxiv.org/abs/1412.3555">a paper</a> describing another kind of RNN called a gated recurrent unit (GRU) which does just that. Instead of using one state for differentiating between time steps and another for flowing previous states into later calculations, only one state is used. Gating functions are still used, except that the input gate is replaced with 1 minus the forget gate. Basically it finds a compromise between either forgetting previous states and focussing everything on the new input or ignoring the new input and preserving the previous states. Finally, instead of having an output gate we now have a reset gate which is used to control how much the state should contribute to the input term. Here is what the GRU equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = 1 - g_f$<br />$g_r = sig(i^{t} w_{g_{r}i} + h^{t-1} w_{g_{r}h} + b_{g_r})$<br />$h^{t} = g_i \odot tanh(i^{t} w_{hi} + w_{hh}(g_r \odot h^{t-1}) + b_{h}) + g_f \odot h^{t-1}$<br /><br />And here is the diagram:<br /><br /><a href="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" data-original-width="756" data-original-height="378" /></a>Unknownnoreply@blogger.com1