tag:blogger.com,1999:blog-43189444598231434732018-09-18T08:42:48.125+02:00Geeky is AwesomeTo learn is to teach.mtantinoreply@blogger.comBlogger105125tag:blogger.com,1999:blog-4318944459823143473.post-23594156293254387072018-08-30T23:41:00.000+02:002018-08-31T16:47:31.799+02:00The McLauren series and Taylor series (approximating complex functions with simple polynomials)<div class="sectitle">The McLauren series</div><br />Imagine you had a function $f(x)$ that you knew was a polynomial, but whose details were unknown and you could only apply operations to the function without being able to read it. How could you find the coefficients of this polynomial? We know that for coefficients $a_i$:<br /><br />$f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br /><br />If we find $f(0)$ then we can find $a_0$.<br /><br />$f(0) = a_0 + a_1 0 + a_2 0^2 + a_3 0^3 + a_4 0^4 + ... = a_0 + 0 + 0 + ... = a_0$<br /><br />That was easy, but how can we find $a_1$? We need an operation that gets rid of $a_0$ and also the $x$ in the term $a_1 x$. That operation turns out to be differentiation with respect to $x$:<br /><br />$f'(x) = a_1 + 2 a_2 x + 3 a_3 x^2 + 4 a_4 x^3 + ...$<br /><br />Great! Now we can find $a_1$ by replacing $x$ with 0:<br /><br />$f'(0) = a_1 + 2 a_2 0 + 3 a_3 0^2 + 4 a_4 0^3 + ... = a_1$<br /><br />We can find $a_2$ by repeating these two steps:<br /><br />$f''(x) = 2 a_2 + 2 \cdot 3 a_3 x + 3 \cdot 4 a_4 x^2 + ...$<br />$f''(0) = 2 a_2 + 2 \cdot 3 a_3 0 + 3 \cdot 4 a_4 0^2 + ... = 2 a_2$<br /><br />What we found is twice of $a_2$ which means that we need to divide by 2 to get $a_2$. The differentiation operation is multiplying constants by each coefficient and the constants get bigger and bigger the more times we apply differentiation. You might notice that what's happening is that $a_0$ was multiplied by 1, $a_1$ was also multiplied by 1, $a_2$ has been multiplied by 2, $a_3$ will be multiplied by 6, $a_4$ by 24, and so on. These are factorials, sequences of whole numbers multiplied together ($3! = 1 \times 2 \times 3 = 6$). Which means that we need to divide by the next factorial after each round of differentiation and substitution by zero.<br /><br />In general we can find the $i$th coefficient in an unknown polynomial function by doing the following:<br /><br />$a_i = \frac{f^i(0)}{i!}$<br /><br />That's great. Now to test it. Let's see if a complex function is actually a polynomial in disguise. Take something simple such as $f(x) = e^x$. This doesn't look like a polynomial, but it may be represented by a polynomial with an infinite number of terms. Let's find what are the coefficients of the hidden polynomial in $e^x$.<br /><br />$f(x) = e^x = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br />$\frac{f(0)}{0!} = \frac{e^0}{1} = a_0$<br />$\frac{f(0)}{0!} = 1$<br /><br />OK, so $a_0$ is 1. Let's find the rest of the coefficients.<br /><br />$a_1 = \frac{f'(0)}{1!} = \frac{e^0}{1} = 1$<br />$a_2 = \frac{f''(0)}{2!} = \frac{e^0}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{e^0}{6} = \frac{1}{6}$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $e^x$ are:<br />$f(x) = 1 + x + \frac{1}{2} x^2 + \frac{1}{6} x^3 + ...$<br /><br />Does this partial polynomial look anything like $e^x$ when plotted as a graph?<br /><br /><a href="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s1600/taylor_exp.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s320/taylor_exp.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Pretty good within a boundary! Note how the curve has a perfect fit at $x = 0$ and that it gets worse as we move away from there. Adding more terms to the polynomial will enlarge the area around $x = 0$ that is close to the curve but it will always be radiating out from there.<br /><br />Let's try for $f(x) = cos(x)$ now.<br /><br />$a_0 = \frac{f(0)}{0!} = \frac{cos(0)}{1} = 1$<br />$a_1 = \frac{f'(0)}{1!} = \frac{-sin(0)}{1} = 0$<br />$a_2 = \frac{f''(0)}{2!} = \frac{-cos(0)}{2} = -\frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{sin(0)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ is<br />$f(x) = 1 - \frac{1}{2} x^2 + ...$<br /><br /><a href="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s1600/taylor_cos.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s320/taylor_cos.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br /><div class="sectitle">The Taylor series</div><br />As you can see, this "polynomialisation" of functions is a neat way to approximate a function we might not know how to implement exactly but know how to differentiate and how to find its value when $x$ is 0. But what if we don't know what $f(0)$ is such as in $ln(x)$ or $\frac{1}{x}$? A slight modification to our method allows us to use any value of $x$ and not just 0. Let's call this value $b$. By slightly modifying the polynomial we expect to be hiding inside the function, we can make the polynomial act the same way when $f(b)$ is used instead of $f(0)$:<br /><br />$f(x) = a_0 + a_1 (x - b) + a_2 (x - b)^2 + a_3 (x - b)^3 + a_4 (x - b)^4 + ...$<br />$f(b) = a_0 + a_1 (b - b) + a_2 (b - b)^2 + a_3 (b - b)^3 + a_4 (b - b)^4 + ...$<br />$f(b) = a_0$<br /><br />$f'(x) = a_1 + 2 a_2 (x - b) + 3 a_3 (x - b)^2 + 4 a_4 (x - b)^3 + ...$<br />$f'(b) = a_1$<br /><br />$a_i = \frac{f^i(b)}{i!}$<br /><br />The catch here is that we are now finding coefficients to the polynomial $a_0 + a_1 (x - b) + a_2 (x - b)^2 + ...$ and not of $a_0 + a_1 x + a_2 x^2 + ...$, but that's OK. Let's try this on $ln(x)$ with $b = 1$.<br /><br />$a_0 = \frac{f(1)}{0!} = \frac{ln(1)}{1} = 0$<br />$a_1 = \frac{f'(1)}{1!} = \frac{\frac{1}{1}}{1} = 1$<br />$a_2 = \frac{f''(1)}{2!} = \frac{-\frac{1}{1^2}}{1} = -1$<br />$a_3 = \frac{f'''(1)}{3!} = \frac{\frac{2}{1^3}}{1} = 2$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $ln(x)$ is<br />$f(x) = (x - 1) - (x - 1)^2 + 2 (x - 1)^3 + ...$<br /><br /><a href="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s1600/taylor_ln.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s320/taylor_ln.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Adding more terms will approximate the original function better and better but what if we didn't have to? Remember how I said in the previous section that the polynomial approximates the original function best close to $x = 0$. Well now we can approximate it best around any point $b$ and not just around 0. This means that if our function has multiple known values, such as $cos(x)$ which is known to be 1 at $x = 0$, 0 at $x = \frac{\pi}{2}$, -1 at $x = \pi$, etc., then we can use several short polynomials centered at different points in the function instead of one large polynomial that approximates it well over a large interval. Let's try approximating $cos(x)$ around $x = \pi$, which means that we'll set $b$ to $\pi$.<br /><br />$a_0 = \frac{f(\pi)}{0!} = \frac{cos(\pi)}{1} = -1$<br />$a_1 = \frac{f'(\pi)}{1!} = \frac{-sin(\pi)}{1} = 0$<br />$a_2 = \frac{f''(\pi)}{2!} = \frac{-cos(\pi)}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(\pi)}{3!} = \frac{sin(\pi)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ which best approximates the area around $x = \pi$ is<br />$f(x) = -1 + \frac{1}{2} (x - \pi)^2 + ...$<br /><br /><a href="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s1600/taylor_cos_pi.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s320/taylor_cos_pi.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />This is useful when implementing mathematical functions on a computer. You keep several simple polynomials defined at different points in the domain of the function and then pick the closest one to the $x$ you need to evaluate. You can then compute an approximation that isn't too bad without requiring a lot of computational time.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-86050409192796957252018-07-29T23:25:00.001+02:002018-07-29T23:45:37.534+02:00Hyperparameter tuning using Scikit-OptimizeOne of my favourite academic discoveries this year was <a href="https://scikit-optimize.github.io/">Scikit-Optimize</a>, a nifty little Python automatic hyperparameter tuning library that comes with a lot of features I found missing in other similar libraries.<br /><br />So as explained in an <a href="https://geekyisawesome.blogspot.com/2017/10/hyperparameter-tuning-using-hyperopt.html">earlier blog post</a>, automatic hyperparameter tuning is about finding the right hyperparameters for a machine learning algorithm automatically. Usually this is done manually using human experience but even simple MonteCarlo search random guesses can result in better performance than human tweaking (<a href="http://www.jmlr.org/papers/v13/bergstra12a.html">see here</a>). So automatic methods were developed that try to explore the space of hyperparameters and their resulting performance after training the machine learning model and then try to home in on the best performing hyperparameters. Of course each time you want to evaluate a new hyperparameter combination you'd need to retrain and evaluate your machine learning model, which might take a very long time to finish, so it's important that a good hyperparameter combination is found with as little evaluations as possible. To do this we'll use <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a>, a process where a separate simpler model is trained to predict the resulting performance of the whole hyperparameter space. We check this trained model to predict which hyperparameters will give the best resulting performance and actually evaluate them by training our machine learning model with them. The actual resulting performance is then used to update the hyperparameter space model so that it makes better predictions and then we'll get a new promising hyperparameter combination from it. This is repeated for a set number of times. Now the most common hyperparameter space model to use is a Gaussian Process which maps continuous numerical hyperparameters to a single number which is the predicted performance. This is not very good when your hyperparameters contain categorical data such as a choice of activation function. There is a <a href="https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9993">paper</a> that suggests that random forests are much better at mapping general hyperparameter combinations to predicted performance.<br /><br />Now that we got the theory out of the way, let's see how to use the library. We'll apply it on a gradient descent algorithm that needs to minimize the squared function. For this we'll need 3 hyperparameters: the range of the initial value to be selected randomly, the learning rate, and the number of epochs to run. So we have two continuous values and one discrete integer value.<br /><br /><pre class="brush:python">import random<br /><br />def cost(x):<br /> return x**2<br /><br />def cost_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x, num_epochs):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(num_epochs):<br /> x = x - learning_rate*cost_grad(x)<br /> return x<br /></pre><br />Now we need to define the skopt optimizer:<br /><br /><pre class="brush:python">import skopt<br /><br />opt = skopt.Optimizer(<br /> [<br /> skopt.space.Real(0.0, 10.0, name='max_init_x'),<br /> skopt.space.Real(1.0, 1e-10, 'log-uniform', name='learning_rate'),<br /> skopt.space.Integer(1, 20, name='num_epochs'),<br /> ],<br /> n_initial_points=5,<br /> base_estimator='RF',<br /> acq_func='EI',<br /> acq_optimizer='auto',<br /> random_state=0,<br /> )<br /></pre><br />The above code is specifying 3 hyperparameters:<br /><ul><li>the maximum initial value that is a real number (continuous) and that can be between 10 and 0</li><li>the learning rate that is also a real number but that is also on a logarithmic scale (so that you are equally likely to try very large values and very small values) and can be between 1 and 1e-10</li><li>the number of epochs that is an integer (whole number) and that can be between 1 and 20</li></ul>It is also saying that the hyperparameter space model should be initialized based on 5 random hyperparameter combinations (you train the hyperparameter space model on 5 randomly chosen hyperparameters in order to be able to get the first suggested hyperparameter), that this model should be a random forest (RF), that the acquisition function (the function to decide which hyperparameter combination is the most promising to try next according to the hyperparameter space model) is the expected improvement (EI) of the hyperparameter combination, that the acquisition optimizer (the optimizer to find the next promising hyperparameter combination) is automatically chosen, and that the random state is set to a fixed number (zero) so that it always gives the same random values each time you run it.<br /><br />Next we will use the optimizer to find good hyperparameter combinations.<br /><br /><pre class="brush:python">best_cost = 1e100<br />best_hyperparams = None<br />for trial in range(5 + 20):<br /> [max_init_x, learning_rate, num_epochs] = opt.ask()<br /> [max_init_x, learning_rate, num_epochs] = [max_init_x.tolist(), learning_rate.tolist(), num_epochs.tolist()]<br /> next_hyperparams = [max_init_x, learning_rate, num_epochs]<br /> next_cost = cost(graddesc(max_init_x, learning_rate, num_epochs))<br /> if next_cost < best_cost:<br /> best_cost = next_cost<br /> best_hyperparams = next_hyperparams<br /> print(trial+1, next_cost, next_hyperparams)<br /> opt.tell(next_hyperparams, next_cost)<br />print(best_hyperparams)<br /></pre>The nice thing about this library is that you can use an 'ask/tell' system where you ask the library to give you the next hyperparameters to try and then you do something with them in order to get the actual performance value and finally you tell the library what this performance value is. This lets you do some nifty things such as ask for another value if the hyperparameters resulted in an invalid state in the machine learning model or even to save your progress and continue later.<br /><br />In the above code we're running a for loop to run the number of times we want to evaluate different hyperparameters. We need to run it for the 5 random values we specified before to initialize the hyperparameter space model plus another 20 evaluations to actually optimize the hyperameters. Now skopt does something funny which is that it returns not plain Python values for hyperparameters but rather each number is represented as a numpy scalar. Because of this we convert each numpy scalar back into a plain Python float or int using ".tolist()". We ask for the next hyperparamters to try, convert them to plain Python values, get their resulting cost after running gradient descent, store the best hyperparameters found up to now, and tell the library what the given hyperparameters' resulting performance was. At the end we print the best hyperparamter combination found.<br /><br />Some extra stuff:<br /><ul><li>You can ask for categorical hyperparameters using "skopt.space.Categorical(['option1', 'option2'], name='options')" which will return one of the values in the list when calling "ask".</li><li>You can ask for a different hyperparameter combination in case of an invalid one by changing "ask" to give you several hyperparameter suggestions rather than just one and then trying each one of them until one works by using "opt.ask(num_hyperpars)" (you can also incrementally ask for more values and always take the last one).</li><li>You can save and continue by saving all the hyperparameters evaluated and their corresponding performance value in a text file. You then later resupply the saved hyperparameters and their performance using "tell" for each one. This is much faster than actually evaluating them on the machine learning model so straight supplying known values will be ready quickly. Just be careful that you also call "ask" before each "tell" in order to get the same exact behaviour from the optimizer or else the next "tell" will give different values from what it would have given had you not loaded the previous ones manually.</li></ul>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-53404097895830233462018-06-28T23:19:00.002+02:002018-06-28T23:41:14.889+02:00Saving/Loading a Tensorflow model using HDF5 (h5py)The normal way to save the parameters of a neural network in Tensorflow is to create a tf.train.Saver() object and then calling the object's "save" and "restore" methods. But this can be a bit cumbersome and you might prefer to have more control on what and how things get saved and loaded. The standard file format for saving large tensors (such as the parameters of a neural network) is to use <a href="http://docs.h5py.org/en/stable/">HDF5</a>.<br /><br />Saving is easy. Here is the code you'll need:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'w') as f:<br /> for var in tf.trainable_variables():<br /> key = var.name.replace('/', ' ')<br /> value = session.run(var)<br /> f.create_dataset(key, data=value)<br /></pre><br />Notice that we need to use the session in order to get a parameter value.<br /><br />If you're using variable scopes then your variable names will have slashes in them and here we're replacing slashes with spaces. The reason is because the HDF5 format treats key values as directories where folder names are separated by slashes. This means that you need to traverse the keys recursively in order to arrive at the data (one folder name at a time) if you do not know the full name at the start. This replacement of slashes simplifies the code for loading a model later.<br /><br />Notice also that you can filter the variables to save as you like as well as save extra stuff. I like to save the Tensorflow version in the file in order to be able to check for incompatible variable names in contrib modules (RNNs had some variable names changed in version 1.2).<br /><br />Now comes the loading part. Loading is a tiny bit more involved because it requires that you make you neural network code include stuff for assigning values to the variables. All you need to do whilst constructing your Tensorflow graph is to include the following code:<br /><pre class="brush:python">param_setters = dict()<br />for var in tf.trainable_variables():<br /> placeholder = tf.placeholder(var.dtype, var.shape, var.name.split(':')[0]+'_setter')<br /> param_setters[var.name] = (tf.assign(var, placeholder), placeholder)<br /></pre><br />What this code does is it creates separate placeholder and assign nodes for each variable in your graph. In order to modify a variable you need to run the corresponding assign node in a session and pass the value through the corresponding placeholder. All the corresponding assign nodes and placeholders are kept in a dictionary called param_setters. We're also naming the placeholder the same as the variable but with '_setter' at the end.<br /><br />Notice that param_setters is a dictionary mapping variable names to a tuple consisting of the assign node and the placeholder.<br /><br />Now we can load the HDF5 file as follows:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'r') as f:<br /> for (name, val) in f.items()<br /> name = name.replace(' ', '/')<br /> val = np.array(val)<br /> session.run(param_setters[name][0], { param_setters[name][1]: val })<br /></pre><br />What's happening here is that we're loading each parameter from the file and replacing the spaces in names back into slashes. We then run the corresponding assign node for the given parameter name in param_setters and set it to the loaded value.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-36560109695240696702018-05-23T22:29:00.002+02:002018-05-23T22:34:35.379+02:00Fancy indexing in Tensorflow: Getting a different element from every row in a matrixLet's say you have the following 4 by 3 matrix:<br /><pre class="brush:python">M = np.array([[ 1, 2, 3 ],<br /> [ 4, 5, 6 ],<br /> [ 7, 8, 9 ],<br /> [ 10, 11, 12 ]])<br /></pre>When in Tensorflow we'd also have this line:<br /><pre class="brush:python">M = tf.constant(M)<br /></pre><br />Let's say that you want to get the first element of every row. In both Numpy and Tensorflow you can just do this:<br /><pre class="brush:python">x = M[:, 0]<br /></pre>which means 'get every row and from every row get the element at index 0'. "x" is now equal to:<br /><pre class="brush:python">np.array([1, 4, 7, 10])<br /></pre><br />Now let's say that instead of the first element of every row, you wanted the third element of the first row, the second element of the second row, the first element of the third row, and the second element of the fourth row. In Numpy, this is how you do that:<br /><pre class="brush:python">idx = [2,1,0,1]<br />x = M[[0,1,2,3], idx]<br /></pre>or equivalently:<br /><pre class="brush:python">x = M[np.arange(M.shape[0]), idx]<br /></pre>This is just breaking up the 'coordinates' of the desired elements into separate lists for each axis. "x" is now equal to:<br /><pre class="brush:python">np.array([3, 5, 7, 11])<br /></pre><br />Unfortunately this sort of fancy indexing isn't available in Tensorflow. Instead we have the function "tf.gather_nd" which lets us provide a list of 'coordinates'. Unlike Numpy, tf.gather_nd does not take separate lists per axis but expects a single list with one coordinate per item like this:<br /><pre class="brush:python">idx = tf.constant([[0,2],[1,1],[2,0],[3,1]], tf.int32)<br />x = tf.gather_nd(M, idx)<br /></pre><br />This is typically inconvenient as we usually have a single vector of indexes rather then a list of coordinates. It would be better to be able to just put a "range" like we did with Numpy. We can use the range and then join it to the vector of indexes sideways using "tf.stack", like this:<br /><pre class="brush:python">idx = tf.constant([2,1,0,1], tf.int32)<br />x = tf.gather_nd(M, tf.stack([tf.range(M.shape[0]), idx], axis=1))<br /></pre><br />Kind of bulky but at least it's possible. I miss Theano's Numpy-like interface.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-7222975979872868722018-04-30T22:26:00.000+02:002018-04-30T22:49:16.196+02:00Why Bahdanau's neural attention requires two layersThe neural attention model was first described for the task of machine translation in Bahdanau's <a href="https://arxiv.org/abs/1409.0473">Neural machine translation by jointly learning to align and translate</a>. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:<br /><br />$$<br />\begin{align}<br />c_i &= \sum_j \alpha_{ij} s_j \\<br />\alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\<br />z_{ij} &= W \tanh(V (s_j ++ p_{i-1}))<br />\end{align}<br />$$<br />where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.<br /><br />So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?<br /><br />In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.<br /><br />Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be<br />$$<br />\begin{matrix}<br />(\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br />but if we add the constant k to each of the three numbers then the result will still be the same<br />$$<br />\begin{matrix}<br />& (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\<br />=& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\<br />=& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\<br />=& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br /><br />This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.<br /><br />Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.<br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\<br />(W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\<br />(W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\<br />\dots & \dots & \dots & \dots<br />\end{matrix}<br />$$<br /><br />Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.<br /><br />Now take the first two rows. What is the result of subtracting the second from the first?<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\<br />(W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />Notice how all the columns have the same difference, which means that the second column can be rewritten as:<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-65373482219869214102018-03-29T20:47:00.000+02:002018-03-29T20:54:11.218+02:00Word Mover's Distance (a text semantic similarity measure)<a href="http://proceedings.mlr.press/v37/kusnerb15.pdf">Word Mover's Distance</a> is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for <a href="http://aclweb.org/anthology/E17-1019">evaluating automatic caption generation</a>. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD <a href="https://github.com/mkusner/wmd">here</a>, but reading the code is a challenge so here's how WMD works.<br /><br />This measure makes use of an existing distance measure called <a href="http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm">Earth Mover's Distance</a>. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.<br /><br />Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as <a href="https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing">word2vec</a>, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.<br /><br />The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it <a href="https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html">here</a>.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-48848289051116548772018-02-26T19:47:00.000+01:002018-02-26T19:55:49.144+01:00Finding the Greatest Common Divisor (GCD) / Highest Common Factor (HCF) of two numbers<p>The greatest common divisor or highest common factor of two numbers can be found using a very simple algorithm described since the time of Euclid, known as the Euclidean algorithm. The nice thing about it is that it can be computed visually by drawing squares in a rectangle as follows.</p><br /><p>Let's say that you have a line segment of length 6 units and a smaller line segment of length 2 units.</p><canvas id="c1" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c1').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(210, 10); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); </script><br /><br /><p>We can check if the smaller line's length is a divisor of the larger line's length if we can place several of the smaller lines next to each other and reach the exact length of the larger line.</p><canvas id="c2" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c2').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(610, 10); ctx.stroke(); </script><br /><br /><p>Likewise we can check if a number is a common divisor of two other numbers by checking if a square can fill exactly a rectangle.</p><canvas id="c3" width="620" height="420"></canvas><br /><script>var ctx = document.getElementById('c3').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 600, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400; y+=200) { for (var x = 10; x < 600; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>In the above diagram, the big blue rectangle has sides of 6 units by 4 units and the small red squares have sides 2 units by 2 units. Since the red square can exactly fill the blue rectangle by placing copies of it next to each other, 2 is a common divisor of 4 and 6.</p><p>The greatest common divisor is the largest square that can exactly fill a rectangle. The largest square that can do this has its side equal to the smallest side of the rectangle (otherwise not even one square will fit in the rectangle) whilst the smallest square has its side equal to 1 unit (assuming that sides are always whole number lengths). Let's take the above rectangle as an example and use the Euclidean algorithm to find the largest square that fits a 6 by 4 rectangle.</p><p>The Euclidean algorithm takes advantage of the fact that the greatest common divisor of two numbers is also a divisor of the difference of the two numbers. If two numbers "a" and "b" have "c" as a common divisor, then they can be rewritten as "c×p" and "c×q". The difference "a-b" is then equal to "c×p - c×q" which is equal to "c×(p - q)" which is therefore a divisor of "c" since one of its factors is "c". Notice that "c" can be any divisor of "a" and "b", which means that "a-b" should also have the greatest common divisor of "a" and "b" as a divisor.</p><p>This means that the largest square to exactly fill the original 4 by 6 square will also be the largest square to exactly fill a 2 by 4 rectangle, since 2 is the difference between 4 and 6.</p><canvas id="c4" width="620" height="420"></canvas><script>var ctx = document.getElementById('c4').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 200, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400+10; y+=200) { for (var x = 10; x < 200+10; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>This piece of information allows us to substitute the difficult problem of finding the largest square that exactly fills a rectangle with an easier problem of finding the same square but on a smaller rectangle. The smaller the rectangle, the easier it is to find the largest square to exactly fill it and the Euclidean algorithm allows us to progressively chop off parts of the rectangle until all that remains is a small square which would be the square that exactly fills the original rectangle.</p><p>So the algorithm goes as follows:</p><ol><li>If the rectangle has both sides equal then it itself is the square that is the largest square that exactly fills the rectangle and we are done.</li><li>Otherwise put a square inside the rectangle of length equal to the shortest side of the rectangle.</li><li>Chop off the part of the rectangle that is covered by the square and repeat.</li></ol><p>The 6 by 4 rectangle is trivial so let's try this algorithm on a slightly harder problem of 217 by 124. This time we'll be using exact pixels to represent the rectangle lengths.</p><p>This is a 217 by 124 rectangle.</p><canvas id="c5" width="237" height="144"></canvas><script>var ctx = document.getElementById('c5').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); </script><br /><br /><p>This is the overlap it has with a square of length equal to its shortest side.</p><canvas id="c6" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c6').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10, 10, 124, 124); ctx.stroke(); </script><br /><br /><p>We now chop off that part of the rectangle and find the largest square which fits the remaining 93 by 124 rectangle.</p><canvas id="c7" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c7').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c8" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c8').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10, 93, 92); ctx.stroke(); </script><br /><br /><p>This is the remaining 93 by 31 rectangle.</p><canvas id="c9" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c9').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c10" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c10').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 62 by 31 rectangle.</p><canvas id="c11" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c11').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c12" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c12').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+155, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 31 by 31 rectangle.</p><canvas id="c13" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c13').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+186, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>We have finally reached a square of side 31, which means that 31 is the greatest common divisor of 217 and 124.</p><canvas id="c14" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c14').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 124+10; y+=31) { for (var x = 10; x < 217+10; x+=31) { ctx.rect(x, y, 31, 31); } } ctx.stroke(); </script> <p>Of course we don't really need to visualise anything. We can do this all numerically by just repeatedly subtracting the smallest number from the biggest and replacing the biggest with the difference. Here's a Python function that does that:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> if a < b:<br /> b = b - a<br /> else:<br /> a = a - b<br /> return a<br /></pre><p>We can make this shorter by using pattern matching:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> (a, b) = (min(a,b), max(a,b)-min(a,b))<br /> return a<br /></pre><p>If one of the numbers is much bigger than the other, this program will waste a lot of time repeatedly subtracting the same number until the bigger number becomes smaller than the other. Fortunately this process can be done in a single operation by dividing and getting the remainder. The remainder is what's left after you can't subtract the second number from the first any more. This speeds the algorithm up considerably. The operation is called the modulo operation. With modulo the subtractions will not stop when both numbers are equal as modulo assumes that you can do one more subtraction and make the first number zero. Instead we will now keep on looping until the remainder is zero, in which case the last number you used to get the remainder was the greatest common divisor:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while b > 0:<br /> (a, b) = (min(a,b), max(a,b)%min(a,b))<br /> return a<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-54863868494296382542018-01-30T11:32:00.001+01:002018-01-30T11:47:52.208+01:00A quick guide to running deep learning applications on Amazon AWS virtual server supercomputerDeep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.<br /><br />First of all, when I got started I was initially following this video on how to use AWS:<br /><a href="https://youtu.be/vTAMSm06baA?t=1476">https://youtu.be/vTAMSm06baA?t=1476</a><br /><br /><div class="sectitle">AWS Educate</div><br />If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:<br /><a href="https://aws.amazon.com/education/awseducate/">https://aws.amazon.com/education/awseducate/</a><br /><br /><div class="sectitle">Get the applications you need</div><br />If you're on Windows, then before creating your server you should install <a href="https://winscp.net/eng/index.php">WinSCP</a> and <a href="https://www.putty.org/">PuTTY</a> in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.<br /><br /><div class="sectitle">Create an account</div><br />Start by visiting this link and making an account:<br /><a href="https://aws.amazon.com/">https://aws.amazon.com/</a><br /><br />You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.<br /><br /><div class="sectitle">Enter the AWS console</div><br />Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:<br /><a href="https://console.aws.amazon.com/">https://console.aws.amazon.com/</a><br /><br />Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:<br /><a href="https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/">https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/</a><br /><br />Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.<br /><br />After having chosen a region, next go to Services and click on EC2:<br /><a href="https://console.aws.amazon.com/ec2/">https://console.aws.amazon.com/ec2/</a><br /><br /><div class="sectitle">Create a virtual server</div><br />Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".<br /><br />This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has <a href="https://aws.amazon.com/ec2/instance-types/">12GB of GPU memory</a>. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).<br /><br />If this is your first time creating a powerful instance then you will first need to <a href="https://console.aws.amazon.com/support">ask Amazon to let you use it</a> (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.<br /><br />After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".<br /><br />Leave this next step with default settings. Click on "Next: Add Storage".<br /><br />This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about <a href="https://aws.amazon.com/ebs/pricing/">10 cents per GB per month</a>. Otherwise you can use a magnetic drive which costs about <a href="https://aws.amazon.com/ebs/previous-generation/">5 cents per GB per month</a>.<br /><br />Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".<br /><br />This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".<br /><br />As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.<br /><br />If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br /><br /><div class="sectitle">Connecting to your virtual server</div><br />Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:<br /><a href="https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId">https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId</a><br /><br />Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.<br /><br />Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.<br /><br />If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br />After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.<br /><br />If you're not using Windows then you should use the terminal and follow this guide instead:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html</a><br />You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.<br /><br />Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "<a href="https://www.lifewire.com/examples-linux-unzip-command-2201157">unzip</a>" Linux command.<br /><br />DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.<br /><br /><div class="sectitle">Using your virtual server</div><br />Finally we are close to start running our scripts. But we still need to do two more things first.<br /><br />First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.<br /><br />Now you can install anything you want using pip.<br /><br />Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:<br /><a href="https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/">https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/</a><br /><br />Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.<br /><br /><div class="sectitle">Hoorray!</div><br />You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.<br /><br />You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.<br /><a href="https://console.aws.amazon.com/billing/home">https://console.aws.amazon.com/billing/home</a>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-38909971087674260032017-12-29T12:03:00.002+01:002018-04-04T15:33:30.503+02:00An introduction to the Long Short Term Memory (LSTM) RNN and Gated Recurrent Unit (GRU)In my <a href="https://geekyisawesome.blogspot.com.mt/2017/11/an-introduction-to-recurrent-neural.html">previous post</a> I talked about the recurrent neural network and said that RNNs are used to encode a sequence of vectors into a single vector. The way it works is that the RNN has a memory, which we call a "state", and this memory gets updated each time a new input is introduced to the RNN. Even if the same input is shown twice, the memory can still be changed significantly because the new state is determined by both the new input and the previous state. After all the inputs have been shown to the RNN, the final state will be a vector that represents everything that the RNN has seen. Of course the RNN cannot remember everything in perfect detail since the state vector has a fixed size whilst the sequence of inputs can be arbitrarily long. This means that the final state will actually be a summary of the whole sequence which only "remembers" important or salient features in the sequence. This process has been likened to the short term memory in animals.<br /><br />The simple RNN described in the previous post, when using sigmoid or hyperbolic tangent as an activation function, tends to suffer from "vanishing gradients" when trained to work with long sequences. What does it mean to have vanishing gradients?<br /><br />In order to train your RNN you need to find the gradient of the error with respect to each weight in order to know whether to increase or decrease the weight. Now let's assume that we're using $tanh$ as an activation function for computing the next state. The function that turns an input and a state into a new state is something like the following:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />We're leaving out the bias term in order to keep things short. If you want to find the gradient of $s^1$ with respect to $w_i$ (for a sequence length one), that gradient will be<br /><br />$\frac{ds^1}{dw_i} = (1 - tanh^2(i^1 w_i + s^0 w_s))(i^1)$<br /><br />Notice how the gradient involves a multiplication with $tanh^2$ of the input. $tanh$ gives a number between 1 and -1 which is a fraction and squaring a fraction makes it smaller. So the gradient might be a pretty small number.<br /><br />What happens when we have a sequence of two inputs?<br /><br />$s^2 = tanh(i^2 w_i + tanh(i^1 w_i + s^0 w_s) w_s)$<br /><br />The gradient will be:<br /><br />$\frac{ds^2}{dw_i} = (1 - tanh^2(\dots))(i^2 + (1 - tanh^2(\dots))(i^1))$<br /><br />Notice how now we've multiplied the already small number produced by $tanh^2$ by another number produced by $tanh^2$ before getting to our first input $i^1$. Multiplying a fraction by another fraction results in a fraction that's smaller than both fractions. This means that $i^1$ will contribute significantly less to the gradient than $i^2$. If we have a sequence of 3 inputs then the first input would require 3 fractional multiplications which makes it even smaller. This will eventually make the gradient negligible (it vanishes) and hence result in the RNN only learning to reduce the error with respect to recent inputs only, completely forgetting inputs towards the beginning of the sequence.<br /><br />One solution for this problem is to use activation functions which do not result in fractional multiplications for each time step, such as the rectified linear unit function (ReLU). $ReLU$ is a function that returns numbers unchanged if they're positive whilst replacing negative numbers with zero:<br /><br />$ReLU(x) = max(x, 0)$<br /><br />The gradient of this function is either 1 (if $x$ is positive) or 0 (if $x$ is negative). In mathematical terms this is equivalent to the indicator function $1_{x>0}$ which gives 1 when the subscript condition is met and 0 otherwise. So if we replace our $tanh$ with $ReLU$ for our previous equation we'll get the following:<br /><br />$s^2 = ReLU(i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s)$<br />$\frac{ds^2}{dw_i} = 1_{i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s > 0}(i^2 + 1_{i^1 w_i + s^0 w_s > 0}(i^1))$<br /><br />You might be worried about how a single failed condition in the indicator functions will result in the making all previous time steps being multiplied by 0 which will vanish them completely. But keep in mind that $i$ and $s$ are not single numbers but vectors of numbers, and that the $ReLU$ function will work on each number in the vectors independently. So whilst some numbers in the vectors will be negative, others will not. With large enough vectors it is unlikely that an entire vector will consist of just negative numbers so you're likely to have a number of cases with only multiplications by 1 (never 0) which will preserve at least parts of the vectors of early inputs.<br /><br />Although simple RNNs with ReLUs have been reported to give good results in some papers such as <a href="https://arxiv.org/abs/1504.00941">this</a>, a more complex form of RNN called a long short term memory (LSTM) RNN, which was designed to reduce the problem of vanishing gradients, is much more popular. According to one of the authors of the LSTM (Jurgen Schmidhuber), its name means that it is <a href="https://soundcloud.com/twiml/twiml-talk-44-jurgen-schmidhuber-lstms-plus-deep-learning-history-lesson">an RNN with a short term memory that is long</a>.<br /><br />The idea is to make the state pass through to the next state without going through an activation function. You pass the input through an activation function, but the previous state is just added to it linearly. So instead of having this state update:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />the LSTM has this state update:<br /><br />$s^1 = tanh(i^1 w_i) + s^0$<br /><br />This changes how the gradient changes as the sequence gets longer because now we don't have nested activation functions any more. Instead this is what you end up with when you add another time step:<br /><br />$s^2 = tanh(i^2 w_i) + tanh(i^1 w_i) + s^0$<br /><br />This equation perfectly preserves the gradients as it gets longer since all you're doing is adding another term but it also loses a lot of its expressive power since the order in which you present the items in a sequence doesn't matter any more. You'll get the same state vector regardless of how you shuffle the sequence.<br /><br />What we need is a mixture of the expressiveness of the simple RNN mentioned in the beginning and gradient preservation of this new function. The solution is to have two state vectors: one for expressiveness called the "hidden state" $h$ and one for gradient preservation called the "cell state" $c$. Here is the new equation:<br /><br />$c^1 = tanh(i^1 w_i + h^0 w_h) + c^0$<br />$h^1 = tanh(c^1)$<br /><br />Notice the following things:<br /><ul><li>The order of the inputs matters now because each input is combined with a different hidden state vector.</li><li>The hidden state vector is derived from the cell state vector by just passing the cell state through a $tanh$.</li><li>Even though the hidden state is derived from the cell state, $h^0$ and $c^0$ are separate constants and $h^0$ is not equal to $tanh(c^0)$.</li></ul><br />Let's derive the equation for the final state of a length 2 sequence step by step:<br /><br />$c^2 = tanh(i^2 w_i + h^1 w_h) + c^1$<br />$h^2 = tanh(c^2)$<br /><br />$c^2 = tanh(i^2 w_i + tanh(c^1) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />$c^2 = tanh(i^2 w_i + tanh(tanh(i^1 w_i + h^0 w_h)) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />You can see that what is happening is that you end up with a separate term for each input where the input of that term is just inside a $tanh$ and no deeper. On the other hand each term also consists of all the previous inputs but at much lower levels of influence since each previous input is within an additional two nested $tanh$ functions the further back in the sequence it is found. This allows for a distinction in the order of the inputs whilst keeping each term focused on one particular input.<br /><br />The creators of the LSTM didn't stop there however. They also introduced gating functions in the LSTM. A gating function is basically a single layer sub neural net that outputs a fraction between 0 and 1 using a sigmoid. This number is then multiplied by another number in order to either leave the second number unchanged or turn it to zero. In other words, if the gate is 1 then it allows the second number to pass whilst if it is 0 then it blocks the second number (and gates in between will make the number smaller). <a href="http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf">Originally there were two gates</a>: one for the input terms $tanh(i^1 w_i + h^0 w_h)$ and one for the new hidden state $tanh(c^1)$. The gates control whether to accept the new input (the input gate) and whether to allow previous information to be represented by the hidden state (the output gate). Later, <a href="http://www.jmlr.org/papers/volume3/gers02a/gers02a.pdf">another paper</a> introduced a forget gate that regulates the cell state as well. All of these gates are controlled by sub neural nets that take a decision based on the current input and the previous hidden state.<br /><br />Here is what the final LSTM equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = sig(i^{t} w_{g_{i}i} + h^{t-1} w_{g_{i}h} + b_{g_i})$<br />$g_o = sig(i^{t} w_{g_{o}i} + h^{t-1} w_{g_{o}h} + b_{g_o})$<br />$c^{t} = g_i \odot tanh(i^{t} w_{ci} + h^{t-1} w_{ch} + b_{c}) + g_f \odot c^{t-1}$<br />$h^{t} = g_o \odot tanh(c^{t})$<br /><br />where $g_f$, $g_i$, and $g_o$ are the forget gate, input gate, and output gate respectively. $w_{g_{f}i}$ means weight for the input being used in the calculation of $g_f$. $b$ are the bias terms of the sub neural nets (we didn't use biases in the previous equations to keep things short). $\odot$ is the element-wise product of two vectors and $sig$ stands for sigmoid.<br /><br />Here is a diagram illustrating the different components of the LSTM with gating function application (element-wise multiplication) being represented by triangles:<br /><br /><a href="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" data-original-width="756" data-original-height="378" /></a><br /><br />A little while before, you might have been a little sceptical about using two different state vectors in order to combine the expressivity of the simple RNN with the gradient preservation of the addition based RNN described above. Can't we just use the cell state in the $tanh$ as well? In fact there was later <a href="https://arxiv.org/abs/1412.3555">a paper</a> describing another kind of RNN called a gated recurrent unit (GRU) which does just that. Instead of using one state for differentiating between time steps and another for flowing previous states into later calculations, only one state is used. Gating functions are still used, except that the input gate is replaced with 1 minus the forget gate. Basically it finds a compromise between either forgetting previous states and focussing everything on the new input or ignoring the new input and preserving the previous states. Finally, instead of having an output gate we now have a reset gate which is used to control how much the state should contribute to the input term. Here is what the GRU equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = 1 - g_f$<br />$g_r = sig(i^{t} w_{g_{r}i} + h^{t-1} w_{g_{r}h} + b_{g_r})$<br />$h^{t} = g_i \odot tanh(i^{t} w_{hi} + w_{hh}(g_r \odot h^{t-1}) + b_{h}) + g_f \odot h^{t-1}$<br /><br />And here is the diagram:<br /><br /><a href="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" data-original-width="756" data-original-height="378" /></a>mtantinoreply@blogger.com1tag:blogger.com,1999:blog-4318944459823143473.post-50014336646792906302017-11-29T21:28:00.001+01:002018-02-25T20:17:49.223+01:00An introduction to recurrent neural networks: Simple RNNsTraditional feed forward neural networks work by taking a single fixed-length vector of inputs and producing a single fixed-length vector of outputs. After being trained on a training set, the neural network should be able to not only map inputs in the training set to their correct outputs, but also do so with new unseen inputs. The network is able to generalise to new inputs, but the new inputs must be of the same size. The network is not to generalise across complexity. For example if you train the network to perform addition on 4 digit numbers, it will not be able to perform addition on 5 digit numbers or even 3 digit numbers. Likewise if it learns to understand 5 word sentences then it will not be able to do anything with 6 word sentences. With images we usually solve this by resizing differently sized images into a standard size. This is not as straight forward to do with things like sentences. This is where recurrent neural networks come in.<br /><br />Recurrent neural networks (RNNs) are used to give a neural network a short term memory which is used to be able to read a sequence of inputs and remember just a summary of what is important in the sequence. This summary, which would be a fixed-length vector called a state, can then be used by the rest of the neural network as usual. It can be used to predict the next word in a partial sentence or to determine the sentiment of a sentence. A simple RNN is a feed forward neural network where neurons in a layer have extra connections that loop around to the same layer as shown below:<br /><br /><a href="https://4.bp.blogspot.com/-aSg0QgTaNUI/Wh8Xk2OdMoI/AAAAAAAABJI/hEsmLrUCKGcZkxyXqIZQ4w1u6U2BR0COgCEwYBhgL/s1600/rnns_01.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-aSg0QgTaNUI/Wh8Xk2OdMoI/AAAAAAAABJI/hEsmLrUCKGcZkxyXqIZQ4w1u6U2BR0COgCEwYBhgL/s320/rnns_01.png" width="320" height="291" data-original-width="596" data-original-height="542" /></a><br /><br />The figure shows a neural network consisting of 2 inputs and a state of neurons. The red connections allow each state neuron to produce an output based on a combination of the input neurons and the state neurons themselves. In other words the next state vector is produced based on a combination of the current input and previous state. This is the basis of short-term memory and the result is that after being exposed to a number of inputs, the state will be a vector that is influenced by each of the input vectors. The point is to train the neural network to remember what is important according to the task at hand. This means that we also need to use the final state (after processing all inputs) to generate an output (the state itself is not usually a useful output) which will make the RNN learn a useful representation of the input sequence.<br /><br /><a href="https://4.bp.blogspot.com/-qAIWWwYQdxE/Wh8XxT_VqXI/AAAAAAAABJM/L9SPhT_NMxwF-uQaIsprc4bZSpPzCLXsACLcBGAs/s1600/rnns_02.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-qAIWWwYQdxE/Wh8XxT_VqXI/AAAAAAAABJM/L9SPhT_NMxwF-uQaIsprc4bZSpPzCLXsACLcBGAs/s320/rnns_02.png" width="320" height="204" data-original-width="829" data-original-height="529" /></a><br /><br />How are the inputs and the recurrent connections combined together? By concatenating the input and state vectors together and then passing them through a weight matrix as usual, generating the next state vector.<br /><br /><a href="https://2.bp.blogspot.com/-Ju_-TrIqrNo/Wh8X15rX-UI/AAAAAAAABJQ/NAdGlhzua3Uxfg2js_SF4Q7-9fZYJXNBgCLcBGAs/s1600/rnns_03.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-Ju_-TrIqrNo/Wh8X15rX-UI/AAAAAAAABJQ/NAdGlhzua3Uxfg2js_SF4Q7-9fZYJXNBgCLcBGAs/s320/rnns_03.png" width="320" height="301" data-original-width="269" data-original-height="253" /></a><br /><br />But what happens for the first input? What's the first input going to concatenated with if there is no previous state? We have to define a default initial state for the first input. This is usually the all zeros vector but you can instead learn a constant vector that gets optimized during training.<br /><br />Great, so that's all the basics sorted. Now for the formal notation. It is common to use the terminology of time series when talking about RNNs such that each input in a sequence belongs to a different time step in a series. In our formal notation, let's use superscripts to refer to time steps such that the first time step is 1 and the number of time steps is $T$.<br /><br />$$s^0 = \mathbf{0}$$<br />$$s^t = f_s([s^{t-1} i^t] W_s + b_s)$$<br />$$o = f_o(s^T W_o + b_o)$$<br /><br />where $s^t$ is the state vector at time $t$, $i^t$ is the input vector at time $t$, $o$ is the output vector, $\mathbf{0}$ is the all zeros vector, $f_s$ and $f_o$ are the activation functions of the state vector and output vector respectively, $W_s$ $b_s$ $W_o$ and $b_o$ are the weights and biases of the state vector and output vector respectively.<br /><br />The question is how to learn the parameters of a recurrent function, which is not as single as with a feed forward neural network. The first thing we need to do is to unroll the recurrent network into a linear network that reuses the same parameters throughout. Although an RNN can be unrolled to infinity it is also the case that you only need to unroll it as far as your input sequences require. So if your training set contains an input sequence that is 3 time steps long, then you can use an RNN that is unrolled 3 times like this:<br /><br /><a href="https://3.bp.blogspot.com/-xNZPsxw7gtY/Wh8ay2nFidI/AAAAAAAABJo/1PdsiGCM92ceobTPdQPR8PktYcAxiqztQCLcBGAs/s1600/rnns_04.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-xNZPsxw7gtY/Wh8ay2nFidI/AAAAAAAABJo/1PdsiGCM92ceobTPdQPR8PktYcAxiqztQCLcBGAs/s640/rnns_04.png" width="640" height="341" data-original-width="1371" data-original-height="730" /></a><br /><br />Now it makes more sense and we can view it as a feed forward neural network. In fact an RNN is a feed forward network with the constraint that corresponding weights across time steps have to be identical. Notice how $W_{s00}$ and $W_{i00}$ are repeated with every time step. So whereas the weights in different layers in a feed forward neural net can be different, in an RNN they have to be the same. You might be thinking about how to handle the other input sequences of different lengths in the training set. We'll get to that later. For now let's assume that all sequences in the training set are grouped by length and that each minibatch consists of same length sequences.<br /><br />Since training will involve finding the gradient of the loss with respect to each weight, let's start by finding the gradient of the output with respect to a sample of weights. If you're not familiar with the back propagation algorithm and how gradients are used to train neural networks in general you should check out this previous <a href="https://geekyisawesome.blogspot.com.mt/2016/06/the-backpropagation-algorithm-for.html">blog post</a> before continuing on.<br /><br />Let's start with the gradient with respect to a non-recurrent weight in the output:<br /><br />$$\frac{do_0}{dW_{o00}} = \frac{d}{dW_{o00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3) = f_o'(\ldots)s_0^3$$<br /><br />That was straight forward. What about for recurrent weights?<br /><br />$$\frac{do_0}{dW_{s00}} = \frac{d}{dW_{s00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3)$$<br />$$= f_o'(\ldots)(\frac{d}{dW_{s00}}f_s(W_{s00}s_0^2 + W_{s10}s_1^2 + W_{i00}i_0^3 + W_{i10}i_1^3) + \frac{d}{dW_{s00}}f_s(W_{s01}s_0^2 + W_{s11}s_1^2 + W_{i01}i_0^3 + W_{i11}i_1^3))$$<br />$$= f_o'(\ldots)(f_s'(\ldots)(\frac{d}{dW_{s00}}W_{s00}s_0^2))$$<br /><br />And it is at this point that we will realise that things are more complicated than usual with feed forward neural nets. This is because $s_0^2$ can be decomposed to reveal more terms that contain $W_{s00}$ which means that we need to use the product rule.<br /><br />$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2\frac{d}{dW_{s00}}W_{s00} + W_{s00}\frac{d}{dW_{s00}}s_0^2))$$<br />$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2 + W_{s00}\frac{d}{dW_{s00}}f_s(W_{s00}s_0^1 + W_{s10}s_1^1 + W_{i00}i_0^2 + W_{i10}i_1^2)))$$<br /><br />...and so on, which would require as many decompositions as the length of the sequence. This is not compatible with the back propagation algorithm as it's not easy to extract a pattern that works for any sequence length. Keep in mind that we need to do this for the input weights as well.<br /><br />Fortunately there is a simple solution: Treat all weights as being different and then add together the corresponding derivatives. What this means is that you put a superscript on each weight which indicates the time step it belongs to, hence making each weight different. So instead of having $W_{s00}$ we'd have $W_{s00}^3$, $W_{s00}^2$ and $W_{s00}^1$. Then we find the derivatives of each separate weight and finally add them all up:<br /><br />$$\frac{do_0}{dW_{s00}} = \frac{do_0}{dW_{s00}^3} + \frac{do_0}{dW_{s00}^2} + \frac{do_0}{dW_{s00}^1}$$<br /><br />This allows us to use normal back propagation to find each individual weight as if we were working on a feed forward neural net and then finally just add together corresponding derivatives in order to keep the weights identical. Notice that this is not a hack to force the weights to remain identical. The sum of the subderivatives really does equal $\frac{do_0}{dW_{s00}}$. You can try proving it yourself by trying to find the derivative using product rules as I was doing before. This trick is called back propagation through time.<br /><br />We now get back to the question of handling variable length sequences in a training set. The solution to this is to make all sequences of equal length by padding them with pad vectors (the all zeros vector for example) and then make the RNN simply return the current state unmodified if the input is a pad vector. That way the state will remain the same beyond the length of the sequence, as if there were no pad vectors. This is the new RNN equation:<br /><br />$$s^0 = \mathbf{0}$$<br />$$s^t =<br />\begin{cases}<br />f_s([s^{t-1} i^t] W_s + b_s) & \quad \text{if } i^t \text{ is not a pad}\\<br />s^{t-1} & \quad \text{otherwise}<br />\end{cases}<br />$$<br />$$o = f_o(s^T W_o + b_o)$$<br /><br />You can now see how to implement a language model which predicts the next word in a partial sentence using Tensorflow by checking this previous <a href="https://geekyisawesome.blogspot.com.mt/2017/05/neural-language-models-and-how-to-make.html">blog post</a>. You might also want to learn about how to represent words as vectors using word embeddings in this other <a href="https://geekyisawesome.blogspot.com.mt/2017/03/word-embeddings-how-word2vec-and-glove.html">blog post</a>.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91087400828036759802017-10-21T23:19:00.001+02:002017-10-22T12:18:40.051+02:00Hyperparameter tuning using HyperoptOne of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.<br /><br />The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.<br /><br />This is where the Python library <a href="https://jaberg.github.io/hyperopt/">Hyperopt</a> comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a <a href="http://optunity.readthedocs.io/en/latest/user/solvers/TPE.html">tree of Parzen Estimators</a>.<br /><br />Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.<br /><br /><pre class="brush:python">import random<br /><br />def loss(x):<br /> return x**2<br /><br />def loss_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(10):<br /> x = x - learning_rate*loss_grad(x)<br /> return x<br /></pre><br />Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".<br /><br />This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.<br /><br /><pre class="brush:python">import hyperopt<br /><br />def hyperopt_objective(hyperparams):<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /><br />best = hyperopt.fmin(<br /> hyperopt_objective,<br /> space = [<br /> hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),<br /> hyperopt.hp.choice('max_init_x', [ 10.0, 1.0 ]),<br /> ],<br /> algo = hyperopt.tpe.suggest,<br /> max_evals = 10<br /> )<br />print(best)<br /></pre><br />The output of this program is this:<br /><pre>{'learning_rate': 1, 'max_init_x': 1}<br /></pre>This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.<br /><br />Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:<br /><br /><pre class="brush:python">eval_num = 0<br />best_loss = None<br />best_hyperparams = None<br />def hyperopt_objective(hyperparams):<br /> global eval_num<br /> global best_loss<br /> global best_hyperparams<br /><br /> eval_num += 1<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> print(eval_num, l, hyperparams)<br /><br /> if best_loss is None or l < best_loss:<br /> best_loss = l<br /> best_hyperparams = hyperparams<br /><br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /></pre><br />We can now see each hyperparameter combination being evaluated. This is the output we'll see:<br /><pre>1 1.6868761146697238 (1.0, 10.0)<br />2 0.34976768426779775 (0.01, 1.0)<br />3 0.006508209785146999 (0.1, 1.0)<br />4 1.5999357079405185 (0.01, 10.0)<br />5 0.2646974732349057 (0.01, 1.0)<br />6 0.5182259594937579 (1.0, 10.0)<br />7 53.61565213613977 (1.0, 10.0)<br />8 1.8239879002601682 (1.0, 10.0)<br />9 0.15820396975495435 (0.01, 1.0)<br />10 0.784445725853568 (1.0, 1.0)<br /></pre><br />Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.<br /><br />Here is the full code in one place:<br /><br /><pre class="brush:python" style="height:400px; overflow:scroll;">import random<br />import hyperopt<br /><br />def loss(x):<br /> return x**2<br /><br />def loss_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(10):<br /> x = x - learning_rate*loss_grad(x)<br /> return x<br /><br />eval_num = 0<br />best_loss = None<br />best_hyperparams = None<br />def hyperopt_objective(hyperparams):<br /> global eval_num<br /> global best_loss<br /> global best_hyperparams<br /><br /> eval_num += 1<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> print(eval_num, l, hyperparams)<br /><br /> if best_loss is None or l < best_loss:<br /> best_loss = l<br /> best_hyperparams = hyperparams<br /><br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /><br />hyperopt.fmin(<br /> hyperopt_objective,<br /> space = [<br /> hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),<br /> hyperopt.hp.choice('max_init_x', [ 10.0, 1.0 ]),<br /> ],<br /> algo = hyperopt.tpe.suggest,<br /> max_evals = 10<br /> )<br /><br />print(best_hyperparams)<br /></pre><br />To find out more about Hyperopt see this <a href="https://github.com/jaberg/hyperopt/wiki/FMin">documentation page</a> and the <a href="https://github.com/jaberg/hyperopt">Github repository</a>.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-15675259591453729732017-09-18T16:27:00.003+02:002017-09-19T08:39:37.834+02:00Find an equation that passes through a set of points (Lagrange polynomial interpolation)When I was in school, I remember learning about how to find the equation of a line that passes through two points. When I asked how to find the equation of a curve instead I was told that this was not possible, probably because there is an infinite number of curves that pass through a finite set of points (as we'll see below). However I would have been happy to learn about this simple method to produce a polynomial curve that passes through any given set of points. In general this task is called interpolation and the particular interpolating method we shall see here is called <a href="https://en.wikipedia.org/wiki/Lagrange_polynomial">Lagrange polynomials</a>.<br /><br />A polynomial is an equation of the form<br /><br />$$a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0$$<br /><br />For example $2x^3 + 4x^2 - 3x + 1$ is a polynomial where 2 is $a_3$, 4 is $a_2$, 4 is $a_1$ and 1 is $a_0$. A polynomial plus another polynomial is also a polynomial, for example $(x^2 + 2x) + (3x + 2) = x^2 + 5x + 2$. A polynomial times another polynomial is also a polynomial, for example $(x^2 + 2x)(3x + 2) = 3x^3 + 8x^2 + 4x$.<br /><br />Now on to polynomial interpolation. Let's start with the simplest case, when you have only one point. Let's say you want a polynomial that passes through (3,2), that is, when $x$ is 3, the polynomial should equal 2. In this case our equation can simply be<br /><br />$$y = 2$$<br /><br />that is, the polynomial is just 2. The equation is always 2 regardless of where $x$ is so we have found our polynomial.<br /><br /><figure><a href="https://1.bp.blogspot.com/-pZ0esix4-30/Wb_VeecRCNI/AAAAAAAABHY/1l_lHMsinhcBgmz961xJPXpD1M3s_XCKgCLcBGAs/s1600/lagrange01.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-pZ0esix4-30/Wb_VeecRCNI/AAAAAAAABHY/1l_lHMsinhcBgmz961xJPXpD1M3s_XCKgCLcBGAs/s320/lagrange01.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 2$.</figcaption></figure><br /><br />Now on to a more interesting case. Let's say we want to pass through these points:<br /><ul><li>(3,2)</li><li>(4,3)</li><li>(6,4)</li></ul><br />The trick is to add up a set of polynomials each of which goes through one point. We need a polynomial that equals 2 when $x$ is 3, another polynomial that equals 3 when $x$ is 4, and another polynomial that equals 4 when $x$ is 6. But we have to be careful as we don't want these three polynomials to interfere with each other after being added up together. For example if we can't just do $y = (2) + (3) + (4)$ as the two terms will interfere with each other and the result will not pass through any of the points. Instead we need each polynomial to have a shape such that each one passes through its corresponding point but then passes through 0 in place of the remaining points. The three polynomials we need to add up together are:<br /><ul><li>Polynomial corresponding to (3,2): must equal 2 when $x$ is 3 and equal 0 when $x$ is 4 or 6.</li><li>Polynomial corresponding to (4,3): must equal 3 when $x$ is 4 and equal 0 when $x$ is 3 or 6.</li><li>Polynomial corresponding to (6,4): must equal 4 when $x$ is 6 and equal 0 when $x$ is 3 or 4.</li></ul>Since the polynomials equal 0 where the other points are then they will not interfere with each other where the polynomials pass through their corresponding point.<br /><br />Let's focus on the zeros first. There is an easy trick to ensure that a polynomial goes through zero at certain values of $x$. If you want your polynomial to be zero when $x$ is 4 or 6, then use $(x-4)(x-6)$. In this polynomial, when $x$ is 4 then the first bracket will equal 0, which makes the whole thing 0, and when $x$ is 6 the second bracket will equal 0 which will also make the whole thing 0. This is what $y = (x-4)(x-6)$, $y = (x-3)(x-6)$ and $y = (x-3)(x-4)$ looks like:<br /><br /><figure><a href="https://4.bp.blogspot.com/-JHj_I6n6hws/Wb_VlBjGlYI/AAAAAAAABHc/uGVrY8bXcY0zvjsKELIiXkVNDUEAcTMawCLcBGAs/s1600/lagrange02a.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-JHj_I6n6hws/Wb_VlBjGlYI/AAAAAAAABHc/uGVrY8bXcY0zvjsKELIiXkVNDUEAcTMawCLcBGAs/s320/lagrange02a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://2.bp.blogspot.com/-gi8vPgqOXHU/Wb_WNMOiSLI/AAAAAAAABHk/i8FZRs1FUvQ9A0Iwt1xRVXI6KR9gL1JJACLcBGAs/s1600/lagrange02b.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-gi8vPgqOXHU/Wb_WNMOiSLI/AAAAAAAABHk/i8FZRs1FUvQ9A0Iwt1xRVXI6KR9gL1JJACLcBGAs/s320/lagrange02b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-6)$. Passes through zero when $x$ is 3 or 6.</figcaption></figure><br /><br /><figure><a href="https://4.bp.blogspot.com/-18aHYqEuCi0/Wb_WQi_QBfI/AAAAAAAABHo/otev8mhn5w8O9sKIZ8m2gFNNv-neFfN0QCLcBGAs/s1600/lagrange02c.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-18aHYqEuCi0/Wb_WQi_QBfI/AAAAAAAABHo/otev8mhn5w8O9sKIZ8m2gFNNv-neFfN0QCLcBGAs/s320/lagrange02c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-4)$. Passes through zero when $x$ is 3 or 4.</figcaption></figure><br /><br />See how each curve passes through zero at every point except one? That exception point will be the corresponding point of each polynomial. Adding these polynomials up will not make them interfere with each other at the x-values of each point. But in order to get the resultant polynomial to pass through the points, we need each separate polynomial to pass through its corresponding point whilst still being zero at every other point. For this we apply another easy trick which is to multiply the polynomials by a number. Multiplying an equation by a number will not change where it is zero. It will change the shape of the curve but the places at which it is zero will not be moved. This is what $(x-4)(x-6)$ looks like after being multiplied by 2, 0.5, and -1:<br /><br /><figure><a href="https://3.bp.blogspot.com/-VdPamJMwdV0/Wb_WVjc8tJI/AAAAAAAABHs/KIpgKh4SGvU0KxA3VekQt9awicunCzeTgCLcBGAs/s1600/lagrange03a.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-VdPamJMwdV0/Wb_WVjc8tJI/AAAAAAAABHs/KIpgKh4SGvU0KxA3VekQt9awicunCzeTgCLcBGAs/s320/lagrange03a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 2(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://3.bp.blogspot.com/-ojV3beeUvVo/Wb_WYhlxOgI/AAAAAAAABHw/D7Jkta6YGNUvxW6ocXJhtV79QnEcs1TfwCLcBGAs/s1600/lagrange03b.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-ojV3beeUvVo/Wb_WYhlxOgI/AAAAAAAABHw/D7Jkta6YGNUvxW6ocXJhtV79QnEcs1TfwCLcBGAs/s320/lagrange03b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 0.5(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://4.bp.blogspot.com/-7QB-ON_M0vQ/Wb_Wc11zSQI/AAAAAAAABH0/LE9vFrXyI64jce9D7NijPanjNls6T6SlACLcBGAs/s1600/lagrange03c.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-7QB-ON_M0vQ/Wb_Wc11zSQI/AAAAAAAABH0/LE9vFrXyI64jce9D7NijPanjNls6T6SlACLcBGAs/s320/lagrange03c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = -(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br />So we just need to find the number that when multiplied by each separate polynomial will make it pass through its corresponding point. This number is the y-value of the corresponding point divided by the current value of the polynomial there. What we're doing is first making the polynomial equal 1 at its corresponding point and then multiplying it by whatever number we want it to be. For example if you want to multiply the number 2 by a number that will make the result 3, that number should be $\frac{3}{2}$, that is, $2 \times \frac{3}{2} = 3$.<br /><br />So what is the current value of $(x-4)(x-6)$ at its corresponding point, (3,2)? It's $(3-4)(3-6) = 3$. So by multiplying $(x-4)(x-6)$ by $\frac{2}{(3-4)(3-6)}$ we can make the polynomial pass through 2, without changing where it is zero. This is what $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$, $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$ and $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$ looks like:<br /><br /><figure><a href="https://1.bp.blogspot.com/-rQyM6etoPNA/Wb_WiX6zllI/AAAAAAAABH4/rVZp6eRE9Lk7_QrHrkDWXFrBFAv1C7dlQCLcBGAs/s1600/lagrange04a.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-rQyM6etoPNA/Wb_WiX6zllI/AAAAAAAABH4/rVZp6eRE9Lk7_QrHrkDWXFrBFAv1C7dlQCLcBGAs/s320/lagrange04a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$. Passes through corresponding point (3,2) but through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://3.bp.blogspot.com/-dFQHLfFXY8E/Wb_Wlcgi-cI/AAAAAAAABIA/EQbgcRpk_Rs9SejS8AwDMu4FrZs44vXmQCLcBGAs/s1600/lagrange04b.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-dFQHLfFXY8E/Wb_Wlcgi-cI/AAAAAAAABIA/EQbgcRpk_Rs9SejS8AwDMu4FrZs44vXmQCLcBGAs/s320/lagrange04b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$. Passes through corresponding point (4,3) but through zero when $x$ is 3 or 6.</figcaption></figure><br /><br /><figure><a href="https://1.bp.blogspot.com/-03ofA81EJqI/Wb_WlJQAz6I/AAAAAAAABH8/SpZmHQtlpGob6kod7DJ458_ajErZVPbEQCLcBGAs/s1600/lagrange04c.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-03ofA81EJqI/Wb_WlJQAz6I/AAAAAAAABH8/SpZmHQtlpGob6kod7DJ458_ajErZVPbEQCLcBGAs/s320/lagrange04c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$. Passes through corresponding point (6,4) but through zero when $x$ is 3 or 4.</figcaption></figure><br /><br />Now we can add them up into<br />$$y = (x-4)(x-6)\frac{2}{(3-4)(3-6)} + (x-3)(x-6)\frac{3}{(4-3)(4-6)} + (x-3)(x-4)\frac{4}{(6-3)(6-4)}$$<br />which is simplified into $-\frac{1}{6} x^2 + \frac{13}{6} x - 3$<br /><br /><figure><a href="https://4.bp.blogspot.com/-Tubqgg1WStM/Wb_WxUTM_wI/AAAAAAAABIE/OgVD8QcNQ3YWA_H7LQGFd8IQsYSUVJo-ACLcBGAs/s1600/lagrange05.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Tubqgg1WStM/Wb_WxUTM_wI/AAAAAAAABIE/OgVD8QcNQ3YWA_H7LQGFd8IQsYSUVJo-ACLcBGAs/s320/lagrange05.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Final graph that passes through the 3 points.</figcaption></figure><br /><br />If we added another point at (10,7) then the polynomial passing through all the points would be<br />$$y = \frac{2(x-4)(x-6)(x-10)}{(3-4)(3-6)(3-10)} + \frac{3(x-3)(x-6)(x-10)}{(4-3)(4-6)(4-10)} + \frac{4(x-3)(x-4)(x-10)}{(6-3)(6-4)(6-10)} + \frac{7(x-3)(x-4)(x-6)}{(10-3)(10-4)(10-6)}$$<br /><br /><figure><a href="https://2.bp.blogspot.com/-z-8_Xr9H9Fo/Wb_W0l02EEI/AAAAAAAABII/7j5_eygGgj4polSjmWqDiv9uDO4d7dIRwCLcBGAs/s1600/lagrange06.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-z-8_Xr9H9Fo/Wb_W0l02EEI/AAAAAAAABII/7j5_eygGgj4polSjmWqDiv9uDO4d7dIRwCLcBGAs/s320/lagrange06.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph that passes through the 3 points plus (10,7).</figcaption></figure><br /><br />See how we have found another curve that also passes through the previous 3 points? This equation building method proves that you can find an infinite number of polynomials that pass through a finite number of points, since you can always make a polynomial that passes through the given set of points plus any other point anywhere where each position of the new point requires a differently shaped curve. Since there is an infinite amount of positions where to place the extra point, there is an infinite number of polynomials (and thus curves) that can pass through the given set of points.<br /><br />In general, given points $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)$, your polynomial will be<br />$$\sum^{n}_{i=1} y_i \frac{\prod^{n}_{j=1,j\neq i} (x-x_j)}{\prod^{n}_{j=1,j\neq i} (x_i-x_j)}$$<br /><br />Notice that you cannot have two points with the same x-value or you'll get a division by zero. Notice also how you are not guaranteed to get an extrapolation from your points. The polynomial will completely derail into an upward or downward curve beyond the range of the given points, regardless of any pattern suggested by the points. Finally your equation will get more and more complicated with every new point, even after simplification. If you have $n$ points then you will have up to $n$ terms in your polynomial. It might be simplifiable into less terms, but that is rare.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91711000339029954592017-08-28T09:06:00.001+02:002017-09-19T08:40:07.588+02:00Proof that the sum of digits of any multiple of 3 (or 9) is itself a multiple of 3 (or 9)Take any multiple of 3, such as 327. You can verify that it really is a multiple of 3 by adding together all its digits, that is 3+2+7=12, and checking if the sum is itself a multiple of 3. 12 is a multiple of 3, so 327 must also be a multiple of 3. Since the sum of digits will be a much smaller number than the original number, it will be easier to determine if it is a multiple of 3 or not. Of course you can then take the sum of digits and find its sum of digits again repeatedly until it's a single digit number: 3, 6, 9. The same thing is true of multiples of 9 but instead the sum of digits will be a multiple of 9.<br /><br />This is the proof that this works for every multiple of 3:<br /><br />Start with a number with $n$ digits, such as the 3 digit number 327. We'll call each of these digits $a_i$, starting from $a_{n-1}$ for the first digit and ending with $a_0$ for the last digit. So 327 has $a_2 = 3$, $a_1 = 2$, and $a_0 = 7$.<br /><br />Our number is equal to the following:<br />$$a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$$<br />This is just the hundreds, tens, units formulation of numbers. For example:<br />$$327 = 3 \times 10^2 + 2 \times 10^1 + 7 \times 10^0 = 3 \times 100 + 2 \times 10 + 7 \times 1 = 327$$<br /><br />Now the trick is to replace each $10^x$ with $999...+1$. For example $100 = 99+1$.<br /><br /><div style="width:100%; height:120px; overflow:scroll; border:solid 1px black;">$$\begin{eqnarray}<br />a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0 &=& a_{n-1} \left(999... + 1\right) + a_{n-2} \left(99... + 1\right) + \ldots + a_1 \left(9 + 1\right) + a_0 \left(0 + 1\right) \\<br />&=& \left(a_{n-1} 999... + a_{n-1}\right) + \left(a_{n-2} 99... + a_{n-2}\right) + \ldots + \left(a_1 9 + a_1\right) + \left(a_0 0 + a_0\right) \\<br />&=& \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0\right) + \left(a_{n-1} + a_{n-2} + \ldots + a_1 + a_0\right) \\<br />\end{eqnarray}$$<br /></div><br />Now we'll make some terms switch places in the equation:<br /><div style="width:100%; height:100px; overflow:scroll; border:solid 1px black;">$$\begin{eqnarray}<br />a_{n-1} + a_{n-2} + \ldots + a_1 + a_0 &=& \left(a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0\right) - \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0\right) \\<br />&=& \left(a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0\right) - \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9\right)<br />\end{eqnarray}$$<br /></div><br />Notice that in the last line above we have eliminate the term $a_0 0$ because it's a multiplication by 0.<br /><br />Now, if our original number is a multiple of 3, then it must be that the right hand side at the end of the above equation is a multiple of 3. Any string of 9s is always a multiple of 3 since $999\ldots = 3 \times 3 \times 111\ldots$. It is also a multiple of 9, but we'll get to that later. This means that the two bracketed terms in the last line of the above equation are both multiples of 3 because:<br /><ul><li>$a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$: is a multiple of 3 by definition (we said that the number would be one).</li><li>$a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1<br />9$: is a sum of numbers multiplied by strings of 9s, which means that we can factor out the common factor 3.</li></ul><br />The difference of two multiples of 3 is itself a multiple of 3. Therefore the left hand side must also be a multiple of 3 (since the two sides are equal), and the left hand side just happens to be:<br /><br />$$a_{n-1} + a_{n-2} + \ldots + a_1 + a_0$$<br /><br />which is the sum of all digits. Hence for any number that is a multiple of 3, the sum of its digits must also be a multiple of 3.<br /><br />Until now we have only shown that any multiple of 3 will have the sum of its digits also be a multiple of 3. But can a number which is not a multiple of 3 coincidentally have the sum of its digits be a multiple of 3? No, because otherwise it would imply that a non-multiple of 3 minus a multiple of 3 is a multiple of 3. The second term in the subtraction in the right hand side of the above derivation is always going to be a multiple of 3, but in order for the whole of the right hand side to be a multiple of 3 you will need both terms being a multiple of 3. So you can rest your mind that checking if a large number is a multiple of 3 can be as simple as checking if the sum of its digits is a multiple of 3. And if that sum is still too big you can check if the sum of its digits are a multiple of 3 and so on, because you're always just reusing the same test each time.<br /><br />What about multiples of 9? As already mentioned above, a string of 9s is also always a multiple of 9. So the second term in the subtraction above is also always a multiple of 9. Hence if both terms of the subtraction are multiples of 9, then the sum of digits is going to be a multiple of 9.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-77056332836214567652017-07-14T09:48:00.004+02:002017-07-14T09:57:07.632+02:00Making a point and click / room escape game using PowerPointBefore we begin, it is important that you understand that there is no simple way to make a complex point and click game with PowerPoint where actions in one room affect what happens in other rooms. You can only make games where everything happens in the same room. You can have different rooms but all the puzzles of a room must be solved within the room. This isn't so bad. There are existing games where you have to go through several levels with each level having a small puzzle to solve, such as <a href="http://www.primarygames.com/puzzles/action/monkeygohappy/">this</a>. You also cannot have an inventory or pass codes without going into a lot of complexity that would make more sense using actual game programming. That said, it can still be a creative use of PowerPoint for children so here is how you do it.<br /><br />The heart of the point and click game is the discovery of items that can be used to get over obstacles. You find a key and then you open a door. In PowerPoint such a thing can be done using transparent boxes that are in front of clickable objects. When you click on a key, a trigger event will set an exit animation on the transparent box, making whatever was behind it available for clicking. Let's begin.<br /><br />Start by drawing a key and a door. I'm using a star to represent a key and I put a star shape on the door to represent a key hole. Also add a second slide with the next room after going through the door. I just put a smiley on the next slide to show that the game has ended.<br /><br /><a href="https://3.bp.blogspot.com/-Xb4cXJEYB9Q/WWh24lOwPfI/AAAAAAAABFk/TdfhYwUxj9M9tDxA71nAZ8y8cU5ojsDRQCLcBGAs/s1600/pointandclick_1.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-Xb4cXJEYB9Q/WWh24lOwPfI/AAAAAAAABFk/TdfhYwUxj9M9tDxA71nAZ8y8cU5ojsDRQCLcBGAs/s320/pointandclick_1.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Next make the door take you to the next slide when clicked. This is done using a hyperlink. Just right click on the door, go on hyperlink, place in this document, next slide, OK. Now when you start the presentation, clicking on the door will take you to the next slide.<br /><br /><a href="https://2.bp.blogspot.com/-58cSxDfOBhc/WWh280KyYLI/AAAAAAAABFo/h75I-FFw5Qsa-re1N2selbCEmjXRNAelwCLcBGAs/s1600/pointandclick_2.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-58cSxDfOBhc/WWh280KyYLI/AAAAAAAABFo/h75I-FFw5Qsa-re1N2selbCEmjXRNAelwCLcBGAs/s320/pointandclick_2.png" width="320" height="176" data-original-width="946" data-original-height="519" /></a><br /><br />Of course clicking anywhere in a slide will take you to the next slide. We don't want that. You should only be able to go to the next slide by clicking on the door. Go on transitions and unset advance slide on mouse click.<br /><br /><a href="https://3.bp.blogspot.com/-d0-m8-erDcw/WWh3AGltWoI/AAAAAAAABFs/Nc3w5keMCZcNdfgGb-kGnEy9D_99nd8cACLcBGAs/s1600/pointandclick_3_.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-d0-m8-erDcw/WWh3AGltWoI/AAAAAAAABFs/Nc3w5keMCZcNdfgGb-kGnEy9D_99nd8cACLcBGAs/s320/pointandclick_3_.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Now put a box over the door covering it completely. Make this box transparent (not "no fill"). The door is now not clickable any more because there is a box over it. Just right click the box, format shape, fill, transparency 100%. I left the border there so that you can see where the box is but you can remove the outline so that it looks better.<br /><br /><a href="https://2.bp.blogspot.com/-vHHLXwTexqg/WWh3GO-X1VI/AAAAAAAABFw/imaN75p1MqAVHpF7cTCr0Zzy0H3oBXFCACLcBGAs/s1600/pointandclick_4.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-vHHLXwTexqg/WWh3GO-X1VI/AAAAAAAABFw/imaN75p1MqAVHpF7cTCr0Zzy0H3oBXFCACLcBGAs/s320/pointandclick_4.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Now we want the key to disappear when clicked on. Click on the key, go on animations, add animation, and choose an exit animation such as Fade. Open the animation pane. You'll see the animation you just created and the automatically assigned name of the key. In my case the name is "4-Point Star 5"<br /><br /><a href="https://3.bp.blogspot.com/-crwssVIGmgg/WWh3JVFFQ5I/AAAAAAAABF0/wTFp4SMBEJomR84gT7WG95b7RYiqpHLNACLcBGAs/s1600/pointandclick_5.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-crwssVIGmgg/WWh3JVFFQ5I/AAAAAAAABF0/wTFp4SMBEJomR84gT7WG95b7RYiqpHLNACLcBGAs/s320/pointandclick_5.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />At the moment, the key will disappear as soon as you click anywhere, not when you click on the key. To fix this we can click on the key animation (the orange box with "1" in it next to the key), go on trigger, on click of, and choose the name of the key shape you saw in the animation pane. Now the key's exit animation will only start when you click on the key itself.<br /><br /><a href="https://2.bp.blogspot.com/-N85zqZzEGsc/WWh3NzvRAYI/AAAAAAAABF4/BPryAskDxssSnMXTB5BvkX0HmpRiB4AbwCLcBGAs/s1600/pointandclick_6.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-N85zqZzEGsc/WWh3NzvRAYI/AAAAAAAABF4/BPryAskDxssSnMXTB5BvkX0HmpRiB4AbwCLcBGAs/s320/pointandclick_6.png" width="320" height="170" data-original-width="1600" data-original-height="851" /></a><br /><br />When the key is picked up, the door should be usable. This mean that we want the transparent box over the door to disappear. We can do the same exit animation with trigger that we did with the key to the transparent box and make it exit when the key is clicked on. I made the door disappear rather than fade so that there is no delay and the transparent box is removed immediately.<br /><br /><a href="https://4.bp.blogspot.com/-jh7bUCd9328/WWh3R7iVgpI/AAAAAAAABF8/S6gw9Xw9q4sJDxcOZG2G7QyEnSKcRuMsQCLcBGAs/s1600/pointandclick_7.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-jh7bUCd9328/WWh3R7iVgpI/AAAAAAAABF8/S6gw9Xw9q4sJDxcOZG2G7QyEnSKcRuMsQCLcBGAs/s320/pointandclick_7.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />As things stand, we have two exit animations: one for the key and one for the transparent box. However these two animations will not happen together but will each wait for their own click on the key. To make both animations happen together just click on the transparent box's animation and choose start with previous instead of start on click.<br /><br /><a href="https://3.bp.blogspot.com/-sfqGaBk0Dlw/WWh3U9sNpiI/AAAAAAAABGA/FoWqXzstVAM19FAhKcU3rFB1JnOh4vytQCLcBGAs/s1600/pointandclick_8.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-sfqGaBk0Dlw/WWh3U9sNpiI/AAAAAAAABGA/FoWqXzstVAM19FAhKcU3rFB1JnOh4vytQCLcBGAs/s320/pointandclick_8.png" width="320" height="171" data-original-width="1600" data-original-height="853" /></a><br /><br />That's it. Your one-key-one-door point and click game is ready. Start the animation by pressing F5 and see what happens. Notice that you cannot click on anything other than the key. After clicking on the key it will disappear and then you can click on the door. Clicking on the door will take you to the next room.<br /><br />This basic mechanic that I just described can be used to open chests that contain other keys, kill dragons that guard lairs, and turn off booby traps. You can put keys behind objects that have a motion animation when clicked on so that you find a key behind them. You can even put multiple keys, each of which is accessible only after getting the previous one and the final key is the one that opens the door. You can also have multiple rooms where the next slide is another is another room with a locked door. Can you think of other things to make an interesting point and click game in PowerPoint?mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-6509579955028572092017-06-25T11:16:00.001+02:002017-06-25T11:17:58.275+02:00Display all outputs in Python immediately when in cmd/terminalWhen running a Python program in the terminal, it sometimes happens that there is a long delay before I see some print statement's output. Sometimes I have to wait until the program finishes.<br /><br />If it's just one print statement you're concerned with then you only need to import the "sys" module and call "sys.stdout.flush()" in order to force the program to show anything that has been printed but is still hidden in the buffer.<br /><br /><pre class="brush:python">import sys<br /><br />print('bla bla bla')<br />sys.stdout.flush()<br /></pre><br />But most of the time you want this to always happen and not after some particular point in the program. To force Python to always show whatever is printed just add the command line option "-u" when running python.<br /><br /><pre>> python -u myscript.py<br /></pre><br />You can read more about it here:<br /><a href="https://docs.python.org/3.6/using/cmdline.html#cmdoption-u">https://docs.python.org/3.6/using/cmdline.html#cmdoption-u</a>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-62609445381060570902017-05-30T09:28:00.000+02:002017-06-10T14:25:50.626+02:00Neural language models and how to make them in Tensorflow 1.0In this blog post I will explain the basics you need to know in order to create a neural language model in Tensorflow 1.0. This tutorial will cover how to create a training set from raw text, how to use LSTMs, how to work with variable length sentences, what teacher forcing is, and how to train the neural network with early stopping.<br /><br /><div class="sectitle">Introduction</div><br /><div class="subsectitle">Components and vocabulary</div><br />A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.<br /><br />In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.<br /><br /><a href="https://3.bp.blogspot.com/-ZQvepKdplXw/WS0gHLNtzMI/AAAAAAAABEo/8dpOb7XBQhYVckxdGymdVEUUBFACGX6RgCLcB/s1600/neurallangmod.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-ZQvepKdplXw/WS0gHLNtzMI/AAAAAAAABEo/8dpOb7XBQhYVckxdGymdVEUUBFACGX6RgCLcB/s320/neurallangmod.png" width="320" height="143" data-original-width="808" data-original-height="360" /></a><br /><br />It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.<br /><br /><div class="subsectitle">Training</div><br />The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.<br /><br /><a href="https://1.bp.blogspot.com/-lhPryIrqDOk/WS0gL83Tm-I/AAAAAAAABEs/1xV76XcRkrEUhpJMkYi8G0qxxlBORA7tACLcB/s1600/neurallangmod-training.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-lhPryIrqDOk/WS0gL83Tm-I/AAAAAAAABEs/1xV76XcRkrEUhpJMkYi8G0qxxlBORA7tACLcB/s320/neurallangmod-training.png" width="320" height="105" data-original-width="932" data-original-height="305" /></a><br /><br />Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.<br /><br />The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.<br /><br /><a href="https://3.bp.blogspot.com/-m6YO4Dzz87c/WS0gPjIcmqI/AAAAAAAABEw/QWZSPWogBB88OdeU32RrkMPamu6Sc0axgCLcB/s1600/neurallangmod-prefixes.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-m6YO4Dzz87c/WS0gPjIcmqI/AAAAAAAABEw/QWZSPWogBB88OdeU32RrkMPamu6Sc0axgCLcB/s320/neurallangmod-prefixes.png" width="320" height="83" data-original-width="981" data-original-height="254" /></a><br /><br />This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.<br /><br /><div class="subsectitle">Teacher forcing</div><br />If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.<br /><br />One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called <a href="https://arxiv.org/abs/1506.03099">scheduled sampling</a>. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called <a href="https://arxiv.org/abs/1610.09038">professor forcing</a>.<br /><br /><div class="sectitle">Full code listing</div><br />This is the full code that you can execute to get a Tensorflow neural language model:<br /><br /><div style="height:400px; overflow:scroll;"><pre class="brush:python">from __future__ import absolute_import, division, print_function, unicode_literals<br />from builtins import ascii, bytes, chr, dict, filter, hex, input, int, map, next, oct, open, pow, range, round, str, super, zip<br /><br />import tensorflow as tf<br />import numpy as np<br />import random<br />import timeit<br />import collections<br />import nltk<br /><br />TRAINING_SESSION = True<br /><br />rand = random.Random(0)<br />embed_size = 256<br />state_size = 256<br />max_epochs = 100<br />minibatch_size = 20<br />min_token_freq = 3<br /><br />run_start = timeit.default_timer()<br /><br />print('Loading raw data...')<br /><br />all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]<br />rand.shuffle(all_seqs)<br />all_seqs = all_seqs[:20000]<br /><br />trainingset_size = round(0.9*len(all_seqs))<br />validationset_size = len(all_seqs) - trainingset_size<br />train_seqs = all_seqs[:trainingset_size]<br />val_seqs = all_seqs[-validationset_size:]<br /><br />print('Training set size:', trainingset_size)<br />print('Validation set size:', validationset_size)<br /><br />all_tokens = (token for seq in train_seqs for token in seq)<br />token_freqs = collections.Counter(all_tokens)<br />vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))<br />while token_freqs[vocab[-1]] < min_token_freq:<br /> vocab.pop()<br />vocab_size = len(vocab) + 2 # + edge and unknown tokens<br /><br />token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }<br />index_to_token = { i+2: token for (i, token) in enumerate(vocab) }<br />edge_index = 0<br />unknown_index = 1<br /><br />print('Vocabulary size:', vocab_size)<br /><br />def parse(seqs):<br /> indexes = list()<br /> lens = list()<br /> for seq in seqs:<br /> indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> indexes.append(indexes_)<br /> lens.append(len(indexes_)+1) #add 1 due to edge token<br /> <br /> maxlen = max(lens)<br /> <br /> in_mat = np.zeros((len(indexes), maxlen))<br /> out_mat = np.zeros((len(indexes), maxlen))<br /> for (row, indexes_) in enumerate(indexes):<br /> in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_<br /> out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]<br /> return (in_mat, out_mat, np.array(lens))<br /> <br />(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)<br />(val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs)<br /><br />print('Training set max length:', train_seqs_in.shape[1]-1)<br />print('Validation set max length:', val_seqs_in.shape[1]-1)<br /><br />################################################################<br />print()<br />print('Training...')<br /><br />#Full correct sequence of token indexes with start token but without end token.<br />seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]<br /><br />#Length of sequences in seq_in.<br />seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]<br />tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])<br /><br />#Full correct sequence of token indexes without start token but with end token.<br />seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]<br />tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))<br /><br />batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.<br />num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.<br /><br />#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.<br />token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]<br /><br />with tf.variable_scope('prefix_encoder'):<br /> #Encode each sequence prefix into a vector.<br /> <br /> #Embedding matrix for token vocabulary.<br /> embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]<br /> <br /> #3D tensor of tokens in sequences replaced with their corresponding embedding vector.<br /> embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]<br /> <br /> #Use an LSTM to encode the generated prefix.<br /> init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))<br /> cell = tf.contrib.rnn.BasicLSTMCell(state_size)<br /> prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]<br /> <br />with tf.variable_scope('softmax'):<br /> #Output a probability distribution over the token vocabulary (including the end token).<br /> <br /> W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())<br /> b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())<br /> logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])<br /> predictions = tf.nn.softmax(logits) #[seq, prefix, token]<br /><br />losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask<br />total_loss = tf.reduce_sum(losses)<br />train_step = tf.train.AdamOptimizer().minimize(total_loss)<br />saver = tf.train.Saver()<br /><br />sess = tf.Session()<br /><br />if TRAINING_SESSION:<br /> sess.run(tf.global_variables_initializer())<br /><br /> print('epoch', 'val loss', 'duration', sep='\t')<br /><br /> epoch_start = timeit.default_timer()<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /> <br /> print(0, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /> last_validation_loss = validation_loss<br /><br /> saver.save(sess, './model')<br /><br /> trainingset_indexes = list(range(len(train_seqs)))<br /> for epoch in range(1, max_epochs+1):<br /> epoch_start = timeit.default_timer()<br /> <br /> rand.shuffle(trainingset_indexes)<br /> for i in range(len(trainingset_indexes)//minibatch_size):<br /> minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]<br /> sess.run(train_step, feed_dict={<br /> seq_in: train_seqs_in [minibatch_indexes],<br /> seq_len: train_seqs_len[minibatch_indexes],<br /> seq_target: train_seqs_out[minibatch_indexes],<br /> })<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br /> if validation_loss > last_validation_loss:<br /> break<br /> last_validation_loss = validation_loss<br /> <br /> saver.save(sess, './model')<br /> <br /> print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /><br /> print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /><br />################################################################<br />print()<br />print('Evaluating...')<br /><br />saver.restore(sess, tf.train.latest_checkpoint('.'))<br /><br />def seq_prob(seq):<br /> seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> outputs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + seq_indexes ],<br /> seq_len: [ 1+len(seq) ],<br /> })[0]<br /> probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]<br /> return np.prod(probs)<br /><br />print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))<br />print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))<br />print()<br /><br />def next_tokens(prefix):<br /> prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]<br /> probs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + prefix_indexes ],<br /> seq_len: [ 1+len(prefix) ],<br /> })[0][-1]<br /> token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))<br /> return token_probs<br /><br />print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])<br />print()<br /><br />def greedy_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = sorted(next_tokens(prefix), reverse=True)<br /> (_, next_token) = probs[0]<br /> if next_token == '<unk>':<br /> (_, next_token) = probs[1]<br /> elif next_token == '<end>':<br /> break<br /> else:<br /> prefix.append(next_token)<br /> return prefix<br /><br />print('Greedy generation:', ' '.join(greedy_gen()))<br />print()<br /><br />def random_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = next_tokens(prefix)<br /> (unk_prob, _) = probs[unknown_index]<br /> <br /> r = rand.random() * (1.0 - unk_prob)<br /> total = 0.0<br /> for (prob, token) in probs:<br /> if token != '<unk>':<br /> total += prob<br /> if total >= r:<br /> break<br /> if token == '<end>':<br /> break<br /> else:<br /> prefix.append(token)<br /> return prefix<br /><br />print('Random generation:', ' '.join(random_gen()))<br />print()<br /></pre></div><br /><div class="sectitle">Code explanation</div><br />This section explains snippets of the code.<br /><br /><div class="subsectitle">Preprocessing</div><br />The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.<br /><br /><pre class="brush:python">all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]<br />rand.shuffle(all_seqs)<br />all_seqs = all_seqs[:20000]<br /><br />trainingset_size = round(0.9*len(all_seqs))<br />validationset_size = len(all_seqs) - trainingset_size<br />train_seqs = all_seqs[:trainingset_size]<br />val_seqs = all_seqs[-validationset_size:]<br /></pre><br />Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.<br /><br /><pre class="brush:python">all_tokens = (token for seq in train_seqs for token in seq)<br />token_freqs = collections.Counter(all_tokens)<br />vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))<br />while token_freqs[vocab[-1]] < min_token_freq:<br /> vocab.pop()<br />vocab_size = len(vocab) + 2 # + edge and unknown tokens<br /><br />token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }<br />index_to_token = { i+2: token for (i, token) in enumerate(vocab) }<br />edge_index = 0<br />unknown_index = 1<br /></pre><br />Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.<br /><br /><pre class="brush:python">def parse(seqs):<br /> indexes = list()<br /> lens = list()<br /> for seq in seqs:<br /> indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> indexes.append(indexes_)<br /> lens.append(len(indexes_)+1) #add 1 due to edge token<br /> <br /> maxlen = max(lens)<br /> <br /> in_mat = np.zeros((len(indexes), maxlen))<br /> out_mat = np.zeros((len(indexes), maxlen))<br /> for (row, indexes_) in enumerate(indexes):<br /> in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_<br /> out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]<br /> return (in_mat, out_mat, np.array(lens))<br /> <br />(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)<br />(val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs)<br /></pre><br /><div class="subsectitle">Model definition</div><br />The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.<br /><br /><pre class="brush:python">#Full correct sequence of token indexes with start token but without end token.<br />seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]<br /><br />#Length of sequences in seq_in.<br />seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]<br />tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])<br /><br />#Full correct sequence of token indexes without start token but with end token.<br />seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]<br />tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))<br /><br />batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.<br />num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.<br /><br />#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.<br />token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]<br /></pre><br />Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.<br /><br /><pre class="brush:python">with tf.variable_scope('prefix_encoder'):<br /> #Encode each sequence prefix into a vector.<br /> <br /> #Embedding matrix for token vocabulary.<br /> embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]<br /> <br /> #3D tensor of tokens in sequences replaced with their corresponding embedding vector.<br /> embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]<br /> <br /> #Use an LSTM to encode the generated prefix.<br /> init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))<br /> cell = tf.contrib.rnn.BasicLSTMCell(state_size)<br /> prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]<br /></pre><br />Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.<br /><br /><pre class="brush:python">with tf.variable_scope('softmax'):<br /> #Output a probability distribution over the token vocabulary (including the end token).<br /> <br /> W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())<br /> b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())<br /> logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])<br /> predictions = tf.nn.softmax(logits) #[seq, prefix, token]<br /></pre><br /><div class="subsectitle">Training</div><br />Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.<br /><br /><pre class="brush:python">losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask<br />total_loss = tf.reduce_sum(losses)<br />train_step = tf.train.AdamOptimizer().minimize(total_loss)<br />saver = tf.train.Saver()<br /><br />sess = tf.Session()<br /> <br />sess.run(tf.global_variables_initializer())<br /></pre><br />The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".<br /><br /><pre class="brush:python">validation_loss = 0<br />for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br />last_validation_loss = validation_loss<br /><br />saver.save(sess, './model')<br /></pre><br />Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.<br /><br /><pre class="brush:python">trainingset_indexes = list(range(len(train_seqs)))<br />for epoch in range(1, max_epochs+1):<br /> rand.shuffle(trainingset_indexes)<br /> for i in range(len(trainingset_indexes)//minibatch_size):<br /> minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]<br /> sess.run(train_step, feed_dict={<br /> seq_in: train_seqs_in [minibatch_indexes],<br /> seq_len: train_seqs_len[minibatch_indexes],<br /> seq_target: train_seqs_out[minibatch_indexes],<br /> })<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br /> if validation_loss > last_validation_loss:<br /> break<br /> last_validation_loss = validation_loss<br /> <br /> saver.save(sess, './model')<br /></pre><br /><div class="subsectitle">Application</div><br />We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.<br /><br /><pre class="brush:python">saver.restore(sess, tf.train.latest_checkpoint('.'))<br /><br />def seq_prob(seq):<br /> seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> outputs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + seq_indexes ],<br /> seq_len: [ 1+len(seq) ],<br /> })[0]<br /> probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]<br /> return np.prod(probs)<br /><br />print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))<br />print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))<br />print()<br /><br />def next_tokens(prefix):<br /> prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]<br /> probs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + prefix_indexes ],<br /> seq_len: [ 1+len(prefix) ],<br /> })[0][-1]<br /> token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))<br /> return token_probs<br /><br />print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])<br />print()<br /></pre><br />These are the outputs I got:<br /><br /><pre>P(the dog barked.) = 1.71368e-08<br />P(the cat barked.) = 6.16375e-10<br /><br />the dog ... [(0.097657956, 'was'), (0.089791521, '<unk>'), (0.058101207, 'is'), (0.055007596, 'had'), (0.02786131, 'could')]<br /></pre><br />We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use <a href="http://geekyisawesome.blogspot.com.mt/2016/10/using-beam-search-to-generate-most.html">beam search</a> for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using <a href="https://en.wikipedia.org/wiki/Fitness_proportionate_selection">roulette selection</a>. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.<br /><br /><pre class="brush:python">def greedy_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = sorted(next_tokens(prefix), reverse=True)<br /> (_, next_token) = probs[0]<br /> if next_token == '<unk>':<br /> (_, next_token) = probs[1]<br /> elif next_token == '<end>':<br /> break<br /> else:<br /> prefix.append(next_token)<br /> return prefix<br /><br />print('Greedy generation:', ' '.join(greedy_gen()))<br />print()<br /><br />def random_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = next_tokens(prefix)<br /> (unk_prob, _) = probs[unknown_index]<br /> <br /> r = rand.random() * (1.0 - unk_prob)<br /> total = 0.0<br /> for (prob, token) in probs:<br /> if token != '<unk>':<br /> total += prob<br /> if total >= r:<br /> break<br /> if token == '<end>':<br /> break<br /> else:<br /> prefix.append(token)<br /> return prefix<br /><br />print('Random generation:', ' '.join(random_gen()))<br />print()<br /></pre><br />These are the outputs I got:<br /><br /><pre>Greedy generation: `` i don't know '' .<br /><br />Random generation: miss the road place , title comes to the seeds of others to many and details under the dominant home<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-46466574370919646012017-04-03T20:47:00.002+02:002017-04-03T20:54:08.308+02:00Getting the top n most probable sentences using beam searchThis is a continuation from a <a href="http://geekyisawesome.blogspot.com.mt/2016/10/using-beam-search-to-generate-most.html">previous blog post</a> on single sentence beam search.<br /><br />Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:<br /><br />In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.<br /><br />In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.<br /><br />If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of<br /><ol><li>incomplete sentences that are also as long as the allowed maximum (since all the prefixes grow together)</li><li>complete sentences that were found before but which do not have a maximum probability</li></ol>After returning all the incomplete sentences (they have to be returned since they cannot be extended further) then the complete sentences will also be returned as they will become the most probable sentences of what is left in the beam. The assumption is that if the complete sentence was a nonsensical one, then it wouldn't have remained in the beam so might as well return it as well rather than lose it.<br /><br />Here is the modified Python 3 code:<br /><br /><pre class="brush:python">import heapq<br /><br />class Beam(object):<br /><br /> def __init__(self, beam_width, init_beam=None):<br /> if init_beam is None:<br /> self.heap = list()<br /> else:<br /> self.heap = init_beam<br /> heapq.heapify(self.heap) #make the list a heap<br /> self.beam_width = beam_width<br /><br /> def add(self, prob, complete, prefix):<br /> heapq.heappush(self.heap, (prob, complete, prefix))<br /> if len(self.heap) > self.beam_width:<br /> heapq.heappop(self.heap)<br /><br /> def __iter__(self):<br /> return iter(self.heap)<br /><br /><br />def beamsearch(probabilities_function, beam_width=10, clip_len=-1):<br /> prev_beam = Beam(beam_width)<br /> prev_beam.add(1.0, False, [ '<start>' ])<br /><br /> while True:<br /> curr_beam = Beam(beam_width)<br /><br /> #Add complete sentences that do not yet have the best<br />probability to the current beam, the rest prepare to add more words to<br />them.<br /> for (prefix_prob, complete, prefix) in prev_beam:<br /> if complete == True:<br /> curr_beam.add(prefix_prob, True, prefix)<br /> else:<br /> #Get probability of each possible next word for the<br />incomplete prefix.<br /> for (next_prob, next_word) in probabilities_function(prefix):<br /> if next_word == '<end>': #if next word is the end<br />token then mark prefix as complete and leave out the end token<br /> curr_beam.add(prefix_prob*next_prob, True, prefix)<br /> else: #if next word is a non-end token then mark<br />prefix as incomplete<br /> curr_beam.add(prefix_prob*next_prob, False,<br />prefix+[next_word])<br /><br /> sorted_beam = sorted(curr_beam) #get all prefixes in current beam sorted by probability<br /> any_removals = False<br /> while True:<br /> (best_prob, best_complete, best_prefix) = sorted_beam[-1] #get highest probability prefix<br /> if best_complete == True or len(best_prefix)-1 ==<br />clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then yield it<br /> yield (best_prefix[1:], best_prob) #yield best sentence without the start token and together with its probability<br /> sorted_beam.pop() #remove the yielded sentence and check the next highest probability prefix<br /> any_removals = True<br /> if len(sorted_beam) == 0: #if there are no more sentences in the beam then stop checking<br /> break<br /> else:<br /> break<br /><br /> if any_removals == True: #if there were any changes in the current beam then...<br /> if len(sorted_beam) == 0: #if there are no more prefixes in the current beam (due to clip length being reached) then end the beam search<br /> break<br /> else: #otherwise set the previous beam to the modified current beam<br /> prev_beam = Beam(beam_width, sorted_beam)<br /> else: #if the current beam was left unmodified then assign it to the previous beam as is<br /> prev_beam = curr_beam<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-55773293215448934652017-03-27T16:43:00.001+02:002017-03-27T16:44:11.140+02:00Word embeddings: How word2vec and GloVe workWord embeddings are vectors that represent words. For example the word "dog" might be represented as [0.1, -2.1, 1.2] whilst "cat" might be represented as [0.2, 2.4, 1.1]. These vectors are important in neural networks because neural networks can only work with continuous numbers whereas words are discrete symbols.<br /><br />One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.<br /><br />One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:<br /><br /><pre>(1 0 0) (1 2) = (1*1 + 0*3 + 0*5 1*2 + 0*4 + 0*6) = (1 2)<br /> (3 4)<br /> (5 6)<br /></pre><br />In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.<br /><br />This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in <a href="https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/</a>.<br /><br />Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are <a href="https://en.wikipedia.org/wiki/Word2vec">word2vec</a> and <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>.<br /><br />word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.<br /><br />GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.<br /><br />Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.<br /><br />Download word2vec code from <a href="https://github.com/dav/word2vec">here</a>.<br />Download GloVe code from <a href="https://github.com/stanfordnlp/GloVe">here</a>.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-68025612774432530112017-02-28T22:54:00.000+01:002017-09-19T08:44:42.146+02:00Neural network deep learning hard attention using REINFORCE / score function estimator / likelihood ratio estimatorOne of the things I struggled the most to understand with papers on deep learning is when they mention hard attention. Let's start with a description of what attention is first.<br /><br />Attention based learning is when you have a sequence input where only part of it is relevant and you want your neural network to explicitly choose what is relevant. For example, if you are translating a sentence from English to French, only some of the words in the English sentence are relevant to generating the next word in the French sentence. So part of your neural network would be dedicated to weighing the importance of each English word in the context of what the neural network intends to generate next in the French sentence. This attention module would work by taking two inputs:<br /><ul><li>an embedded word vector from the English sentence</li><li>the current state of the recurrent neural network that is encoding what has been generated up to now</li></ul>Given these two inputs the attention module would then output a single number between 0 and 1 which quantifies how relevant the given word is given what the neural network intends to generate next (which is encoded in the RNN state). This weighting would then be multiplied by the word vector in order to shrink it's magnitude and then the weighted sum of the word vectors would be used to encode the English sentence. This sentence vector would represent the information in the sentence that is relevant to generating the next French word. In this way, the neural network does not need to encode the whole sentence and use the same representation for generating all the French sentence. The English sentence would be attended to in different ways for generating each French word.<br /><br />This technique has not only been used for machine translation but also for image captioning (only attend to parts of the image for every word being generated) and neural turing machines (only attend to parts of the tape with every operation). Just look at what <a href="http://distill.pub/2016/augmented-rnns/">Colah</a> has to say about it. However this is just soft attention, which is easy. It's called soft attention because you still partially attend to every part of the input; it's just some parts get very little attention. This means that it's still possible to measure the effect of a part of an input and so it's still possible to determine, by gradient descent, whether its attention should be increased or decreased.<br /><br />Hard attention, on the other hand, either completely includes or excludes elements of the input. How would you do this in a neural network? You'd need to use thresholding for example, where if the value of a neuron is above a threshold then it outputs a one, zero otherwise. You can also use argmax which selects the neuron with the greatest value and sets it to one and all the others to zero. Unfortunately both of these solutions are undifferentiable and that makes direct gradient descent ineffective if they are used in a hidden layer (in an output layer you can just maximize their continuous values and then threshold them at test time). It would be the same situation neural networks had in the past when they couldn't have a hidden layer because there was no known optimization algorithm for 2 layers of thresholding neurons.<br /><br />It would be nice to somehow make neurons that have discrete outputs but which can still be trained by gradient descent. One way to do this is to use stochastic neurons which output discrete values based on a probability that can be tweaked. The simplest probabilistic neuron is the Bernoulli neuron which outputs a 1 with probability "p" and a 0 otherwise. Let's assume a simple attention based neural network that uses one Bernoulli neuron as an attention module and one normal sigmoid neuron as a feature extraction neuron. The Bernoulli neuron's output is multiplied by the normal neuron's output in order to gate it. Both neurons take the same single number as input.<br /><br /><a href="https://2.bp.blogspot.com/-BeExEkIkfH8/WLZ1H9JotZI/AAAAAAAABAs/lM5FAHjrZ8kvvWx23_ACqewxCpnAqJDiACLcB/s1600/simple_bernoulli_neuron_net.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-BeExEkIkfH8/WLZ1H9JotZI/AAAAAAAABAs/lM5FAHjrZ8kvvWx23_ACqewxCpnAqJDiACLcB/s320/simple_bernoulli_neuron_net.png" width="320" height="259" /></a><br /><br />The grey neuron with a "B" inside is the Bernoulli neuron, "w0" and "w1" are weights and "b0" and "b1" are biases. If we had to turn this into an equation in order to differentiate it, we'd end up with this:<br /><br />$$y = B_x \cdot sig(w_1 x + b_1)$$<br /><br />where "B_x" is a random variable that is dependent on "x". Unfortunately random variables "hide" their inner workings and we cannot express their probability as part of the equation. This means that we cannot optimize "w0" and "b0" by gradient descent as it would be meaningless to differentiation with respect to "w0" given that it's not in the equation. In fact "B" is treated as a constant in the above equation and we can still differentiate the equation with respect to all the other parameters. Note that this situation is different from dropout, where you randomly multiply some of the neuron values by 0 in order to avoid overfitting. In the case of dropout, the above equation would be enough as we're not optimizing the value of the dropout random variable.<br /><br />There is a simple solution to this problem however. Instead of finding the gradient of "y", we can find the gradient of the expected value of "y". The expected value is the mean value you get when running a stochastic function over and over again. For example, if you're tossing a fair coin, with one face representing a 1 and the other face representing a 0, the expected value is 0.5. However, if you're tossing an unfair coin where the "1" face comes up 75% of the time, then on average you will get a value of 0.75. In general, the expected value of a discrete probability distribution, such as the Bernoulli distribution, can be found using the following equation:<br /><br />$$E[X] = \sum_{v \in X} p(v) \cdot v$$<br /><br />that is, just multiple each value by its respective probability and take the sum. In the case of a coin with probability of the "1" face coming up being "p" (a Bernoulli distribution), the expected value is $1 \cdot p + 0 \cdot (1-p)$ which equals "p". We can take advantage of the expect value by using gradient descent to minimize the expected error rather than the error itself. This would expose the parameters that determine the probability of the Bernoulli neuron and we would be able to use gradient descent as usual.<br /><br />Let's take a concrete example of an error function. We want to minimize the sum square error of the neural net such that when given a 0 it should output a 1 and when given a 1 it should output a 0 (a logical NOT). This is what the error function (called the cost function) would look like, together with the error function of a single input:<br /><br />$$<br />\begin{align}<br />C &= \sum_{(x,t_x) \in \{ (0,1),(1,0) \}} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />\end{align}<br />$$<br /><br />Let's focus on just the single input error function. The expected error would look like:<br /><br />$$<br />\begin{align}<br />E[C_x] &= sig(w_0 x + b_0) \cdot (t_x - 1 \cdot sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x - 0 \cdot sig(w_1 x + b_1))^2 \\<br />&= sig(w_0 x + b_0) \cdot (t_x - sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x)^2 \\<br />\end{align}<br />$$<br /><br />Remember that expected value finds the sum of all the possible values of the stochastic function due to randomness multiplied by their respective probability. In the case of our neural net, the two possible values due to randomness are caused by the Bernoulli neuron "B" being either 1 with probability $sig(w_0 x + b_0)$ or 0 with probability $1 - sig(w_0 x + b_0)$. Now we can find the gradient of the expected error and minimize it, which would include optimizing the parameters determining the probability of the Bernoulli neuron.<br /><br />In general it might not be tractable to expose all the possible values of a stochastic neural network, especially when multinoulli neurons are used where there are many possible discrete values instead of just 0 or 1 and especially when there are multiple such neurons whose values must be considered together, creating a combinatorial explosion. To solve this problem, we can approximate the expected error by taking samples. For example, if you want to approximate the expected value of a coin, you can toss it a hundred times, count the number of times the coin gives the "1" face and divide by 100. This can be done with a stochastic neural network, where you run the network a hundred times and calculate the mean of the error for each training pair. The problem is that we don't want the expected error, but the derivative of the expected error, which cannot be calculated on the constant you get after approximating the expected error. We need to take samples of the expected derivative of the error instead. This is what the expected derivative of the error looks like:<br /><br />$$<br />\begin{align}<br />\frac{dE[C_x]}{dw} &= \frac{d}{dw} \sum_{v \in C_x} p(v) \cdot v \\<br />&= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\<br />\end{align}<br />$$<br /><br />The problem with this equation is that it can't be sampled like an expected value can so you'll end up having to calculate the full summation, which we're trying to avoid. The solution to this is that we continue breaking down the algebra until eventually we get something that can be approximated by sampling. This has been derived multiple times in the literature and so it has multiple names such as REINFORCE, score function estimator, and likelihood ratio estimator. It takes advantage of the following theorem of logarithms: $\frac{d}{dx} f(x) = f(x) \cdot \frac{d}{dx} \log(f(x))$<br /><br />$$<br />\begin{align}<br />\frac{dE[C_x]}{dw} &= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\<br />&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot \frac{d}{dw} p(v) \right) \\<br />&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot p(v) \cdot \frac{d}{dw} \log(p(v)) \right) \\<br />&= \sum_{v \in C_x} p(v) \cdot \left( \frac{d}{dw} v + v \cdot \frac{d}{dw} \log(p(v)) \right) \\<br />&\approx \frac{\sum_{i = 1}^{N} \frac{d}{dw} \tilde{v} + \tilde{v} \cdot \frac{d}{dw} \log(p(\tilde{v}))}{N} \text{ where } \tilde{v} \sim{} C_x \\<br />\end{align}<br />$$<br /><br />The last line is approximating the derivative of the expected error using "N" samples. This is possible because probability "p" was factored out inside the summation, which gives it the same form of an expect value equation and which hence can be approximated in the same way. You might be asking how it is that you can find the probability of each possible value. As long as you have access to the values returned by the stochastic neurons, you can use a dictionary to map values to their respective probabilities and multiply them together if necessary. The following is the derivative of the above Bernoulli NOT gate:<br /><br />$$<br />\begin{align}<br />C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />\frac{dE[C_x]}{dw_0} &\approx \frac{\sum_{i = 1}^{N} \left( \frac{d}{dw_0} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \right) + (t_x - B_x \cdot sig(w_1 x + b_1))^2 \cdot \left(<br />\frac{d}{dw_0}<br />\left\{ \begin{array}{lr}<br />\log(sig(w_0 x + b_0)) & : B_x = 1 \\<br />\log(1 - sig(w_0 x + b_0)) & : B_x = 0 \\<br />\end{array} \right.<br />\right) }{N} \\<br />\end{align}<br />$$mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-45362065973473967332017-01-31T22:47:00.001+01:002017-02-18T19:29:42.626+01:00Applying softmax and categorical_crossentropy to 3D tensors in TheanoTheano is a great deep learning library but there are some things in it that need to be polished a bit. For example at the moment, you can only apply the softmax function in the tensor.nnet module to matrices (2D tensors). Same goes for the categorical_crossentropy function.<br /><br />This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.<br /><br />But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.<br /><br />Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:<br /><br /><pre>Original 3D tensor:<br />[ [ [111,112], [121,122] ], [ [211,212], [221,222] ] ]<br /><br />Reshaped 2D tensor:<br />[ [111,112], [121,122], [211,212], [221,222] ]<br /><br />Applied softmax:<br />[ [111',112'], [121',122'], [211',212'], [221',222'] ]<br /><br />Reshaping back to 3D:<br />[ [ [111',112'], [121',122'] ], [ [211',212'], [221',222'] ] ]<br /></pre><br />It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:<br /><br /><pre class="brush:python" style="border:black solid 2px;">import theano<br />import theano.tensor as T<br /><br />X = T.tensor3()<br />(d1,d2,d3) = X.shape<br />Y = T.nnet.softmax(X.reshape((d1*d2,d3))).reshape((d1,d2,d3))<br /></pre><br />Here is what happens step by step:<br /><pre class="brush:python">print(X.reshape((d1*d2,d3)).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> [[ 1. 2.]<br /> [ 1. 3.]<br /> [ 1. 4.]<br /> [ 1. 5.]]<br /></pre><br /><pre class="brush:python">print(T.nnet.softmax(X.reshape((d1*d2,d3))).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> array([[ 0.26894142, 0.73105858],<br /> [ 0.11920292, 0.88079708],<br /> [ 0.04742587, 0.95257413],<br /> [ 0.01798621, 0.98201379]])<br /></pre><br /><pre class="brush:python">print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> [[[ 0.26894142 0.73105858]<br /> [ 0.11920292 0.88079708]]<br /><br /> [[ 0.04742587 0.95257413]<br /> [ 0.01798621 0.98201379]]]<br /></pre><br />The categorical_crossentropy function can be used in the same way:<br /><br /><pre class="brush:python" style="border:black solid 2px;">import theano<br />import theano.tensor as T<br /><br />X = T.tensor3()<br />S = T.imatrix()<br />(d1,d2,d3) = X.shape<br />(e1,e2) = S.shape<br />Y = T.nnet.categorical_crossentropy( T.nnet.softmax(X.reshape((d1*d2,d3))), S.reshape((e1*e2,)) ).reshape((d1,d2))<br /></pre><br />And this is how it is used:<br /><pre class="brush:python">print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]], S: [[0,1],[1,0]] }))<br /><br />>>> array([[ 1.31326169, 0.12692801],<br /> [ 0.04858735, 4.01814993]])<br /></pre><br />Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-29095074457151571912016-12-27T16:15:00.002+01:002016-12-27T16:15:46.774+01:00Bernoulli vs Binomial vs Multinoulli vs Multinomial distributions<div class="sectitle">Bernoulli distribution</div>Say you are flipping a coin that has a probability of 0.4 of turning up heads and 0.6 of turning up tails. The simplest probability distribution there is is the <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a> which just asks for the probability of the coin turning up heads after one toss, that is, 0.4. That's all. The Bernoulli distribution is only about the probability of a random event with two possible outcomes occurring after one trial. It's used for success/fail type random variables where you want to reason about the probability that a random variable will succeed (such as a coin turning up heads).<br /><br />The probability mass function for the Bernoulli distribution is:<br /><pre>f(x;p) = p^x * (1 - p)^(1-x)<br /></pre>where "*" is multiplication and "^" is power, "p" is the probability of a success and "x" is the number of successes desired, which can be either 1 or 0. When "x" is 1 then you find the probability of a success whilst when it is 0 then you find the probability of a fail. Since there are only two outcomes, if "p" is the probability of a success then the probability of a non-success is "1-p". Notice that this function is quite silly really as it is just a closed form expression for choosing either "p" or "1-p" depending on what "x" is. The number you're interested in is raised to the power of 1 whilst the number you're not interested in is raised to the power of 0, turning it into 1, and then the two results are multiplied together.<br /><br /><div class="sectitle">Binomial distribution</div>One generalization we can make upon the Bernoulli distribution is to consider multiple trails instead of just one and then ask about the probability that a number of successes would occur. This is called the <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a>. For example, what is the probability that the coin described above results in exactly 2 heads after tossing it 3 times. There are many ways how this result can come about. You can get [head, head, tail], or [head, tail, head], or [tail, head, head]. For this reason we need to consider the number of different ways the required total can be achieved, which is found by using the <a href="https://en.wikipedia.org/wiki/Combination">combination formula</a>.<br /><br />The probability mass function for the Binomial distribution is:<br /><pre>f(x;n,p) = n!/(x! * (n-x)!) * p^x * (1 - p)^(n-x)<br /></pre>where "*" is multiplication and "^" is power, "p" is the probability of a success, "n" is the number of trials to try out, and "x" is the number of desired successes out of the "n" trials.<br /><br /><div class="sectitle">Multinoulli distribution</div>Whereas the binomial distribution generalises the Bernoulli distribution across the number of trials, the <a href="https://en.wikipedia.org/wiki/Categorical_distribution">multinoulli distribution</a> generalises it across the number of outcomes, that is, rolling a dice instead of tossing a coin. The multinoulli distribution is also called the categorical distribution.<br /><br />The probability mass function for the multinoulli distribution is:<br /><pre>f(xs;ps) = ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k<br /></pre>where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, which can be either 1 or 0 and where there can only be exactly a single "1" in "xs".<br /><br /><div class="sectitle">Multinomial distribution</div>Finally the most general generalisation of the Bernoulli distribution is across both the number of trials and the number of outcomes, called the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">multinomial distribution</a>. In order to be more general, this distribution also allows specifying the number of successes desired for each individual outcome, rather than just of one outcome like the multinoulli distribution. This lets us calculate the probability of rolling a dice 6 times and getting 3 "2"s, 2 "3"s and a "5". Since we want it to be compatible with the binomial distribution, these outcomes can arrival in any order, that is, it doesn't matter whether you get [5,3,2,3,2,2] or [2,2,2,3,3,5]. For this reason we need to use the combination function again.<br /><br />The probability mass function for the multinomial distribution is:<br /><pre>f(xs;n,ps) = n!/(xs_1! * xs_2! * ... * xs_k!) * ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k<br /></pre>where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, the total of which must be "n".mtantinoreply@blogger.com1tag:blogger.com,1999:blog-4318944459823143473.post-58131235387847430492016-11-30T20:41:00.001+01:002016-11-30T21:14:14.230+01:00Checking if a number is prime: when to stop checking potential factorsLet's say you want to find out whether the number 888659800715261 is a prime number. If this were a programming exercise, a naive solution would be to go through all the numbers from 2 to 888659800715261/2 (half the number) and check whether any of the numbers divide 888659800715261 evenly. If none of them do, then the number is prime. The reason why we stop at half the number is because the largest factor that a number "n" can have before "n" itself is n/2, the one before that is n/3, and so on. Sure, we don't need to check every single number between 2 and n/2 as it would be enough to only check the prime numbers between 2 and n/2, but the point is that as soon as we reach n/2, we can stop checking and declare that 888659800715261 is prime.<br /><br />The problem is that half of 888659800715261 is 444329900357630 which is still too big to check each number between 2 and it, probably even if you only check the prime numbers. The question is, is there an even lower upper-bound where we can stop checking potential factors in order to be sure that the number has no factors?<br /><br />Let's look at which numbers between 1 and 24 are factors of 24:<br /><pre>[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]</pre><br />In order for a number to be a factor of 24, there must be another factor which when the two factors are multiplied together give 24. For example, if 2 is a factor of 24, then there must be another factor of 24 which when multiplied by 2 gives 24. This other factor is 12. Let's call 2 and 12 co-factors. The co-factor of 3 is 8, the co-factor of 4 is 6. We can imagine co-factors being mirror reflections of each other:<br /><br /><pre>[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]<br /> ( [ { < > } ] )<br /></pre><br />Notice how all the co-factors nest each other. The co-factors 4 and 6 are between the co-factors 3 and 8 which in turn are between the co-factors 2 and 12 which in turn are between 1 and 24. It seems like there is a line of reflection at 5, where all the factors smaller than 5 have co-factors larger than 5. This means that beyond 5, all factors are co-factors of smaller numbers.<br /><br />Let's see another list of factors for 16:<br /><pre>[1] [2] 3 [4] 5 6 7 [8] 9 10 11 12 13 14 15 [16]<br /> ( [ { } ] )<br /></pre><br />Notice how 4 is a co-factor with itself in this case, since 4 squared is 16. The line of reflection is crossed at 4 this time, which happens to be the square root of 16. In fact the square root of 24 is 4.898... which is close to 5 (the line of reflection of 24). This is always the case, because the line of reflection occurs where the two co-factors are the same. If the co-factors are not the same then one factor must be smaller than the square root whilst the other must be larger. If both are smaller or bigger then when multiplied together they will not give the number being factored. This is because if two co-factors "a" and "b" are supposed to equal "n", and both of them are larger than √n, then that would mean that a×b must also be larger than √n×√n, which means that a×b is larger than "n". The area of a rectangle with two large sides cannot be equal to the area of a rectangle with two small sides. Therefore, if the two sides are equal (a square) then in order to change the length of the sides without changing the area of the rectangle you must make one side larger whilst making the other smaller. This proves that the square root must be between any two co-factors.<br /><br />So what's the point? The point is that if you reach the square root of "n" without having found any factors, then you are guaranteed that there will not be any factors beyond √n because if there are any then they must have co-factors smaller than √n, which you have already confirmed that there aren't any. So the square root of a number is the lower upper-bound to check for factors. Can there be an even lower upper-bound? No, because the number might be a square number and have no other factors apart from its square root, such as 25. Therefore the smallest co-factor a number can have is the square root of the number.<br /><br />So in order to check if 888659800715261 is a prime number, we only have to check all the numbers up to 29810397 for factors, which is much better than 444329900357630. By the way, there is no number between 2 and 29810397 that evenly divides 888659800715261, which means that it is a prime number.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-92097245020304925242016-10-30T20:26:00.000+01:002017-04-03T20:53:45.944+02:00Using beam search to generate the most probable sentenceThis blog post continues in a second blog post about <a href="http://geekyisawesome.blogspot.com.mt/2017/04/getting-top-n-most-probable-sentences.html">how to generate the top n most probable sentences</a>.<br /><br />In my <a href="http://geekyisawesome.blogspot.com.mt/2016/09/text-generation-using-language-model.html">last blog post</a> I talked about how to generate random text using a language model that gives the probability of a particular word following a prefix of a sentence. For example, given the prefix "The dog", a language model might tell you that "barked" has a 5% chance of being the next word whilst "meowed" has a 0.3%. It's one thing generating random text in a way that is guided by the probabilities of the words but it is an entirely different thing to generate the most probable text according to the probabilities. By most probable text we mean that if you multiply the probabilities of all the words in the sentence together you get the maximum product. This is useful for conditioned language models which give you different probabilities depending on some extra input, such as an image description generator which accepts an image apart from the prefix and returns probabilities for the next word depending on what's in the image. In this case we'd like to find the most probable description for a particular image.<br /><br />You might think that the solution is to always pick the most probable word and add it to the prefix. This is called a greedy search and is known to not give the optimal solution. The reason is because it might be the case that every combination of words following the best first word might not be as good as those following the second best word. We need to use a more exploratory search than greedy search. We can do this by thinking of the problem as searching a probability tree like this:<br /><br /><a href="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s1600/prefix_probability_tree.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s320/prefix_probability_tree.png" width="320" height="265" /></a><br /><br />The tree shows a probability tree of which words can follow a sequence of words together with their probabilities. To find the probability of a sentence you multiply every probability in the sentence's path from <start> to <end>. For example, the sentence "the dog barked" has a probability of 75% × 73% × 25% × 100% = 13.7%. The problem we want to solve is how to find the sentence that has the highest probability.<br /><br />One way to do this is to use a breadth first search. Starting from the <start> node, go through every node connected to it, then to every node connected to those nodes and so on. Each node represents a prefix of a sentence. For each prefix compute its probability, which is the product of all the probabilities on its path from the <start> node. As soon as the most probable prefix found is a complete sentence, that would be the most probable sentence. The reason why no other less probable prefixes could ever result in more probable sentences is because as a prefix grows, its probability shrinks. This is because any additional multiplications with probabilities made to any prefix probability will only make it smaller. For example, if a prefix has a probability of 20% and another word is added to it which has a probability of 99%, then the new probability will be 20% × 99% which is the smaller probability of 19.8%.<br /><br />Of course a breadth first search is impractical on any language model that has a realistic vocabulary and sentence length since it would take too long to check all the prefixes in the tree. We can instead opt to take a more approximate approach where we only check a subset of the prefixes. The subset would be the top 10 most probable prefixes found at that point. We do a breadth first search as explained before but this time only the top 10 most probable prefixes are kept and we stop when the most probable prefix in these top 10 prefixes is a complete sentence.<br /><br />This is practical but it's important that the way we find the top 10 prefixes is fast. We can't sort all the prefixes and choose the first 10 as there would be too many. We can instead use a <a href="https://en.wikipedia.org/wiki/Heap_%28data_structure%29">heap data structure</a>. This data structure is designed to quickly take in a bunch of numbers and quickly pop out the smallest number. With this you can insert the prefix probabilities one by one until there are 10 prefixes in it. After that start popping out the smallest probability immediately after inserting a new one in order to only keep the 10 largest ones.<br /><br />Python provides a library called "heapq" (heap queue) just for this situation. Here is a class that makes use of heapq in order to create a beam of prefix probabilities:<br /><br /><pre class="brush:python">import heapq<br /><br />class Beam(object):<br />#For comparison of prefixes, the tuple (prefix_probability, complete_sentence) is used.<br />#This is so that if two prefixes have equal probabilities then a complete sentence is preferred over an incomplete one since (0.5, False) < (0.5, True)<br /><br /> def __init__(self, beam_width):<br /> self.heap = list()<br /> self.beam_width = beam_width<br /><br /> def add(self, prob, complete, prefix):<br /> heapq.heappush(self.heap, (prob, complete, prefix))<br /> if len(self.heap) > self.beam_width:<br /> heapq.heappop(self.heap)<br /> <br /> def __iter__(self):<br /> return iter(self.heap)<br /></pre><br />The code to perform the actual beam search is this:<br /><br /><pre class="brush:python">def beamsearch(probabilities_function, beam_width=10, clip_len=-1):<br /> prev_beam = Beam(beam_width)<br /> prev_beam.add(1.0, False, [ '<start>' ])<br /> while True:<br /> curr_beam = Beam(beam_width)<br /> <br /> #Add complete sentences that do not yet have the best probability to the current beam, the rest prepare to add more words to them.<br /> for (prefix_prob, complete, prefix) in prev_beam:<br /> if complete == True:<br /> curr_beam.add(prefix_prob, True, prefix)<br /> else:<br /> #Get probability of each possible next word for the incomplete prefix.<br /> for (next_prob, next_word) in probabilities_function(prefix):<br /> if next_word == '<end>': #if next word is the end token then mark prefix as complete and leave out the end token<br /> curr_beam.add(prefix_prob*next_prob, True, prefix)<br /> else: #if next word is a non-end token then mark prefix as incomplete<br /> curr_beam.add(prefix_prob*next_prob, False, prefix+[next_word])<br /> <br /> (best_prob, best_complete, best_prefix) = max(curr_beam)<br /> if best_complete == True or len(best_prefix)-1 == clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then return it<br /> return (best_prefix[1:], best_prob) #return best sentence without the start token and together with its probability<br /> <br /> prev_beam = curr_beam<br /></pre><br />"probabilities_function" returns a list of word/probability pairs given a prefix. "beam_width" is the number of prefixes to keep (so that instead of keeping the top 10 prefixes you can keep the top 100 for example). By making the beam search bigger you can get closer to the actual most probable sentence but it would also take longer to process. "clip_len" is a maximum length to tolerate, beyond which the most probable prefix is returned as an incomplete sentence. Without a maximum length, a faulty probabilities function which does not return a highly probable end token will lead to an infinite loop or excessively long garbage sentences.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-36875548580999692502016-09-28T11:12:00.000+02:002016-09-28T11:48:35.121+02:00Text generation using a language modelA <a href="https://en.wikipedia.org/wiki/Language_model">language model</a> is something that tells you the probability of a sequence of words. For example it tells you that the sequence "a dog is barking" is more probable than "a dog is was". This probability can then be used to generate text by selecting a sequence of words that is probable.<br /><br />One way to do this is to take a huge amount of text and count how often each 4 word segment occurs, then divide by the total amount of 4 word segments. For example, you find that in your huge text, "a dog is barking" to occurs 42 times and divide it by the amount of 4 word segments in the text. This will give you an approximate probability of "a dog is barking".<br /><br />Usually language models give you the probability of a particular word following a prefix of a sentence. For example given the prefix "a dog is", what is the probability that the next word in the prefix is "barking"? This is expressed mathematically as P("barking" | "a dog is"). This is what <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/">neural network language models</a> do. You give them a prefix of words and they output a number for each known word, where the number is the probability of that word following the prefix.<br /><br />Once you have a probability distribution for all the words in your vocabulary, you can then pick the word with the highest probability to continue the incomplete sentence. If you do this for every word, you will generate a whole sentence but you will always generate the same sentence. In order to generate a random sentence, we need to choose a random next word, but in such a way that highly probable words are more likely to be chosen in order to avoid generating gibberish.<br /><br /><div class="sectitle">The softmax function</div>Given a set of numbers, such as frequencies, we can turn these numbers into probabilities by using the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation</a> as described above. This is when you divide a frequency by the total. However this isn't the best way to do things. First of all this assumes that words that don't occur in the text have a probability of zero, which is unlikely to be the case as it's more likely that the word just has a very small probability than a probability of zero. Second of all, this requires the number to be frequencies, that is, to be all greater than or equal to zero, which might be too limiting. A neural network's outputs can be used to quantify how strongly it believes a particular word should follow a prefix, but this quantity might be negative in order to avoid being limited by a lower bound. Thirdly, we might want to skew the probabilities so that the most probable word has a higher probability and the least probable word has a lower probability or vice versa. This is done so that when you're selecting words based on their probability you either bias the selection towards higher probability words or you go the other way and make the probabilities more similar in order to generate more random looking text.<br /><br />To address all these concerns we can use a softmax function to convert the quantities into probabilities which uses a <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann distribution</a>. Given a list of "n" quantities called "q", the probability of the kth quantity is:<br /><pre>e^(q_k/T) / sum(e^(q_i/T) for i in 1..n)</pre>where "T" is called the temperature and is used to control how similar the probabilities should be (by default it is 1).<br /><br />For example if your quantities are [0, 1, -1], then for temperature 1 the probabilities are:<br /><br /><pre>0: e^(0/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.245</pre><pre>1: e^(1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.665</pre><pre>-1: e^(-1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.09</pre><br />Notice how the probabilities and all valid (non-negative and sum to 1), how by raising "e" to the power of the quantity makes it always greater than zero and see what happens when the temperature is set to a larger number:<br /><br /><pre>0: e^(0/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.307</pre><pre>1: e^(1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.506</pre><pre>-1: e^(-1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.186</pre><br />See how more similar the probabilities are now with a higher temperature?<br /><br />The nice thing about the temperature is that even if you have already calculated the probabilities using temperature 1 and have lost the original quantities, you can still change the temperature of the probabilities as follows:<br /><br />Given probabilities "p", new probability "p'_k" based on temperature "T" are<br /><pre>p'_k = p_k^(1/T) / sum(p_i^(1/T) for i in 1..n)</pre><br />From the above examples,<br /><pre>0: 0.245^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.307</pre><pre>1: 0.665^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.506</pre><pre>-1: 0.09^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.186</pre><br /><div class="sectitle">The roulette wheel selection</div>Now how do we select a word using it's probability? There is an algorithm called <a href="https://en.wikipedia.org/wiki/Fitness_proportionate_selection">Roulette wheel selection</a> which does this. The idea is to put all the probabilities next to each other on a number line and then select a point on the number line at random. The word whose probability contains the point is chosen. Here's a diagram showing the probabilities 0.6, 0.3 and 0.1 next to each other on a number line and a random red point is placed on the number line in the "word 1" region meaning that word 1 was selected:<br /><br /><a href="https://4.bp.blogspot.com/-xwb58iTCIf0/V-uHgUWmTDI/AAAAAAAAA74/92yG_53qW7M6qY6eMdfkYS-bJQaKmSlNACLcB/s1600/roulette_wheel_selection.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-xwb58iTCIf0/V-uHgUWmTDI/AAAAAAAAA74/92yG_53qW7M6qY6eMdfkYS-bJQaKmSlNACLcB/s320/roulette_wheel_selection.png" width="320" height="60" /></a><br /><br />To do this, we generate a random number between 0 and 1 (the random point), and then start adding probabilities (in any order) until the total becomes larger than the random number. The word belonging to the last probability added is the one that is chosen. Here's a Python 3 implementation:<br /><br /><pre class="brush:python">def select(probs):<br /> r = random.random() #random number between 0 and 1<br /> total = 0.0<br /> for i in range(len(probs)):<br /> total += probs[i]<br /> if total > r: #important to use > and not >= in order to avoid selecting words with a probability of zero<br /> return i<br /> #a words must have been selected at this point or else the probabilities are invalid (either negative or do not sum to 1<br /> raise ValueError('Invalid probabilities')<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-45771621940275845192016-08-28T18:04:00.000+02:002016-08-28T18:19:44.377+02:00Using dropout in neural networks to avoid overfittingImagine you want to train a neural network to recognise objects in photos. So you build a training set of labelled photos and use it for training. The training brings down the error so that it can recognise the objects in the training set with 100% accuracy. So now you try to use the network on some new photos for actual use but it gets all of them wrong. This is a pesky problem with training neural networks called overfitting, that is, the network learns a set of weights that work for the training set provided but nothing else beyond it. It's a problem of generalisation that is present in every machine learning technique that has to learn from a finite training set due to something called the <a href="https://geekyisawesome.blogspot.com.mt/2013/07/no-free-lunch.html">No Free Lunch theorem</a>.<br /><br />Generally solutions are to make the training set larger by adding a wider variety of examples and to make the model as simple as possible since a complicated model is more likely to learn something complicated but uninteresting rather than something useful. There should also be a separate validation set which has some extra data that was not used in the training set which is used to check how the learning is performing on new data.<br /><br />There is also a very interesting solution in neural networks called dropout. Dropout is a technique that was published in a paper titled <a href="http://arxiv.org/pdf/1207.0580.pdf">Improving neural networks by preventing co-adaptation of feature detectors</a> and it works by randomly replacing activations in certain layers with a zero for every training set entry. This means that given a particular layer, for the first training set entry you randomly choose half of the neurons and replace their computed values with zeros before passing them on to the next layer. For the second training set entry you randomly choose a different set of neurons and repeat.<br /><br />There are several hypothesis for why this should work:<br /><ul><li>Dropout is <span style="font-weight:bold;">a form of regularisation which prevents neurons from overly relying on a small set of previous neurons</span>. An overly important single neuron would mean that the output of the network is relying on a single piece of evidence rather than looking at all the evidence and deciding from there. That single piece of evidence that the neuron learned to detect might be consistently present in the training set data but it might not be in other data which results in overfitting. With dropout no single neuron is guaranteed to be there when using the network on a training set entry so the network learns to avoid relying on a single neuron.</li><li>Dropout forces each neuron to <span style="font-weight:bold;">learn a function that is independently interpretable</span>. This means that a single neuron learns to detect the presence of, say, stripes, whilst another neuron learns to detect the presence of fur, and so on. Without dropout the network will tend to have multiple neurons that detect a single thing as a co-adapted group, which means that all of the neurons have to be activated when used on new data, making them less reliable. With dropout no pair of neurons is guaranteed to be there together when using the network on a training set entry so the neurons learn to avoid relying on each other.<br /><ul><li>If you're thinking that this contradicts the first point, keep in mind that these independent functions will be learned by several nodes since a single neuron is not guaranteed to be there. This means that <span style="font-weight:bold;">several versions of the same function might be learned</span>, adding to the function's reliability.</li></ul></li><li>Dropout <span style="font-weight:bold;">adds noise to the activations which forces the network to be able to handle incomplete input data</span>. Since the activations get corrupted by dropout then the network has to learn to deal with the error, which makes it more reliable.</li><li>As a continuation of the previous point, <span style="font-weight:bold;">the corruption is like new training data</span>. This is an example of an artificial expansion of the training set, which is used explicitly when training with images by, for example, cropping the images, adding noise pixels, blurring, and so on.</li><li>Apart from expanding the training set, dropout also <span style="font-weight:bold;">simulates an ensemble of networks</span>. An ensemble of networks is when you train several neural networks that are trained on the same training set and then use them all together on new data. The networks will likely give different output as they will not learn the same thing, but the outputs of all the networks together are used as votes to determine what the majority of the networks decided the output should be. Dropout simulates this by creating a new neural network for each training set entry, which is done by using different neurons for each entry. These networks will have some shared weights but this is good as it means that they occupy less memory and will give their output faster (since you're computing just one netork rather than many smaller ones). In fact the output of the whole network will the the average of all the simulated smaller ones.</li></ul><br />Which of these hypothesis is true will probably depend on what the network is learning and there are probably many more possible reasons for why dropout works.<br /><br />Moving on, how do you actually implement dropout? What we'll do is we create a vector mask of ones and zeros, with half the numbers being ones chosen uniformly randomly, and then multiply it by the vector of layer activations we want to apply dropout to. The resulting vector is what will be used as input to the next layer. The problem is that we want to use all the neurons at test time (when we finish training). If we do this then what happens is that the neurons of the next layer will receive inputs that are twice as large since the weights of the layer adapted to only half the neurons being there. This means that we need to reduce the activations by half during test time. This is not a very neat solution as we'd like to get rid of any extra processing at test time. Instead, we use "inverted dropout" where we double the activations of the neurons that are not dropped at training time. This will make the weights adapt to having inputs that are twice as large which will work fine with twice the number of neurons at normal activation amount. You can find this explanation in <a href="http://vision.stanford.edu/teaching/cs231n/slides/lecture6.pdf">these lecture notes</a> on page 44.<br /><br />Here is some Python code using numpy:<br /><pre class="brush:python">def apply_dropout(layer, dropout_prob=0.5):<br /> mask = np.random.binomial(n=1, p=1-dropout_prob, size=layer.shape)<br /> dropped_layer = layer * mask<br /> rescaled_dropped_layer = dropped_layer/(1-dropout_prob)<br /> return rescaled_dropped_layer<br /></pre><br />Just use the output of this function as input to the next layer. At test time you can either avoid using it altogether or use a dropout_prob of 0.0 which will have no effect. For implementing in Theano, you can see how it is applied in the function _dropout_from_layer in <a href="https://github.com/mdenil/dropout/blob/master/mlp.py">this source code</a>.<br /><br />As an extra piece of information, when using recurrent neural networks you'll also want to use dropout; however be careful as the paper titled <a href="http://arxiv.org/pdf/1409.2329v5.pdf">Recurrent Neural Network Regularization</a> explains that you shouldn't apply dropout on the recurrent layers but only on the feedforward layers (that is, before or after the RNN).mtantinoreply@blogger.com0