tag:blogger.com,1999:blog-43189444598231434732017-06-22T08:54:02.834+02:00Geeky is AwesomeTo learn is to teach.mtantinoreply@blogger.comBlogger90125tag:blogger.com,1999:blog-4318944459823143473.post-62609445381060570902017-05-30T09:28:00.000+02:002017-06-10T14:25:50.626+02:00Neural language models and how to make them in Tensorflow 1.0In this blog post I will explain the basics you need to know in order to create a neural language model in Tensorflow 1.0. This tutorial will cover how to create a training set from raw text, how to use LSTMs, how to work with variable length sentences, what teacher forcing is, and how to train the neural network with early stopping.<br /><br /><div class="sectitle">Introduction</div><br /><div class="subsectitle">Components and vocabulary</div><br />A neural language model is a neural network that, given a prefix of a sentence, will output the probability that a word will be the next word in the sentence. For example, given the prefix "the dog", the neural network will tell you that "barked" has a high probability of being the next word. This is done by using a recurrent neural network (RNN), such as a long short term memory (LSTM), to turn the prefix into a single vector (that represents the prefix) which is then passed to a softmax layer that gives a probability distribution over every word in the known vocabulary.<br /><br />In order to be able to still have prefix before the first word (to be able to predict a first word in the sentence) we will make use of an artificial "<BEG>" word that represents the beginning of a sentence and is placed before every sentence. Likewise we will have an artificial "<END>" word that represents the end of sentence in order to be able to predict when the sentence ends.<br /><br /><a href="https://3.bp.blogspot.com/-ZQvepKdplXw/WS0gHLNtzMI/AAAAAAAABEo/8dpOb7XBQhYVckxdGymdVEUUBFACGX6RgCLcB/s1600/neurallangmod.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-ZQvepKdplXw/WS0gHLNtzMI/AAAAAAAABEo/8dpOb7XBQhYVckxdGymdVEUUBFACGX6RgCLcB/s320/neurallangmod.png" width="320" height="143" data-original-width="808" data-original-height="360" /></a><br /><br />It is important to note that the words are presented to the neural network as vectors and not as actual strings. The vectors are called word "embeddings" and are derived from a matrix where each row stands for a different word in the vocabulary. This matrix is trained just like the neural network's weights in order to provide the best vector representation for each word.<br /><br /><div class="subsectitle">Training</div><br />The probabilities are not given explicitly during training. Instead we just expose the neural network to complete sentences taken from a corpus (we will use the Brown corpus from NLTK) and let it learn the probabilities indirectly. Before training, each word will have an approximately equal probability given any prefix. The training procedure works by inputting a prefix of a sentence at one end of the neural network and then forcing it to increase the probability of the given next word in the sentence by a little bit. Since the probabilities of all the words in the output must add up to 1, increasing the probability of the given next word will also decrease the probability of all the other words by a little bit. By repeatedly increasing the probability of the correct word, the distribution will accumulate at the correct word.<br /><br /><a href="https://1.bp.blogspot.com/-lhPryIrqDOk/WS0gL83Tm-I/AAAAAAAABEs/1xV76XcRkrEUhpJMkYi8G0qxxlBORA7tACLcB/s1600/neurallangmod-training.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-lhPryIrqDOk/WS0gL83Tm-I/AAAAAAAABEs/1xV76XcRkrEUhpJMkYi8G0qxxlBORA7tACLcB/s320/neurallangmod-training.png" width="320" height="105" data-original-width="932" data-original-height="305" /></a><br /><br />Of course given a prefix there will be several words that can follow and not just one. If each one is increased just a bit in turn repeatedly, all the known correct next words will increase together and all the other words will decrease, forcing the probability distribution to share the peak among all the correct words. If the correct words occur at different frequencies given the prefix, the most frequent word will get the most probability (due to increasing its probability more often) whilst the rest of the correct words will get a fraction of that probability in proportion to their relative frequencies.<br /><br />The most common way to train neural language models is not actually to use a training set of prefixes and next words, but to use full sentences. Upon being inputted with a sequence of vectors, an RNN will output another sequence of vectors. The nth output vector represents the prefix of the input sequence up to the nth input.<br /><br /><a href="https://3.bp.blogspot.com/-m6YO4Dzz87c/WS0gPjIcmqI/AAAAAAAABEw/QWZSPWogBB88OdeU32RrkMPamu6Sc0axgCLcB/s1600/neurallangmod-prefixes.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-m6YO4Dzz87c/WS0gPjIcmqI/AAAAAAAABEw/QWZSPWogBB88OdeU32RrkMPamu6Sc0axgCLcB/s320/neurallangmod-prefixes.png" width="320" height="83" data-original-width="981" data-original-height="254" /></a><br /><br />This means that we can present the neural network with a sentence, which gets turned into a sequence of vectors by the RNN, each of which gets turned into a probability distribution by the softmax. The training objective is to make the neural network predict which is the next word in the sentence after every prefix (including the end of sentence token at the end). This is done with all the sentences in the corpus so that the neural network is forced to extract some kind of pattern in the language of the sentences and be able to predict the next word in any sentence prefix.<br /><br /><div class="subsectitle">Teacher forcing</div><br />If we always provide a correct prefix during training, then we are using a training method called "teacher forcing", where the neural network is only trained to deal with correct prefixes. This is the simplest method (and the method we will be using in this blog post) but it also introduces a bias to the neural network as it might not always be exposed to correct prefixes. Let's say that the neural network is going to be used to generate sentences by, for example, picking the most probable word given the prefix and then adding the chosen word to the end of the prefix. We can repeatedly do this until we choose the end of sentence token, at which point we should have a complete sentence. The problem with teacher forcing is that if the neural network makes one mistake during the generation process and picks a non-sense word as a most probable next word, then the rest of the sentence will probably also be garbage as it was never trained on sentences with mistakes.<br /><br />One way to deal with this is to include not only prefixes in the training sentences by also prefixes with some of the words replaced by words chosen by the still-in-training neural network and force it to still give a higher probability to the correct next word. This is called <a href="https://arxiv.org/abs/1506.03099">scheduled sampling</a>. Another way to deal with this is to take the training prefixes and some generated prefixes (from the still-in-training neural net) and take their vector representation from the RNN. Generative adversarial training will then be used to make the RNN represent both groups of vectors similarly. This forces the RNN to be fault tolerant to prefixes with errors and to be able to represent them in a way that can lead to correct next words. This is called <a href="https://arxiv.org/abs/1610.09038">professor forcing</a>.<br /><br /><div class="sectitle">Full code listing</div><br />This is the full code that you can execute to get a Tensorflow neural language model:<br /><br /><div style="height:400px; overflow:scroll;"><pre class="brush:python">from __future__ import absolute_import, division, print_function, unicode_literals<br />from builtins import ascii, bytes, chr, dict, filter, hex, input, int, map, next, oct, open, pow, range, round, str, super, zip<br /><br />import tensorflow as tf<br />import numpy as np<br />import random<br />import timeit<br />import collections<br />import nltk<br /><br />TRAINING_SESSION = True<br /><br />rand = random.Random(0)<br />embed_size = 256<br />state_size = 256<br />max_epochs = 100<br />minibatch_size = 20<br />min_token_freq = 3<br /><br />run_start = timeit.default_timer()<br /><br />print('Loading raw data...')<br /><br />all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]<br />rand.shuffle(all_seqs)<br />all_seqs = all_seqs[:20000]<br /><br />trainingset_size = round(0.9*len(all_seqs))<br />validationset_size = len(all_seqs) - trainingset_size<br />train_seqs = all_seqs[:trainingset_size]<br />val_seqs = all_seqs[-validationset_size:]<br /><br />print('Training set size:', trainingset_size)<br />print('Validation set size:', validationset_size)<br /><br />all_tokens = (token for seq in train_seqs for token in seq)<br />token_freqs = collections.Counter(all_tokens)<br />vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))<br />while token_freqs[vocab[-1]] < min_token_freq:<br /> vocab.pop()<br />vocab_size = len(vocab) + 2 # + edge and unknown tokens<br /><br />token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }<br />index_to_token = { i+2: token for (i, token) in enumerate(vocab) }<br />edge_index = 0<br />unknown_index = 1<br /><br />print('Vocabulary size:', vocab_size)<br /><br />def parse(seqs):<br /> indexes = list()<br /> lens = list()<br /> for seq in seqs:<br /> indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> indexes.append(indexes_)<br /> lens.append(len(indexes_)+1) #add 1 due to edge token<br /> <br /> maxlen = max(lens)<br /> <br /> in_mat = np.zeros((len(indexes), maxlen))<br /> out_mat = np.zeros((len(indexes), maxlen))<br /> for (row, indexes_) in enumerate(indexes):<br /> in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_<br /> out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]<br /> return (in_mat, out_mat, np.array(lens))<br /> <br />(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)<br />(val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs)<br /><br />print('Training set max length:', train_seqs_in.shape[1]-1)<br />print('Validation set max length:', val_seqs_in.shape[1]-1)<br /><br />################################################################<br />print()<br />print('Training...')<br /><br />#Full correct sequence of token indexes with start token but without end token.<br />seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]<br /><br />#Length of sequences in seq_in.<br />seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]<br />tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])<br /><br />#Full correct sequence of token indexes without start token but with end token.<br />seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]<br />tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))<br /><br />batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.<br />num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.<br /><br />#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.<br />token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]<br /><br />with tf.variable_scope('prefix_encoder'):<br /> #Encode each sequence prefix into a vector.<br /> <br /> #Embedding matrix for token vocabulary.<br /> embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]<br /> <br /> #3D tensor of tokens in sequences replaced with their corresponding embedding vector.<br /> embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]<br /> <br /> #Use an LSTM to encode the generated prefix.<br /> init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))<br /> cell = tf.contrib.rnn.BasicLSTMCell(state_size)<br /> prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]<br /> <br />with tf.variable_scope('softmax'):<br /> #Output a probability distribution over the token vocabulary (including the end token).<br /> <br /> W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())<br /> b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())<br /> logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])<br /> predictions = tf.nn.softmax(logits) #[seq, prefix, token]<br /><br />losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask<br />total_loss = tf.reduce_sum(losses)<br />train_step = tf.train.AdamOptimizer().minimize(total_loss)<br />saver = tf.train.Saver()<br /><br />sess = tf.Session()<br /><br />if TRAINING_SESSION:<br /> sess.run(tf.global_variables_initializer())<br /><br /> print('epoch', 'val loss', 'duration', sep='\t')<br /><br /> epoch_start = timeit.default_timer()<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /> <br /> print(0, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /> last_validation_loss = validation_loss<br /><br /> saver.save(sess, './model')<br /><br /> trainingset_indexes = list(range(len(train_seqs)))<br /> for epoch in range(1, max_epochs+1):<br /> epoch_start = timeit.default_timer()<br /> <br /> rand.shuffle(trainingset_indexes)<br /> for i in range(len(trainingset_indexes)//minibatch_size):<br /> minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]<br /> sess.run(train_step, feed_dict={<br /> seq_in: train_seqs_in [minibatch_indexes],<br /> seq_len: train_seqs_len[minibatch_indexes],<br /> seq_target: train_seqs_out[minibatch_indexes],<br /> })<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br /> if validation_loss > last_validation_loss:<br /> break<br /> last_validation_loss = validation_loss<br /> <br /> saver.save(sess, './model')<br /> <br /> print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /><br /> print(epoch, round(validation_loss, 3), round(timeit.default_timer() - epoch_start), sep='\t')<br /><br />################################################################<br />print()<br />print('Evaluating...')<br /><br />saver.restore(sess, tf.train.latest_checkpoint('.'))<br /><br />def seq_prob(seq):<br /> seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> outputs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + seq_indexes ],<br /> seq_len: [ 1+len(seq) ],<br /> })[0]<br /> probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]<br /> return np.prod(probs)<br /><br />print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))<br />print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))<br />print()<br /><br />def next_tokens(prefix):<br /> prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]<br /> probs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + prefix_indexes ],<br /> seq_len: [ 1+len(prefix) ],<br /> })[0][-1]<br /> token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))<br /> return token_probs<br /><br />print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])<br />print()<br /><br />def greedy_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = sorted(next_tokens(prefix), reverse=True)<br /> (_, next_token) = probs[0]<br /> if next_token == '<unk>':<br /> (_, next_token) = probs[1]<br /> elif next_token == '<end>':<br /> break<br /> else:<br /> prefix.append(next_token)<br /> return prefix<br /><br />print('Greedy generation:', ' '.join(greedy_gen()))<br />print()<br /><br />def random_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = next_tokens(prefix)<br /> (unk_prob, _) = probs[unknown_index]<br /> <br /> r = rand.random() * (1.0 - unk_prob)<br /> total = 0.0<br /> for (prob, token) in probs:<br /> if token != '<unk>':<br /> total += prob<br /> if total >= r:<br /> break<br /> if token == '<end>':<br /> break<br /> else:<br /> prefix.append(token)<br /> return prefix<br /><br />print('Random generation:', ' '.join(random_gen()))<br />print()<br /></pre></div><br /><div class="sectitle">Code explanation</div><br />This section explains snippets of the code.<br /><br /><div class="subsectitle">Preprocessing</div><br />The first thing we need to do is create the training and validation sets. We will use the Brown corpus from NLTK as data. Since the purpose of this tutorial is to quickly train a neural language model on a normal computer, we will work with a subset of the corpus so that training will be manageable. We will only take sentences that are between 5 and 20 tokens long and only use a random sample of 20000 sentences from this pool. From this we will take a random 10% to be used for the validation set and the rest will be used for the training set.<br /><br /><pre class="brush:python">all_seqs = [ [ token.lower() for token in seq ] for seq in nltk.corpus.brown.sents() if 5 <= len(seq) <= 20 ]<br />rand.shuffle(all_seqs)<br />all_seqs = all_seqs[:20000]<br /><br />trainingset_size = round(0.9*len(all_seqs))<br />validationset_size = len(all_seqs) - trainingset_size<br />train_seqs = all_seqs[:trainingset_size]<br />val_seqs = all_seqs[-validationset_size:]<br /></pre><br />Next we need to get the vocabulary from the training set. This will consist of all the words that occur frequently in the training set sentences, with the rare words being replaced by an "unknown" token. This will allow the neural network to be able to work with out-of-vocabulary words as they will be represented as "<unk>" and the network will ave seen this token in the training sentences. Each vocabulary word will be represented by an index, with "0" representing the beginning and end token, "1" representing the unknown token, "2" representing the most frequent vocabulary word, "3" the second most frequent word, and so on.<br /><br /><pre class="brush:python">all_tokens = (token for seq in train_seqs for token in seq)<br />token_freqs = collections.Counter(all_tokens)<br />vocab = sorted(token_freqs.keys(), key=lambda token:(-token_freqs[token], token))<br />while token_freqs[vocab[-1]] < min_token_freq:<br /> vocab.pop()<br />vocab_size = len(vocab) + 2 # + edge and unknown tokens<br /><br />token_to_index = { token: i+2 for (i, token) in enumerate(vocab) }<br />index_to_token = { i+2: token for (i, token) in enumerate(vocab) }<br />edge_index = 0<br />unknown_index = 1<br /></pre><br />Finally we need to turn all the sentences in the training and validation set into a matrix of indexes where each row represents a sentence. Since different sentences have different lengths, we will make the matrix as wide as the longest sentence and use 0 to pad each sentence into uniform length (pads are added to the end of the sentences). To know where a sentence ends we will also keep track of the sentence lengths. For training we will need to use an input matrix and a target matrix. The input matrix contains sentences that start with the start-of-sentence token in every row whilst the target matrix contains sentences that end with the end-of-sentence token in every row. The former is used to pass as input to the neural net during training whilst the latter is used to tell which is the correct output of each sentence prefix.<br /><br /><pre class="brush:python">def parse(seqs):<br /> indexes = list()<br /> lens = list()<br /> for seq in seqs:<br /> indexes_ = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> indexes.append(indexes_)<br /> lens.append(len(indexes_)+1) #add 1 due to edge token<br /> <br /> maxlen = max(lens)<br /> <br /> in_mat = np.zeros((len(indexes), maxlen))<br /> out_mat = np.zeros((len(indexes), maxlen))<br /> for (row, indexes_) in enumerate(indexes):<br /> in_mat [row,:len(indexes_)+1] = [edge_index]+indexes_<br /> out_mat[row,:len(indexes_)+1] = indexes_+[edge_index]<br /> return (in_mat, out_mat, np.array(lens))<br /> <br />(train_seqs_in, train_seqs_out, train_seqs_len) = parse(train_seqs)<br />(val_seqs_in, val_seqs_out, val_seqs_len) = parse(val_seqs)<br /></pre><br /><div class="subsectitle">Model definition</div><br />The Tensorflow neural network model we shall implement will accept a batch of sentences at once. This means that the input will be a matrix of integers where each row is a sentence and each integer is a word index. Since both the number of sentences and the sentence length are variable we will use "None" as a dimension size. We will then get the size using the "tf.shape" function. The sentences on their own are not enough as an input as we also need to provide the sentence lengths as a vector. The length of this vector needs to be as long as the number of rows in the sequences (one for each sentence). These lengths will be used to generate a mask matrix of 0s and 1s where a "1" indicates the presence of a token in the sequence matrix whilst a "0" indicates the presence of a pad token. This is generated by the "tf.sequence_mask" function.<br /><br /><pre class="brush:python">#Full correct sequence of token indexes with start token but without end token.<br />seq_in = tf.placeholder(tf.int32, shape=[None, None], name='seq_in') #[seq, token]<br /><br />#Length of sequences in seq_in.<br />seq_len = tf.placeholder(tf.int32, shape=[None], name='seq_len') #[seq]<br />tf.assert_equal(tf.shape(seq_in)[0], tf.shape(seq_len)[0])<br /><br />#Full correct sequence of token indexes without start token but with end token.<br />seq_target = tf.placeholder(tf.int32, shape=[None, None], name='seq_target') #[seq, token]<br />tf.assert_equal(tf.shape(seq_in), tf.shape(seq_target))<br /><br />batch_size = tf.shape(seq_in)[0] #Number of sequences to process at once.<br />num_steps = tf.shape(seq_in)[1] #Number of tokens in generated sequence.<br /><br />#Mask of which positions in the matrix of sequences are actual labels as opposed to padding.<br />token_mask = tf.cast(tf.sequence_mask(seq_len, num_steps), tf.float32) #[seq, token]<br /></pre><br />Now that the inputs are handled we need to process the prefixes of the sequences into prefix vectors using an LSTM. We first convert the token indexes into token vectors. This is done using an embedding matrix where the first row of the matrix is the word vector of the first word in the vocabulary, the second for the second word, and so on. This is then passed to an LSTM that is initialized with zero-vectors for both the cell state and the hidden state. We pass in the sequence lengths to keep the RNN from interpreting pad values.<br /><br /><pre class="brush:python">with tf.variable_scope('prefix_encoder'):<br /> #Encode each sequence prefix into a vector.<br /> <br /> #Embedding matrix for token vocabulary.<br /> embeddings = tf.get_variable('embeddings', [ vocab_size, embed_size ], tf.float32, tf.contrib.layers.xavier_initializer()) #[vocabulary token, token feature]<br /> <br /> #3D tensor of tokens in sequences replaced with their corresponding embedding vector.<br /> embedded = tf.nn.embedding_lookup(embeddings, seq_in) #[seq, token, token feature]<br /> <br /> #Use an LSTM to encode the generated prefix.<br /> init_state = tf.contrib.rnn.LSTMStateTuple(c=tf.zeros([ batch_size, state_size ]), h=tf.zeros([ batch_size, state_size ]))<br /> cell = tf.contrib.rnn.BasicLSTMCell(state_size)<br /> prefix_vectors = tf.nn.dynamic_rnn(cell, embedded, sequence_length=seq_len, initial_state=init_state, scope='rnn')[0] #[seq, prefix, prefix feature]<br /></pre><br />Next we need to take the prefix vectors and derive probabilities for the next word in every prefix. Since the prefix vectors are a 3D tensor and the weights matrix is a 2D tensor we have to first reshape the prefix vectors into a 2D tensor before multiplying them together. Following this we will then reshape the resulting 2D tensor back into a 3D tensor and apply softmax to it.<br /><br /><pre class="brush:python">with tf.variable_scope('softmax'):<br /> #Output a probability distribution over the token vocabulary (including the end token).<br /> <br /> W = tf.get_variable('W', [ state_size, vocab_size ], tf.float32, tf.contrib.layers.xavier_initializer())<br /> b = tf.get_variable('b', [ vocab_size ], tf.float32, tf.zeros_initializer())<br /> logits = tf.reshape(tf.matmul(tf.reshape(prefix_vectors, [ -1, state_size ]), W) + b, [ batch_size, num_steps, vocab_size ])<br /> predictions = tf.nn.softmax(logits) #[seq, prefix, token]<br /></pre><br /><div class="subsectitle">Training</div><br />Now comes the part that has to do with training the model. We need to add the loss function which is masked to ignore padding tokens. We will take the sum crossentropy and apply the Adam optimizer to it. Crossentropy measures how close to 1.0 the probability of the correct next word in the softmax is. This allows us to maximize the correct probability which is done by using Adam, an optimization technique that works using gradient descent but with the learning rate being adapted for every weight. We would also like to be able to save our model weights during training in order to reuse them later. Finally we create a Tensorflow session variable and use it to initialize all the model weights.<br /><br /><pre class="brush:python">losses = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=seq_target, logits=logits) * token_mask<br />total_loss = tf.reduce_sum(losses)<br />train_step = tf.train.AdamOptimizer().minimize(total_loss)<br />saver = tf.train.Saver()<br /><br />sess = tf.Session()<br /> <br />sess.run(tf.global_variables_initializer())<br /></pre><br />The second thing we should do is measure how well the randomly initialised neural net performs in terms of the sum crossentropy. In order to avoid running out of memory, instead of putting all the validation set data as input to the neural net we will break it up into minibatches of a fixed size and find the total loss of all the minibatches together. This final loss value will be placed in a variable called "last_validation_loss" in order to be able to track how the loss is progressing as training goes on. The last line is to save the weights of the neural network in the same folder as the code and give the files (there will be several files with different extensions) the file name "model".<br /><br /><pre class="brush:python">validation_loss = 0<br />for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br />last_validation_loss = validation_loss<br /><br />saver.save(sess, './model')<br /></pre><br />Next we'll do the same thing but on the training set and we'll run the "train_step" operation instead of the "total_loss" operation in order to actually optimize the weights into a smaller "total_loss" value. It is more beneficial to take randomly sampled minibatches instead of just breaking the training set into deterministically chosen groups, so we use an array of indexes called "trainingset_indexes" to determine which training pairs will make it to the next minibatch. We randomly shuffle these indexes and then break it into fixed size groups. The indexes in the next group are used to choose the training pairs are used for the next minibatch. Following this we will again calculate the new loss value on the validation set to see how we're progressing. If the new loss is worse than the previous loss then we stop the training. Otherwise we save the model weights and continue training. This is called early stopping. There is a hard limit set to the number of epochs to run in order to avoid training for too long.<br /><br /><pre class="brush:python">trainingset_indexes = list(range(len(train_seqs)))<br />for epoch in range(1, max_epochs+1):<br /> rand.shuffle(trainingset_indexes)<br /> for i in range(len(trainingset_indexes)//minibatch_size):<br /> minibatch_indexes = trainingset_indexes[i*minibatch_size:(i+1)*minibatch_size]<br /> sess.run(train_step, feed_dict={<br /> seq_in: train_seqs_in [minibatch_indexes],<br /> seq_len: train_seqs_len[minibatch_indexes],<br /> seq_target: train_seqs_out[minibatch_indexes],<br /> })<br /> <br /> validation_loss = 0<br /> for i in range(len(val_seqs)//minibatch_size):<br /> minibatch_validation_loss = sess.run(total_loss, feed_dict={<br /> seq_in: val_seqs_in [i*minibatch_size:(i+1)*minibatch_size],<br /> seq_len: val_seqs_len[i*minibatch_size:(i+1)*minibatch_size],<br /> seq_target: val_seqs_out[i*minibatch_size:(i+1)*minibatch_size],<br /> })<br /> validation_loss += minibatch_validation_loss<br /><br /> if validation_loss > last_validation_loss:<br /> break<br /> last_validation_loss = validation_loss<br /> <br /> saver.save(sess, './model')<br /></pre><br /><div class="subsectitle">Application</div><br />We can now use the trained neural network. We first restore the last saved model weights which are the ones that gave the best validation loss and then we will define two functions: one for getting the probability of a whole sequence and the other for getting the next token after a prefix. "seq_prob" works by getting every prefix's softmax output, getting the probability of each token in the sequence after each prefix and then multiplying them together. "next_tokens" works by passing a prefix to the neural network and only getting the softmax output of the last (longest) prefix. The probabilities and corresponding tokens are returned.<br /><br /><pre class="brush:python">saver.restore(sess, tf.train.latest_checkpoint('.'))<br /><br />def seq_prob(seq):<br /> seq_indexes = [ token_to_index.get(token, unknown_index) for token in seq ]<br /> outputs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + seq_indexes ],<br /> seq_len: [ 1+len(seq) ],<br /> })[0]<br /> probs = outputs[np.arange(len(outputs)), seq_indexes+[ edge_index ]]<br /> return np.prod(probs)<br /><br />print('P(the dog barked.) =', seq_prob(['the', 'dog', 'barked', '.']))<br />print('P(the cat barked.) =', seq_prob(['the', 'cat', 'barked', '.']))<br />print()<br /><br />def next_tokens(prefix):<br /> prefix_indexes = [ token_to_index.get(token, unknown_index) for token in prefix ]<br /> probs = sess.run(predictions, feed_dict={<br /> seq_in: [ [ edge_index ] + prefix_indexes ],<br /> seq_len: [ 1+len(prefix) ],<br /> })[0][-1]<br /> token_probs = list(zip(probs, ['<end>', '<unk>']+vocab))<br /> return token_probs<br /><br />print('the dog ...', sorted(next_tokens(['the', 'dog']), reverse=True)[:5])<br />print()<br /></pre><br />These are the outputs I got:<br /><br /><pre>P(the dog barked.) = 1.71368e-08<br />P(the cat barked.) = 6.16375e-10<br /><br />the dog ... [(0.097657956, 'was'), (0.089791521, '<unk>'), (0.058101207, 'is'), (0.055007596, 'had'), (0.02786131, 'could')]<br /></pre><br />We can extend "next_tokens" to generate whole sentences using one of two ways: generating the most probable sentence or generating a randomly sampled sentence. For the first we are going to use greedy search which chooses the most probable word given a prefix and adds it to the prefix. This will not give the most probable sentence but it should be close (use <a href="http://geekyisawesome.blogspot.com.mt/2016/10/using-beam-search-to-generate-most.html">beam search</a> for a better selection method). For the second function we want to choose words at random but based on their probability such that rare words are rarely chosen. This is called sampling sentences (the probability of sampling a particular sentence is equal to the probability of the sentence). We did this using <a href="https://en.wikipedia.org/wiki/Fitness_proportionate_selection">roulette selection</a>. For both functions we left out the unknown token during generation and gave a hard maximum word limit of 100 words.<br /><br /><pre class="brush:python">def greedy_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = sorted(next_tokens(prefix), reverse=True)<br /> (_, next_token) = probs[0]<br /> if next_token == '<unk>':<br /> (_, next_token) = probs[1]<br /> elif next_token == '<end>':<br /> break<br /> else:<br /> prefix.append(next_token)<br /> return prefix<br /><br />print('Greedy generation:', ' '.join(greedy_gen()))<br />print()<br /><br />def random_gen():<br /> prefix = []<br /> for _ in range(100):<br /> probs = next_tokens(prefix)<br /> (unk_prob, _) = probs[unknown_index]<br /> <br /> r = rand.random() * (1.0 - unk_prob)<br /> total = 0.0<br /> for (prob, token) in probs:<br /> if token != '<unk>':<br /> total += prob<br /> if total >= r:<br /> break<br /> if token == '<end>':<br /> break<br /> else:<br /> prefix.append(token)<br /> return prefix<br /><br />print('Random generation:', ' '.join(random_gen()))<br />print()<br /></pre><br />These are the outputs I got:<br /><br /><pre>Greedy generation: `` i don't know '' .<br /><br />Random generation: miss the road place , title comes to the seeds of others to many and details under the dominant home<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-46466574370919646012017-04-03T20:47:00.002+02:002017-04-03T20:54:08.308+02:00Getting the top n most probable sentences using beam searchThis is a continuation from a <a href="http://geekyisawesome.blogspot.com.mt/2016/10/using-beam-search-to-generate-most.html">previous blog post</a> on single sentence beam search.<br /><br />Sometimes it is not enough to just generate the most probable sentence using a language model. Sometimes you want to generate the top 3 most probable sentences instead. In that case we need to modify our beam search a bit. We will make the function a generator that returns a sequence of sentences in order of probability instead of just returning a single most probable sentence. Here are the changes we need to make:<br /><br />In the single sentence version, we were getting the most probable prefix in the current beam and checking if it is complete. If it is, then we return it and stop there. Instead, we will now not stop until the current beam is empty (or until the caller stops requesting for more sentences). After returning the most probable prefix we will check the second most probable prefix and keep on returning complete prefixes until we either find one which is not complete or we return all the beam. In the case that we return the whole beam then the algorithm stops there as there is nothing left with which to generate new prefixes. This means that the beam width gives a limit on the number of sentences that can be returned. If we do not return all the beam then we continue generating prefixes with the remainder.<br /><br />In the case that some complete sentences were returned, they need to also be removed from the beam before we continue generating. Since the beam is implemented as a min-first heap queue (min-first because we want to pop the least probable prefix quickly when the beam becomes bigger than the beam width) then we cannot remove the highest probable complete sentence quickly as well. In order to do this, we first turn the heap into a list which is sorted by probability and then start popping out the items at the end if they are complete sentences. Following this, we will then heapify the list back into a min-first heap queue and continue as normal. This sorting and reheapifying should not impact on the performance too much if the beam width is relatively small.<br /><br />If the clip length is reached then the whole beam is immediately returned in order of probability. This is because as soon as one prefix is equal to the allowed maximum then that means that the entire beam consists of<br /><ol><li>incomplete sentences that are also as long as the allowed maximum (since all the prefixes grow together)</li><li>complete sentences that were found before but which do not have a maximum probability</li></ol>After returning all the incomplete sentences (they have to be returned since they cannot be extended further) then the complete sentences will also be returned as they will become the most probable sentences of what is left in the beam. The assumption is that if the complete sentence was a nonsensical one, then it wouldn't have remained in the beam so might as well return it as well rather than lose it.<br /><br />Here is the modified Python 3 code:<br /><br /><pre class="brush:python">import heapq<br /><br />class Beam(object):<br /><br /> def __init__(self, beam_width, init_beam=None):<br /> if init_beam is None:<br /> self.heap = list()<br /> else:<br /> self.heap = init_beam<br /> heapq.heapify(self.heap) #make the list a heap<br /> self.beam_width = beam_width<br /><br /> def add(self, prob, complete, prefix):<br /> heapq.heappush(self.heap, (prob, complete, prefix))<br /> if len(self.heap) > self.beam_width:<br /> heapq.heappop(self.heap)<br /><br /> def __iter__(self):<br /> return iter(self.heap)<br /><br /><br />def beamsearch(probabilities_function, beam_width=10, clip_len=-1):<br /> prev_beam = Beam(beam_width)<br /> prev_beam.add(1.0, False, [ '<start>' ])<br /><br /> while True:<br /> curr_beam = Beam(beam_width)<br /><br /> #Add complete sentences that do not yet have the best<br />probability to the current beam, the rest prepare to add more words to<br />them.<br /> for (prefix_prob, complete, prefix) in prev_beam:<br /> if complete == True:<br /> curr_beam.add(prefix_prob, True, prefix)<br /> else:<br /> #Get probability of each possible next word for the<br />incomplete prefix.<br /> for (next_prob, next_word) in probabilities_function(prefix):<br /> if next_word == '<end>': #if next word is the end<br />token then mark prefix as complete and leave out the end token<br /> curr_beam.add(prefix_prob*next_prob, True, prefix)<br /> else: #if next word is a non-end token then mark<br />prefix as incomplete<br /> curr_beam.add(prefix_prob*next_prob, False,<br />prefix+[next_word])<br /><br /> sorted_beam = sorted(curr_beam) #get all prefixes in current beam sorted by probability<br /> any_removals = False<br /> while True:<br /> (best_prob, best_complete, best_prefix) = sorted_beam[-1] #get highest probability prefix<br /> if best_complete == True or len(best_prefix)-1 ==<br />clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then yield it<br /> yield (best_prefix[1:], best_prob) #yield best sentence without the start token and together with its probability<br /> sorted_beam.pop() #remove the yielded sentence and check the next highest probability prefix<br /> any_removals = True<br /> if len(sorted_beam) == 0: #if there are no more sentences in the beam then stop checking<br /> break<br /> else:<br /> break<br /><br /> if any_removals == True: #if there were any changes in the current beam then...<br /> if len(sorted_beam) == 0: #if there are no more prefixes in the current beam (due to clip length being reached) then end the beam search<br /> break<br /> else: #otherwise set the previous beam to the modified current beam<br /> prev_beam = Beam(beam_width, sorted_beam)<br /> else: #if the current beam was left unmodified then assign it to the previous beam as is<br /> prev_beam = curr_beam<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-55773293215448934652017-03-27T16:43:00.001+02:002017-03-27T16:44:11.140+02:00Word embeddings: How word2vec and GloVe workWord embeddings are vectors that represent words. For example the word "dog" might be represented as [0.1, -2.1, 1.2] whilst "cat" might be represented as [0.2, 2.4, 1.1]. These vectors are important in neural networks because neural networks can only work with continuous numbers whereas words are discrete symbols.<br /><br />One way to represent words as vectors is by taking a finite and fixed vocabulary of words that you want to work with, put the words in the vocabulary in a certain order such as alphabetical, and then use one hot vectors to represent the words. A one hot vector is a vector consisting of zeros everywhere and single one somewhere, such as [0, 0, 1, 0]. The position of the one indicates the position of the word in the vocabulary. The one hot vector [1, 0, 0] represents the first word in a three word vocabulary. You can then feed this vector as an input to a neural network and continue as usual.<br /><br />One hot vectors are a very wasteful representation in both space and time. They are wasteful in space because you need a vector as large as your vocabulary to represent each word. They are wasteful in time because when computing activation values of the first layer in the neural network, the weight matrix multiplication is going to involve multiplying by a lot of zeros. This is what the matrix multiplication looks like when then first layer outputs two activation values:<br /><br /><pre>(1 0 0) (1 2) = (1*1 + 0*3 + 0*5 1*2 + 0*4 + 0*6) = (1 2)<br /> (3 4)<br /> (5 6)<br /></pre><br />In fact, if you think about it, all you need is to pick the row in the matrix corresponding to where the one is in the one hot vector. What we usually do in fact is to have an "embedding layer" which is just a matrix with as many rows as the vocabulary size and as many columns as you want your embedded vectors to be big. We then simply pick the row according to the word we want to pass to the neural network. This embedding matrix is then optimized together with the rest of the neural network and the selected vectors passed as input to the next layer as usual.<br /><br />This has an interesting result. The vectors tend to become similar for similar words, that is, the more similar two words are, the larger the cosine similarity of their corresponding vectors. There are also some other interesting properties such as analogies (asking "man is to king as woman is to...?") and odd word out exercises (which word from these five does not fit with the others?). It seems that the vectors are not just made different for different words but are made different to different degrees according to the differences between the words. If two words are used similarly in certain contexts but not others (such as being similar in gender but not in other cases), then that similarity will be encoded in the vectors. You can see more about these properties in <a href="https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/">https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/</a>.<br /><br />Of course there are systems for creating generic word embeddings that are useful for many natural language processing applications. The two most popular generic embeddings are <a href="https://en.wikipedia.org/wiki/Word2vec">word2vec</a> and <a href="https://nlp.stanford.edu/projects/glove/">GloVe</a>.<br /><br />word2vec is based on one of two flavours: The continuous bag of words model (CBOW) and the skip-gram model. CBOW is a neural network that is trained to predict which word fits in a gap in a sentence. For example, given the partial sentence "the cat ___ on the", the neural network predicts that "sat" has a high probability of filling the gap. The order of the words is not important: given four words, the neural network predicts the word in the middle. The embedding layer of the neural network is used to represent words in general. On the other hand, the skip-gram model works in the other way round. Given a middle word it predicts which words surround it. Of course by "predicts" we mean that it outputs a probability for each word in the vocabulary. These tasks are meant to force the neural network to create embeddings that reflect the relationship between words, hence bestowing them with meaning.<br /><br />GloVe takes a different approach. Instead of extracting the embeddings from a neural network that is designed to perform a surrogate task (predicting neighbouring words), the embeddings are optimized directly so that the dot product of two word vectors equals the log of the number of times the two words will occur near each other (within 5 words for example). For example if "dog" and "cat" occur near each other 10 times in a corpus, then vec(dog) dot vec(cat) = log(10). This forces the vectors to somehow encode the frequency distribution of which words occur near them.<br /><br />Which is better? It probably depends on your data. The nice thing about both of these methods is that you can train your own word vectors based on your own corpus so that the meaning that gets encoded into the vectors becomes specific to your own domain.<br /><br />Download word2vec code from <a href="https://github.com/dav/word2vec">here</a>.<br />Download GloVe code from <a href="https://github.com/stanfordnlp/GloVe">here</a>.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-68025612774432530112017-02-28T22:54:00.000+01:002017-03-01T16:39:23.701+01:00Neural network deep learning hard attention using REINFORCE / score function estimator / likelihood ratio estimator<script type="text/x-mathjax-config"> MathJax.Hub.Config({tex2jax: {inlineMath: [['$','$'], ['\\(','\\)']]}}); </script><br /><script type="text/javascript" async="async" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_CHTML"></script><br /><br />One of the things I struggled the most to understand with papers on deep learning is when they mention hard attention. Let's start with a description of what attention is first.<br /><br />Attention based learning is when you have a sequence input where only part of it is relevant and you want your neural network to explicitly choose what is relevant. For example, if you are translating a sentence from English to French, only some of the words in the English sentence are relevant to generating the next word in the French sentence. So part of your neural network would be dedicated to weighing the importance of each English word in the context of what the neural network intends to generate next in the French sentence. This attention module would work by taking two inputs:<br /><ul><li>an embedded word vector from the English sentence</li><li>the current state of the recurrent neural network that is encoding what has been generated up to now</li></ul>Given these two inputs the attention module would then output a single number between 0 and 1 which quantifies how relevant the given word is given what the neural network intends to generate next (which is encoded in the RNN state). This weighting would then be multiplied by the word vector in order to shrink it's magnitude and then the weighted sum of the word vectors would be used to encode the English sentence. This sentence vector would represent the information in the sentence that is relevant to generating the next French word. In this way, the neural network does not need to encode the whole sentence and use the same representation for generating all the French sentence. The English sentence would be attended to in different ways for generating each French word.<br /><br />This technique has not only been used for machine translation but also for image captioning (only attend to parts of the image for every word being generated) and neural turing machines (only attend to parts of the tape with every operation). Just look at what <a href="http://distill.pub/2016/augmented-rnns/">Colah</a> has to say about it. However this is just soft attention, which is easy. It's called soft attention because you still partially attend to every part of the input; it's just some parts get very little attention. This means that it's still possible to measure the effect of a part of an input and so it's still possible to determine, by gradient descent, whether its attention should be increased or decreased.<br /><br />Hard attention, on the other hand, either completely includes or excludes elements of the input. How would you do this in a neural network? You'd need to use thresholding for example, where if the value of a neuron is above a threshold then it outputs a one, zero otherwise. You can also use argmax which selects the neuron with the greatest value and sets it to one and all the others to zero. Unfortunately both of these solutions are undifferentiable and that makes direct gradient descent ineffective if they are used in a hidden layer (in an output layer you can just maximize their continuous values and then threshold them at test time). It would be the same situation neural networks had in the past when they couldn't have a hidden layer because there was no known optimization algorithm for 2 layers of thresholding neurons.<br /><br />It would be nice to somehow make neurons that have discrete outputs but which can still be trained by gradient descent. One way to do this is to use stochastic neurons which output discrete values based on a probability that can be tweaked. The simplest probabilistic neuron is the Bernoulli neuron which outputs a 1 with probability "p" and a 0 otherwise. Let's assume a simple attention based neural network that uses one Bernoulli neuron as an attention module and one normal sigmoid neuron as a feature extraction neuron. The Bernoulli neuron's output is multiplied by the normal neuron's output in order to gate it. Both neurons take the same single number as input.<br /><br /><a href="https://2.bp.blogspot.com/-BeExEkIkfH8/WLZ1H9JotZI/AAAAAAAABAs/lM5FAHjrZ8kvvWx23_ACqewxCpnAqJDiACLcB/s1600/simple_bernoulli_neuron_net.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-BeExEkIkfH8/WLZ1H9JotZI/AAAAAAAABAs/lM5FAHjrZ8kvvWx23_ACqewxCpnAqJDiACLcB/s320/simple_bernoulli_neuron_net.png" width="320" height="259" /></a><br /><br />The grey neuron with a "B" inside is the Bernoulli neuron, "w0" and "w1" are weights and "b0" and "b1" are biases. If we had to turn this into an equation in order to differentiate it, we'd end up with this:<br /><br />$$y = B_x \cdot sig(w_1 x + b_1)$$<br /><br />where "B_x" is a random variable that is dependent on "x". Unfortunately random variables "hide" their inner workings and we cannot express their probability as part of the equation. This means that we cannot optimize "w0" and "b0" by gradient descent as it would be meaningless to differentiation with respect to "w0" given that it's not in the equation. In fact "B" is treated as a constant in the above equation and we can still differentiate the equation with respect to all the other parameters. Note that this situation is different from dropout, where you randomly multiply some of the neuron values by 0 in order to avoid overfitting. In the case of dropout, the above equation would be enough as we're not optimizing the value of the dropout random variable.<br /><br />There is a simple solution to this problem however. Instead of finding the gradient of "y", we can find the gradient of the expected value of "y". The expected value is the mean value you get when running a stochastic function over and over again. For example, if you're tossing a fair coin, with one face representing a 1 and the other face representing a 0, the expected value is 0.5. However, if you're tossing an unfair coin where the "1" face comes up 75% of the time, then on average you will get a value of 0.75. In general, the expected value of a discrete probability distribution, such as the Bernoulli distribution, can be found using the following equation:<br /><br />$$E[X] = \sum_{v \in X} p(v) \cdot v$$<br /><br />that is, just multiple each value by its respective probability and take the sum. In the case of a coin with probability of the "1" face coming up being "p" (a Bernoulli distribution), the expected value is $1 \cdot p + 0 \cdot (1-p)$ which equals "p". We can take advantage of the expect value by using gradient descent to minimize the expected error rather than the error itself. This would expose the parameters that determine the probability of the Bernoulli neuron and we would be able to use gradient descent as usual.<br /><br />Let's take a concrete example of an error function. We want to minimize the sum square error of the neural net such that when given a 0 it should output a 1 and when given a 1 it should output a 0 (a logical NOT). This is what the error function (called the cost function) would look like, together with the error function of a single input:<br /><br />$$<br />\begin{align}<br />C &= \sum_{(x,t_x) \in \{ (0,1),(1,0) \}} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />\end{align}<br />$$<br /><br />Let's focus on just the single input error function. The expected error would look like:<br /><br />$$<br />\begin{align}<br />E[C_x] &= sig(w_0 x + b_0) \cdot (t_x - 1 \cdot sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x - 0 \cdot sig(w_1 x + b_1))^2 \\<br />&= sig(w_0 x + b_0) \cdot (t_x - sig(w_1 x + b_1))^2 + (1 - sig(w_0 x + b_0)) \cdot (t_x)^2 \\<br />\end{align}<br />$$<br /><br />Remember that expected value finds the sum of all the possible values of the stochastic function due to randomness multiplied by their respective probability. In the case of our neural net, the two possible values due to randomness are caused by the Bernoulli neuron "B" being either 1 with probability $sig(w_0 x + b_0)$ or 0 with probability $1 - sig(w_0 x + b_0)$. Now we can find the gradient of the expected error and minimize it, which would include optimizing the parameters determining the probability of the Bernoulli neuron.<br /><br />In general it might not be tractable to expose all the possible values of a stochastic neural network, especially when multinoulli neurons are used where there are many possible discrete values instead of just 0 or 1 and especially when there are multiple such neurons whose values must be considered together, creating a combinatorial explosion. To solve this problem, we can approximate the expected error by taking samples. For example, if you want to approximate the expected value of a coin, you can toss it a hundred times, count the number of times the coin gives the "1" face and divide by 100. This can be done with a stochastic neural network, where you run the network a hundred times and calculate the mean of the error for each training pair. The problem is that we don't want the expected error, but the derivative of the expected error, which cannot be calculated on the constant you get after approximating the expected error. We need to take samples of the expected derivative of the error instead. This is what the expected derivative of the error looks like:<br /><br />$$<br />\begin{align}<br />\frac{dE[C_x]}{dw} &= \frac{d}{dw} \sum_{v \in C_x} p(v) \cdot v \\<br />&= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\<br />\end{align}<br />$$<br /><br />The problem with this equation is that it can't be sampled like an expected value can so you'll end up having to calculate the full summation, which we're trying to avoid. The solution to this is that we continue breaking down the algebra until eventually we get something that can be approximated by sampling. This has been derived multiple times in the literature and so it has multiple names such as REINFORCE, score function estimator, and likelihood ratio estimator. It takes advantage of the following theorem of logarithms: $\frac{d}{dx} f(x) = f(x) \cdot \frac{d}{dx} \log(f(x))$<br /><br />$$<br />\begin{align}<br />\frac{dE[C_x]}{dw} &= \sum_{v \in C_x} \frac{d}{dw} (p(v) \cdot v) \\<br />&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot \frac{d}{dw} p(v) \right) \\<br />&= \sum_{v \in C_x} \left( p(v) \cdot \frac{d}{dw} v + v \cdot p(v) \cdot \frac{d}{dw} \log(p(v)) \right) \\<br />&= \sum_{v \in C_x} p(v) \cdot \left( \frac{d}{dw} v + v \cdot \frac{d}{dw} \log(p(v)) \right) \\<br />&\approx \frac{\sum_{i = 1}^{N} \frac{d}{dw} \tilde{v} + \tilde{v} \cdot \frac{d}{dw} \log(p(\tilde{v}))}{N} \text{ where } \tilde{v} \sim{} C_x \\<br />\end{align}<br />$$<br /><br />The last line is approximating the derivative of the expected error using "N" samples. This is possible because probability "p" was factored out inside the summation, which gives it the same form of an expect value equation and which hence can be approximated in the same way. You might be asking how it is that you can find the probability of each possible value. As long as you have access to the values returned by the stochastic neurons, you can use a dictionary to map values to their respective probabilities and multiply them together if necessary. The following is the derivative of the above Bernoulli NOT gate:<br /><br />$$<br />\begin{align}<br />C_x &= (t_x - B_x \cdot sig(w_1 x + b_1))^2 \\<br />\frac{dE[C_x]}{dw_0} &\approx \frac{\sum_{i = 1}^{N} \left( \frac{d}{dw_0} (t_x - B_x \cdot sig(w_1 x + b_1))^2 \right) + (t_x - B_x \cdot sig(w_1 x + b_1))^2 \cdot \left(<br /> \frac{d}{dw_0}<br /> \left\{ \begin{array}{lr}<br /> \log(sig(w_0 x + b_0)) & : B_x = 1 \\<br /> \log(1 - sig(w_0 x + b_0)) & : B_x = 0 \\<br /> \end{array} \right.<br /> \right) }{N} \\<br />\end{align}<br />$$mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-45362065973473967332017-01-31T22:47:00.001+01:002017-02-18T19:29:42.626+01:00Applying softmax and categorical_crossentropy to 3D tensors in TheanoTheano is a great deep learning library but there are some things in it that need to be polished a bit. For example at the moment, you can only apply the softmax function in the tensor.nnet module to matrices (2D tensors). Same goes for the categorical_crossentropy function.<br /><br />This is good for when you have a list of predictions, for example, when you have a bunch of images and you want your neural network to output a set of probabilities corresponding to each possible image label for each image. In this case you want to give your softmax function a matrix of values where each row corresponds to an image and each value in each row is a feature that is used to determine how to distribute the label probabilities.<br /><br />But let's say that instead of just a single label we want to generate a sentence or a list of labels. In this case we need to provide a 3D tensor where the first dimension corresponds to sentences, the second dimension corresponds to word place holders in the sentences and the third dimension corresponds to the features that will be used to determine the probabilities of the possible words that fill in each place holder. In this case the softmax will not accept the 3D tensor.<br /><br />Assuming that the probabilities to output are exclusively dependent on their corresponding features and that features are not shared among different "place holders", the solution is to reshape your 3D tensor into a 2D tensor, apply your softmax and then reshape the 2D tensor back into the original 3D tensor shape. Here's an example:<br /><br /><pre>Original 3D tensor:<br />[ [ [111,112], [121,122] ], [ [211,212], [221,222] ] ]<br /><br />Reshaped 2D tensor:<br />[ [111,112], [121,122], [211,212], [221,222] ]<br /><br />Applied softmax:<br />[ [111',112'], [121',122'], [211',212'], [221',222'] ]<br /><br />Reshaping back to 3D:<br />[ [ [111',112'], [121',122'] ], [ [211',212'], [221',222'] ] ]<br /></pre><br />It would be nice if this is done automatically behind the scene by Theano. In the mean time, here is a snippet to help you:<br /><br /><pre class="brush:python" style="border:black solid 2px;">import theano<br />import theano.tensor as T<br /><br />X = T.tensor3()<br />(d1,d2,d3) = X.shape<br />Y = T.nnet.softmax(X.reshape((d1*d2,d3))).reshape((d1,d2,d3))<br /></pre><br />Here is what happens step by step:<br /><pre class="brush:python">print(X.reshape((d1*d2,d3)).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> [[ 1. 2.]<br /> [ 1. 3.]<br /> [ 1. 4.]<br /> [ 1. 5.]]<br /></pre><br /><pre class="brush:python">print(T.nnet.softmax(X.reshape((d1*d2,d3))).eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> array([[ 0.26894142, 0.73105858],<br /> [ 0.11920292, 0.88079708],<br /> [ 0.04742587, 0.95257413],<br /> [ 0.01798621, 0.98201379]])<br /></pre><br /><pre class="brush:python">print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]] }))<br /><br />>>> [[[ 0.26894142 0.73105858]<br /> [ 0.11920292 0.88079708]]<br /><br /> [[ 0.04742587 0.95257413]<br /> [ 0.01798621 0.98201379]]]<br /></pre><br />The categorical_crossentropy function can be used in the same way:<br /><br /><pre class="brush:python" style="border:black solid 2px;">import theano<br />import theano.tensor as T<br /><br />X = T.tensor3()<br />S = T.imatrix()<br />(d1,d2,d3) = X.shape<br />(e1,e2) = S.shape<br />Y = T.nnet.categorical_crossentropy( T.nnet.softmax(X.reshape((d1*d2,d3))), S.reshape((e1*e2,)) ).reshape((d1,d2))<br /></pre><br />And this is how it is used:<br /><pre class="brush:python">print(Y.eval({ X: [[[1,2],[1,3]],[[1,4],[1,5]]], S: [[0,1],[1,0]] }))<br /><br />>>> array([[ 1.31326169, 0.12692801],<br /> [ 0.04858735, 4.01814993]])<br /></pre><br />Where "S" is choosing which probability to apply negative log to in each softmax, for example the corresponding number in "S" for [1,2] is 0 so we choose the first probability the comes out of it, which is 0.26894142, to which we apply negative log, that is, -ln(0.26894142) = 1.31326169. Similarly, the corresponding number in "S" for [1,4] is 1 so we choose the second probability which is 0.95257413 from which we perform -ln(0.95257413) = 0.04858735.mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-29095074457151571912016-12-27T16:15:00.002+01:002016-12-27T16:15:46.774+01:00Bernoulli vs Binomial vs Multinoulli vs Multinomial distributions<div class="sectitle">Bernoulli distribution</div>Say you are flipping a coin that has a probability of 0.4 of turning up heads and 0.6 of turning up tails. The simplest probability distribution there is is the <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a> which just asks for the probability of the coin turning up heads after one toss, that is, 0.4. That's all. The Bernoulli distribution is only about the probability of a random event with two possible outcomes occurring after one trial. It's used for success/fail type random variables where you want to reason about the probability that a random variable will succeed (such as a coin turning up heads).<br /><br />The probability mass function for the Bernoulli distribution is:<br /><pre>f(x;p) = p^x * (1 - p)^(1-x)<br /></pre>where "*" is multiplication and "^" is power, "p" is the probability of a success and "x" is the number of successes desired, which can be either 1 or 0. When "x" is 1 then you find the probability of a success whilst when it is 0 then you find the probability of a fail. Since there are only two outcomes, if "p" is the probability of a success then the probability of a non-success is "1-p". Notice that this function is quite silly really as it is just a closed form expression for choosing either "p" or "1-p" depending on what "x" is. The number you're interested in is raised to the power of 1 whilst the number you're not interested in is raised to the power of 0, turning it into 1, and then the two results are multiplied together.<br /><br /><div class="sectitle">Binomial distribution</div>One generalization we can make upon the Bernoulli distribution is to consider multiple trails instead of just one and then ask about the probability that a number of successes would occur. This is called the <a href="https://en.wikipedia.org/wiki/Binomial_distribution">binomial distribution</a>. For example, what is the probability that the coin described above results in exactly 2 heads after tossing it 3 times. There are many ways how this result can come about. You can get [head, head, tail], or [head, tail, head], or [tail, head, head]. For this reason we need to consider the number of different ways the required total can be achieved, which is found by using the <a href="https://en.wikipedia.org/wiki/Combination">combination formula</a>.<br /><br />The probability mass function for the Binomial distribution is:<br /><pre>f(x;n,p) = n!/(x! * (n-x)!) * p^x * (1 - p)^(n-x)<br /></pre>where "*" is multiplication and "^" is power, "p" is the probability of a success, "n" is the number of trials to try out, and "x" is the number of desired successes out of the "n" trials.<br /><br /><div class="sectitle">Multinoulli distribution</div>Whereas the binomial distribution generalises the Bernoulli distribution across the number of trials, the <a href="https://en.wikipedia.org/wiki/Categorical_distribution">multinoulli distribution</a> generalises it across the number of outcomes, that is, rolling a dice instead of tossing a coin. The multinoulli distribution is also called the categorical distribution.<br /><br />The probability mass function for the multinoulli distribution is:<br /><pre>f(xs;ps) = ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k<br /></pre>where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, which can be either 1 or 0 and where there can only be exactly a single "1" in "xs".<br /><br /><div class="sectitle">Multinomial distribution</div>Finally the most general generalisation of the Bernoulli distribution is across both the number of trials and the number of outcomes, called the <a href="https://en.wikipedia.org/wiki/Multinomial_distribution">multinomial distribution</a>. In order to be more general, this distribution also allows specifying the number of successes desired for each individual outcome, rather than just of one outcome like the multinoulli distribution. This lets us calculate the probability of rolling a dice 6 times and getting 3 "2"s, 2 "3"s and a "5". Since we want it to be compatible with the binomial distribution, these outcomes can arrival in any order, that is, it doesn't matter whether you get [5,3,2,3,2,2] or [2,2,2,3,3,5]. For this reason we need to use the combination function again.<br /><br />The probability mass function for the multinomial distribution is:<br /><pre>f(xs;n,ps) = n!/(xs_1! * xs_2! * ... * xs_k!) * ps_1^xs_1 * ps_2^xs_2 * ... * ps_k^xs_k<br /></pre>where "*" is multiplication and "^" is power, "k" is the number of outcomes (6 in the case of a dice, 2 for a coin), "ps" is the list of "k" probabilities where "ps_i" is the probability of the ith outcome resulting in a success, "xs" is a list of numbers of successes where "xs_i" is the number of successes desired for the ith outcome, the total of which must be "n".mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-58131235387847430492016-11-30T20:41:00.001+01:002016-11-30T21:14:14.230+01:00Checking if a number is prime: when to stop checking potential factorsLet's say you want to find out whether the number 888659800715261 is a prime number. If this were a programming exercise, a naive solution would be to go through all the numbers from 2 to 888659800715261/2 (half the number) and check whether any of the numbers divide 888659800715261 evenly. If none of them do, then the number is prime. The reason why we stop at half the number is because the largest factor that a number "n" can have before "n" itself is n/2, the one before that is n/3, and so on. Sure, we don't need to check every single number between 2 and n/2 as it would be enough to only check the prime numbers between 2 and n/2, but the point is that as soon as we reach n/2, we can stop checking and declare that 888659800715261 is prime.<br /><br />The problem is that half of 888659800715261 is 444329900357630 which is still too big to check each number between 2 and it, probably even if you only check the prime numbers. The question is, is there an even lower upper-bound where we can stop checking potential factors in order to be sure that the number has no factors?<br /><br />Let's look at which numbers between 1 and 24 are factors of 24:<br /><pre>[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]</pre><br />In order for a number to be a factor of 24, there must be another factor which when the two factors are multiplied together give 24. For example, if 2 is a factor of 24, then there must be another factor of 24 which when multiplied by 2 gives 24. This other factor is 12. Let's call 2 and 12 co-factors. The co-factor of 3 is 8, the co-factor of 4 is 6. We can imagine co-factors being mirror reflections of each other:<br /><br /><pre>[1] [2] [3] [4] 5 [6] 7 [8] 9 10 11 [12] 13 14 15 16 17 18 19 20 21 22 23 [24]<br /> ( [ { < > } ] )<br /></pre><br />Notice how all the co-factors nest each other. The co-factors 4 and 6 are between the co-factors 3 and 8 which in turn are between the co-factors 2 and 12 which in turn are between 1 and 24. It seems like there is a line of reflection at 5, where all the factors smaller than 5 have co-factors larger than 5. This means that beyond 5, all factors are co-factors of smaller numbers.<br /><br />Let's see another list of factors for 16:<br /><pre>[1] [2] 3 [4] 5 6 7 [8] 9 10 11 12 13 14 15 [16]<br /> ( [ { } ] )<br /></pre><br />Notice how 4 is a co-factor with itself in this case, since 4 squared is 16. The line of reflection is crossed at 4 this time, which happens to be the square root of 16. In fact the square root of 24 is 4.898... which is close to 5 (the line of reflection of 24). This is always the case, because the line of reflection occurs where the two co-factors are the same. If the co-factors are not the same then one factor must be smaller than the square root whilst the other must be larger. If both are smaller or bigger then when multiplied together they will not give the number being factored. This is because if two co-factors "a" and "b" are supposed to equal "n", and both of them are larger than √n, then that would mean that a×b must also be larger than √n×√n, which means that a×b is larger than "n". The area of a rectangle with two large sides cannot be equal to the area of a rectangle with two small sides. Therefore, if the two sides are equal (a square) then in order to change the length of the sides without changing the area of the rectangle you must make one side larger whilst making the other smaller. This proves that the square root must be between any two co-factors.<br /><br />So what's the point? The point is that if you reach the square root of "n" without having found any factors, then you are guaranteed that there will not be any factors beyond √n because if there are any then they must have co-factors smaller than √n, which you have already confirmed that there aren't any. So the square root of a number is the lower upper-bound to check for factors. Can there be an even lower upper-bound? No, because the number might be a square number and have no other factors apart from its square root, such as 25. Therefore the smallest co-factor a number can have is the square root of the number.<br /><br />So in order to check if 888659800715261 is a prime number, we only have to check all the numbers up to 29810397 for factors, which is much better than 444329900357630. By the way, there is no number between 2 and 29810397 that evenly divides 888659800715261, which means that it is a prime number.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-92097245020304925242016-10-30T20:26:00.000+01:002017-04-03T20:53:45.944+02:00Using beam search to generate the most probable sentenceThis blog post continues in a second blog post about <a href="http://geekyisawesome.blogspot.com.mt/2017/04/getting-top-n-most-probable-sentences.html">how to generate the top n most probable sentences</a>.<br /><br />In my <a href="http://geekyisawesome.blogspot.com.mt/2016/09/text-generation-using-language-model.html">last blog post</a> I talked about how to generate random text using a language model that gives the probability of a particular word following a prefix of a sentence. For example, given the prefix "The dog", a language model might tell you that "barked" has a 5% chance of being the next word whilst "meowed" has a 0.3%. It's one thing generating random text in a way that is guided by the probabilities of the words but it is an entirely different thing to generate the most probable text according to the probabilities. By most probable text we mean that if you multiply the probabilities of all the words in the sentence together you get the maximum product. This is useful for conditioned language models which give you different probabilities depending on some extra input, such as an image description generator which accepts an image apart from the prefix and returns probabilities for the next word depending on what's in the image. In this case we'd like to find the most probable description for a particular image.<br /><br />You might think that the solution is to always pick the most probable word and add it to the prefix. This is called a greedy search and is known to not give the optimal solution. The reason is because it might be the case that every combination of words following the best first word might not be as good as those following the second best word. We need to use a more exploratory search than greedy search. We can do this by thinking of the problem as searching a probability tree like this:<br /><br /><a href="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s1600/prefix_probability_tree.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Jjpb7iyB37A/WBZI4ImGQII/AAAAAAAAA9s/ululnUWt2vw9NMKuEr-F9H8tR2LEv36lACLcB/s320/prefix_probability_tree.png" width="320" height="265" /></a><br /><br />The tree shows a probability tree of which words can follow a sequence of words together with their probabilities. To find the probability of a sentence you multiply every probability in the sentence's path from <start> to <end>. For example, the sentence "the dog barked" has a probability of 75% × 73% × 25% × 100% = 13.7%. The problem we want to solve is how to find the sentence that has the highest probability.<br /><br />One way to do this is to use a breadth first search. Starting from the <start> node, go through every node connected to it, then to every node connected to those nodes and so on. Each node represents a prefix of a sentence. For each prefix compute its probability, which is the product of all the probabilities on its path from the <start> node. As soon as the most probable prefix found is a complete sentence, that would be the most probable sentence. The reason why no other less probable prefixes could ever result in more probable sentences is because as a prefix grows, its probability shrinks. This is because any additional multiplications with probabilities made to any prefix probability will only make it smaller. For example, if a prefix has a probability of 20% and another word is added to it which has a probability of 99%, then the new probability will be 20% × 99% which is the smaller probability of 19.8%.<br /><br />Of course a breadth first search is impractical on any language model that has a realistic vocabulary and sentence length since it would take too long to check all the prefixes in the tree. We can instead opt to take a more approximate approach where we only check a subset of the prefixes. The subset would be the top 10 most probable prefixes found at that point. We do a breadth first search as explained before but this time only the top 10 most probable prefixes are kept and we stop when the most probable prefix in these top 10 prefixes is a complete sentence.<br /><br />This is practical but it's important that the way we find the top 10 prefixes is fast. We can't sort all the prefixes and choose the first 10 as there would be too many. We can instead use a <a href="https://en.wikipedia.org/wiki/Heap_%28data_structure%29">heap data structure</a>. This data structure is designed to quickly take in a bunch of numbers and quickly pop out the smallest number. With this you can insert the prefix probabilities one by one until there are 10 prefixes in it. After that start popping out the smallest probability immediately after inserting a new one in order to only keep the 10 largest ones.<br /><br />Python provides a library called "heapq" (heap queue) just for this situation. Here is a class that makes use of heapq in order to create a beam of prefix probabilities:<br /><br /><pre class="brush:python">import heapq<br /><br />class Beam(object):<br />#For comparison of prefixes, the tuple (prefix_probability, complete_sentence) is used.<br />#This is so that if two prefixes have equal probabilities then a complete sentence is preferred over an incomplete one since (0.5, False) < (0.5, True)<br /><br /> def __init__(self, beam_width):<br /> self.heap = list()<br /> self.beam_width = beam_width<br /><br /> def add(self, prob, complete, prefix):<br /> heapq.heappush(self.heap, (prob, complete, prefix))<br /> if len(self.heap) > self.beam_width:<br /> heapq.heappop(self.heap)<br /> <br /> def __iter__(self):<br /> return iter(self.heap)<br /></pre><br />The code to perform the actual beam search is this:<br /><br /><pre class="brush:python">def beamsearch(probabilities_function, beam_width=10, clip_len=-1):<br /> prev_beam = Beam(beam_width)<br /> prev_beam.add(1.0, False, [ '<start>' ])<br /> while True:<br /> curr_beam = Beam(beam_width)<br /> <br /> #Add complete sentences that do not yet have the best probability to the current beam, the rest prepare to add more words to them.<br /> for (prefix_prob, complete, prefix) in prev_beam:<br /> if complete == True:<br /> curr_beam.add(prefix_prob, True, prefix)<br /> else:<br /> #Get probability of each possible next word for the incomplete prefix.<br /> for (next_prob, next_word) in probabilities_function(prefix):<br /> if next_word == '<end>': #if next word is the end token then mark prefix as complete and leave out the end token<br /> curr_beam.add(prefix_prob*next_prob, True, prefix)<br /> else: #if next word is a non-end token then mark prefix as incomplete<br /> curr_beam.add(prefix_prob*next_prob, False, prefix+[next_word])<br /> <br /> (best_prob, best_complete, best_prefix) = max(curr_beam)<br /> if best_complete == True or len(best_prefix)-1 == clip_len: #if most probable prefix is a complete sentence or has a length that exceeds the clip length (ignoring the start token) then return it<br /> return (best_prefix[1:], best_prob) #return best sentence without the start token and together with its probability<br /> <br /> prev_beam = curr_beam<br /></pre><br />"probabilities_function" returns a list of word/probability pairs given a prefix. "beam_width" is the number of prefixes to keep (so that instead of keeping the top 10 prefixes you can keep the top 100 for example). By making the beam search bigger you can get closer to the actual most probable sentence but it would also take longer to process. "clip_len" is a maximum length to tolerate, beyond which the most probable prefix is returned as an incomplete sentence. Without a maximum length, a faulty probabilities function which does not return a highly probable end token will lead to an infinite loop or excessively long garbage sentences.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-36875548580999692502016-09-28T11:12:00.000+02:002016-09-28T11:48:35.121+02:00Text generation using a language modelA <a href="https://en.wikipedia.org/wiki/Language_model">language model</a> is something that tells you the probability of a sequence of words. For example it tells you that the sequence "a dog is barking" is more probable than "a dog is was". This probability can then be used to generate text by selecting a sequence of words that is probable.<br /><br />One way to do this is to take a huge amount of text and count how often each 4 word segment occurs, then divide by the total amount of 4 word segments. For example, you find that in your huge text, "a dog is barking" to occurs 42 times and divide it by the amount of 4 word segments in the text. This will give you an approximate probability of "a dog is barking".<br /><br />Usually language models give you the probability of a particular word following a prefix of a sentence. For example given the prefix "a dog is", what is the probability that the next word in the prefix is "barking"? This is expressed mathematically as P("barking" | "a dog is"). This is what <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/">neural network language models</a> do. You give them a prefix of words and they output a number for each known word, where the number is the probability of that word following the prefix.<br /><br />Once you have a probability distribution for all the words in your vocabulary, you can then pick the word with the highest probability to continue the incomplete sentence. If you do this for every word, you will generate a whole sentence but you will always generate the same sentence. In order to generate a random sentence, we need to choose a random next word, but in such a way that highly probable words are more likely to be chosen in order to avoid generating gibberish.<br /><br /><div class="sectitle">The softmax function</div>Given a set of numbers, such as frequencies, we can turn these numbers into probabilities by using the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimation</a> as described above. This is when you divide a frequency by the total. However this isn't the best way to do things. First of all this assumes that words that don't occur in the text have a probability of zero, which is unlikely to be the case as it's more likely that the word just has a very small probability than a probability of zero. Second of all, this requires the number to be frequencies, that is, to be all greater than or equal to zero, which might be too limiting. A neural network's outputs can be used to quantify how strongly it believes a particular word should follow a prefix, but this quantity might be negative in order to avoid being limited by a lower bound. Thirdly, we might want to skew the probabilities so that the most probable word has a higher probability and the least probable word has a lower probability or vice versa. This is done so that when you're selecting words based on their probability you either bias the selection towards higher probability words or you go the other way and make the probabilities more similar in order to generate more random looking text.<br /><br />To address all these concerns we can use a softmax function to convert the quantities into probabilities which uses a <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">Boltzmann distribution</a>. Given a list of "n" quantities called "q", the probability of the kth quantity is:<br /><pre>e^(q_k/T) / sum(e^(q_i/T) for i in 1..n)</pre>where "T" is called the temperature and is used to control how similar the probabilities should be (by default it is 1).<br /><br />For example if your quantities are [0, 1, -1], then for temperature 1 the probabilities are:<br /><br /><pre>0: e^(0/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.245</pre><pre>1: e^(1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.665</pre><pre>-1: e^(-1/1) / (e^(0/1) + e^(1/1) + e^(-1/1)) = 0.09</pre><br />Notice how the probabilities and all valid (non-negative and sum to 1), how by raising "e" to the power of the quantity makes it always greater than zero and see what happens when the temperature is set to a larger number:<br /><br /><pre>0: e^(0/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.307</pre><pre>1: e^(1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.506</pre><pre>-1: e^(-1/2) / (e^(0/2) + e^(1/2) + e^(-1/2)) = 0.186</pre><br />See how more similar the probabilities are now with a higher temperature?<br /><br />The nice thing about the temperature is that even if you have already calculated the probabilities using temperature 1 and have lost the original quantities, you can still change the temperature of the probabilities as follows:<br /><br />Given probabilities "p", new probability "p'_k" based on temperature "T" are<br /><pre>p'_k = p_k^(1/T) / sum(p_i^(1/T) for i in 1..n)</pre><br />From the above examples,<br /><pre>0: 0.245^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.307</pre><pre>1: 0.665^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.506</pre><pre>-1: 0.09^(1/2) / (0.245^(1/2) + 0.665^(1/2) + 0.09^(1/2)) = 0.186</pre><br /><div class="sectitle">The roulette wheel selection</div>Now how do we select a word using it's probability? There is an algorithm called <a href="https://en.wikipedia.org/wiki/Fitness_proportionate_selection">Roulette wheel selection</a> which does this. The idea is to put all the probabilities next to each other on a number line and then select a point on the number line at random. The word whose probability contains the point is chosen. Here's a diagram showing the probabilities 0.6, 0.3 and 0.1 next to each other on a number line and a random red point is placed on the number line in the "word 1" region meaning that word 1 was selected:<br /><br /><a href="https://4.bp.blogspot.com/-xwb58iTCIf0/V-uHgUWmTDI/AAAAAAAAA74/92yG_53qW7M6qY6eMdfkYS-bJQaKmSlNACLcB/s1600/roulette_wheel_selection.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-xwb58iTCIf0/V-uHgUWmTDI/AAAAAAAAA74/92yG_53qW7M6qY6eMdfkYS-bJQaKmSlNACLcB/s320/roulette_wheel_selection.png" width="320" height="60" /></a><br /><br />To do this, we generate a random number between 0 and 1 (the random point), and then start adding probabilities (in any order) until the total becomes larger than the random number. The word belonging to the last probability added is the one that is chosen. Here's a Python 3 implementation:<br /><br /><pre class="brush:python">def select(probs):<br /> r = random.random() #random number between 0 and 1<br /> total = 0.0<br /> for i in range(len(probs)):<br /> total += probs[i]<br /> if total > r: #important to use > and not >= in order to avoid selecting words with a probability of zero<br /> return i<br /> #a words must have been selected at this point or else the probabilities are invalid (either negative or do not sum to 1<br /> raise ValueError('Invalid probabilities')<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-45771621940275845192016-08-28T18:04:00.000+02:002016-08-28T18:19:44.377+02:00Using dropout in neural networks to avoid overfittingImagine you want to train a neural network to recognise objects in photos. So you build a training set of labelled photos and use it for training. The training brings down the error so that it can recognise the objects in the training set with 100% accuracy. So now you try to use the network on some new photos for actual use but it gets all of them wrong. This is a pesky problem with training neural networks called overfitting, that is, the network learns a set of weights that work for the training set provided but nothing else beyond it. It's a problem of generalisation that is present in every machine learning technique that has to learn from a finite training set due to something called the <a href="https://geekyisawesome.blogspot.com.mt/2013/07/no-free-lunch.html">No Free Lunch theorem</a>.<br /><br />Generally solutions are to make the training set larger by adding a wider variety of examples and to make the model as simple as possible since a complicated model is more likely to learn something complicated but uninteresting rather than something useful. There should also be a separate validation set which has some extra data that was not used in the training set which is used to check how the learning is performing on new data.<br /><br />There is also a very interesting solution in neural networks called dropout. Dropout is a technique that was published in a paper titled <a href="http://arxiv.org/pdf/1207.0580.pdf">Improving neural networks by preventing co-adaptation of feature detectors</a> and it works by randomly replacing activations in certain layers with a zero for every training set entry. This means that given a particular layer, for the first training set entry you randomly choose half of the neurons and replace their computed values with zeros before passing them on to the next layer. For the second training set entry you randomly choose a different set of neurons and repeat.<br /><br />There are several hypothesis for why this should work:<br /><ul><li>Dropout is <span style="font-weight:bold;">a form of regularisation which prevents neurons from overly relying on a small set of previous neurons</span>. An overly important single neuron would mean that the output of the network is relying on a single piece of evidence rather than looking at all the evidence and deciding from there. That single piece of evidence that the neuron learned to detect might be consistently present in the training set data but it might not be in other data which results in overfitting. With dropout no single neuron is guaranteed to be there when using the network on a training set entry so the network learns to avoid relying on a single neuron.</li><li>Dropout forces each neuron to <span style="font-weight:bold;">learn a function that is independently interpretable</span>. This means that a single neuron learns to detect the presence of, say, stripes, whilst another neuron learns to detect the presence of fur, and so on. Without dropout the network will tend to have multiple neurons that detect a single thing as a co-adapted group, which means that all of the neurons have to be activated when used on new data, making them less reliable. With dropout no pair of neurons is guaranteed to be there together when using the network on a training set entry so the neurons learn to avoid relying on each other.<br /><ul><li>If you're thinking that this contradicts the first point, keep in mind that these independent functions will be learned by several nodes since a single neuron is not guaranteed to be there. This means that <span style="font-weight:bold;">several versions of the same function might be learned</span>, adding to the function's reliability.</li></ul></li><li>Dropout <span style="font-weight:bold;">adds noise to the activations which forces the network to be able to handle incomplete input data</span>. Since the activations get corrupted by dropout then the network has to learn to deal with the error, which makes it more reliable.</li><li>As a continuation of the previous point, <span style="font-weight:bold;">the corruption is like new training data</span>. This is an example of an artificial expansion of the training set, which is used explicitly when training with images by, for example, cropping the images, adding noise pixels, blurring, and so on.</li><li>Apart from expanding the training set, dropout also <span style="font-weight:bold;">simulates an ensemble of networks</span>. An ensemble of networks is when you train several neural networks that are trained on the same training set and then use them all together on new data. The networks will likely give different output as they will not learn the same thing, but the outputs of all the networks together are used as votes to determine what the majority of the networks decided the output should be. Dropout simulates this by creating a new neural network for each training set entry, which is done by using different neurons for each entry. These networks will have some shared weights but this is good as it means that they occupy less memory and will give their output faster (since you're computing just one netork rather than many smaller ones). In fact the output of the whole network will the the average of all the simulated smaller ones.</li></ul><br />Which of these hypothesis is true will probably depend on what the network is learning and there are probably many more possible reasons for why dropout works.<br /><br />Moving on, how do you actually implement dropout? What we'll do is we create a vector mask of ones and zeros, with half the numbers being ones chosen uniformly randomly, and then multiply it by the vector of layer activations we want to apply dropout to. The resulting vector is what will be used as input to the next layer. The problem is that we want to use all the neurons at test time (when we finish training). If we do this then what happens is that the neurons of the next layer will receive inputs that are twice as large since the weights of the layer adapted to only half the neurons being there. This means that we need to reduce the activations by half during test time. This is not a very neat solution as we'd like to get rid of any extra processing at test time. Instead, we use "inverted dropout" where we double the activations of the neurons that are not dropped at training time. This will make the weights adapt to having inputs that are twice as large which will work fine with twice the number of neurons at normal activation amount. You can find this explanation in <a href="http://vision.stanford.edu/teaching/cs231n/slides/lecture6.pdf">these lecture notes</a> on page 44.<br /><br />Here is some Python code using numpy:<br /><pre class="brush:python">def apply_dropout(layer, dropout_prob=0.5):<br /> mask = np.random.binomial(n=1, p=1-dropout_prob, size=layer.shape)<br /> dropped_layer = layer * mask<br /> rescaled_dropped_layer = dropped_layer/(1-dropout_prob)<br /> return rescaled_dropped_layer<br /></pre><br />Just use the output of this function as input to the next layer. At test time you can either avoid using it altogether or use a dropout_prob of 0.0 which will have no effect. For implementing in Theano, you can see how it is applied in the function _dropout_from_layer in <a href="https://github.com/mdenil/dropout/blob/master/mlp.py">this source code</a>.<br /><br />As an extra piece of information, when using recurrent neural networks you'll also want to use dropout; however be careful as the paper titled <a href="http://arxiv.org/pdf/1409.2329v5.pdf">Recurrent Neural Network Regularization</a> explains that you shouldn't apply dropout on the recurrent layers but only on the feedforward layers (that is, before or after the RNN).mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-4902741220827903442016-07-28T14:56:00.000+02:002016-07-28T14:58:09.305+02:00Python console progress bar (using \b and \r)One of the things that really fascinated me in console applications is when I see text being changed on the same line without new lines being added. An example would be a progress bar or some percentage being updated on the same line like this:<br /><br /><a href="https://4.bp.blogspot.com/-1tSfhAXxVHM/V5oACtDWMII/AAAAAAAAA6Y/Ru9Io6-9jBkX7xEU152f7p0bHgH045QFACLcB/s1600/progressbar.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-1tSfhAXxVHM/V5oACtDWMII/AAAAAAAAA6Y/Ru9Io6-9jBkX7xEU152f7p0bHgH045QFACLcB/s320/progressbar.png" width="320" height="74" /></a><br /><br />The idea is to use <a href="https://docs.python.org/2.0/ref/strings.html">ASCII control characters</a> which control the console's behaviour. These control characters were most useful back when computer displays were typewriters that <a href="http://stackoverflow.com/questions/3091524/what-are-carriage-return-linefeed-and-form-feed">printed out the text that the program wanted to show</a> but are now still useful in console applications.<br /><br />First of all, the code shown below only works in the command prompt / terminal and not in IDLE. To try them out open command prompt (if using Windows) or terminal (if using Linux) and enter the command "python" which will open a stripped down version of IDLE. Alternatively you can write a script file and then call it through the command prompt / terminal by entering the command "python <script file>" such as "python C:\file.py".<br /><br /><div class="title">Control characters</div><br />There are two control characters of interest here:<br /><br /><span style="font-weight:bold;">\b</span> moves the cursor one character back, which will print the following character over the previous character. This is useful for overwriting the characters at the end of the line.<br /><pre class="brush:python">print('xy\bz')<br /></pre><br /><a href="https://1.bp.blogspot.com/-jeWSRuVr57w/V5oAK3zFpNI/AAAAAAAAA6c/TJaY3MPsUIY55MefAxqTvHJAUNjpDwyEQCLcB/s1600/example_b.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-jeWSRuVr57w/V5oAK3zFpNI/AAAAAAAAA6c/TJaY3MPsUIY55MefAxqTvHJAUNjpDwyEQCLcB/s1600/example_b.png" /></a><br /><br /><span style="font-weight:bold;">\r</span> moves the cursor to the beginning of the line, which will print the following character over the first character. This is useful for overwriting the characters at the beginning of the line or the whole line.<br /><pre class="brush:python">print('xy\rz')<br /></pre><br /><a href="https://4.bp.blogspot.com/-Tusiw0hNq_k/V5oAQtppB8I/AAAAAAAAA6g/7d5WEebgDDAgJKnrjGArpGY9bbCpb5KxwCLcB/s1600/example_r.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Tusiw0hNq_k/V5oAQtppB8I/AAAAAAAAA6g/7d5WEebgDDAgJKnrjGArpGY9bbCpb5KxwCLcB/s1600/example_r.png" /></a><br /><br />So how do we use these to display a functioning progress bar? First of all, if we're using \b then we'll probably need to overwrite a number of characters at the end, not just one. To do this, just print \b as many times as needed:<br /><pre class="brush:python">print('xyyy\b\b\bzzz')<br /></pre><br /><a href="https://1.bp.blogspot.com/-O5PawSORXdU/V5oAUejiyLI/AAAAAAAAA6k/oLBkMz0iEhEIDr1eWPgU3aloyhXQbSU4ACLcB/s1600/example_bbb.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-O5PawSORXdU/V5oAUejiyLI/AAAAAAAAA6k/oLBkMz0iEhEIDr1eWPgU3aloyhXQbSU4ACLcB/s1600/example_bbb.png" /></a><br /><br /><div class="title">Separate prints</div><br />We also probably want to use separate prints rather than put everything in the same print statement. To do this you need to make sure that the print will not start a new line at the end. In Python 3 you can add the argument "end=''" to the print:<br /><pre class="brush:python">def f():<br /> print('xy', end='')<br /> print('\bz')<br /><br />f()<br /></pre><br /><a href="https://1.bp.blogspot.com/-B-vXUtLMBVs/V5oAYgVUn7I/AAAAAAAAA6o/4MPiGEiSC4wUCtnpzKQQKPHwQGgyVOo6QCLcB/s1600/example_b_2prints.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-B-vXUtLMBVs/V5oAYgVUn7I/AAAAAAAAA6o/4MPiGEiSC4wUCtnpzKQQKPHwQGgyVOo6QCLcB/s1600/example_b_2prints.png" /></a><br /><br /><pre class="brush:python">def f():<br /> print('xy', end='')<br /> print('\rzz')<br /><br />f()<br /></pre><br /><a href="https://3.bp.blogspot.com/-Beqh4ettAnk/V5oAeQZBNdI/AAAAAAAAA6s/Naa7qDAVnYgqaW3th59A0GZgl2OZEt3SgCLcB/s1600/example_r_2prints.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-Beqh4ettAnk/V5oAeQZBNdI/AAAAAAAAA6s/Naa7qDAVnYgqaW3th59A0GZgl2OZEt3SgCLcB/s1600/example_r_2prints.png" /></a><br /><br />Alternatively you can use the sys.stdout.write function which is equivalent to print without the extra features such as adding a new line at the end:<br /><pre class="brush:python">import sys<br /><br />def f():<br /> sys.stdout.write('xy')<br /> sys.stdout.write('\bz\n')<br /><br />f()<br /></pre><br /><a href="https://3.bp.blogspot.com/-NS6a2zyzIJA/V5oAjDKfFzI/AAAAAAAAA60/EJcBW_2J3sUoMPhyNmKuiQty767hnE03QCLcB/s1600/example_b_2stdouts.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-NS6a2zyzIJA/V5oAjDKfFzI/AAAAAAAAA60/EJcBW_2J3sUoMPhyNmKuiQty767hnE03QCLcB/s1600/example_b_2stdouts.png" /></a><br /><br /><div class="title">String formatting</div><br />When displaying a percentage it's important that the number always takes the same amount of character space so that you know how many characters to overwrite. This can be done using the <a href="https://docs.python.org/3/library/string.html">format method of strings</a>. The format method allows you to make a string take a fixed amount of characters, filling the remainder with spaces as follows:<br /><pre class="brush:python">curr = 12<br />total = 21<br />frac = curr/total<br />print('[{:>7.2%}]'.format(frac))<br /></pre><br /><a href="https://3.bp.blogspot.com/-2CXLXkiSku8/V5oAqC72yHI/AAAAAAAAA64/lTCfCq6pEBwuhGhnBEXcFzZQYQtYotKrACLcB/s1600/example_percentage.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-2CXLXkiSku8/V5oAqC72yHI/AAAAAAAAA64/lTCfCq6pEBwuhGhnBEXcFzZQYQtYotKrACLcB/s1600/example_percentage.png" /></a><br /><br />The formatting code is the "{:>7.2%}", where ">" means that the string is be right aligned with spaces added to the front, "7" is the fixed length of the string + padded spaces (3 whole number digits + a point + 2 decimal digits + a percentage sign) ".2" is the number of decimal places to round the percentage to, and "%" means that the fraction should be expressed as a percentage.<br /><br /><div class="title">Replicated strings</div><br />When displaying a progress bar it would also be useful to know that you can replicate a string for a number of times using the '*' operator, as follows:<br /><pre class="brush:python">curr = 12<br />total = 21<br />full_progbar = 10<br />filled_progbar = round(curr/total*full_progbar)<br />print('#'*filled_progbar + '-'*(full_progbar-filled_progbar))<br /></pre><br /><a href="https://4.bp.blogspot.com/-iAoaC-YrcTs/V5oAuUpoXFI/AAAAAAAAA68/hadvyWN6xAo2lITzMuE9tIXPD7Xq2cIdACLcB/s1600/example_progbar.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-iAoaC-YrcTs/V5oAuUpoXFI/AAAAAAAAA68/hadvyWN6xAo2lITzMuE9tIXPD7Xq2cIdACLcB/s1600/example_progbar.png" /></a><br /><br /><div class="title">Full example</div><br />Here is a full progress bar example:<br /><br /><pre class="brush:python">def progbar(curr, total, full_progbar):<br /> frac = curr/total<br /> filled_progbar = round(frac*full_progbar)<br /> print('\r', '#'*filled_progbar + '-'*(full_progbar-filled_progbar), '[{:>7.2%}]'.format(frac), end='')<br /><br />for i in range(100000+1):<br /> progbar(i, 100000, 20)<br />print()<br /></pre><br /><a href="https://1.bp.blogspot.com/-L4MAKv2RpMc/V5oBWMVveFI/AAAAAAAAA7I/am8sgIjac80GI77-0dUqGJgtuDO37oIoACLcB/s1600/example_full.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-L4MAKv2RpMc/V5oBWMVveFI/AAAAAAAAA7I/am8sgIjac80GI77-0dUqGJgtuDO37oIoACLcB/s1600/example_full.png" /></a><br /><br />It might also be the case that the progress bar is not updating consistently and seems to be getting stuck. This could be because the print is buffering and need to be "flushed" after every print. This is done by adding<br /><pre class="brush:python">sys.stdout.flush()<br /></pre>after every print (don't forget to import "sys" before).mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-18657748109841748922016-06-21T09:05:00.000+02:002016-06-21T09:12:28.642+02:00The Backpropagation Algorithm for Artificial Neural Networks (ANNs)In a <a href="http://geekyisawesome.blogspot.com.mt/2016/02/gradient-descent-algorithm-on.html">previous post</a>, I talked about how to use the gradient descent algorithm to optimize the weights and biases of an artificial neural network in order to give selected outputs for selected inputs. However plain gradient descent is inefficient on large neural nets since it recomputes a lot of values for each neuron, plus it has to be rewritten for every change in architecture (number of neurons and layers) and requires a lot of code. The standard solution is to use an optimized version of the gradient descent algorithm for neural nets called the <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation algorithm</a>.<br /><br />I will assume that you have read the previous post. Whereas the previous post was very specific using a specified architecture and a specified activation and cost function, here I will keep things as general as possible such that they can be applied on any feed forward neural network. Here are the definitions of symbols we shall be using, similar to last post's definitions:<br /><br /><a href="https://4.bp.blogspot.com/-8v8tOPKSBAA/V2jk0d7ixzI/AAAAAAAAA2s/x0l2_SCGkV4VJhHsgSQVG_4VavL8Zz4agCLcB/s1600/definitions.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-8v8tOPKSBAA/V2jk0d7ixzI/AAAAAAAAA2s/x0l2_SCGkV4VJhHsgSQVG_4VavL8Zz4agCLcB/s1600/definitions.png" /></a><br /><br />In order to help explain the algorithm and equations, I shall be applying it to the last post's architecture, just to help you understand. So here is last post's neural net:<br /><br /><a href="https://2.bp.blogspot.com/-ZNrsBW15h6g/V2jk7HLMutI/AAAAAAAAA20/kEJeqMj9wb8elyKJNjf5SzMtiXxpQeGWgCLcB/s1600/example_neuralnet.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-ZNrsBW15h6g/V2jk7HLMutI/AAAAAAAAA20/kEJeqMj9wb8elyKJNjf5SzMtiXxpQeGWgCLcB/s1600/example_neuralnet.png" /></a><br /><br />In the example, we have a 3 layer neural net with 2 neurons in each layer and 2 input neurons. It uses the sigmoid function as an activation function and the mean square error as the cost function.<br /><br />The main point of interest in the backpropagation algorithm is the way you find the derivative of the cost function with respect to a weight or bias in any layer. Let's look at how to find these derivatives for each layer in the example neural net.<br /><br /><div class="sectitle">Weights and bias in layer L (output layer)</div><br />We begin by finding the derivatives for the output layer, which are the simplest.<br /><br /><div class="subsectitle">d Cost / d Weight L</div>The derivative of the cost with respect to any weight in layer L can be found using<br /><br /><a href="https://1.bp.blogspot.com/-rYuynl5vJ8g/V2jlARthJSI/AAAAAAAAA28/-FC4Cfg7mZ8y5ZzKbx8o89242wJYCd7fACLcB/s1600/dC_dwL.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-rYuynl5vJ8g/V2jlARthJSI/AAAAAAAAA28/-FC4Cfg7mZ8y5ZzKbx8o89242wJYCd7fACLcB/s1600/dC_dwL.png" /></a><br /><br />Here's how we can arrive to this equation based on a weight in layer 3 in the example:<br /><br /><a href="https://3.bp.blogspot.com/-SY81xEQteOc/V2jlHMAnbTI/AAAAAAAAA3E/SEXKF2aY88sMsioj8ZVjqZl2pWf1Q6V3ACLcB/s1600/example_l3_w.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-SY81xEQteOc/V2jlHMAnbTI/AAAAAAAAA3E/SEXKF2aY88sMsioj8ZVjqZl2pWf1Q6V3ACLcB/s1600/example_l3_w.png" /></a><br /><br /><div class="subsectitle">d Cost / d Bias L</div><br />The derivative of the cost with respect to any bias in layer L can be found using<br /><br /><a href="https://2.bp.blogspot.com/-qXhGkyyjfMw/V2jlNXMGpfI/AAAAAAAAA3M/RE53_YQzdSwWa3zpR6poU6a6gNAdKd7yQCLcB/s1600/dC_dbL.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-qXhGkyyjfMw/V2jlNXMGpfI/AAAAAAAAA3M/RE53_YQzdSwWa3zpR6poU6a6gNAdKd7yQCLcB/s1600/dC_dbL.png" /></a><br /><br />Here's how we can arrive to this equation based on a bias in layer 3 in the example:<br /><br /><a href="https://2.bp.blogspot.com/-bFb8oGUU5NY/V2jlTcJgLjI/AAAAAAAAA3U/kvhc-rYVDVQrcp_Yrf3a_-QsTrW1uUBuACLcB/s1600/example_l3_b.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-bFb8oGUU5NY/V2jlTcJgLjI/AAAAAAAAA3U/kvhc-rYVDVQrcp_Yrf3a_-QsTrW1uUBuACLcB/s1600/example_l3_b.png" /></a><br /><br /><div class="subsectitle">Using delta</div><br />A part of the weights and biases equations is repeated. It will be convenient when finding derivatives of other layers to use an intermediate letter, called lower case delta, to represent this part of these equations.<br /><br /><a href="https://1.bp.blogspot.com/-M4suN0BZC98/V2jlX3szGAI/AAAAAAAAA3c/Xt8srfLaWxgyQ3JwELey18r1FNcQV0akACLcB/s1600/delta_L.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-M4suN0BZC98/V2jlX3szGAI/AAAAAAAAA3c/Xt8srfLaWxgyQ3JwELey18r1FNcQV0akACLcB/s1600/delta_L.png" /></a><br /><br /><div class="sectitle">Weights and bias in layer L-1</div><br />Now we find the derivative with respect to the weights and biases in layer L-1. The derivations will be continuations of the previous derivations and we won't be repeating the first bit of the derivations again.<br /><br /><div class="subsectitle">d Cost / d Weight L-1</div>The derivative of the cost with respect to any weight in layer L-1 can be found using<br /><br /><a href="https://1.bp.blogspot.com/--47pmIGrtSE/V2jlhjlGGvI/AAAAAAAAA3o/8gF05JXHpe4tsGASup7EhBXzr2zesAG3wCLcB/s1600/dC_dwL-1.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/--47pmIGrtSE/V2jlhjlGGvI/AAAAAAAAA3o/8gF05JXHpe4tsGASup7EhBXzr2zesAG3wCLcB/s1600/dC_dwL-1.png" /></a><br /><br />Here's how we can arrive to this equation based on a weight in layer 2 in the example:<br /><br /><a href="https://2.bp.blogspot.com/--q04u4F-0Ck/V2jlnMhVWbI/AAAAAAAAA3w/_J_j-YLkAD0R21fTJxyzNPmYNeb9uJPhACLcB/s1600/example_l2_w.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/--q04u4F-0Ck/V2jlnMhVWbI/AAAAAAAAA3w/_J_j-YLkAD0R21fTJxyzNPmYNeb9uJPhACLcB/s1600/example_l2_w.png" /></a><br /><br /><div class="subsectitle">d Cost / d Bias L-1</div><br />The derivative of the cost with respect to any bias in layer L-1 can be found using<br /><br /><a href="https://2.bp.blogspot.com/-2b3NYATE9zU/V2jltIWpfoI/AAAAAAAAA38/dqvrMeS0WPMLtzDTWvLBKNiE9tfyrJZjwCLcB/s1600/dC_dbL-1.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-2b3NYATE9zU/V2jltIWpfoI/AAAAAAAAA38/dqvrMeS0WPMLtzDTWvLBKNiE9tfyrJZjwCLcB/s1600/dC_dbL-1.png" /></a><br /><br />Here's how we can arrive to this equation based on a bias in layer 2 in the example:<br /><br /><a href="https://4.bp.blogspot.com/-zQUaliNNgQI/V2jlyKguNkI/AAAAAAAAA4I/4Sd4IsdAdvIz9Yk3yRYCt-DJbg9J2oynQCLcB/s1600/example_l2_b.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-zQUaliNNgQI/V2jlyKguNkI/AAAAAAAAA4I/4Sd4IsdAdvIz9Yk3yRYCt-DJbg9J2oynQCLcB/s1600/example_l2_b.png" /></a><br /><br /><div class="subsectitle">Using delta</div><br />These equations can be shortened by using delta again and now we can use the previous delta to shorten this one even more.<br /><br /><a href="https://2.bp.blogspot.com/-DWrc3_oPatQ/V2jl3cwdjkI/AAAAAAAAA4Q/PI_LZ7jIAPYILOl5O73UpAoNrUcmoXW3QCLcB/s1600/delta_L-1.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-DWrc3_oPatQ/V2jl3cwdjkI/AAAAAAAAA4Q/PI_LZ7jIAPYILOl5O73UpAoNrUcmoXW3QCLcB/s1600/delta_L-1.png" /></a><br /><br />Notice how there is a recursive definition of delta. This is the basis for the backpropagation algorithm.<br /><br /><div class="sectitle">Weights and bias in layer L-2</div><br />Now we find the derivative with respect to the weights and biases in layer L-2. Like before, we won't be repeating the first bit of the derivations again.<br /><br /><div class="subsectitle">d Cost / d Weight L-2</div>The derivative of the cost with respect to any weight in layer L-2 can be found using<br /><br /><a href="https://4.bp.blogspot.com/-ftwjORlsIwo/V2jl95kbYEI/AAAAAAAAA4c/YSkkFGwGyDgZr1ra__m635UlPtTE6fXNwCLcB/s1600/dC_dwL-2.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-ftwjORlsIwo/V2jl95kbYEI/AAAAAAAAA4c/YSkkFGwGyDgZr1ra__m635UlPtTE6fXNwCLcB/s1600/dC_dwL-2.png" /></a><br /><br />Here's how we can arrive to this equation based on a weight in layer 1 in the example (notice that there is a trick we'll be using with the summations that is explained below):<br /><br /><a href="https://4.bp.blogspot.com/-vn2IyhwMA_k/V2jmFCl-2LI/AAAAAAAAA4o/uqdXceRF0okCKwNZvY70fm4_oAH3mOczwCLcB/s1600/example_l1_w.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-vn2IyhwMA_k/V2jmFCl-2LI/AAAAAAAAA4o/uqdXceRF0okCKwNZvY70fm4_oAH3mOczwCLcB/s1600/example_l1_w.png" /></a><br /><br />At the end of the derivation we did a trick with the sigmas (summations) in order to turn the equation into a recursive one.<br /><br />First, we moved the delta to the inside of the sigmas. This can be done because it's just a common term that is factored out and we're factoring it back in. For example if the equation is "𝛿(a + b + c)" then we can turn that into "𝛿a + 𝛿b + 𝛿c".<br /><br />Second, we swapped the "p" summation with the "q" summation. This does not change anything if you think about it. All it does is change the order of the summations, which does not change anything.<br /><br />Finally we replaced the "p" summation with the previous delta, allowing us to shorten the equation.<br /><br /><div class="subsectitle">d Cost / d Bias L-2</div><br />The derivative of the cost with respect to any bias in layer L-2 can be found using<br /><br /><a href="https://2.bp.blogspot.com/-suZKM7GlkfI/V2jmKXQhZhI/AAAAAAAAA4w/Lkw8EixvXR0np3FJk2eWn8qOPrK-gFo-gCLcB/s1600/dC_dbL-2.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-suZKM7GlkfI/V2jmKXQhZhI/AAAAAAAAA4w/Lkw8EixvXR0np3FJk2eWn8qOPrK-gFo-gCLcB/s1600/dC_dbL-2.png" /></a><br /><br />Here's how we can arrive to this equation based on a bias in layer 1 in the example:<br /><br /><a href="https://3.bp.blogspot.com/-YhQ7SA4bqCk/V2jmQl7_pxI/AAAAAAAAA44/2qv78xt0wsAsi430cYAQ5V3fWKaHjg0fwCLcB/s1600/example_l1_b.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-YhQ7SA4bqCk/V2jmQl7_pxI/AAAAAAAAA44/2qv78xt0wsAsi430cYAQ5V3fWKaHjg0fwCLcB/s1600/example_l1_b.png" /></a><br /><br /><div class="subsectitle">Using delta</div><br />Again, we can shorten the equations using delta.<br /><br /><a href="https://4.bp.blogspot.com/-M_ANMOvK964/V2jmWOFpi8I/AAAAAAAAA5E/TSmmFuNpYp4pOPjTXZNBXB9Gl_2-yF6NQCLcB/s1600/delta_L-2.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-M_ANMOvK964/V2jmWOFpi8I/AAAAAAAAA5E/TSmmFuNpYp4pOPjTXZNBXB9Gl_2-yF6NQCLcB/s1600/delta_L-2.png" /></a><br /><br />Notice that this is exactly the same as the previous delta. This is because beyond L-1, all the deltas will be identical and can be defined using a single recursive relation.<br /><br /><div class="sectitle">In general: using indices</div><br />Up to now we saw what the derivatives for layer L, L-1, and L-2 are. We can now generalize these equations for any layer. Given the recursive pattern in the deltas, we can create an equation that gives the deltas of any layer provided you have the deltas of the previous layers. The base case of the recursion would be for layer L, which is defined on its own.<br /><br /><a href="https://4.bp.blogspot.com/-Zsxj5o4B6Ww/V2jmbJUc5II/AAAAAAAAA5M/G3ZflephgQw2RIz36Vf85vqz9XlB1mBCQCLcB/s1600/general_indexed.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Zsxj5o4B6Ww/V2jmbJUc5II/AAAAAAAAA5M/G3ZflephgQw2RIz36Vf85vqz9XlB1mBCQCLcB/s1600/general_indexed.png" /></a><br /><br />Now we have a way to find the derivatives for any layer, no matter how deep the neural network is. Here is how you'd use these equations on the example:<br /><br /><a href="https://4.bp.blogspot.com/-ulvgOSGlDfc/V2jmgH2Ca0I/AAAAAAAAA5Y/h6enw65uQe8fYS4c067QibPtX2c9dljWgCLcB/s1600/example_general_applied.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-ulvgOSGlDfc/V2jmgH2Ca0I/AAAAAAAAA5Y/h6enw65uQe8fYS4c067QibPtX2c9dljWgCLcB/s1600/example_general_applied.png" /></a><br /><br /><div class="sectitle">In general: using matrices</div><br />As things stand, there are a lot of index letters littering our equations (i,j,p). We can get rid of them however if we use matrix operations that work on all the indices at once. If we do this we can even get shorter code when programming the backpropagation algorithm by making use of a fast matrix library such as numpy.<br /><br /><div class="subsectitle">Using matrices in neural nets</div><br />Let's look at how we can use matrices to compute the output of a neural net. A whole layer of activations can be calculated as a whole by treating it as a vector. <span style="font-weight:bold;">We'll treat vectors as being horizontal by default and which need to be transposed in order to be made vertical</span>.<br /><br />Here's an example of what we want to calculate:<br /><a href="https://4.bp.blogspot.com/-MZQPbGMNWbM/V2jmpM3HnpI/AAAAAAAAA5g/JfyCHfDslFcDW5pUdvYLSmkriI1BrnfDACLcB/s1600/example_neuralnet_singlelayer.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-MZQPbGMNWbM/V2jmpM3HnpI/AAAAAAAAA5g/JfyCHfDslFcDW5pUdvYLSmkriI1BrnfDACLcB/s1600/example_neuralnet_singlelayer.png" /></a><br /><br />Of course we're still using indices just because we're organising our activations in a vector. In order to get rid of the indices we need to treat the weights as a matrix, where the first index is the row number and the second index is the column number. That way we can have a single letter "w" which represents all of the weights of a layer. When multiplied by a vector of the previous layer's activations, the matrix multiplication will result in another vector of weighted sums which can be added to a bias vector and passed through an activation function to compute the next vector of activations. Here is an example of the calculation based on the above example:<br /><br /><a href="https://2.bp.blogspot.com/-hvMbzdLM5Bo/V2jmu0SmynI/AAAAAAAAA5s/afxydup5Z9kbYtWnb6KB6KNR6Jjj0Hv1wCLcB/s1600/example_feedforward_matrix.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-hvMbzdLM5Bo/V2jmu0SmynI/AAAAAAAAA5s/afxydup5Z9kbYtWnb6KB6KNR6Jjj0Hv1wCLcB/s1600/example_feedforward_matrix.png" /></a><br /><br /><div class="subsectitle">Final clean generalized deltas and cost gradients</div><br />And now we can finally see what the clean matrix-based general derivatives are for any layer:<br /><br /><a href="https://2.bp.blogspot.com/-8mO6sPRdIik/V2jmzW3kO8I/AAAAAAAAA50/WU0GydJooeUEL6lKW1mXxrrm5lv4vKQuACLcB/s1600/general_matrix.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-8mO6sPRdIik/V2jmzW3kO8I/AAAAAAAAA50/WU0GydJooeUEL6lKW1mXxrrm5lv4vKQuACLcB/s1600/general_matrix.png" /></a><br /><br /><div class="sectitle">The backpropagation algorithm<br /></div><br />Finding the gradients is one step (the most complex one) of the backpropagation algorithm. The full backpropagation algorithm goes as follows:<br /><br /><ol><li>For each input-target pair in the training set,<br /><ol><li>Compute the activations and the "z" of each layer when passing the input through the network. This is called a forward pass.</li><li>Use the values collected from the forward pass to calculate the gradient of the cost with respect to each weight and bias, starting from the output layer and using the delta equations to compute the gradients for previous layers. This is called the backward pass.</li></ol></li><li>Following the backward pass, you have the gradient of the cost with respect to each weight and bias for each input in the training set. Add up all the corresponding gradients of each input in the training set (all the gradients with respect to the same weight or bias).</li><li>Now use the gradient to perform gradient descent by multiplying the gradient by a step size, or learning rate as it's called in the backpropagation algorithm, and subtracting it from the corresponding weights and biases. In algebra,<br />weight = weight - learningrate*dCost/dWeight</li></ol><br /><div class="sectitle">In code<br /></div>And this is what the full program looks like in Python 3 using numpy to perform matrix operations. Again, we shall be training the neural net to perform the function of a half adder. A half adder adds together two binary digits and returns the sum and carry. So 0 + 1 in binary gives 1 carry 0 whilst 1 + 1 in binary gives 0 carry 1.<br /><br /><pre class="brush:python;">import math, random<br />import numpy as np<br /><br />class ANN:<br /> '''General artificial neural network class.'''<br /> <br /> def __init__(self, num_in, num_hiddens, num_out, trainingset):<br /> '''Create a neural network with a set number of layers and neurons and with initialised weights and biases.'''<br /> if not all(len(x)==num_in and len(y)==num_out for (x,y) in trainingset.items()):<br /> raise ValueError()<br /><br /> self._num_in = num_in<br /> self._num_hiddens = num_hiddens<br /> self._num_out = num_out<br /> self._trainingset = trainingset<br /><br /> self._num_layers = len(num_hiddens) + 1<br /> <br /> self._weights = dict([<br /> (1, np.array([<br /> [self.weight_init(1, i, j) for j in range(num_hiddens[0])]<br /> for i in range(num_in)<br /> ]))<br /> ]<br /> + [<br /> (l+1, np.array([<br /> [self.weight_init(l+1, i, j) for j in range(num_hiddens[l])]<br /> for i in range(num_hiddens[l-1])<br /> ]))<br /> for l in range(1, len(num_hiddens))<br /> ]<br /> + [<br /> (self._num_layers, np.array([<br /> [self.weight_init(self._num_layers, i, j) for j in range(num_out)]<br /> for i in range(num_hiddens[-1])<br /> ]))<br /> ])<br /> <br /> self._biases = dict([<br /> (l+1, np.array([<br /> self.bias_init(l+1, j) for j in range(num_hiddens[l])<br /> ]))<br /> for l in range(len(num_hiddens))<br /> ]<br /> + [<br /> (self._num_layers, np.array([<br /> self.bias_init(self._num_layers, j) for j in range(num_out)<br /> ]))<br /> ])<br /><br /> def weight_init(self, layer, i, j):<br /> '''How to initialise weights.'''<br /> return 2*random.random()-1<br /><br /> def bias_init(self, layer, j):<br /> '''How to initialise biases.'''<br /> return 2*random.random()-1<br /> <br /> def a(self, z):<br /> '''The activation function.'''<br /> return 1/(1 + np.vectorize(np.exp)(-z))<br /><br /> def da_dz(self, z):<br /> '''The derivative of the activation function.'''<br /> return self.a(z)*(1-self.a(z))<br /><br /> def out(self, x):<br /> '''Compute the output of the neural network given an input vector.'''<br /> n = x<br /> for l in range(1, self._num_layers+1):<br /> n = self.a(np.dot(n, self._weights[l]) + self._biases[l])<br /> return n<br /><br /> def cost(self):<br /> '''The cost function based on the training set.'''<br /> return sum(sum((self.out(x) - t)**2) for (x,t) in self._trainingset.items())/(2*len(self._trainingset))<br /><br /> def dCxt_dnL(self, x, t):<br /> '''The derivative of the cost function for a particular input-target pair with respect to the output of the network.'''<br /> return self.out(x) - t<br /><br /> def get_cost_gradients(self):<br /> '''Calculate and return the gradient of the cost function with respect to each weight and bias in the network.'''<br /> dC_dws = dict()<br /> dC_dbs = dict()<br /> for l in range(1, self._num_layers+1):<br /> dC_dws[l] = np.zeros_like(self._weights[l])<br /> dC_dbs[l] = np.zeros_like(self._biases[l])<br /> <br /> for (x,t) in self._trainingset.items():<br /> #forward pass<br /> zs = dict()<br /> ns = dict()<br /> ns_T = dict()<br /> ns[0] = np.array(x)<br /> ns_T[0] = np.array([np.array(x)]).T<br /> for l in range(1, self._num_layers+1):<br /> z = np.dot(ns[l-1], self._weights[l]) + self._biases[l]<br /> zs[l] = z<br /> n = self.a(z)<br /> ns[l] = n<br /> ns_T[l] = np.array([n]).T<br /><br /> #backward pass<br /> d = self.dCxt_dnL(x, t)*self.da_dz(zs[self._num_layers])<br /> dC_dws[self._num_layers] += np.dot(ns_T[self._num_layers-1], np.array([d]))<br /> dC_dbs[self._num_layers] += d<br /> for l in range(self._num_layers-1, 1-1, -1):<br /> d = np.dot(d, self._weights[l+1].T)*self.da_dz(zs[l])<br /> dC_dws[l] += np.dot(ns_T[l-1], np.array([d]))<br /> dC_dbs[l] += d<br /><br /> for l in range(1, self._num_layers+1):<br /> dC_dws[l] /= len(self._trainingset)<br /> dC_dbs[l] /= len(self._trainingset)<br /><br /> return (dC_dws, dC_dbs)<br /><br /> def epoch_train(self, learning_rate):<br /> '''Train the neural network for one epoch.'''<br /> (dC_dws, dC_dbs) = self.get_cost_gradients()<br /> for l in range(1, self._num_layers+1):<br /> self._weights[l] -= learning_rate*dC_dws[l]<br /> self._biases[l] -= learning_rate*dC_dbs[l]<br /> <br /> def check_gradient(self, epsilon=1e-5):<br /> '''Check if the gradients are being calculated correctly. This is done according to http://ufldl.stanford.edu/tutorial/supervised/DebuggingGradientChecking/ . This method calculates the difference between each calculated gradient (according to get_cost_gradients()) and an estimated gradient using a small number epsilon. If the gradients are calculated correctly then the returned numbers should all be very small (smaller than 1e-10.'''<br /> (predicted_dC_dws, predicted_dC_dbs) = self.get_cost_gradients()<br /><br /> approx_dC_dws = dict()<br /> approx_dC_dbs = dict()<br /> for l in range(1, self._num_layers+1):<br /> approx_dC_dws[l] = np.zeros_like(self._weights[l])<br /> approx_dC_dbs[l] = np.zeros_like(self._biases[l])<br /> <br /> for l in range(1, self._num_layers+1):<br /> (rows, cols) = self._weights[l].shape<br /> for r in range(rows):<br /> for c in range(cols):<br /> orig = self._weights[l][r][c]<br /> self._weights[l][r][c] = orig + epsilon<br /> cost_plus = self.cost()<br /> self._weights[l][r][c] = orig - epsilon<br /> cost_minus = self.cost()<br /> self._weights[l][r][c] = orig<br /> approx_dC_dws[l][r][c] = (cost_plus - cost_minus)/(2*epsilon)<br /> <br /> (cols,) = self._biases[l].shape<br /> for c in range(cols):<br /> orig = self._biases[l][c]<br /> self._biases[l][c] = orig + epsilon<br /> cost_plus = self.cost()<br /> self._biases[l][c] = orig - epsilon<br /> cost_minus = self.cost()<br /> self._biases[l][c] = orig<br /> approx_dC_dbs[l][c] = (cost_plus - cost_minus)/(2*epsilon)<br /><br /> errors_w = dict()<br /> errors_b = dict()<br /> for l in range(1, self._num_layers+1):<br /> errors_w[l] = abs(predicted_dC_dws[l] - approx_dC_dws[l])<br /> errors_b[l] = abs(predicted_dC_dbs[l] - approx_dC_dbs[l])<br /> return (errors_w, errors_b)<br /><br />################################################################<br />nn = ANN(<br /> 2, #number of inputs<br /> [ 2, 2 ], #number of hidden neurons (each number is a hidden layer)<br /> 2, #number of outputs<br /> { #training set<br /> (0,0):(0,0),<br /> (0,1):(1,0),<br /> (1,0):(1,0),<br /> (1,1):(0,1)<br /> }<br />)<br /><br />print('Initial cost:', nn.cost())<br />print('Starting training')<br />for epoch in range(1, 20000+1):<br /> nn.epoch_train(0.5)<br /> if epoch%1000 == 0:<br /> print(' Epoch:', epoch, ', Current cost:', nn.cost())<br />print('Training finished')<br /><br />print('Neural net output:')<br />for (x,t) in sorted(nn._trainingset.items()):<br /> print(' ', x, ':', nn.out(x))<br /></pre><br />After 20,000 iterations I got this output:<br /><pre>Initial cost: 0.232581149941<br />Starting training<br /> Epoch: 1000 , Current cost: 0.218450291043<br /> Epoch: 2000 , Current cost: 0.215777328824<br /> Epoch: 3000 , Current cost: 0.178593050179<br /> Epoch: 4000 , Current cost: 0.102704613785<br /> Epoch: 5000 , Current cost: 0.0268721186425<br /> Epoch: 6000 , Current cost: 0.00819827730488<br /> Epoch: 7000 , Current cost: 0.00444660869981<br /> Epoch: 8000 , Current cost: 0.00298786209459<br /> Epoch: 9000 , Current cost: 0.00223137023455<br /> Epoch: 10000 , Current cost: 0.00177306322414<br /> Epoch: 11000 , Current cost: 0.00146724847349<br /> Epoch: 12000 , Current cost: 0.00124934223594<br /> Epoch: 13000 , Current cost: 0.00108652952015<br /> Epoch: 14000 , Current cost: 0.000960438140866<br /> Epoch: 15000 , Current cost: 0.000860007442299<br /> Epoch: 16000 , Current cost: 0.000778192177283<br /> Epoch: 17000 , Current cost: 0.000710297773182<br /> Epoch: 18000 , Current cost: 0.000653078479503<br /> Epoch: 19000 , Current cost: 0.000604220195232<br /> Epoch: 20000 , Current cost: 0.0005620295804<br />Training finished<br />Neural net output:<br /> (0, 0) : [ 0.02792593 0.00058427]<br /> (0, 1) : [ 0.97284142 0.01665468]<br /> (1, 0) : [ 0.97352071 0.01661805]<br /> (1, 1) : [ 0.03061993 0.97196112]<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-78263063104989290602016-05-22T20:11:00.000+02:002016-07-29T10:41:21.176+02:00Making your own deep learning workstation: From Windows to Ubuntu dual boot with CUDA and TheanoI recently managed to turn my Windows 7 gaming PC into a dual boot GPU enabled deep learning workstation which I use for deep learning experiments. Here is how I did it.<br /><br />First of all, you'll want to use Theano as a deep learning framework which automatically optimises your code to use the GPU, the fast parallel processor on your graphics card, which makes deep learning faster (twice as fast on my computer). Unfortunately this seems to be difficult to do on Windows, but easy to do on Linux; so I created a partition for Ubuntu Linux (splitting a hard disk into two separate spaces which can contain two different operating systems) and made my computer dual boot (so that I can choose whether to use Windows or Ubuntu when my computer is switched on). It seems that the most compatible Ubuntu version which most tutorials focus on is Ubuntu 14.04, which is what I installed. I then installed CUDA, which is the NVIDIA graphic card driver that allows you to write programs that use the GPU. After installing this, I installed Theano together with the even easier specialisation framework Keras which lets you do deep learning with only a few lines of Python code.<br /><br />Be warned that I had to reinstall Ubuntu multiple times because I followed different tutorials which didn't work. Here I will link to the tutorials which worked for me.<br /><br /><div class="sectitle">Create a partition for Ubuntu</div><a href="http://www.howtogeek.com/101862/how-to-manage-partitions-on-windows-without-downloading-any-other-software/">Tutorial</a><br />The first step is to create a partition in your C: drive in which to install Ubuntu. I made a partition of 100GB using the Windows Disk Management tool. Follow the instructions in the tutorial.<br /><br /><div class="sectitle">Install Ubuntu 14.04 into the partition (yes, <a href="https://www.facebook.com/groups/DeepLearnng/permalink/1814791028754547/">it is recommended to use 14.04</a>), together with making the computer dual boot</div><a href="http://blog.nold.ca/2010/09/adding-ubuntu-to-windows-7-bootloader.html">Tutorial</a><br />Follow the instructions in the tutorial to use BCDEdit, the Windows boot manager tool, to make it possible to choose whether to use Windows or Ubuntu. Stop at step 5.<br /><br />To continue past step 5, download the Ubuntu 14.04 ISO file from <a href="http://releases.ubuntu.com/trusty/">http://releases.ubuntu.com/trusty/</a>. I choose the 64-bit PC (AMD64) desktop image but you might need the 32-bit one instead. After downloading it, burn it into an empty DVD, restart your computer, and pop in the DVD before it starts booting. Assuming your boot priority list has the DVD drive before the hard disk (changing this if it's not the case depends on your BIOS and mother card but it usually requires pressing F12 when the computer is starting up), you should be asked whether to install or try Ubuntu. Choose to try Ubuntu and once the desktop is loaded start the installation application.<br /><br />At this point you're on step 6 of the tutorial. Be sure to choose "something else" when asked how to install Ubuntu (whether it's along side your other operating system or remove whatever you already have in your computer). Create a 1GB partition for swap space and another partition with the remainder for the available unformatted space for Ubuntu. You can find instructions on this at <a href="http://askubuntu.com/questions/343268/how-to-use-manual-partitioning-during-installation/521195#521195">http://askubuntu.com/questions/343268/how-to-use-manual-partitioning-during-installation/521195#521195</a>. <span style="font-weight:bold;">It's very important that the device for boot loader is set to be the same partition where Ubuntu will be installed, otherwise Windows will not show when you turn on your computer.</span> Do not restart, but choose to continue testing the Live CD instead.<br /><br />You are now in step 8 in the tutorial. This is where you make it possible to select Ubuntu when you switch on your computer. Here is a list of steps you need:<br /><ol><li>First, you need to mount the Windows and Ubuntu partitions to make them accessible. In Ubuntu there is no "My Computer" like in Windows where you can see all the partitions. Instead you have the drives and partitions shown on the side. Click on the ones whose size matches those of the Windows C: drive and the new Ubuntu partition (after clicking on them they will be opened so you can check their contents). This will to mount the partitions.</li><li>Next you need to know the partition name of the Ubuntu partition. Click on the Ubuntu icon on the side and search for a program called GParted. Open the program and use the drop down list in the top corner to look for the partition you created for Ubuntu (selecting different choices in the drop down list will show you partitions in different physical hard disks you have). Take note of the name of the Ubuntu partition such as "/dev/sda2". Call this name [Ubuntu partition].</li><li>Next you need to know the directory path to the Windows partition. In an open folder window (or go on the Files icon in the side), click on "Computer" on the side, go in "media", enter into the "ubuntu" folder and there you will find all the mounted partitions. Find the Windows partition. Call the directory path to this partition icon [Windows partition] (/media/ubuntu/abcdef-123456).</li><li>Open the terminal (ctrl+alt+T) and then just copy the following line into the terminal. Where it says '[Windows parition]' just drag and drop the partition icon into the terminal and the whole directory will be written for you.<br /><pre style="border-style:solid;">sudo dd if=[Ubuntu partition] of='[Windows partition]/ubuntu.bin' bs=512 count=1</pre></li></ol><br />Now you can restart, remove the DVD, and log into your newly installed Ubuntu operating system, update it, and do another restart.<br /><br /><div class="sectitle">Installing CUDA</div><a href="http://www.r-tutor.com/gpu-computing/cuda-installation/cuda7.5-ubuntu">Tutorial</a><br />This is the part that took the most tries since only one tutorial actually worked well and gave no errors. Follow the instructions in the tutorial. When downloading the CUDA driver be sure to download the .deb file.<br /><br /><div class="sectitle">Installing Theano with dependencies</div><a href="http://deeplearning.net/software/theano/install_ubuntu.html">Tutorial</a><br />Now comes the straight forward part. Unfortunately it's easier to just use Python 2 than to try to make things work with Python 3. The <a href="https://github.com/fchollet/keras/tree/master/examples">examples in Keras</a> are for Python 2 anyway. Just open the terminal and paste:<br /><pre style="border-style:solid;">sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git</pre><br />However also install an extra package which will let you test that Theano was installed correctly:<br /><pre style="border-style:solid;">sudo pip install nose-parameterized</pre><br />Now you can install theano and keras.<br /><pre style="border-style:solid;">sudo pip install Theano<br />sudo pip install keras<br /></pre><br />Now whenever you want to run a theano or keras python script and use the GPU all you have to do is write the following in the terminal:<br /><pre style="border-style:solid;">THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python [python script]</pre><br />You may also want to use Python IDLE in order to experiment with theano, in which case install it like this:<br /><pre style="border-style:solid;">sudo apt-get install idle</pre>and then run it to use the GPU like this:<br /><pre style="border-style:solid;">THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 idle</pre>or do this to debug an error in your program:<br /><pre style="border-style:solid;">THEANO_FLAGS=mode=FAST_COMPILE,device=cpu,floatX=float32 idle</pre><br />I also recommend this good <a href="http://www.marekrei.com/blog/theano-tutorial/">theano tutorial</a> to get you started.<br /><br />Has everything worked? Leave a comment if you think I should add something else to this tutorial.mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-54274972861186778512016-04-17T16:40:00.000+02:002016-04-17T16:45:31.314+02:00The Kullback–Leibler divergenceIn a text file, each character is represented by a number of bits. Usually each character consists of 8 bits. But in data compression and information theory we learn that the number of bits need not be the same for each character. The most frequent characters can be given the fewest number of bits whilst the least frequent characters can be given many bits. According to <a href="https://en.wikipedia.org/wiki/Information_theory">Shannon's information theory</a>, the <a href="https://en.wikipedia.org/wiki/Self-information">smallest number of bits</a> needed to represent a character that occurs with probability "p" is<br /><pre>-log2(p)</pre><br />This means that if a character occurs half the time in a document, then you need -log2(0.5) bits which equals 1. On the other hand a character that occurs an eighth of the time should be represented with -log2(0.125) bits which equals 3. The rarer the character, the more bits should be assigned to it in order to be able to use less bits for the more frequent characters. This will result in compressing the whole document.<br /><br />The <a href="https://en.wikipedia.org/wiki/Entropy_%28information_theory%29">entropy</a> of a document is the expected number of bits per character, that is, the average number of bits in a document. If the number of bits is minimum, as described above, the expected number of bits is<br /><pre>-sum(p log2(p) for all character probabilities p)</pre><br />Now that we know what entropy is, we can understand what the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a> is. Assume that the number of bits for each character is calculated based on a different document than the one being compressed. Clearly this is a recipe for disaster, but you might want to compute an average probability for each character once based on a representative corpus and then always use these probabilities in all documents, which saves time. By how much will the probabilities diverge? This is what the Kullback-Leibler divergence is used for.<br /><br />What we'll do is we'll calculate the difference in the average number of bits per character when using the wrong probabilities instead of the correct ones. This is done as follows:<br /><pre>-sum(p log2(q)) - -sum(p log2(p))<br /></pre><br />Notice how the correct probability for a particular character is "p" but in the first term we're using a different probability "q". This can now be simplified as follows:<br /><pre>-sum(p log2(q)) - -sum(p log2(p))<br />sum(p log2(p)) - sum(p log2(q))<br />sum(p log2(p) - p log2(q))<br />sum(p(log2(p) - log2(q)))<br />sum(p log2(p/q))<br /></pre><br />And this is the Kullback-Leibler divergence between the probabilities that "p" comes from and the probabilities that "q" comes from. It is usually used to measure the difference between two probability distributions.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-83912832685478195682016-03-14T07:28:00.000+01:002016-03-14T09:44:41.640+01:00Displaying the digits of pi in your web browserHappy <a href="http://www.piday.org/">pi day</a>. Here's how you can display as many digits of pi as you want (subject to your computer's resources).<br /><br />We're going to be using <a href="http://www.cs.ox.ac.uk/people/jeremy.gibbons/publications/spigot.pdf">Gibbon's unbounded spigot algorithm</a> which spits out consecutive digits of pi in decimal without iterative approximations. It's really cool when you see it in action.<br /><br />In the above linked paper, the algorithm is given in Haskell. I won't explain the mathematical derivation in this post, but here is a translation into Python 3:<br /><pre class="brush:python">def gibbons():<br /> (q, r, t, k, n, l) = (1, 0, 1, 1, 3, 3)<br /> while True:<br /> if 4*q + r - t < n*t:<br /> yield n<br /> (q, r, n) = (<br /> 10*q,<br /> 10*(r - n*t),<br /> 10*(3*q + r)//t - 10*n<br /> )<br /> else:<br /> (q, r, t, k, n, l) = (<br /> q*k,<br /> (2*q + r)*l,<br /> t*l,<br /> k+1,<br /> (q*(7*k + 2) + r*l)//(t*l),<br /> l + 2<br /> )<br /></pre>To use this code run this: <pre class="brush:python">digit_seq = gibbons()<br />print(next(digit_seq))<br />print(next(digit_seq))<br />print(next(digit_seq))<br />print(next(digit_seq))<br /></pre>This will output <pre>3<br />1<br />4<br />1<br /></pre>If you want it to print out a thousand digits, do this: <pre class="brush:python">digit_seq = gibbons()<br />for (position, digit) in enumerate(digit_seq):<br /> if position == 1000:<br /> break<br /> print(digit, end='')<br /></pre>Now this code will let you print as many digits as you want; however being in Python it's not something you can just tell your non-geek friend to try out. So for pi day, here's a javascript implementation that anyone can just paste in their browser's web console and get a nice stream of pi digits. One thing you should keep in mind is that in Python you can have numbers which are as big as you like. Javascript is not like that, so we'll need to use an external library called <a href="https://github.com/MikeMcl/bignumber.js/">bignumber.js</a>. This is the full code we'll be using to generate the digits: <pre class="brush:javascript">var script = document.createElement('script');<br />script.src = "https://cdnjs.cloudflare.com/ajax/libs/bignumber.js/2.3.0/bignumber.min.js";<br />document.getElementsByTagName('head')[0].appendChild(script);<br /><br />function gibbons() {<br /> var q=new BigNumber(1), r=new BigNumber(0), t=new BigNumber(1), k=new BigNumber(1), n=new BigNumber(3), l=new BigNumber(3);<br /> function f() {<br /> while (true) {<br /> if (q.mul(4).add(r).sub(t).lt(n.mul(t))) {<br /> var digit = n.toString();<br /> var q_ = q.mul(10);<br /> var r_ = r.sub(n.mul(t)).mul(10);<br /> var n_ = q.mul(3).add(r).mul(10).divToInt(t).sub(n.mul(10));<br /> q=q_, r=r_, n=n_;<br /> return digit;<br /> }<br /> else {<br /> var q_ = q.mul(k);<br /> var r_ = q.mul(2).add(r).mul(l);<br /> var t_ = t.mul(l);<br /> var k_ = k.add(1);<br /> var n_ = (q.mul(k.mul(7).add(2)).add(r.mul(l))).divToInt(t.mul(l));<br /> var l_ = l.add(2);<br /> q=q_, r=r_, t=t_, k=k_, n=n_, l=l_;<br /> }<br /> }<br /> }<br /> return f;<br />}<br /><br />function show_digits(max_num_digits) {<br /> var get_digit = gibbons();<br /> var num_digits = 0;<br /> function f() {<br /> var digits = '';<br /> for (var i = 1; i <= 50; i++) {<br /> digits += get_digit();<br /> num_digits++;<br /> if (num_digits >= max_num_digits) {<br /> break;<br /> }<br /> }<br /> console.log(digits);<br /> if (num_digits < max_num_digits) {<br /> setTimeout(f, 100);<br /> }<br /> else {<br /> console.log('\n'+max_num_digits+' digits of pi displayed!');<br /> }<br /> }<br /> f();<br />}<br /><br />show_digits(1000);<br /></pre> The first line is to import the bignumber library in a way that works in the web console. After that there are two functions. The first is for the Gibbons algorithm which lets us get the digits of pi. The way it works is by returning a function which returns the next digit each time it is called. The second function outputs 50 digits of pi every 100 milliseconds. <a href="#code2copypaste"><div class="sectitle" id="code2copypaste">The code to copy-paste</div></a> But this is not something you'll send you friend. Instead send them this minified code (of course you shouldn't trust obscure code sent by strangers as it might be evil code, but you can minify the above code youself if you don't trust me). <pre class="brush:javascript" style="border:5px solid black;">var script=document.createElement("script");script.src="https://cdnjs.cloudflare.com/ajax/libs/bignumber.js/2.3.0/bignumber.min.js",document.getElementsByTagName("head")[0].appendChild(script);<br /><br />function gibbons(){function u(){for(;;){if(n.mul(4).add(i).sub(m).lt(d.mul(m))){var u=d.toString(),e=n.mul(10),r=i.sub(d.mul(m)).mul(10),t=n.mul(3).add(i).mul(10).divToInt(m).sub(d.mul(10));return n=e,i=r,d=t,u}var e=n.mul(l),r=n.mul(2).add(i).mul(o),a=m.mul(o),b=l.add(1),t=n.mul(l.mul(7).add(2)).add(i.mul(o)).divToInt(m.mul(o)),g=o.add(2);n=e,i=r,m=a,l=b,d=t,o=g}}var n=new BigNumber(1),i=new BigNumber(0),m=new BigNumber(1),l=new BigNumber(1),d=new BigNumber(3),o=new BigNumber(3);return u}function show_digits(u){function n(){for(var l="",d=1;50>=d&&(l+=i(),m++,!(m>=u));d++);console.log(l),u>m?setTimeout(n,100):console.log("\n"+u+" digits of pi displayed!")}var i=gibbons(),m=0;n()}<br /><br />show_digits(1000);<br /></pre><br />What you have to do is open your browser's web console on any web page by pressing F12 (even a blank tab works, although Facebook will block it), then paste the first line of code at the bottom. After pressing enter, paste the second line of code. Finally paste the "show_digits(1000);" line and press enter. You'll see a stream of digits displayed in the console panel. I think this is a nice way to celebrate pi day. Maybe someone else can make the digits somehow interact with the web page you are on instead of just display them in the console. Until then, enjoy your pie today!mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-64957871532796167152016-02-28T12:23:00.000+01:002016-05-10T09:43:35.489+02:00Gradient Descent Algorithm on Artificial Neural Networks (ANNs)<a href="https://en.wikipedia.org/wiki/Polynomial">Polynomials</a> are great. They're a simple mathematical equation pattern which is simple and easy to work with. You basically have a number of constants "a", "b", "c", etc. which you use in the following pattern:<br />f(x) = a x^0 + b x^1 + c x^2 + ...<br /><br />By changing the values of the constants you change the shape of the graph of f(x). You can design a particular graph shape by choosing the values of these constants. In fact, one of the most interesting properties of polynomials is that they can be easily designed so that their graph passes exactly through a given set of points. For example, you can choose the constants of the polynomial to make f(1) = 2, f(3) = -1, f(-1.5) = 0.5, etc. This is very easy to make using <a href="https://en.wikipedia.org/wiki/Lagrange_polynomial#Examples">Lagrange polynomials</a>. The constants of the polynomial can be seen as parameters which can be optimized in order to get a desired function.<br /><br />Polynomials are not the only equation which is designed to be easily modifiable. One of the most well known modifiable equation patterns are <a href="https://en.wikipedia.org/wiki/Artificial_neural_network">Artificial Neural Networks</a>. Although neural networks are inspired by the neurons in the brain and how they change when learning to perform new tasks, the concept behind neural networks is that of having an equation that, by design, can be easily modified to turn chosen inputs into chosen outputs. They can take any number of inputs and outputs and they usually only work on binary numbers (which is great for computers). They have been used for lots of things such as recognising objects in images and classifying text into meaningful categories. Neural networks are known to generalise a set of input-output mappings into useful functions that work well for new inputs. For example, if you design a neural network to map 100 particular pictures of dogs to the output "1" and another 100 particular pictures of cats to the output "0", then the network will start giving the expected output for new pictures of dogs and cats.<br /><br /><div class="sectitle">Neural network definition</div><br />Whereas the building block of the polynomial is the term "k x^i", the building block of the neural network is the neuron. A neuron takes in a number of signals from other neurons, weighs the signals by their importance by making them weaker or stronger, adds them together and if the sum is over a particular threshold, then the neuron will output its own signal to some other neurons. Here is a diagram of this:<br /><a href="https://4.bp.blogspot.com/-I5HuIGLaOAE/Vs1URRSUPfI/AAAAAAAAAqo/Ot69kJjiqJo/s1600/neuron.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-I5HuIGLaOAE/Vs1URRSUPfI/AAAAAAAAAqo/Ot69kJjiqJo/s320/neuron.png" /></a><br /><br />Neurons are organized into layers, where one layer of neurons supplies input signals to the next layer of neurons. In mathematical terms, a neuron is defined as follows:<br /><a href="https://1.bp.blogspot.com/-quQRaHii8f0/Vs1amleI0vI/AAAAAAAAArc/4yWS19-wZE8/s1600/neuron%2Bdef.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-quQRaHii8f0/Vs1amleI0vI/AAAAAAAAArc/4yWS19-wZE8/s1600/neuron%2Bdef.png" /></a><br /><br /><a href="https://en.wikipedia.org/wiki/Activation_function">Activation functions</a> are functions that sharply change value when the input is over a certain threshold. Traditionally, neural networks use the sigmoid function given above which has a graph like this:<br /><a href="https://1.bp.blogspot.com/-tv4UNhzlXdY/Vs1WvuymqmI/AAAAAAAAArA/COJrBCwyaHk/s1600/sigmoid.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-tv4UNhzlXdY/Vs1WvuymqmI/AAAAAAAAArA/COJrBCwyaHk/s320/sigmoid.png" /></a><br /><br />See how the value changes from 0 to 1 after passing x = 0? This is what will make the neuron output a signal if the weighted sum is over a threshold. Left as is, the threshold is 0; but we can change the threshold by changing the value of the bias in order to shift the graph to the left or right. The weighting of input signals by importance is done by multiplying the input by a weight, which can be a large number, a small fraction, or even a negative number.<br /><br />Just like you can modify polynomial functions by changing their constants, you can modify neural network functions by changing their weights and biases. Further down we'll see how to do this.<br /><br /><div class="sectitle">Neural network example</div><br />Given the above definitions, here is a diagram of an example of a neural network:<br /><a href="https://3.bp.blogspot.com/-WxJHhsqeMHg/Vs1YVBDs0ZI/AAAAAAAAArM/s-MEC423LjI/s1600/neuralnet.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-WxJHhsqeMHg/Vs1YVBDs0ZI/AAAAAAAAArM/s-MEC423LjI/s1600/neuralnet.png" /></a><br /><br />We shall be referring to this diagram throughout this blog post. Notice the following things about it:<br /><ul><li>Its 2 inputs are called "x0" and "x1" whilst its 2 outputs are called "y0" and "y1".</li><li>Since there are two outputs, the output of the whole network is a column vector of two numbers.</li><li>The first layer of neurons are square because they propagate the "x" signals to the next layer unmodified. The rest of the neurons work as explained.</li><li>Each circle neuron receives input signals from all neurons in the previous layer.</li><li>Apart from input signals, each circle neuron also accepts a bias signal that is always the same.</li></ul><br />This diagram is just a visual representation of the equation shown below. The equation is derived layer by layer from the output. Notice how the equation is a column vector of two numbers since there are two outputs.<br /><br /><a href="https://2.bp.blogspot.com/-R4BqhXaMd38/VtIO1rTSM6I/AAAAAAAAAvo/zn6kuPvGFfw/s1600/neuralnet%2Bequation.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-R4BqhXaMd38/VtIO1rTSM6I/AAAAAAAAAvo/zn6kuPvGFfw/s1600/neuralnet%2Bequation.png" /></a><br /><br />Notice that we are representing the neural network using normal algebra instead of using linear algebra with matrices as usual. I believe that this makes the example easier to understand. Linear algebra is useful for generalising the concepts to any neural network shape, but for understand one network example we'll stick to this expanded representation.<br /><br /><div class="sectitle">Neural network modification</div><br />So now we know how to create a neural network equation, but how do we choose its weights and biases in order to get the function we want? In the <a href="http://geekyisawesome.blogspot.com.mt/2016/01/gradient-descent.html">previous blog post</a>, we talked about how to use the gradient descent algorithm in order to numerically find the minimum of an equation when solving the equation algebraically is difficult. In fact, one of the most well known uses of the gradient descent algorithm is for "training" neural networks into performing a desired function, that is, giving desired outputs from particular inputs.<br /><br />Usually neural networks are trained using the <a href="https://en.wikipedia.org/wiki/Backpropagation">backpropagation algorithm</a> which is based on the gradient descent algorithm. However we'll see how to do the training using plain gradient descent which will help us understand the backpropagation algorithm in a later blog post.<br /><br />What we'll do is create a "cost" function that quantifies how close the outputs that the network is giving are to the desired outputs. Suppose that we want the above neural network to output (0,1) when given (0,0) and (1,0) when given (1,1). The cost function will measure how far off the neural network is from actually giving these outputs. If the cost function gives 0, then the network is giving exactly these desired outputs, otherwise it will be larger than zero. Different weights and biases will make the cost function give different values and our job is to find the combination of weights and biases that will give a cost of zero. We will do this using the gradient descent algorithm.<br /><br />There are many possible cost functions, but traditionally, this is how we define the cost function:<br /><br /><a href="https://1.bp.blogspot.com/-_HKkT2lC2AI/Vs6wQAwfF3I/AAAAAAAAAr4/ZHZIQREDuQw/s1600/cost%2Bfunction.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-_HKkT2lC2AI/Vs6wQAwfF3I/AAAAAAAAAr4/ZHZIQREDuQw/s1600/cost%2Bfunction.png" /></a><br /><br />Notice that O(X,W,B) = Y.<br /><br />The cost function here is measuring how far the actual output is from the target output by calculating the <a href="https://en.wikipedia.org/wiki/Mean_squared_error">mean squared error</a>. For convenience later on we're also dividing the equation by 2.<br /><br />For the above network, the cost function applies as follows:<br /><a href="https://4.bp.blogspot.com/-3s5QArne_2M/VtF0UPVbDDI/AAAAAAAAAsM/N0Naq1SJ-po/s1600/applied%2Bcost%2Bfunction.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-3s5QArne_2M/VtF0UPVbDDI/AAAAAAAAAsM/N0Naq1SJ-po/s1600/applied%2Bcost%2Bfunction.png" /></a><br /><br />Now that we have a continuous equation that quantifies how far from the desired function our neural network's weights and biases are, we can use gradient descent in order to optimize the weights and biases into giving the desired function. To do that we need to find the partial derivative of the network's equation with respect to every weight and bias. Before we do so, notice that the derivative of the activation function is as follows:<br /><a href="https://4.bp.blogspot.com/-Sj6M0ze14PU/VtHdE9PptLI/AAAAAAAAAuY/uLHqfqsCL6U/s1600/a%2Bderiv.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Sj6M0ze14PU/VtHdE9PptLI/AAAAAAAAAuY/uLHqfqsCL6U/s1600/a%2Bderiv.png" /></a><br /><br />The above equation is saying that the derivative of a neuron's output is defined using the neuron's output (the output multiplied by one minus the output). This is a useful speed optimization in sigmoid activation functions.<br /><br />We shall now find the partial derivative of two weights and two biases. You can then find the partial derivatives of every other weight and bias yourself. Make sure that you know how the <a href="https://www.math.hmc.edu/calculus/tutorials/chainrule/">chain rule</a> in differentiation works. As you follow the derivations, keep in mind that all the "y"s and "n"s are actually functions of "B", "W", and "X" just like "O", but we'll leave the "(X,W,B)" out in order to save space.<br /><br /><div class="subsectitle">Partial derivative with respect to weight in layer 3 (output layer)</div><br />Let's find the partial derivative of one of the weights in layer 3 (output layer).<br /><a href="https://3.bp.blogspot.com/-nXLDiiMIXLA/Vxv71xJ0InI/AAAAAAAAAyQ/OQhWR6WOnVwp3AhaBmHIgSw3NQDmrH_HgCLcB/s1600/netderv_w3_1.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-nXLDiiMIXLA/Vxv71xJ0InI/AAAAAAAAAyQ/OQhWR6WOnVwp3AhaBmHIgSw3NQDmrH_HgCLcB/s1600/netderv_w3_1.png" /></a><br /><br />Notice how the 2 in the denominator was cancelled with the 2s in the numerator. This is why we put it in the denominator of the cost function. Notice also that the summation does not change anything in the derivative since the derivative of a summation is the summation of the summed terms' derivative.<br /><br />Now we'll find the derivative of each "y" separately.<br /><br /><a href="https://3.bp.blogspot.com/-kFOxNTdz0QQ/Vxv8DWNMMiI/AAAAAAAAAyY/yFcuJv5C3nEBqjfON6IrvbdE0BJ1LiTXwCLcB/s1600/netderv_w3_2.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-kFOxNTdz0QQ/Vxv8DWNMMiI/AAAAAAAAAyY/yFcuJv5C3nEBqjfON6IrvbdE0BJ1LiTXwCLcB/s1600/netderv_w3_2.png" /></a><br /><br />Notice that we stopped decomposing neurons past layer 3 since we know that the weight we are deriving with respect to will not lie in deeper layers.<br /><br />Finally we plug these back into the original equation and we have our partial derivative:<br /><a href="https://4.bp.blogspot.com/-gCtCzNx2VAY/Vxv8KaVOUgI/AAAAAAAAAyc/rHYP6TWVuG0-DU_nG7KDQ9aMd_d-QPMyACLcB/s1600/netderv_w3_3.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-gCtCzNx2VAY/Vxv8KaVOUgI/AAAAAAAAAyc/rHYP6TWVuG0-DU_nG7KDQ9aMd_d-QPMyACLcB/s1600/netderv_w3_3.png" /></a><br /><br />It's important to notice that the derivative uses the output of the network's neurons. This is important, since it means that before finding the derivative we need to first find the output of the network. Notice also that the green "n" at layer 3 is "y0" but we won't replace it with "y0" in order to preserve a pattern that is revealed later.<br /><br /><div class="subsectitle">Partial derivative with respect to bias in layer 3 (output layer)</div><br />Let's find the partial derivative of one of the biases in layer 3 (output layer) now. We'll skip some steps that were shown in the previous derivations.<br /><a href="https://1.bp.blogspot.com/-YfrReoV97m0/Vxv8TAyfGzI/AAAAAAAAAyg/5rFYE0s4KI4sAGGYOiqjlXvUFDKq_TxzACLcB/s1600/netderv_b3_1.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-YfrReoV97m0/Vxv8TAyfGzI/AAAAAAAAAyg/5rFYE0s4KI4sAGGYOiqjlXvUFDKq_TxzACLcB/s1600/netderv_b3_1.png" /></a><br /><br />And now we find the derivatives of the "y"s separately.<br /><a href="https://3.bp.blogspot.com/-pfGpf4PuTpE/Vxv8a5mLb8I/AAAAAAAAAyk/VxmDv4YgUNYxNfC33ZWXYa1k1GSZLIZWQCLcB/s1600/netderv_b3_2.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-pfGpf4PuTpE/Vxv8a5mLb8I/AAAAAAAAAyk/VxmDv4YgUNYxNfC33ZWXYa1k1GSZLIZWQCLcB/s1600/netderv_b3_2.png" /></a><br /><br />And we now plug them back into the original equation.<br /><a href="https://1.bp.blogspot.com/-DkLIbofqHUs/Vxv8lAY0IpI/AAAAAAAAAyo/TJj5_AYho8sExZOV2p7H4ZBn9TARQvhZwCLcB/s1600/netderv_b3_3.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-DkLIbofqHUs/Vxv8lAY0IpI/AAAAAAAAAyo/TJj5_AYho8sExZOV2p7H4ZBn9TARQvhZwCLcB/s1600/netderv_b3_3.png" /></a><br /><br /><div class="subsectitle">Partial derivative with respect to weight in layer 2</div><br />Let's find the partial derivative of one of the weights in layer 2 now. We'll skip some steps that were shown in the previous derivations.<br /><a href="https://4.bp.blogspot.com/-DKFNZFgZnHU/Vxv-YTiKzmI/AAAAAAAAAzc/Qu6FTeAX2IUBONL_Uk9kenhTG51FRLuFgCLcB/s1600/netderv_w2_1.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-DKFNZFgZnHU/Vxv-YTiKzmI/AAAAAAAAAzc/Qu6FTeAX2IUBONL_Uk9kenhTG51FRLuFgCLcB/s1600/netderv_w2_1.png" /></a><br /><br />And now we find the derivatives of the "y"s separately.<br /><a href="https://2.bp.blogspot.com/-u2neDiCzfRo/Vxv-cpeyHII/AAAAAAAAAzg/GpHHn_tFnncUs0KuPRftJZfvHV-Tt0blACLcB/s1600/netderv_w2_2.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-u2neDiCzfRo/Vxv-cpeyHII/AAAAAAAAAzg/GpHHn_tFnncUs0KuPRftJZfvHV-Tt0blACLcB/s1600/netderv_w2_2.png" /></a><br /><br />And we now plug them back into the original equation.<br /><a href="https://4.bp.blogspot.com/-V7KmtYAjEpM/Vxv-imAWUII/AAAAAAAAAzo/auxpYiQdH_8vjrRk2IJUy3O9HxbKjVv3wCLcB/s1600/netderv_w2_3.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-V7KmtYAjEpM/Vxv-imAWUII/AAAAAAAAAzo/auxpYiQdH_8vjrRk2IJUy3O9HxbKjVv3wCLcB/s1600/netderv_w2_3.png" /></a><br /><br /><div class="subsectitle">Partial derivative with respect to bias in layer 2</div><br />Let's find the partial derivative of one of the biases in layer 2 now. We'll skip some steps that were shown in the previous derivations.<br /><a href="https://2.bp.blogspot.com/-5fb5xLJc7kw/Vxv-sMOIKvI/AAAAAAAAAzs/pFOiy3qjFQQcMAzdDcq4zLpQC27dQQZYACLcB/s1600/netderv_b2_1.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-5fb5xLJc7kw/Vxv-sMOIKvI/AAAAAAAAAzs/pFOiy3qjFQQcMAzdDcq4zLpQC27dQQZYACLcB/s1600/netderv_b2_1.png" /></a><br /><br />And now we find the derivatives of the "y"s separately.<br /><a href="https://1.bp.blogspot.com/-2DpQACbjvqE/Vxv-wW0s5hI/AAAAAAAAAzw/A-s0YdsDk00GKVeJA5XJ72M9EpNw2pTZgCLcB/s1600/netderv_b2_2.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-2DpQACbjvqE/Vxv-wW0s5hI/AAAAAAAAAzw/A-s0YdsDk00GKVeJA5XJ72M9EpNw2pTZgCLcB/s1600/netderv_b2_2.png" /></a><br /><br />And we now plug them back into the original equation.<br /><a href="https://4.bp.blogspot.com/-7aQE2zz2K3A/Vxv-2CtrVpI/AAAAAAAAAz0/rpFNH1gm4pwUn5cDIxZehG3h6alIcMJkgCLcB/s1600/netderv_b2_3.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-7aQE2zz2K3A/Vxv-2CtrVpI/AAAAAAAAAz0/rpFNH1gm4pwUn5cDIxZehG3h6alIcMJkgCLcB/s1600/netderv_b2_3.png" /></a><br /><br /><div class="subsectitle">Partial derivative with respect to weight in layer 1</div><br />Let's find the partial derivative of one of the weights in layer 1 now. We'll skip some steps that were shown in the previous derivations.<br /><a href="https://1.bp.blogspot.com/-jPExi6fCJXQ/Vxv9XQFGs5I/AAAAAAAAAy4/VueB_DQDipM1X7FHlv4kEikuaPazjYbtgCLcB/s1600/netderv_w1_1.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-jPExi6fCJXQ/Vxv9XQFGs5I/AAAAAAAAAy4/VueB_DQDipM1X7FHlv4kEikuaPazjYbtgCLcB/s1600/netderv_w1_1.png" /></a><br /><br />And now we find the derivatives of the "y"s separately.<br /><a href="https://2.bp.blogspot.com/-B6PzVib_nis/Vxv9ct3_tqI/AAAAAAAAAy8/V6fIx6fCiv869IlWnOsZQpRkfi8RcpxLACLcB/s1600/netderv_w1_2.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-B6PzVib_nis/Vxv9ct3_tqI/AAAAAAAAAy8/V6fIx6fCiv869IlWnOsZQpRkfi8RcpxLACLcB/s1600/netderv_w1_2.png" /></a><br /><br />And we now plug them back into the original equation.<br /><a href="https://2.bp.blogspot.com/-HVyNYYdfTB8/Vxv9nW8BPiI/AAAAAAAAAzA/ULwcDORhCeI84E203drZTqWkETVRyTQsACLcB/s1600/netderv_w1_3.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-HVyNYYdfTB8/Vxv9nW8BPiI/AAAAAAAAAzA/ULwcDORhCeI84E203drZTqWkETVRyTQsACLcB/s1600/netderv_w1_3.png" /></a><br /><br />Notice that we did not replace the black "n" at layer 0 with "x" in order to preserve a pattern that is revealed later.<br /><br /><div class="subsectitle">Partial derivative with respect to bias in layer 1</div><br />Let's find the partial derivative of one of the biases in layer 1 now. We'll skip some steps that were shown in the previous derivations.<br /><a href="https://2.bp.blogspot.com/-vEXKqGsjQk0/Vxv93GRlMgI/AAAAAAAAAzI/VjL8kUnRG9sPd0BhHebnSPQLfGXDMyoEwCLcB/s1600/netderv_b1_1.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-vEXKqGsjQk0/Vxv93GRlMgI/AAAAAAAAAzI/VjL8kUnRG9sPd0BhHebnSPQLfGXDMyoEwCLcB/s1600/netderv_b1_1.png" /></a><br /><br />And now we find the derivatives of the "y"s separately.<br /><a href="https://1.bp.blogspot.com/-hrxcpJXP-_E/Vxv98D3gJzI/AAAAAAAAAzM/83yiO7TeFEk7TGw8KzSCFFe4YXhwbk-5ACLcB/s1600/netderv_b1_2.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-hrxcpJXP-_E/Vxv98D3gJzI/AAAAAAAAAzM/83yiO7TeFEk7TGw8KzSCFFe4YXhwbk-5ACLcB/s1600/netderv_b1_2.png" /></a><br /><br />And we now plug them back into the original equation.<br /><a href="https://4.bp.blogspot.com/-0-R7NZmAQGA/Vxv9_3rtseI/AAAAAAAAAzQ/FJldSA49vpcWbflO57Boh4D8osex51uJQCLcB/s1600/netderv_b1_3.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-0-R7NZmAQGA/Vxv9_3rtseI/AAAAAAAAAzQ/FJldSA49vpcWbflO57Boh4D8osex51uJQCLcB/s1600/netderv_b1_3.png" /></a><br /><br /><div class="subsectitle">Patterns</div>There is a pattern in the derivatives. In order to see the pattern, let's look at a part of the equation inside the summations of all the derivatives and compare them.<br /><br />Here is the repeating pattern in the derivation for the weights:<br /><a href="https://4.bp.blogspot.com/-oa-cOMK_Bds/Vxv9te6w2aI/AAAAAAAAAzE/KoFEYp89hnUzRMswXdssDUQQ0YdyrOfmACLcB/s1600/pattern_w.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-oa-cOMK_Bds/Vxv9te6w2aI/AAAAAAAAAzE/KoFEYp89hnUzRMswXdssDUQQ0YdyrOfmACLcB/s1600/pattern_w.png" /></a><br /><br />Here is the repeating pattern in the derivation for the biases:<br /><a href="https://4.bp.blogspot.com/-1W_FPSmJPPo/Vxv-GaEK6UI/AAAAAAAAAz4/pwsLuMgzmBM-a25NkGAFihDhIEocMEX4ACKgB/s1600/pattern_b.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-1W_FPSmJPPo/Vxv-GaEK6UI/AAAAAAAAAz4/pwsLuMgzmBM-a25NkGAFihDhIEocMEX4ACKgB/s1600/pattern_b.png" /></a><br /><br />See how there is a repeating pattern that gets longer as the layer of the weight we are deriving with respect to gets deeper into the network? This is an important pattern on which the backpropagation algorithm is based, which computes the gradient of a layer's weights using the gradients of the previous layer.<br /><br />Notice also that as more layers are added, the chain of multiplications gets longer. Since each number is a fraction between 0 and 1, the multiplications produce smaller and smaller numbers as the chain gets longer. This will make the gradients become smaller which makes training the layers at the back very slow. In fact this is a problem in multi-layered neural networks which is known as the <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">vanishing gradient problem</a>. It can be solved by using different activation functions such as the <a href="https://en.wikipedia.org/wiki/Rectifier_%28neural_networks%29">rectified linear unit</a>.<br /><br /><div class="subsectitle">In code</div>Now that we know how to find the partial derivative of every weight and bias, all we have to do is plug them in the gradient descent algorithm and minimize the cost function. When the cost function is minimized, the network will give us the desired outputs.<br /><br />Here is a complete code example in Python 3 of a gradient descent algorithm applied to train the above neural network to act as a half adder. A half adder adds together two binary digits and returns the sum and carry. So 0 + 1 in binary gives 1 carry 0 whilst 1 + 1 in binary gives 0 carry 1. One of the functions is called "ns" which gives the outputs of each neuron in the network given an input, grouped by layer. This will be called several times by the gradient functions, so to avoid recomputing the same values, it's output is automatically cached by Python using the lru_cache decorator. The initial values are set to random numbers between 1 and -1 since starting them off as all zeros as was done in the previous blog post will <a href="http://stackoverflow.com/questions/20027598/why-should-weights-of-neural-networks-be-initialized-to-random-numbers">prevent the gradient descent algorithm from working</a> in a neural network.<br /><br /><pre class="brush:python">import math<br />import random<br />from functools import lru_cache<br /><br />trainingset = { (0,0):(0,0), (0,1):(1,0), (1,0):(1,0), (1,1):(0,1) }<br /><br />def grad_desc(cost, gradients, initial_values, step_size, threshold):<br /> old_values = initial_values<br /> while True:<br /> new_values = [ value - step_size*gradient(*old_values) for (value, gradient) in zip(old_values, gradients) ]<br /> if cost(*new_values) < threshold:<br /> return new_values<br /> old_values = new_values<br /><br />def a(z):<br /> return 1/(1 + math.exp(-z))<br /><br />@lru_cache(maxsize=4)<br />def ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> (n00, n01) = ( x0, x1 )<br /> (n10, n11) = ( a(n00*w100 + n01*w110 + b10), a(n00*w101 + n01*w111 + b11) )<br /> (n20, n21) = ( a(n10*w200 + n11*w210 + b20), a(n10*w201 + n11*w211 + b21) )<br /> (n30, n31) = ( a(n20*w300 + n21*w310 + b30), a(n20*w301 + n21*w311 + b31) )<br /> return ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) )<br /><br />def out(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> return ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)[-1]<br /><br />def cost(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> (y0, y1) = out(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> tmp += (y0 - t0)**2 + (y1 - t1)**2<br /> return tmp/(2*len(trainingset))<br /><br />def dCdw300(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * n30*(1-n30) * n20<br /> return tmp/len(trainingset)<br /><br />def dCdw310(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * n30*(1-n30) * n21<br /> return tmp/len(trainingset)<br /><br />def dCdw301(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y1 - t1) * n31*(1-n31) * n20<br /> return tmp/len(trainingset)<br /><br />def dCdw311(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y1 - t1) * n31*(1-n31) * n21<br /> return tmp/len(trainingset)<br /><br />def dCdb30(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * n30*(1-n30)<br /> return tmp/len(trainingset)<br /><br />def dCdb31(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y1 - t1) * n31*(1-n31)<br /> return tmp/len(trainingset)<br /><br />def dCdw200(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*n10 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*n10 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw210(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*n11 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*n11 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw201(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21)*n10 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21)*n10 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw211(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21)*n11 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21)*n11 ))<br /> return tmp/len(trainingset)<br /><br />def dCdb20(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20) ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20) ))<br /> return tmp/len(trainingset)<br /><br />def dCdb21(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w310*n21*(1-n21) ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w311*n21*(1-n21) ))<br /> return tmp/len(trainingset)<br /><br />def dCdw100(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10)*n00 + w310*n21*(1-n21)*w201*n10*(1-n10)*n00 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10)*n00 + w311*n21*(1-n21)*w201*n10*(1-n10)*n00 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw110(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10)*n01 + w310*n21*(1-n21)*w201*n10*(1-n10)*n01 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10)*n01 + w311*n21*(1-n21)*w201*n10*(1-n10)*n01 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw101(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11)*n00 + w310*n21*(1-n21)*w211*n11*(1-n11)*n00 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11)*n00 + w311*n21*(1-n21)*w211*n11*(1-n11)*n00 ))<br /> return tmp/len(trainingset)<br /><br />def dCdw111(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11)*n01 + w310*n21*(1-n21)*w211*n11*(1-n11)*n01 ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11)*n01 + w311*n21*(1-n21)*w211*n11*(1-n11)*n01 ))<br /> return tmp/len(trainingset)<br /><br />def dCdb10(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w200*n10*(1-n10) + w310*n21*(1-n21)*w201*n10*(1-n10) ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w200*n10*(1-n10) + w311*n21*(1-n21)*w201*n10*(1-n10) ))<br /> return tmp/len(trainingset)<br /><br />def dCdb11(w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31):<br /> tmp = 0<br /> for ((x0,x1), (t0,t1)) in trainingset.items():<br /> ( (n00, n01), (n10, n11), (n20, n21), (n30, n31) ) = ns(x0, x1, w100, w101, w110, w111, b10, b11, w200, w201, w210, w211, b20, b21, w300, w301, w310, w311, b30, b31)<br /> (y0, y1) = (n30, n31)<br /> tmp += (y0 - t0) * ( n30*(1-n30) * ( w300*n20*(1-n20)*w210*n11*(1-n11) + w310*n21*(1-n21)*w211*n11*(1-n11) ))<br /> tmp += (y1 - t1) * ( n31*(1-n31) * ( w301*n20*(1-n20)*w210*n11*(1-n11) + w311*n21*(1-n21)*w211*n11*(1-n11) ))<br /> return tmp/len(trainingset)<br /><br />new_values = grad_desc(<br /> cost,<br /> [ dCdw100, dCdw101, dCdw110, dCdw111, dCdb10, dCdb11, dCdw200, dCdw201, dCdw210, dCdw211, dCdb20, dCdb21, dCdw300, dCdw301, dCdw310, dCdw311, dCdb30, dCdb31 ],<br /> [ 2*random.random()-1 for _ in range(18) ],<br /> 0.5,<br /> 1e-3<br />)<br />print('cost:', cost(*new_values))<br />print('output:')<br />for ((x0,x1), (t0,t1)) in trainingset.items():<br /> print(' ', (x0,x1), out(x0,x1,*new_values))<br /><br /></pre>Wait a few seconds and... <pre>cost: 0.0009999186077385778<br />output:<br /> (0, 1) (0.9625083358870274, 0.02072955340806345)<br /> (1, 0) (0.9649753465406843, 0.02060601544190889)<br /> (0, 0) (0.03678639641748687, 0.0051592625464787585)<br /> (1, 1) (0.04304611235264617, 0.9642249998318806)<br /></pre><div class="subsectitle">Conclusion</div>Of course this isn't a very convenient way to train neural networks since we need to create a different set of partial derivatives for every new network shape (number of neurons and layers). In fact, in a future blog post we'll see how to use the backpropagation algorithm which exploits patterns in this process in order to simplify the process. On the other hand, the backpropagation algorithm makes certain assumptions about the network, such as that every neuron in a layer is connected to every neuron in the previous layer. If we were to make no assumptions about the network then we'd end up using the bare bones method presented here.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-23272256115275849202016-01-23T09:33:00.000+01:002016-04-16T17:01:40.357+02:00Gradient Descent<div class="sectitle">Algebraic method: 1 variable</div><br />Say you have an equation like<br /><pre>y = 2x^2 - 3x + 2</pre>that you needed to find the minimum of:<br /><a href="http://2.bp.blogspot.com/-7_Xg33mtoPE/VpyINm2pFeI/AAAAAAAAAoQ/FTHQ2le0474/s1600/quadratic.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-7_Xg33mtoPE/VpyINm2pFeI/AAAAAAAAAoQ/FTHQ2le0474/s320/quadratic.png" /></a><br /><br />The quadratic equation is minimum (the turning point at the bottom) somewhere between x = 0.5 and x = 1. How can we find where exactly it is minimum? The standard solution is by algebraically finding where its gradient is 0. The gradient is 0 only at a turning point, which is where the minimum is. You can find the gradient at any point by taking the equation's <a href="https://en.wikipedia.org/wiki/Derivative">derivative</a>. The derivative with respect to "x" of y = 2x^2 - 3x + 2 is<br /><pre>dy/dx = 4x - 3</pre><a href="http://1.bp.blogspot.com/-xAF7WjeZePI/VpyJOk762qI/AAAAAAAAAoc/bQt0nsliLRs/s1600/quadratic_gradient.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-xAF7WjeZePI/VpyJOk762qI/AAAAAAAAAoc/bQt0nsliLRs/s320/quadratic_gradient.png" /></a><br /><br />The derivative equation is 0 exactly at the same point where the first equation has a minimum turning point. So we algebraically find at which "x" "4x - 3" is 0:<br /><pre>0 = 4x - 3<br />x = 3/4 = 0.75<br /></pre>Therefore "x" has to be 0.75.<br /><br /><div class="sectitle">Algebraic method: 2 variables</div><br />What if we have 2 variables instead of just "x"? This is what the equation<br /><pre>z = 2x^2 + 2y^2 + 2xy - 6x</pre>looks like:<br /><a href="http://1.bp.blogspot.com/-fCixDVY77Zo/VqKmRqCGU3I/AAAAAAAAAo8/eT0FfZ6OizE/s1600/3d.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-fCixDVY77Zo/VqKmRqCGU3I/AAAAAAAAAo8/eT0FfZ6OizE/s320/3d.png" /></a><br /><br />It clearly has a minimum point somewhere close to (x,y) = (0,0). The way to find the minimum is to take it's <a href="https://en.wikipedia.org/wiki/Partial_derivative">partial derivative</a> with respect to "x" and with respect to "y". A partial derivative is when you consider only one of the variables as a variable whilst considering the other variables as constant. This allows you to find how the gradient of the curve changes with respect to one variable. Here is the partial derivative with respect to "x":<br /><pre>∂z/∂x = 4x + 2y - 6</pre>and here is the partial derivative with respect to "y":<br /><pre>∂z/∂y = 4y + 2x</pre><br />The idea is now to find where the gradient with respect to both variables is 0. The minimum has a gradient of zero from all directions ("x" and "y") so we find where this is so:<br /><pre>0 = 4x + 2y - 6 [A]<br />0 = 4y + 2x [B]<br /><br />From [B]<br />x = -2y [C]<br /><br />Substituting [C] in [A]<br />0 = 4(-2y) + 2y - 6<br />y = -1<br /><br />Therefore from [C]<br />x = 2<br /></pre><br />Therefore both derivatives are equal to 0 when (x,y) = (2,-1) which means that that is where the equation is minimum.<br /><br /><div class="sectitle">Gradient Descent</div><br />Notice how finding the the minimum of a 2 variable equation required the use of solving simultaneous equations. Imagine having a 100 variables and solving 100 simultaneous equations. Plus there is no guarantee that the equations are solvable algebraically. So sometimes we use approximate methods to find the minimum. In this case we can use a method called <a href="https://en.wikipedia.org/wiki/Gradient_descent">Gradient Descent</a>. You start by finding the partial derivatives with respect to each variable, then you pick a point on the equation at random and follow the gradient downwards. Let's see an example with the previous equation.<br /><br /><pre>∂z/∂x = 4x + 2y - 6<br />∂z/∂y = 4y + 2x<br /></pre><br />Let's pick (x,y) = (0,0) as a point.<br /><br /><pre>∂z/∂x @ (0,0) = 4(0) + 2(0) - 6 = -6<br />∂z/∂y @ (0,0) = 4(0) + 2(0) = 0<br /></pre><br />The gradients suggest that at (0,0) the curve is flat at the "y" axis and sliding downwards in the positive direction of the "x" axis. Here is what that means.<br /><br />If we fix "y" to be 0, "z" would vary with respect to "x" like this:<br /><a href="http://1.bp.blogspot.com/-5PrrU6WRBBQ/VqNE_ZYyWiI/AAAAAAAAApc/l9GYmxL8TXw/s1600/3d_gradient_0%252C0_x.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-5PrrU6WRBBQ/VqNE_ZYyWiI/AAAAAAAAApc/l9GYmxL8TXw/s320/3d_gradient_0%252C0_x.png" /></a><br />Notice how at "x" = 0 the slope steeps down towards the positive direction, which means that "x" = 0 is too small and needs to be bigger for us to find the minimum. The gradient at "x" = 0 is -6, which is negative therefore the minimum is at a greater "x" than 0.<br /><br />If we fix "x" to be 0, "z" would vary with respect to "y" like this:<br /><a href="http://4.bp.blogspot.com/-GD6JP9CJCYg/VqNFzLZqdUI/AAAAAAAAApo/EbtXQorvn-Q/s1600/3d_gradient_0%252C0_y.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-GD6JP9CJCYg/VqNFzLZqdUI/AAAAAAAAApo/EbtXQorvn-Q/s320/3d_gradient_0%252C0_y.png" /></a><br />Notice how at "y" = 0 the slope is flat, which means that "y" is not important to find the minimum at this point and can be left as is. The gradient at "y" = 0 is 0, therefore we can leave "y" as is.<br /><br />In that case, we stay where we are in the "y" axis direction and increase the "x". By how much should we increase? Is the point (1,0) closer to the minimum than (0,0)? Is it too small an increase? Would (10,0) be a better new point? Or is it too big and should be (0.1,0) instead? How can we get a new point which is closer to the minimum than (0,0) is?<br /><br />As we approach the minimum, the gradients will start flattening out in both the "x" and "y" directions and start getting closer to 0. This means that we can, as a rule of thumb, increase or decrease the "x" and "y" in the point in proportion to the gradient. The larger the gradient, the steeper the slope, so the more promising moving in that direction would be to find the minimum. On the other hand, the smaller the gradient, the flatter the slope, so moving in that direction would not be very profitable as we're already close to the minimum there and might overshoot. If the gradient is zero, then we are right where we want to be and shouldn't move in that axis at all. So the greater the gradient, the greater the distance we should move.<br /><br />It's important to understand that the step size should be in proportion to the gradient, not exactly the gradient. We should move at a fraction of the gradient since the gradient will usually be really big compared to the distance required to move and will probably overshoot all the time if used as is. Let's use 0.1 of the gradient.<br /><br /><pre>new x = 0 - 0.1(-6) = 0.6<br />new y = 0 - 0.1(0) = 0<br /></pre><br />Notice that we subtracted the gradient so that if the gradient is negative we move towards the positive direction whilst if the gradient is positive we move towards the negative direction since that is how the slope will be oriented.<br /><br />We check the gradients again to see if we're there yet:<br /><br /><pre>∂z/∂x @ (0.6,0) = 4(0.6) + 2(0) - 6 = -3.6<br />∂z/∂y @ (0.6,0) = 4(0) + 2(0.6) = 1.2<br /></pre><br />At our new point we're at a slope sliding downwards towards the positive direction in the "x" and towards the negative direction in the "y" direction. Here are the graphs to confirm.<br /><br />If we fix "y" to be 0, "z" would vary with respect to "x" like this:<br /><a href="http://4.bp.blogspot.com/-8aMIOnurUe4/VqNHpxaimrI/AAAAAAAAAp0/ZHvUR0mbf8s/s1600/3d_gradient_0.6%252C0_x.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-8aMIOnurUe4/VqNHpxaimrI/AAAAAAAAAp0/ZHvUR0mbf8s/s320/3d_gradient_0.6%252C0_x.png" /></a><br />Notice how at "x" = 0.6 the slope steeps down towards the positive direction. The gradient at "x" = 0 is -3.6.<br /><br />If we fix "x" to be 0.6, "z" would vary with respect to "y" like this:<br /><a href="http://1.bp.blogspot.com/-n61aPmjdK7U/VqNIRFeizOI/AAAAAAAAAp8/A8uvIsKPLak/s1600/3d_gradient_0.6%252C0_y.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-n61aPmjdK7U/VqNIRFeizOI/AAAAAAAAAp8/A8uvIsKPLak/s320/3d_gradient_0.6%252C0_y.png" /></a><br />Notice how at "y" = 0 the slope is steeps down towards the negative directions. The gradient at "y" = 0 is 1.2.<br /><br />So we need to make "x" bigger and "y" smaller.<br /><br /><pre>new x = 0.6 - 0.1(-3.6) = 0.96<br />new y = 0 - 0.1(1.2) = -0.12<br /></pre><br />And we keep on reiterating until the gradient is sufficiently close to 0 in both the "x" and "y" directions or until the change in both "x" and "y" is small enough.<br /><br />Here's a list of the first 20 points we'll get:<br />(0, 0)<br />(0.6, 0.0)<br />(0.96, -0.12)<br />(1.2, -0.26)<br />(1.37, -0.4)<br />(1.5, -0.51)<br />(1.6, -0.61)<br />(1.68, -0.69)<br />(1.75, -0.75)<br />(1.8, -0.8)<br />(1.84, -0.84)<br />(1.87, -0.87)<br />(1.9, -0.9)<br />(1.92, -0.92)<br />(1.93, -0.93)<br />(1.95, -0.95)<br />(1.96, -0.96)<br />(1.97, -0.97)<br />(1.97, -0.97)<br />(1.98, -0.98)<br /><br />The point gets closer and closer to the actual minimum which is (2,-1). Here are the illustrated points:<br /><a href="http://2.bp.blogspot.com/-oKqFBhoHeEc/VqM5pat5WwI/AAAAAAAAApM/hQ6BtyuOQcA/s1600/points.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-oKqFBhoHeEc/VqM5pat5WwI/AAAAAAAAApM/hQ6BtyuOQcA/s320/points.png" /></a><br /><br />See how as the point approaches the minimum it starts moving less and less?<br /><br />This is called the Gradient Descent algorithm. We can fine-tune its performance by varying the starting point and the step size fraction by which we're multiplying the gradient.<br /><br />Here's a Python 3 program that performs Gradient Descent given a list of gradient functions (for each variable):<br /><pre class="brush:python">def grad_desc(gradients, initial_values, step_size, threshold):<br /> old_values = initial_values<br /> while True:<br /> new_values = [ value - step_size*gradient(*old_values) for (value, gradient) in zip(old_values, gradients) ]<br /> if all(abs(new - old) < threshold for (new, old) in zip(new_values, old_values)):<br /> return new_values<br /> old_values = new_values<br /></pre>Here's how you would use the previous example: <pre class="brush:python">grad_desc([ lambda x,y:4*x+2*y-6, lambda x,y:4*y+2*x ], [0, 0], 0.1, 0.001)<br /></pre>mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-74730535245595924772015-12-13T13:57:00.000+01:002015-12-13T13:57:24.401+01:00(k-)Nearest Neighbour(s) ClassificationYou want a computer to learn to assign objects into categories, such as the genre of a book. You happen to have a bunch of books with a known category. One of the simplest ways to make the computer assign an unknown book's category is to find the most similar book in the bunch to the unknown book and assume that the two books share the same category. For example, you want to find what genre "Harry Potter" is and find that it is most similar to a book you have called "The Hobbit" which is tagged as fantasy, so you conclude that "Happy Potter" is also fantasy. Of course this only makes sense if you have a big collection of reference books since there might not be any books which are similar otherwise, and the most similar book would be of a genre which is significantly different.<br /><br />This is called the <a href="https://en.wikipedia.org/wiki/Nearest_neighbour_classifiers">nearest neighbours classification</a> algorithm, in particular the 1-nearest neighbour, because you only take into consideration the most similar book. Alternatively you can take the top 10 most similar books and use the most frequent genre among the 10 books. This would be called 10-nearest neighbours classification. In general it's called k-nearest neighbours classification.<br /><br />This is a simple algorithm but its advantage is in its simplicity since it makes no assumptions about the data you give it. Whereas other machine learning algorithms assume that there is some simple pattern to decide which genre a book belongs to, the nearest neighbour classifier can discriminate between very complex patterns and will adapt to any data you train it with, provided that there is enough variety of data. The more complex the relationship between the books and the genre, the more variety of books you need to train it with.<br /><br />The way it works is by first converting each book in your bunch into a bunch of lists of numbers called a vectors. Each vector would be a point in space (a vector of 2 numbers is a 2D point, of 3 numbers is a 3D point, and the rest are of high dimensional space). For example, in order to convert a book into a point, each number could be the number of times a particular word occurs. Create a vocabulary of words that matter, such as "wizard" and "gun", and then create a point consisting of the number of times each word occurs in the book. So if "Happy Potter" had "wizard" appearing 100 times and "gun" appearing 0 times, then it's 2D point would be (100, 0).<br /><br />Next, compare the point version of the book in question to the point versions of every book in the bunch. Use some similarity measure to quantify how similar the points are. Similarity measures include <a href="https://en.wikipedia.org/wiki/Euclidean_distance">Euclidean distance</a> (normal distance between points) and <a href="https://en.wikipedia.org/wiki/Cosine_similarity">Cosine similarity</a> (difference in the angle of the points from the origin).<br /><br /><a href="http://4.bp.blogspot.com/-Z_3ZJAhwnuY/Vm1mXh5605I/AAAAAAAAAmc/61GG-zSBQjo/s1600/knn_points.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-Z_3ZJAhwnuY/Vm1mXh5605I/AAAAAAAAAmc/61GG-zSBQjo/s320/knn_points.png" /></a><br /><br />Which is the most similar point to the purple one in the above diagram (the purple point is (100, 0) which represents "Harry Potter")? The purple point will be of the same colour as the closest point.<br /><br />Of course comparing to every point is slow, which is a problem given that nearest neighbour classification requires a lot of points to compare to. There are <a href="https://en.wikipedia.org/wiki/Nearest_neighbor_search#Methods">nearest neighbour search algorithms</a> but they are not very efficient when the points have a lot of dimensions (many numbers in the vector). In some cases it is enough to use approximate search algorithms that do not give exact nearest point, but will find a reasonably close point quickly. The paper "<a href="http://dl.acm.org/ft_gateway.cfm?id=1220221&ftid=713621&dwn=1&CFID=567886069&CFTOKEN=46097158">Scaling Distributional Similarity to Large Corpora</a>" gives an overview of such algorithms for finding words that have similar meanings.<br /><br />If you do not have the genres of the books but still want to categorize similar books together you can use a <a href="https://en.wikipedia.org/wiki/Cluster_analysis">clustering algorithm</a> such as <a href="https://en.wikipedia.org/wiki/K-means_clustering">k-means clustering</a> into order to group books by similarity and then use nearest neighbour classification to associate the new book with the group of the nearest book.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-84361829317315509052015-11-06T07:24:00.000+01:002015-11-06T07:33:53.929+01:00Naive Bayes ClassificationThe <a href="http://geekyisawesome.blogspot.com.mt/2015/10/conditional-probabilities-and-bayes.html">previous post</a> was about Bayes' theorem, so now we'll talk about a use for it in machine learning: <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier">Naive Bayes Classification</a>.<br /><br />Let's say that you're making a program which given the content of a book, will tell you how likely it is that you will like it. In order to do so, it needs to know the contents of books that you like and books that you don't like. Parsing and understanding the content of a book is crazy hard, so you opt for a simpler strategy: You base the decision on the words used in the book. Books that you like use certain words that you like whilst books that you don't like use words that you don't like.<br /><br />So you come up with a vocabulary of words (perhaps only a small set of words need to be considered) and you count the number of times each word appears a book you like and in a book you don't like. Let's say you end up with a table like this:<br /><br />% of books that include word<br /><table border="1"><tr><th>Word\Class</th><th>Like book</th><th>Hate book</th></tr><tr><th>magic</th><td>100%</td><td>0%</td></tr><tr><th>fairy</th><td>90%</td><td>10%</td></tr><tr><th>car</th><td>5%</td><td>95%</td></tr><tr><th>gun</th><td>0%</td><td>100%</td></tr></table><br />This means that 90% of books that you like contain the word "fairy", which is another way of saying that a book with the word "fairy" has a 90% chance of being a good book.<br /><br />Now we have a new book and we want to know if we're likely to like it or not. So we check which words it contains and find the following:<br /><table border="1"><tr><th>Word</th><th>Contained?</th></tr><tr><th>magic</th><td>yes</td></tr><tr><th>fairy</th><td>yes</td></tr><tr><th>car</th><td>no</td></tr><tr><th>gun</th><td>yes</td></tr></table><br />The probability that you'll like the book given that it contains these words is found by calculating<br /><pre>P(Like book | magic=yes, fairy=yes, car=no, gun=yes)</pre><br />Naive Bayes Classification works by first using Baye's theorem on the above conditional probability:<br /><pre>P(magic=yes, fairy=yes, car=no, gun=yes | Like book) P(Like book) / P(magic=yes, fairy=yes, car=no, gun=yes)</pre><br />Now that the list of AND conditions (has magic and fairy and...) is at the front of the conditional, we can use the Naive Bayes Assumption and assume that the occurrence of each term is independent from all the other terms. If we assume this, we can simplify the probability by decomposing the ANDed conditions into separate probabilities multiplied together as follows:<br /><pre>P(magic=yes|Like book) P(fairy=yes|Like book) P(car=no|Like book) P(gun=yes|Like book) P(Like book) / (P(magic=yes) P(fairy=yes) P(car=no) P(gun=yes))</pre><br />Now we can use the table at the top to find P(word|Like book), the probability P(Like book) is the percentage of books that you like (from those used to construct the table), and P(word) is the probability that a book contains the given word (from the books used to construct the table). These percentages are easy to obtain.<br /><br />The problem is that one of our percentages is a zero, P(gun=yes | Like book). Because of this, when it is multiplied by the other probabilities, the result will be zero. The solution is to disallow zero probabilities by assuming that just because a word does not occur in the books you like, doesn't mean that it will never occur. It might be that there is a very tiny probability that it will occur, but that you don't have enough books to find it. In these situations, we need to smooth our probabilities using <a href="http://geekyisawesome.blogspot.com.mt/2013/02/maximum-likelihood-estimate-laplaces-law.html">Laplace Smoothing</a> by adding 1 to every count.<br /><br />Naive Bayes Classification can be used to find the most likely class a list of yes/no answers belongs to (such as whether the book contains the given words), but this is just the simplest type of Naive Bayes Classification known as <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes">Bernoulli Naive Bayes</a>, so called because it assumes a <a href="https://en.wikipedia.org/wiki/Bernoulli_distribution">Bernoulli distribution</a> in the probabilities (a Bernoulli distribution is when there are only 2 possible outcomes from an event with one outcome having a probability of "p" and the other "p-1"). It can also be used on a list of frequencies of the terms by using a <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes">Multinomial Naive Bayes</a> or on a list of numbers with a decimal point (such as the weight of the book) using <a href="https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes">Gaussian Naive Bayes</a>.mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-77096310283475937052015-10-03T10:04:00.000+02:002016-05-26T22:38:50.193+02:00Conditional probabilities and Bayes' theoremSo we all know that when a sports fan asks "What chance does our team have of winning?", the speaker is asking for a probability, but when that same person later asks "What chance does our team have of winning given that John will not be playing?", the speaker is now asking for a conditional probability. In short, a conditional probability is a probability that is changed due to the addition of new information. Let's see an example.<br /><br /><div class="sectitle">Conditional probabilities</div><br />Let's say that we have the following set of numbers, one of which is to be picked at random with equal probability:<br /><a href="http://1.bp.blogspot.com/-PWWRuknLeLM/VgxI8gd3YZI/AAAAAAAAAkk/_QiwTdfFiWU/s1600/condprob_set.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-PWWRuknLeLM/VgxI8gd3YZI/AAAAAAAAAkk/_QiwTdfFiWU/s320/condprob_set.png"></a><br /><br />The probability of each number being chosen is 1/7. But probabilities are usually based on subsets. So what is the probability of randomly choosing a square number from the above set?<br /><a href="http://3.bp.blogspot.com/-P4tsXZvYGv8/VgxI8sUZClI/AAAAAAAAAko/D0aSjZ2M1-k/s1600/condprob_squares.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-P4tsXZvYGv8/VgxI8sUZClI/AAAAAAAAAko/D0aSjZ2M1-k/s320/condprob_squares.png"></a><br /><br />The probability is, of course, 2/7. Now comes the interesting part. Let's say that the number is still chosen at random, but you have the extra information that the number that will be chosen is going to be an even number. In other words, although you don't know which number will be chosen, you do know that it will be an even number. What is the probability that the chosen number will be a square number?<br /><a href="http://3.bp.blogspot.com/-Z6QKM9IEhk0/VgxI8m7V0BI/AAAAAAAAAks/G-_5dUjZNJg/s1600/condprob_squares_even.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-Z6QKM9IEhk0/VgxI8m7V0BI/AAAAAAAAAks/G-_5dUjZNJg/s320/condprob_squares_even.png"></a><br /><br />Clearly the added information requires us to change the original probability of choosing a square number. We now have a smaller set of possible choices, only 2 (the red set). From these, there is only 1 square number (the intersection of the red and blue sets). So now the probability of choosing a square number is 1/2.<br /><br />This is called a conditional probability. Whereas the first non-conditional probability is expressed as follows in mathematical notation:<br /><pre>P(<span style="color:blue;">number is square</span>)</pre>the second probability is a conditioned one and is expressed as follows:<br /><pre>P(<span style="color:blue;">number is square</span> | <span style="color:red;">number is even</span>)</pre>which is read as "probability that the number is square given that the number is even".<br /><br />In general,<br /><pre>P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) = P(<span style="color:blue;">A</span>,<span style="color:red;">B</span>)/P(<span style="color:red;">B</span>)</pre>where P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) is the probability that event <span style="color:blue;">A</span> occurs given that event <span style="color:red;">B</span> has occurred, P(<span style="color:blue;">A</span>,<span style="color:red;">B</span>) is the probability that both events occur together (called the joint probability), and P(<span style="color:red;">B</span>) is the probability that event B occurred.<br /><a href="http://4.bp.blogspot.com/-AGAU3PZjAO4/VgzLX1AMM6I/AAAAAAAAAlE/tzyqeQhaLok/s1600/condprob_AB.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-AGAU3PZjAO4/VgzLX1AMM6I/AAAAAAAAAlE/tzyqeQhaLok/s320/condprob_AB.png"></a><br /><br />From this, we can derive some pretty interesting equations.<br /><br /><div class="sectitle">Bayes' theorem</div><br />First, it is clear from the above picture that it is straightforward to define P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) by simply dividing by P(<span style="color:blue;">A</span>):<br /><pre>P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) = P(<span style="color:blue;">A</span>,<span style="color:red;">B</span>)/P(<span style="color:blue;">A</span>)</pre><br />This means that:<br /><pre>P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) P(<span style="color:blue;">A</span>) = P(<span style="color:blue;">A</span>,<span style="color:red;">B</span>)</pre>and from the other formula, that:<br /><pre>P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) P(<span style="color:red;">B</span>) = P(<span style="color:blue;">A</span>,<span style="color:red;">B</span>)</pre>which together mean that:<br /><pre>P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) P(<span style="color:red;">B</span>) = P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) P(<span style="color:blue;">A</span>)</pre>and<br /><pre>P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) = P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) P(<span style="color:blue;">A</span>)/P(<span style="color:red;">B</span>)</pre><br />This last equation is known as <a href="https://en.wikipedia.org/wiki/Bayes'_theorem">Bayes' theorem</a> which is something that you'll encounter all the time in probability and artificial intelligence.<br /><br />In many cases, the probability P(<span style="color:red;">B</span>) is difficult to find, but we can decompose it further by noticing that the probability of selecting from set <span style="color:red;">B</span> depends on whether or not a selection was made from set <span style="color:blue;">A</span>. Specifically:<br /><pre>P(<span style="color:red;">B</span>) = P(<span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) + P(NOT <span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|NOT <span style="color:blue;">A</span>)</pre>This is saying that the probability of selecting from set <span style="color:red;">B</span> is equal to the probability of one of the following events occurring:<br /><ul><li>A selection is made from set <span style="color:blue;">A</span> and it happens to also be an element in set <span style="color:red;">B</span>: P(<span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>)</li><li>A selection is not made from set <span style="color:blue;">A</span> but the selected element is in set <span style="color:red;">B</span>: P(NOT <span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|NOT <span style="color:blue;">A</span>)</li></ul><br />Thus Bayes' theorem can be rewritten as<br /><pre>P(<span style="color:blue;">A</span>|<span style="color:red;">B</span>) = P(<span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) / ( P(<span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|<span style="color:blue;">A</span>) + P(NOT <span style="color:blue;">A</span>) P(<span style="color:red;">B</span>|NOT <span style="color:blue;">A</span>) )</pre><br />This is a more practical version of the formula. Let's see a practical example of it.<br /><br /><div class="sectitle">Bayes' theorem in action</div><br />Let's say that you have a robot that is trying to recognise objects in front of a camera. It needs to be able to recognise you when it sees you in order to greet you and fetch you your slippers. The robot sometimes makes mistakes. It sometimes thinks that it saw you when it did not (a false positive) and it sometimes sees you and doesn't realise it (a false negative). We need to calculate how accurate it is. Let's look at the following probability tree:<br /><a href="http://1.bp.blogspot.com/-JxOcmj1k5_8/Vg2NXQzUCNI/AAAAAAAAAlk/wUOErxn2Zt4/s1600/probability_tree.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-JxOcmj1k5_8/Vg2NXQzUCNI/AAAAAAAAAlk/wUOErxn2Zt4/s320/probability_tree.png"></a><br /><br />This tree is showing the following data:<br /><pre>P(you are there) = 0.1<br />P(you are not there) = 0.9<br />P(robot detects you | you are there) = 0.85<br />P(robot detects you | you are not there) = 0.15<br />P(robot does not detect you | you are there) = 0.05<br />P(robot does not detect you | you are not there) = 0.95<br /></pre><br />What is the probability that the robot detects you when you're there?<br /><pre>P(robot detects you AND you are there) =<br />P(robot detects you, you are there) =<br />P(you are there) P(robot detects you | you are there) =<br />0.1 x 0.85 = 0.085<br /></pre><br />Notice how we could have used the probability tree to calculate this (multiply the probabilities along a branch to AND them).<br /><br />If the robot detects you, what is the probability that it is correct?<br /><pre>P(you are there | robot detects you) =<br />P(you are there) P(robot detects you | you are there) / ( P(you are there) P(robot detects you | you are there) + P(you are not there) P(robot detects you | you are not there) ) =<br />0.1 x 0.85 / ( 0.1 x 0.85 + 0.9 x 0.15 ) = 0.39<br /></pre><br />This is a small number, even though it correctly detects you 85% of the time. The reason is because you are in front of it only 10% of the time, which means that the majority of the time that it is trying to detect you you are not there. This will make that 15% of the time falsely detecting you pile up. One way to increase the accuracy is to limit the number of times an attempted detection is made in such a way that the probability that you are actually there is increased.<br /><br /><div class="sectitle">Bayesian inference</div><br />There is more to Bayes' theorem than using it to measure the accuracy of a robot's vision. It has interesting philosophical implications in epistemology. This is because it can be used to model the acquisition of knowledge. When used in this way we say that we are performing <a href="https://en.wikipedia.org/wiki/Bayesian_inference">Bayesian inference</a>. Let's say that you're a detective collecting clues on who committed a murder. You have a suspect in mind that you believe is the murderer with a certain probability. You find a clue which you believe is evidence that incriminates the suspect. This evidence should now increase your probability that the suspect is the murderer. But how do you find the new probability? Enter Bayes' theorem.<br /><br />The probability you assigned to the suspect before the new evidence is P(H), the probability of the hypothesis, also known as the prior probability.<br />The new probability that you should assign to the suspect after discovering the evidence is P(H|E), also known as the posterior probability.<br />Now we use Bayesian inference to calculate the posterior probability as follows:<br /><br /><pre>P(H|E) = P(H)P(E | H) / ( P(H)P(E | H) + P(NOT H)P(E | NOT H) )</pre><br />The interpretation of this makes sense. The new probability given the evidence depends on two things:<br /><ul><li>The likelihood that the suspect was the murderer. The smaller this is, the stronger the evidence needs to be to make the hypothesis likely. This is described exactly by the quote "Extraordinary claims require extraordinary evidence".</li><li>The probability that the evidence would exist given that the suspect was not the murderer. It could be that the evidence actually supports the null-hypothesis, that is, that the suspect is actually not the murderer. This is determined by comparing the probability of the hypothesis with the probability of the null-hypothesis.</li></ul><br />Finally notice also that if you have multiple hypothesis and want to see which is the most likely given a new evidence, we are essentially trying to find the maximum posterior probability of each hypothesis given the same evidence. Given the multiple competing hypothesis H_1, H_2, H_3, etc., the most likely H_i is found by:<br /><pre>argmax_i ( P(H_i)P(E | H_i) / ( P(H_i)P(E | H_i) + P(NOT H_i)P(E | NOT H_i) ) )</pre>But we can simplify this by remembering that the denominator is P(E):<br /><pre>argmax_i ( P(H_i)P(E | H_i) / P(E) )</pre>And of course since P(E) is a constant for each hypothesis, it will not affect which hypothesis will give the maximum posterior probability, so we can leave it out, giving:<br /><br /><pre>argmax_i P(H_i)P(E | H_i)</pre>mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-34952853313163570522015-09-14T08:55:00.000+02:002015-09-14T08:55:11.457+02:00How to make a multiple choice test using ExcelHere is a post for the teachers out there who are technologically savvy enough to use Excel but not quite enough to write a program or web application. It is easy to make your own multiple choice test in Excel which corrects itself, provided that it is feasible to make give a copy of the Excel file to each student and collect them all after the test. This also assumes that the risk of students not saving or accidentally deleting the file is negligible. But I know teachers who actually do this sort of thing so here is how to do it well.<br /><br /><div class="sectitle">STEP 1: Activate "Developer" tab</div>Go on File - Options - Customize Ribbon - Select the "Developer" check box - OK:<br /><a href="http://2.bp.blogspot.com/-uCSrQBhm1M0/VfV9y7dV39I/AAAAAAAAAfg/AveQkMEDKuM/s1600/step1.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-uCSrQBhm1M0/VfV9y7dV39I/AAAAAAAAAfg/AveQkMEDKuM/s320/step1.png" /></a><br /><br /><div class="sectitle">STEP 2: Add a group box</div>Go on Developer - Insert - Group box in Form Control:<br /><a href="http://3.bp.blogspot.com/-T__sVsW5oLs/VfV-exh2qPI/AAAAAAAAAfo/qpYqtqNWWzo/s1600/step2a.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-T__sVsW5oLs/VfV-exh2qPI/AAAAAAAAAfo/qpYqtqNWWzo/s320/step2a.png" /></a><br /><br />Then draw the group box and delete the text on it or write your question there:<br /><a href="http://4.bp.blogspot.com/-x5c-gG8hgGw/VfV_UFuFnLI/AAAAAAAAAf0/-SlMifII0iQ/s1600/step2b.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-x5c-gG8hgGw/VfV_UFuFnLI/AAAAAAAAAf0/-SlMifII0iQ/s320/step2b.png" /></a><br /><br /><div class="sectitle">STEP 3: Add option buttons</div>Go on Developer - Insert - Option button in Form Control:<br /><a href="http://3.bp.blogspot.com/-Xak-2zc2wTs/VfWABcxPd_I/AAAAAAAAAf8/-vjCIC5cjJw/s1600/step3.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-Xak-2zc2wTs/VfWABcxPd_I/AAAAAAAAAf8/-vjCIC5cjJw/s320/step3.png" /></a><br /><br />Then draw the option button COMPLETELY INSIDE the group box. This is very important, as if you don't draw it completely inside the group box, it will not be associated with other candidate answers of the same question and will not work properly. It's OK to move it outside of the group box afterwards but not before you draw it inside. Delete the text on the option button or write the candidate answer there:<br /><a href="http://3.bp.blogspot.com/-cjH2NkYhCsg/VfWA3aZymaI/AAAAAAAAAgI/WQAmRK1cEOg/s1600/step3b.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-cjH2NkYhCsg/VfWA3aZymaI/AAAAAAAAAgI/WQAmRK1cEOg/s320/step3b.png" /></a><br /><br /><div class="sectitle">STEP 4: Complete the test</div>Repeat steps 3 and 2 as needed. If you need to reposition the form elements you first right click on them and then they are movable. Don't worry about accidentally selecting an option button; just make sure that checking an option button in a control group will not affect other control groups. If this happens then it is because the option buttons are not associated with the control group, because they were not drawn inside it, and you will have to draw a new one instead of it.<br /><br /><a href="http://2.bp.blogspot.com/-ei90OzzvZC0/VfWPcF4zuWI/AAAAAAAAAgY/Pud9v75tjMM/s1600/step4.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-ei90OzzvZC0/VfWPcF4zuWI/AAAAAAAAAgY/Pud9v75tjMM/s320/step4.png" /></a><br /><br /><div class="sectitle">STEP 5: Automatically check the answers</div>Right click on the option buttons - Format Control - Set Cell link to a particular cell:<br /><a href="http://4.bp.blogspot.com/-uceEdoGazB4/VfWQxjzuA0I/AAAAAAAAAgk/Z_jjC-JrsV4/s1600/step5a.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-uceEdoGazB4/VfWQxjzuA0I/AAAAAAAAAgk/Z_jjC-JrsV4/s320/step5a.png" /></a><br /><br />You only need to set one of the option buttons for each question. Make sure that option buttons of different questions all use different cell links. While you're in the Format Control window you can unset any accidentally set option buttons.<br /><br />You now have the linked cell of each question contain a number which indicates the selected answer:<br /><a href="http://3.bp.blogspot.com/-SkvsS8ng2IA/VfWVVCAhy0I/AAAAAAAAAgw/EwNryARgTNw/s1600/step5b.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-SkvsS8ng2IA/VfWVVCAhy0I/AAAAAAAAAgw/EwNryARgTNw/s320/step5b.png" /></a><br /><br />The number depends on the order in which the option buttons were added. Next write the correct answer next to each linked cell and next to that add a formula which checks if the right answer was selected. The formula is "<b>=G3=H3</b>" where "G3" is the linked cell and "H3" is the cell with the right answer.<br /><a href="http://1.bp.blogspot.com/-Yt7l_oZ15jU/VfWZBzeR75I/AAAAAAAAAhM/qmcuncoBkTM/s1600/step5c.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-Yt7l_oZ15jU/VfWZBzeR75I/AAAAAAAAAhM/qmcuncoBkTM/s320/step5c.png" /></a><br /><br />The 3 columns on the side show the chosen answer (automatically set by the option buttons), the correct answer (entered by you), and whether the right answer was chosen or not (automatically set by a formula which compares the previous two).<br /><br /><div class="sectitle">STEP 6: Automatically compute the mark</div>Finally, add the following formula under the TRUE/FALSE cells: "<b>=COUNTIF(I3:I9,TRUE)</b>" where "I3:I9" is the range of cells which are TRUE/FALSE.<br /><a href="http://2.bp.blogspot.com/-XpeoltFRqLM/VfWa-Hkv-8I/AAAAAAAAAhY/ug73Byhxpwc/s1600/step6.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-XpeoltFRqLM/VfWa-Hkv-8I/AAAAAAAAAhY/ug73Byhxpwc/s320/step6.png" /></a><br /><br /><div class="sectitle">STEP 7: Barricade the Excel sheet</div>At the moment the answers are in plain sight and everything is editable which makes it unsuitable for a test. So here's how to fix that.<br /><br /><div class="subsectitle">Unlock the changing cells</div>Start with setting the cells on the side to unlocked. This will allow the sheet to work when you lock it. Highlight the side cells (including the test mark), then right click - Format Cells - Protection - Uncheck both checkboxes:<br /><a href="http://4.bp.blogspot.com/-FA0IgEwp1OM/VfXtdqUVoxI/AAAAAAAAAjU/OoeQvi6rbwo/s1600/step7a.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-FA0IgEwp1OM/VfXtdqUVoxI/AAAAAAAAAjU/OoeQvi6rbwo/s320/step7a.png" /></a><br /><br />If you have any cells which you want the students to edit, such as a space to type their name, unlock these cells as well in the exact same way.<br /><br /><div class="subsectitle">Hide the sensitive information</div>Next we'll hide the side cells. Highlight the columns with the secret information, then right click on the columns and click hide:<br /><a href="http://2.bp.blogspot.com/-rP_lBbP1-jk/VfXtdhWkkXI/AAAAAAAAAi8/BAfjcMKgEb8/s1600/step7b.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-rP_lBbP1-jk/VfXtdhWkkXI/AAAAAAAAAi8/BAfjcMKgEb8/s320/step7b.png" /></a><br /><br /><a href="http://4.bp.blogspot.com/-A5wBgmRDsVA/VfXtdv0FT8I/AAAAAAAAAi4/ktMUrsi_WaQ/s1600/step7c.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-A5wBgmRDsVA/VfXtdv0FT8I/AAAAAAAAAi4/ktMUrsi_WaQ/s320/step7c.png" /></a><br /><br /><div class="subsectitle">Password protect the sheet</div>Next we'll make it all password protected so that nothing can be changed except the option buttons. Go on Review - Protect Sheet - Set a password - OK:<br /><a href="http://2.bp.blogspot.com/-ouVkXp7IDvw/VfX0MsNYE-I/AAAAAAAAAjs/OyDXTxp-KCk/s1600/step7d.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-ouVkXp7IDvw/VfX0MsNYE-I/AAAAAAAAAjs/OyDXTxp-KCk/s320/step7d.png" /></a><br /><br />Also go on Review - Protect Workbook - Set a password - OK:<br /><a href="http://2.bp.blogspot.com/-Rcp5_LKPrEE/VfXte3BEbiI/AAAAAAAAAjM/8cOvBGfSoLI/s1600/step7e.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-Rcp5_LKPrEE/VfXte3BEbiI/AAAAAAAAAjM/8cOvBGfSoLI/s320/step7e.png" /></a><br /><br />Now you have a multiple choice test sheet which cannot be tampered with.<br /><br /><div class="sectitle">STEP 8: Gathering the marks</div>The test has been taken and everyone saved their Excel sheet. Now you have to collect all the files and find everyone's mark. This would involve opening each file, unprotecting the sheet with your password, unhiding the hidden columns, and reading the mark at the bottom. Pretty daunting, but avoidable.<br /><br />You can use Excel to read the data in other Excel files. Just save a blank Excel file with all the answer files and add the following formula: "<b>'[john smith.xlsx]Sheet1'!I13</b>" where "john smith.xlsx" is the file name of the answer file, "Sheet1" is the Excel sheet name in the answer file, and "I13" is the cell containing the mark.<br /><br /><a href="http://4.bp.blogspot.com/-DaQdBU4FHj8/VfYCsTsfPwI/AAAAAAAAAj4/DcX2Y4LSIVQ/s1600/step8b.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-DaQdBU4FHj8/VfYCsTsfPwI/AAAAAAAAAj4/DcX2Y4LSIVQ/s320/step8b.png" /></a><br /><br />Just do this for all files and you've got a nice result sheet. If you want to give a correction you can even check the TRUE/FALSE column of each answer and say which questions were answered wrong. Use "<b>IF('[john smith.xlsx]Sheet1'!I3, "Correct", "You said " & '[john smith.xlsx]Sheet1'!G3 & " instead of " & '[john smith.xlsx]Sheet1'!H3)</b>" where I3 is the cell with the TRUE/FALSE result of the first question, G3 is the cell with the given answer, and H3 is the cell with the correct answer. You can even add another column in the answer sheet with a comment for whoever gets the question wrong.<br /><br />Keep in mind that students might use a technique like this to read the hidden stuff in your answer file, but that shouldn't be easy to do without getting caught, especially if you hide a lot of columns (more than needed) and put the data in random columns.<br /><br /><div class="sectitle">The grid format</div>You can do your multiple choice in the below format where the sheet contains minimal information and the questions and candidate answers are on a printed sheet of paper.<br /><a href="http://4.bp.blogspot.com/-RoiMTY9gwek/VfYJWi9xglI/AAAAAAAAAkQ/fwG15iafSGs/s1600/alternative%2Bmultiple%2Bchoice.png" imageanchor="1" ><img border="0" src="http://4.bp.blogspot.com/-RoiMTY9gwek/VfYJWi9xglI/AAAAAAAAAkQ/fwG15iafSGs/s320/alternative%2Bmultiple%2Bchoice.png" /></a><br /><br />This allows the students to scribble on the paper and to keep the Excel sheet short which saves scrolling.mtantinoreply@blogger.com2tag:blogger.com,1999:blog-4318944459823143473.post-29332270091258205342015-08-07T16:03:00.000+02:002015-08-07T17:50:49.654+02:00Predicting the number of nodes in a trie with uniformly distributed stringsA <a href="https://en.wikipedia.org/wiki/Trie">trie</a> is a type of tree that stores strings. Each character of the strings is a node and strings that share a common prefix also share the nodes, which means that a common prefix is only stored once, reducing some redundancy. But how much space is saved by using a trie? In order to answer this question, first we have to calculate the expected number of nodes a trie will have for "n" strings of "m" characters each with "c" possible characters (character set).<br /><br />Consider the following diagram of a trie that contains the words "me", "if", "in", and "it". In it we have added a new word "my".<br /><a href="http://1.bp.blogspot.com/-5hczuxnslG8/VcJu6r7Ju7I/AAAAAAAAAd0/r3N8r6d3iPQ/s1600/trie.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-5hczuxnslG8/VcJu6r7Ju7I/AAAAAAAAAd0/r3N8r6d3iPQ/s320/trie.png" /></a><br /><br />The word "my" only required the creation of one new node, since its first letter already existed in the word "me" so that node was shared and not recreated. In general, if a string is inserted in a trie, the number of new nodes created depends on the length of the longest existing prefix in the trie. This length will be the number of nodes that will be shared/reused. The remainder of the string will require new nodes for each character. If the whole string already exists then there will be 0 new nodes whilst if the string is completely new with no existing prefix then there will be a new node for each character. Specifically, for a string of length "m" whose longest existing prefix is of length "p", the number of new nodes created will be "m - p".<br /><br />The equation we need to figure out looks like the following:<br /><pre style="overflow-x:scroll;white-space:pre;">expected number of nodes = (m)(expected number of strings with prefix of length 1 not found)<br /> + (m-1)(expected number of strings with prefix of length 2 not found)<br /> + (m-2)(expected number of strings with prefix of length 3 not found)<br /> + ...<br /> + (1)(expected number of strings with prefix of length m not found)</pre><br />Assuming that the strings are generated using a uniform distribution (any character can appear anywhere in the string), we need to find the expected number of strings out of "n" inserted strings made from "c" possible characters that will have a non-existing prefix of length "p".<br /><br />This is basically the expected number of strings being selected for the first time when "n" selections are made from among all possible "p" length strings made from "c" possible characters (there are "c^p" possible such prefixes). This is equivalent to saying that it is the expected number of non-collisions when randomly placing "n" objects in "c^p" slots.<br /><br />In my <a href="http://geekyisawesome.blogspot.com/2015/07/expected-number-ofuniformly-distributed.html">previous post</a>, I showed that the expected number of collisions when randomly placing "n" objects in "s" slots is<br />n - s(1 - (1 - 1/s)^n)<br />which means that the number of non-collisions is<br />n - (n - s(1 - (1 - 1/s)^n))<br />which simplifies to<br />s(1 - (1 - 1/s)^n)<br />which when we plug in our values becomes<br />(c^p)(1 - (1 - 1/(c^p))^n)<br /><br />But there's a problem. The above equation tells you the expected number of non-collisions when considering "p" length prefixes. But consider the previous diagram again. If the word "he" was added, it is true that the length 2 prefix of the word ("he") does not result in a collision, but this does not mean that just 1 new node will be added. In reality, 2 new nodes will be added because it is also true that its length 1 prefix ("h") will also not result in a collision. What this means is that the equation will not give the number of strings which will not result in a collision due to their length "p" prefix only, but also due to their length "p-1" prefix, which is not what we want. To fix this, we subtract from the equation the number of non-collisions due to the shorter prefix:<br /><pre style="overflow-x:scroll;white-space:pre;">expected number of strings with prefix of length p not found = (c^p)(1 - (1 - 1/(c^p))^n) - (c^(p-1))(1 - (1 - 1/(c^(p-1)))^n)<br /></pre>Of course this does not apply for the length 1 prefix, so we need to be careful to only apply the subtraction for prefix lengths greater than one.<br />(You might think that we need to subtract for each shorter prefix length, but when this was tried the result became a negative number. Perhaps some form of inclusion-exclusion principle needs to be applied. Using this equation, the result matches empirical data for many different parameters.)<br /><br />So, continuing from our earlier equation,<br /><pre style="border:black solid 5px;overflow-x:scroll;white-space:pre;">expected number of nodes = (m)((c^1)(1 - (1 - 1/(c^1))^n))<br /> + (m-1)((c^2)(1 - (1 - 1/(c^2))^n) - (c^(2-1))(1 - (1 - 1/(c^(2-1)))^n))<br /> + (m-2)((c^3)(1 - (1 - 1/(c^3))^n) - (c^(3-1))(1 - (1 - 1/(c^(3-1)))^n))<br /> + ...<br /> + (1)((c^m)(1 - (1 - 1/(c^m))^n) - (c^(m-1))(1 - (1 - 1/(c^(m-1)))^n))<br />= sum( c^i - c^i*((c^i-1)/c^i)^n for i in 1..m )<br /></pre><br />In Python code this becomes:<br /><pre style="border:black solid 5px;" class="brush:python">from fractions import Fraction<br />def exp_num_trie_nodes(n,m,c):<br /> return float(sum( c**i - c**i*Fraction(c**i-1,c**i)**n for i in range(1,m+1) ))</pre><br /><div class="sectitle">Rate of change</div><br />Here is a comparison of how the number of nodes increases depending of which variable (n,m,c) is changed:<br /><br /><a href="http://1.bp.blogspot.com/-pXZYwhC3LW4/VcR0YXpF9TI/AAAAAAAAAeM/MHE5TEYL3Tg/s1600/trie-graph-c.png" imageanchor="1" ><img border="0" src="http://1.bp.blogspot.com/-pXZYwhC3LW4/VcR0YXpF9TI/AAAAAAAAAeM/MHE5TEYL3Tg/s320/trie-graph-c.png" /></a><a href="http://3.bp.blogspot.com/-A4foJReqOY4/VcR0YQ9RDXI/AAAAAAAAAeI/wiTTa6dA1Ug/s1600/trie-graph-m.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-A4foJReqOY4/VcR0YQ9RDXI/AAAAAAAAAeI/wiTTa6dA1Ug/s320/trie-graph-m.png" /></a><a href="http://2.bp.blogspot.com/-iKu_Bx3Xx68/VcR0YaPUaUI/AAAAAAAAAeE/xZ5_s5ofZU4/s1600/trie-graph-n.png" imageanchor="1" ><img border="0" src="http://2.bp.blogspot.com/-iKu_Bx3Xx68/VcR0YaPUaUI/AAAAAAAAAeE/xZ5_s5ofZU4/s320/trie-graph-n.png" /></a><br /><br />As "n" increases, the number of nodes added starts slowing down, which makes sense since the more strings there are, the more existing prefixes can be reused. As "m" increases, the number of nodes added starts speeding up and then becomes linear, which makes sense too since longer strings are sparser and thus it would be harder to find a matching prefix which is long from among 100 strings. As "c" increases, the number of nodes added shoots up until a point where it then slows down, almost like it is logarithmic. This is because after a point it will not matter how rare the strings are since there are only 100 strings to choose from among the "c^m" possible strings. Since the length is not increasing, the same number of nodes will be used.<br /><br /><div class="sectitle">Size of trie</div><br />So does using a trie compress a set of strings? Keep in mind that a node takes more space than a character since it needs to point to other nodes whereas strings are arrays without pointers. We'll assume that all strings are the same length in order to reduce the number of variables. This will reduce the amount of information needed for both the set of strings and the trie (no need to include terminator flags for the strings) and the number of strings of maximum length is greater than the total number of shorter strings so it will not be a significant error in representation.<br /><br />Call the number of nodes in the trie "N(n,m,c)".<br /><br />The size of the normal set of strings is as follows:<br />n(m log(c))<br />where "log(c)" is the size of each character (the number of bits needed to represent each character). Of course this assumes that each string is unique. Tries only store unique strings and the way we compute the number of nodes does not assume that the strings will be unique. So we need to subtract the expected number of repeated strings from among those "n" strings. The number of repeated strings is equal to the number of collisions when placing "n" objects in "c^m" slots.<br /><pre style="overflow-x:scroll;white-space:pre;">Array of strings: n(m log(c)) - (n - (c^m)(1 - (1 - 1/(c^m))^n))</pre><br />The size of the trie is as follows:<br />N(n,m,c)(k(log(c) + log(N(n,m,c))))<br />where "log(c)" is the size of each character (the number of bits needed to represent each character), "log(N(n,m,c))" is the size of a pointer (which at minimum would be the logarithm of the number of nodes), and "k" is the number of pointers used on average per node. Given that the majority of the nodes in a trie will be leaf nodes, the majority of nodes will not have children. In fact the average will be less than one child per node. If arrays are used, "k" must be equal to "c", but if a linked list is used then "k" is the average but we have to also include the linked list pointer size with each character. The pointer size of the linked lists can be assumed to be "log(N(n,m,c))" since the total number of child nodes is equal to the number of nodes (minus the root node).<br /><pre style="overflow-x:scroll;white-space:pre;">Array based: N(n,m,c)(c(log(c) + log(N(n,m,c))))</pre><pre style="overflow-x:scroll;white-space:pre;">Linked list based: N(n,m,c)(k(log(c) + log(N(n,m,c)) + log(N(n,m,c))))</pre><br />Here is a graph showing how the set of strings, array based trie, and linked list based trie increase in size with "n" when "c" is 5, "m" is 5, and "k" is 0.9:<br /><br /><a href="http://3.bp.blogspot.com/-Fvcc9cfpuPg/VcS4kxjLlNI/AAAAAAAAAe0/R5IHcAy6To4/s1600/trie-graph-sizecomparison.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-Fvcc9cfpuPg/VcS4kxjLlNI/AAAAAAAAAe0/R5IHcAy6To4/s320/trie-graph-sizecomparison.png" /></a><br /><br />It is clear that an array based trie cannot be used to compress a collection of strings as nodes take too much space. But what if we changed the value of "k" in the linked list based trie?<br /><br /><a href="http://3.bp.blogspot.com/-m1xWQJ7Ebiw/VcS6HYR540I/AAAAAAAAAfA/MOdQgaAcses/s1600/trie-graph-sizecomparison-ks.png" imageanchor="1" ><img border="0" src="http://3.bp.blogspot.com/-m1xWQJ7Ebiw/VcS6HYR540I/AAAAAAAAAfA/MOdQgaAcses/s320/trie-graph-sizecomparison-ks.png" /></a><br /><br />This shows that unless you have an average number of children per node of 0.2 or less, the array of strings will always take less space. Notice that this says nothing about tries which attempt to minimize the number of nodes such as <a href="https://en.wikipedia.org/wiki/Radix_tree">radix trees</a> where a single node represents a substring rather than a character. Also notice that this is about uniformly distributed strings, not linguistic strings which have a lot of redundancy. In a future post I shall make empirical comparisons on linguistic data.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-34103640785739448672015-07-08T19:36:00.000+02:002015-07-08T19:49:48.162+02:00Expected number of uniformly distributed collisions (birthday problem)Here's an interesting mathematical problem. If you have "n" objects to be inserted into "m" available slots using a uniformly distributed random placement, <a href="https://en.wikipedia.org/wiki/Birthday_problem#Collision_counting">how many collisions with already occupied slots should we expect to happen</a>? This is useful for hashtables and other data structures where duplicates are not allowed.<br /><br />Here is a Python 3 program that simulates inserting objects into random positions in an array and counting the average number of collisions.<br /><br /><pre class="brush:python">def collisions(n, m):<br /> trials = 10000<br /> total_collisions = 0<br /> for _ in range(trials):<br /> slot_is_occupied = [ False for _ in range(m) ]<br /> for _ in range(n):<br /> slot = random.randint(0, m-1)<br /> if slot_is_occupied[slot]:<br /> total_collisions += 1<br /> else:<br /> slot_is_occupied[slot] = True<br /> return total_collisions/trials<br /></pre><br />Here is a sample of the average number of collisions given by the above function for different values of "n" and "m":<br /><table border="1"><col width="10"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"><tr><th>n\m</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th></tr><tr><th>1</th><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td></tr><tr><th>2</th><td>1.0</td><td>0.4947</td><td>0.3334</td><td>0.2515</td><td>0.2054</td><td>0.163</td><td>0.1437</td><td>0.1297</td><td>0.1118</td><td>0.1053</td></tr><tr><th>3</th><td>2.0</td><td>1.2447</td><td>0.8861</td><td>0.685</td><td>0.557</td><td>0.4633</td><td>0.4023</td><td>0.3537</td><td>0.325</td><td>0.2819</td></tr><tr><th>4</th><td>3.0</td><td>2.1291</td><td>1.5812</td><td>1.2588</td><td>1.0443</td><td>0.9054</td><td>0.7754</td><td>0.6924</td><td>0.6176</td><td>0.5459</td></tr><tr><th>5</th><td>4.0</td><td>3.0617</td><td>2.4</td><td>1.944</td><td>1.6457</td><td>1.4204</td><td>1.2364</td><td>1.1123</td><td>0.9984</td><td>0.9019</td></tr><tr><th>6</th><td>5.0</td><td>4.034</td><td>3.265</td><td>2.7078</td><td>2.304</td><td>2.0218</td><td>1.7604</td><td>1.6004</td><td>1.4349</td><td>1.318</td></tr><tr><th>7</th><td>6.0</td><td>5.0168</td><td>4.1742</td><td>3.5406</td><td>3.0498</td><td>2.6716</td><td>2.3973</td><td>2.1499</td><td>1.9417</td><td>1.7744</td></tr><tr><th>8</th><td>7.0</td><td>6.0075</td><td>5.1206</td><td>4.403</td><td>3.8363</td><td>3.3905</td><td>3.0304</td><td>2.7378</td><td>2.5151</td><td>2.3219</td></tr><tr><th>9</th><td>8.0</td><td>7.0035</td><td>6.0765</td><td>5.3052</td><td>4.6788</td><td>4.1652</td><td>3.738</td><td>3.4168</td><td>3.1205</td><td>2.8816</td></tr><tr><th>10</th><td>9.0</td><td>8.0016</td><td>7.0526</td><td>6.2233</td><td>5.5401</td><td>4.9632</td><td>4.493</td><td>4.0913</td><td>3.7721</td><td>3.5016</td></tr></table><br />Basically the answer is the number of objects "n" minus the number of occupied slots. This will give us the number of objects excluding the ones which were inserted without collision, that is, in an empty slot. For example, if I insert 5 objects into an array but at the end there are only 3 occupied slots, then that must mean that 2 of those objects were inserted in the same slot as some other objects (they collided with them).<br /><br />The question is how to predict the expected number of occupied slots.<br /><br /><div class="sectitle">Expected number of occupied slots</div><br />What is the average number of slots ending up being occupied by at least one object? This <a href="http://geekyisawesome.blogspot.com/2015/07/probabilities-are-average-proportions.html">previous blog post</a> explains that you basically just need to multiply the probability of a given slot being occupied at the end by the number of slots. So what is the probability of a slot being occupied?<br /><br /><div class="sectitle">Probability of a slot being occupied</div><br />What is the probability that an object is inserted into a particular slot out of "m" slots?<br />1/m<br /><br />Therefore the probability that the slot remains empty is<br />1 - 1/m<br /><br />What is the probability that the slot is still empty after another placement? It's the probability that the first object did not land on the slot AND that the second object did not land on the slot too. These two probabilities are independent of each other, so<br />(1 - 1/m)(1 - 1/m)<br /><br />In general, after "n" objects have been placed, the probability that the slot is still empty is<br />(1 - 1/m)^n<br /><br />Notice that this makes sense for n = 0 because if no objects were placed, then the probability that the slot is empty is 1.<br /><br />Which means that after "n" objects have been placed, the probability that the slot is occupied is<br />1 - (1 - 1/m)^n<br /><br /><div class="sectitle">Therefore...</div><br />Therefore, the expect number of occupied slots among "m" slots after "n" objects have been inserted with uniform probability is<br />m(1 - (1 - 1/m)^n)<br /><br />Which means that the expected number of collisions is<br />n - m(1 - (1 - 1/m)^n)<br /><br />Here is the same table as the one at the top showing the corresponding predicted number of collisions:<br /><br /><table border="1"><col width="10"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"> <col width="50"><tr><th>n\m</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th></tr><tr><th>1</th><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>0.0</td><td>-0.0</td><td>0.0</td></tr><tr><th>2</th><td>1.0</td><td>0.5</td><td>0.3333</td><td>0.25</td><td>0.2</td><td>0.1667</td><td>0.1429</td><td>0.125</td><td>0.1111</td><td>0.1</td></tr><tr><th>3</th><td>2.0</td><td>1.25</td><td>0.8889</td><td>0.6875</td><td>0.56</td><td>0.4722</td><td>0.4082</td><td>0.3594</td><td>0.321</td><td>0.29</td></tr><tr><th>4</th><td>3.0</td><td>2.125</td><td>1.5926</td><td>1.2656</td><td>1.048</td><td>0.8935</td><td>0.7784</td><td>0.6895</td><td>0.6187</td><td>0.561</td></tr><tr><th>5</th><td>4.0</td><td>3.0625</td><td>2.3951</td><td>1.9492</td><td>1.6384</td><td>1.4113</td><td>1.2387</td><td>1.1033</td><td>0.9944</td><td>0.9049</td></tr><tr><th>6</th><td>5.0</td><td>4.0313</td><td>3.2634</td><td>2.7119</td><td>2.3107</td><td>2.0094</td><td>1.776</td><td>1.5904</td><td>1.4394</td><td>1.3144</td></tr><tr><th>7</th><td>6.0</td><td>5.0156</td><td>4.1756</td><td>3.5339</td><td>3.0486</td><td>2.6745</td><td>2.3794</td><td>2.1416</td><td>1.9462</td><td>1.783</td></tr><tr><th>8</th><td>7.0</td><td>6.0078</td><td>5.1171</td><td>4.4005</td><td>3.8389</td><td>3.3954</td><td>3.0395</td><td>2.7489</td><td>2.5077</td><td>2.3047</td></tr><tr><th>9</th><td>8.0</td><td>7.0039</td><td>6.078</td><td>5.3003</td><td>4.6711</td><td>4.1628</td><td>3.7481</td><td>3.4053</td><td>3.118</td><td>2.8742</td></tr><tr><th>10</th><td>9.0</td><td>8.002</td><td>7.052</td><td>6.2253</td><td>5.5369</td><td>4.969</td><td>4.4984</td><td>4.1046</td><td>3.7715</td><td>3.4868</td></tr></table><br />The maximum absolute error between the two tables is 0.0179.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-27157580843991243692015-07-08T19:24:00.001+02:002015-07-08T19:35:44.396+02:00Probabilities are average proportions (expected value)Intuitively, if a coin flip has a probability of 1/2 of turning out heads, and we flipped the coin 100 times, we expect that 1/2 of those 100 flips will be heads. What is meant by "expect" is that if we do this 100 coin flip experiment for many times, count the number of times it turns out heads for each 100 flip trial, and take the average of these counts, the average will be close to 1/2 of 100. Furthermore, the more 100 flip trials we include in our average, the closer the average will be 1/2 of 100.<br /><br />If this were the case, then <a href="http://stats.stackexchange.com/questions/1525/whats-the-difference-between-a-probability-and-a-proportion">a probability can be treated as an average proportion</a>, because if a probability of something happening is, say, 1/100, then after 1000 attempts we should find that, on average, 1/100 of those 1000 attempts would be the thing happening. In general, if the probability of an outcome is "p", and "n" attempts are made, then we should have "pn" positive outcomes. That probability is acting as a proportion of the average number of attempts made which will result in a positive outcome out of the attempts made. In fact, semantically speaking, the phrase "This outcome occurs with probability 1/100" and the phrase "This outcome occurs once every 100 times" are identical.<br /><br />A simple proof of this is in the way we estimate the probability of an outcome. We attempt to produce the outcome (such as a coin flip resulting in heads) for a number of times "n", count the number of times "x" the outcome is positive (heads), and then just find x/n. But in order for this probability to be reliable, the quotient must remain constant for different values of "n" (the value "x" will change according to "n" to keep x/n equal). Given this statement, if we know a reliable probability x/n, and have performed the experiment "m" times, then the number of positive outcomes "y" can be predicted as follows:<br />For x/n to be reliable, x/n = y/m<br />Therefore, y = m(y/m) = m(x/n)<br />That is, since x/n is known and "m" is known, "y" can be found using those two values only.<br /><br />Of course this is not a rigorous proof. To get a rigorous proof we need to turn to a field of probability called <a href="https://en.wikipedia.org/wiki/Expected_value">expected value</a>. The expected value of a random variable (such as a coin flip) is the average of the values (assumed to be numerical) of the outcomes after a large number of trials. It is defined as the sum of each outcome multiplied by its probability. For example, the expected value of the value on a die is<br />1*1/6 + 2*1/6 + 3*1/6 + 4*1/6 + 5*1/6 + 6*1/6<br />because for each outcome from 1 to 6, the probability is 1/6.<br /><br />In general, if the probability of outcome "o_i" is "p_i", then the expected outcome is<br />sum(o_i*p_i for all i)<br /><br />But this isn't useful for proving the statement in the title. The proof is in this <a href="http://math.stackexchange.com/a/35867/11614">Math Exchange answer</a> which explains that the expected number of positive outcomes out of "n" attempts, given that the probability of each outcome each time is "p", is "pn". It goes like this:<br /><br />Let the random variable "U_i" be the outcome of the "i"th attempt (heads or tails). If the outcome is positive (heads), "U_i" is 1, otherwise it is 0. Given "n" attempts, the number of positive outcomes is<br />U_1 + U_2 + U_3 + ... + U_n<br /><br />Call this actual number of positive outcomes "X", that is<br />X = U_1 + U_2 + U_3 + ... + U_n<br /><br />The expected value of "X", written as E(X) is<br />E(X) = E(U_1 + U_2 + U_3 + ... + U_n)<br /><br />Since the expected value is a <a href="https://en.wikipedia.org/wiki/Expected_value#Linearity">linear operator</a>,<br />E(X) = E(U_1) + E(U_2) + E(U_3) + ... + E(U_n)<br /><br />Now, given the above definition of what an expected value is,<br />E(U_i) = 1*(probability of U_i = 1) + 0*(probability of U_i = 0)<br /><br />If the probability of "U_i" being 1 is "p_i", then<br />E(U_i) = p_i<br /><br />But for all "i", the probability of "U_i" is the same. That is<br />E(U_i) = p<br /><br />So that means that<br />E(X) = p + p + p + ... + p<br />E(X) = pn<br /><br />And there we have it, the expected number of positive outcomes out of "n" attempts, each of which has a probability of "p", is "pn", which means that the probability "p" can be treated exactly as if it was the proportion of positive outcomes out of a number of trials.mtantinoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-70855146966237397652015-06-24T13:32:00.000+02:002015-06-24T13:46:20.939+02:00Compressed frequencies: Representing frequencies with less bitsIn a cache memory you usually store the most frequently used data that is currently in a larger but slower memory. For example, you keep your most frequently accessed files cached in RAM rather than on your hard drive. Since you can't fit all the contents of your hard disk in RAM, you keep only the most frequently used files that can fit. You will still need to access your hard disk once in a while in order to access your less frequently used files but the average file access time will now be greatly reduced.<br /><br />The problem is how to keep count of the number of times each file is being used in order to know which is the most frequently used. The obvious solution is to associate each file with a number and increment that number each time it is used. But numbers take space as well, and sometimes this becomes a significant problem. You might not afford to waste 4 or 8 bytes of memory worth of frequency integers for every item. Is there a way to bring down the number of bytes used by the frequency integers without losing their usefulness?<br /><br />Here is an example of a 4 byte int integer number in memory representing the number <span style="font-weight:bold;">45723</span>:<br /><div style="text-align:center;font-family:'Courier New';font-size:16pt;">00000000 00000000 10110010 10011011</div><br />The most obvious thing you can do is to tear away the most significant bits (the ones on the left which have a larger value) by using smaller range number types such as the 2 byte short, which gives us <span style="white-space:nowrap;font-weight:bold;">10110010 10011011</span>. If the frequency is sometimes, but rarely, larger than the ranges provided by these smaller types, then you can just cap it off by stopping incrementation once the maximum number is reached. For example, if we're using a two byte short, then the maximum number this can be is 65535. Once the frequency reaches this number, then it gets frozen and never incremented again. Many frequences follow a zipfian distribution, meaning that the vast majority of items will have a small frequency, followed by a handful of very frequent items. An example of this is words in a document where most words will only occur once and only a few words such as "the" and "of" will occur frequently. If this is the case then you will be fine with capping off your frequencies since only a few items will have a large frequency and it might not be important to order these high frequency items among themselves.<br /><br />It might seem more useful instead to tear away the least significant bits (the ones on the right which have a smaller value) instead, since these are less useful. The way you do this is to divide the frequency by a constant and keep only the whole number part. For example, if we divide the above number by 256, we'd be shifting the bits by one byte to the right, which gives us <span style="white-space:nowrap;font-weight:bold;">00000000 00000000 00000000 10110010</span>. The least significant byte has been removed which means that we can use less bytes to store the frequency. But in order to do that you need to first have the actual frequency which defeats the purpose. So what we can do is to simulate the division by incrementing the frequency only once every 256 times. If we do that then the resulting number will always be a 256th of the actual frequency which is the frequency without the least significant byte. But how do you know when to increment the frequency next? If you keep a separate counter which counts to 256 in order to know when to increment next then you lose the space you would have saved. Instead we can do it stochastically using random numbers. Increment the frequency with a probability of 1 in 256 and the frequency will be approximately a 256th of the actual frequency.<br /><br />By combining these two ideas together we can reduce an 8 byte frequency into a single byte and that byte will be one of the original 8 bytes of the actual frequency. Here is a Python function that increments an integer with a compressed frequency that is a certain number of bytes long and with a certain number of least significant bytes torn off.<br /><br /><pre class="brush: python">def compressed_increment(frequency, bytes_length, bytes_torn):<br /> if frequency < 256**bytes_length: #cap the frequency to the maximum number that can be contained in bytes_length bytes<br /> if random.randint(1, 256**bytes_torn) == 1: #increment with a probability of 1 in 256^bytes_length (number to divide by to shift the frequency by that number of bytes)<br /> return frequency + 1<br /></pre><br />Of course this is a lossy compression. Information is lost. This means that the compressed frequency is not useful in certain situations, such as when you want to also decrement the frequency or when approximate frequencies are inadequate.mtantinoreply@blogger.com0