tag:blogger.com,1999:blog-43189444598231434732019-06-07T12:23:58.787+02:00Geeky is AwesomeTo learn is to teach.Unknownnoreply@blogger.comBlogger115125tag:blogger.com,1999:blog-4318944459823143473.post-74522759088179742722019-05-30T10:37:00.000+02:002019-05-30T10:37:15.985+02:00Word square makerI made a word square searcher. A <a href="https://en.wikipedia.org/wiki/Word_square">word square</a> is a square grid with equal length words in each row and column such that the word in the first row is the word in the first column, the word in the second row is the word in the second column, and so on. Here's an example:<br /><pre>care<br />acid<br />ring<br />edge<br /></pre><br />It works by taking a file of words and indexing them by character and position such that I can find all the words that have a particular character at a particular position. The program will then build every possible word square that can be made from these words. It goes through all the words and tries them all as a first row/column in a word square. Each word is used as a basis for finding the rest of the words.<br /><br />In the above example, the first row/column is "care". In order to find a suitable word for the second row/column, the index is used to find all the words that have a first letter equal to the first row's second letter. "acid"'s first letter is equal to "care"'s second letter, so it is a good candidate. Each one of these candidate words is used tentatively such that if it doesn't pan out then we can backtrack and try another candidate word. For the third row/column, the index is used to find all the words that have a first letter equal to the first row's third letter and a second letter equal to the second row's third letter. The intersection of these two groups of words would be all the suitable words that can be used as a third row. "ring"'s first letter is equal to "care"'s third letter and its second letter is equal to "acid"'s third letter as well, so it is a good candidate for the third row. This process keeps on going until none of the words make a word square given the first row/column. The program then moves on to another possible word in the first row/column.<br /><br />You can find the program here: <a href="https://onlinegdb.com/By7XtdhpE">https://onlinegdb.com/By7XtdhpE</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-33614535457461529402019-05-01T13:50:00.001+02:002019-05-01T13:50:35.693+02:00Neutral Room Escape Game: Lights - Plus 7 Times 10 puzzle solverI've been playing a point and click game called <a href="http://neutralxe.net/esc/r2.html">Lights</a> which features a puzzle consisting of a 4 digit counter which starts from '0000' and which needs to be made equal to a certain number by a sequence of operations which are to either add 7 to the current value or multiply the current value by 10. I wrote a Python program to quickly solve the puzzle and I'm sharing the code for any one who needs to use it as well. You can find the program online at <a href="https://onlinegdb.com/rkg_l_ZDiV">https://onlinegdb.com/rkg_l_ZDiV</a>. Just run the program, enter the number you need to reach when requested and you'll see a sequence of steps showing which operations to perform in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached numbers until the goal number is reached. Enjoy!<br /><br /><br /><br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-10442361374087890842019-04-30T22:20:00.000+02:002019-04-30T22:20:10.418+02:00Sliding tile puzzle solverSo I've been playing a point and click game called <a href="http://akarika.net/games/escape/loom_blend.htm">Loom Blend</a> which features a <a href="https://en.wikipedia.org/wiki/Sliding_puzzle">sliding tile puzzle</a>. I wrote a Python program to quickly solve the puzzle there and I'm sharing the code for any one who needs to use it as well. You can find the program online at <a href="https://onlinegdb.com/BkkrnXLsN">https://onlinegdb.com/BkkrnXLsN</a>. Just run the program and you'll see a sequence of steps showing which tiles to move in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached tile configurations until the goal configuration is reached. Enjoy!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-56345914555229401872019-03-17T15:42:00.002+01:002019-03-17T15:46:16.872+01:00The Gambler's Fallacy: Why after tossing 100 heads in a row you can still get heads againThe <a href="https://en.wikipedia.org/wiki/Gambler's_fallacy">Gambler's fallacy</a> is a deceptively intuitive fallacy about how randomness works. Imagine you're tossing a fair coin repeatedly. You're astonished to see that every toss is resulting in heads. Heads, heads, heads, heads, over and over. As the number of consequtive heads keeps going up, does the probability of getting tails start increasing? It seems reasonable to conclude this as the number of heads and tails should balance each other out given that the coin is fair, as is stated by the <a href="https://en.wikipedia.org/wiki/Law_of_large_numbers">Law of Large Numbers</a>. But the idea that there is some kind of memory in the coin that is used to prevent the fair coin from acting as if it is unfair is wrong.<br /><br />First, let's show that the Gambler's Fallacy is actually a fallacy by simulating a bunch of coin tosses in a program. We'll keep track of all the times a streak of 10 heads was obtained and the following toss after it and see if the number of heads following the streak is significantly less than the number of tails. We do not consider 11 heads in a row as two streaks of 10 heads but instead start searching for a completely new streak after every streak. Here is a Python 3 program to do that (we also count the number of heads and tails obtained in general to show that the simulated coin is not biased).<br /><br /><pre class="brush:python">import random<br /><br />seed = 0<br />num_trails = 1000000<br />head_streak_length_limit = 10<br /><br />random.seed(seed)<br /><br />num_heads = 0<br />num_tails = 0<br />heads_streak_length = 0<br />next_was_head = 0<br />next_was_tail = 0<br />for trial in range(num_trails):<br /> result = random.choice(['head', 'tail'])<br /><br /> if result == 'head':<br /> num_heads += 1<br /> else:<br /> num_tails += 1<br /><br /> if heads_streak_length < head_streak_length_limit:<br /> if result == 'head':<br /> heads_streak_length += 1<br /> else:<br /> heads_streak_length = 0<br /> else:<br /> if result == 'head':<br /> next_was_head += 1<br /> else:<br /> next_was_tail += 1<br /><br /> heads_streak_length = 0<br /><br />print('heads / tails:', round(num_heads/num_tails, 4))<br />print('after heads streak was head / after heads streak was tail:', round(next_was_head/next_was_tail, 4))<br /></pre><br />And here are the results:<br /><pre>heads / tails: 1.002<br />after heads streak was head / after heads streak was tail: 1.0579<br /></pre><br />Both the ratio of heads and tails in general and the ratio of heads and tails following a streak of 10 heads in a row are close to 1. You can try changing the streak length limit to see how making it shorter does not significantly change this ratio.<br /><br />So why does this happen? It's a simple matter of probability theory. Both the probability of getting a head and of getting a tail is 0.5 and each toss is independent from the previous one. Therefore the probability of getting a head after 10 heads is $0.5^{10} \times 0.5 = 0.5^{11}$ and the probability of getting a tail after 10 heads is also $0.5^{10} \times 0.5 = 0.5^{11}$. It does not make a difference what the previous tosses were, the next toss being heads will always have a probability of 0.5, so both heads and tails are equally likely.<br /><br />So then the more interesting question is why do we think otherwise? Why do we think that as the chain of consecutive heads gets longer, the probability of getting another head goes down? It could be because we expect the coin to be fair and a coin that never gives tails is a biased coin, but that doesn't really mean anything in terms of predicting what the next toss will be. It is entirely possible that the fairest possible coin will give a million heads in a row. There is nothing stopping this from happening. Not only that, but a million heads in a row is just as likely to happen as any other random looking sequence resulting from a million coin tosses. Let's write down a sequence of heads and tails as a string of 'H's and 'T's. The sequence 'HHHHHH' is just as likely to come up as the sequence 'HHHTTH', yet we think of the first sequence as being unlikely whilst of the second sequence as... what? Is the second sequence more likely to come up than the first? Would you bet more money on the second sequence than the first? This is what what I think is the root cause of the Gambler's Fallacy.<br /><br />What I think happens in our mind is that we don't really predict whether the next toss will give a heads or a tails but whether the whole sequence will turn out to be an 'interesting' looking sequence or not. We love patterns, and when we see that a random sequence generator (coin tosses in this case) is giving out a sequence that looks like a simple pattern, we become suspicious of the randomness behind the sequence generator and assume that it isn't really random. I mean <a href="https://www.youtube.com/watch?v=tP-Ipsat90c">we're pretty terrible at generating random sequences manually</a> because we assume that in a random process you will not get things like a streak of heads. But it's not really just a streak of heads that would make us suspicious isn't it? It's also a sequence of alternating heads and tails for example or a long sequence of heads followed by a long sequence of tails. I think that the source of the Gambler's Fallacy is that we categorise entire sequences into interesting and uninteresting sequences. As we watch a sequence being generated one toss at a time we expect that interesting sequences of that length are much less likely to happen that uninteresting ones, which is true. This makes us expect the next toss to break the interestingness of the current sequence, which, in the case of a streak of heads, is by the next toss turning out to be tails. Of course, although interesting sequences are less likely to happen as the length grows, given that the probability of the interesting pattern becoming uninteresting on the next toss is equal to the probability of it remaining interesting, it is still an incorrect assumption.<br /><br />So what is the probability of an interesting sequence? This is a subjective question but we can still try to investigate it. Ideally a survey is conducted on a population of people to check what is considered as interesting by people on average. But that would be weird so I'll just conduct the survey on myself. I have generated all the possible coin tosses that can be obtained after 4 coin tosses and I will be annotating each sequence according to what I find interesting in it. I also did the same on 6 coin tosses to see how the ratio of interesting to uninteresting sequences changes with sequence length.<br /><br />Here is what I find interesting in sequences of 4 coin tosses.<br /><table border="1"><tr><td>TTTT</td><td style="font-weight:bold;">same</td></tr><tr><td>TTTH</td><td>not interesting</td></tr><tr><td>TTHT</td><td>not interesting</td></tr><tr><td>TTHH</td><td style="font-weight:bold;">switched</td></tr><tr><td>THTT</td><td>not interesting</td></tr><tr><td>THTH</td><td style="font-weight:bold;">repeated</td></tr><tr><td>THHT</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>THHH</td><td>not interesting</td></tr><tr><td>HTTT</td><td>not interesting</td></tr><tr><td>HTTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTHT</td><td style="font-weight:bold;">repeated</td></tr><tr><td>HTHH</td><td>not interesting</td></tr><tr><td>HHTT</td><td style="font-weight:bold;">switched</td></tr><tr><td>HHTH</td><td>not interesting</td></tr><tr><td>HHHT</td><td>not interesting</td></tr><tr><td>HHHH</td><td style="font-weight:bold;">same</td></tr></table><br />And here is what I find interesting in sequences of 6 coin tosses (organised into two columns to reduce vertical space).<br /><table border="1"><tr><td>TTTTTT</td><td style="font-weight:bold;">same</td><td>TTTTTH</td><td>not interesting</td></tr><tr><td>TTTTHT</td><td>not interesting</td><td>TTTTHH</td><td>not interesting</td></tr><tr><td>TTTHTT</td><td>not interesting</td><td>TTTHTH</td><td>not interesting</td></tr><tr><td>TTTHHT</td><td>not interesting</td><td>TTTHHH</td><td style="font-weight:bold;">switched</td></tr><tr><td>TTHTTT</td><td>not interesting</td><td>TTHTTH</td><td style="font-weight:bold;">doubled</td></tr><tr><td>TTHTHT</td><td>not interesting</td><td>TTHTHH</td><td>not interesting</td></tr><tr><td>TTHHTT</td><td style="font-weight:bold;">mirrored</td><td>TTHHTH</td><td>not interesting</td></tr><tr><td>TTHHHT</td><td>not interesting</td><td>TTHHHH</td><td>not interesting</td></tr><tr><td>THTTTT</td><td>not interesting</td><td>THTTTH</td><td>not interesting</td></tr><tr><td>THTTHT</td><td style="font-weight:bold;">mirrored</td><td>THTTHH</td><td>not interesting</td></tr><tr><td>THTHTT</td><td>not interesting</td><td>THTHTH</td><td style="font-weight:bold;">alternated</td></tr><tr><td>THTHHT</td><td>not interesting</td><td>THTHHH</td><td>not interesting</td></tr><tr><td>THHTTT</td><td>not interesting</td><td>THHTTH</td><td>not interesting</td></tr><tr><td>THHTHT</td><td>not interesting</td><td>THHTHH</td><td style="font-weight:bold;">doubled</td></tr><tr><td>THHHTT</td><td>not interesting</td><td>THHHTH</td><td>not interesting</td></tr><tr><td>THHHHT</td><td style="font-weight:bold;">mirrored</td><td>THHHHH</td><td>not interesting</td></tr><tr><td>HTTTTT</td><td>not interesting</td><td>HTTTTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTTTHT</td><td>not interesting</td><td>HTTTHH</td><td>not interesting</td></tr><tr><td>HTTHTT</td><td style="font-weight:bold;">doubled</td><td>HTTHTH</td><td>not interesting</td></tr><tr><td>HTTHHT</td><td style="font-weight:bold;">don't know</td><td>HTTHHH</td><td>not interesting</td></tr><tr><td>HTHTTT</td><td>not interesting</td><td>HTHTTH</td><td>not interesting</td></tr><tr><td>HTHTHT</td><td style="font-weight:bold;">alternated</td><td>HTHTHH</td><td>not interesting</td></tr><tr><td>HTHHTT</td><td>not interesting</td><td>HTHHTH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HTHHHT</td><td>not interesting</td><td>HTHHHH</td><td>not interesting</td></tr><tr><td>HHTTTT</td><td>not interesting</td><td>HHTTTH</td><td>not interesting</td></tr><tr><td>HHTTHT</td><td>not interesting</td><td>HHTTHH</td><td style="font-weight:bold;">mirrored</td></tr><tr><td>HHTHTT</td><td>not interesting</td><td>HHTHTH</td><td>not interesting</td></tr><tr><td>HHTHHT</td><td style="font-weight:bold;">doubled</td><td>HHTHHH</td><td>not interesting</td></tr><tr><td>HHHTTT</td><td style="font-weight:bold;">switched</td><td>HHHTTH</td><td>not interesting</td></tr><tr><td>HHHTHT</td><td>not interesting</td><td>HHHTHH</td><td>not interesting</td></tr><tr><td>HHHHTT</td><td>not interesting</td><td>HHHHTH</td><td>not interesting</td></tr><tr><td>HHHHHT</td><td>not interesting</td><td>HHHHHH</td><td style="font-weight:bold;">same</td></tr></table><br />Among the 4 coin toss sequences, 8 out of the 16 sequences were considered interesting, meaning that you have a 50% chance of generating an interesting sequence. But among the 6 coin toss sequences, only 16 out of the 64 sequences were considered interesting, meaning that you have a 25% chance of generating an interesting sequence. Now I don't know if the patterns I identified in the tables above are the only interesting patterns for any sequence length, since maybe some patterns only emerge in very long sequences, but it is safe to say that the longer the sequence, the greater the proportion of uninteresting and random looking sequences will be. Therefore, the longer the sequence, the less likely that a sequence will be interesting. This might be what is going on when we think that the probability of getting a tails goes up as the streak of heads gets longer; we'd really be comparing how likely it is for a sequence of a given length to be uninteresting, which becomes less likely as the length grows.<br /><br /><div class="sectitle">Interesting patterns are interesting</div>I have to admit that whilst annotating interesting sequences I was more interested in being consistent in my reasoning than in saying what I genuinely found interesting. This is because I didn't actually feel that, for example, all "repeated" patterns were interesting, just some of them. But I wanted to avoid explaining myself too much based on what I felt. This means that there might be more going on in feeling that a sequence is aesthetically pleasing than simple mathematical patterns. It would be interesting if someone made an experiment where people are subjected to an EEG to see what is really considered interesting and what is not. How knows, maybe it might even turn out to be a kind of Rorschach test where different patterns of interest indicate different personalities. And maybe it might even be an interesting machine learning challenge to try to make a program predict which sequences would be considered interesting by humans in order to quantify how interesting a given sequence is.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-83109685167918923362019-02-26T17:28:00.001+01:002019-02-26T17:28:59.170+01:00MySQL extra inner join taking too long: optimizer_search_depthMySQL has a <a href="https://dev.mysql.com/doc/refman/5.5/en/optimize-overview.html">query optimiser</a> that takes in SQL queries and tries to find the best way to execute them. Then the query gets executed using the best execution plan found. When you use joins in your query, the time the optimiser takes to finish optimising is <a href="https://dev.mysql.com/doc/refman/5.5/en/controlling-query-plan-evaluation.html">exponential to the number of joins</a>. This means that for a few joins it only takes a negligible amount of time to finish but after a point, which is somewhere between 7 and 10 tables, the optimisation time will shoot up. In my case I went from 0.1 seconds to 17 seconds but just adding another table in my joins.<br /><br />To check if this is happening to you, you can <a href="https://dev.mysql.com/doc/refman/5.5/en/show-profile.html">profile your query</a> by executing the following:<br /><br /><pre class="brush:python">SET profiling = 1;<br /><br />*your query*<br /><br />SHOW PROFILE FOR QUERY 1;<br /></pre><br />(If you using phpMyAdmin you might need to add "SHOW PROFILE;" as well at the end)<br /><br />This will show you a table of execution times for each phase in the execution of the query. The time taken by the query optimiser is in the "statistics" row. If this is too high, then you need to stop your optimiser from spending so much time.<br /><br />Controlling how much time the optimiser should spend doing its job can be accomplished using the <a href="https://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_optimizer_search_depth">optimizer_search_depth</a> variable. This is the maximum depth to search (presumably in some tree search algorithm) when optimising the query and is unfortunately set to the highest value by default. Being set to the highest value makes it the most likely to find the best execution plan but may also be unnecessarily time consuming. Thankfully there is a better default value to use: 0. When this variable is set to zero, the optimiser automatically picks a value based on the query and that seems to work well for me. To do this just execute the following query:<br /><br /><pre class="brush:python">SET optimizer_search_depth = 0;<br /></pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-25216890623852623542019-01-29T17:45:00.003+01:002019-01-29T18:13:33.838+01:00Python string format quick guideThis is a quick summary guide for when you just want to remind yourself how to format a value with Python's <a href="https://docs.python.org/3.6/library/string.html#format-string-syntax">'string'.format()</a>.<br /><br /><div class="sectitle">Referencing</div><br />Curly brackets are used to refer to parameter values in the 'format' method. On their own, the first pair of curly brackets will refer to the first parameter, second to the second, etc.<br /><br /><pre class="brush:python">'{} {} {}'.format('a', 1, True)<br /></pre><pre>a 1 True</pre><br />Alternatively you can specify which parameter you want to use where using indexes.<br /><br /><pre class="brush:python">'{0} {0} {2}'.format('a', 1, True)<br /></pre><pre>a a True</pre><br />You can even use names instead of indexes.<br /><br /><pre class="brush:python">'{a} {b} {c}'.format(a='a', b=1, c=True)<br /></pre><pre>a 1 True</pre><br />And even refer to indexes in lists and to instance variables in objects.<br /><br /><pre class="brush:python">'{a[0]} {b.numerator}'.format(a=[3,1,2], b=fractions.Fraction(1,2))<br /></pre><pre>3 1</pre><br /><div class="sectitle">Formatting</div><br />Adding a colon inside the curly brackets allows you to format the values before they are added to the string. You can combine these with any of the above methods of referencing. The following formatting meta characters are to be used in the following order:<br /><pre>{:[alignment] [grouping] [precision] [data type]}</pre><br /><div class="subsectitle">Aligning/padding/leading zeros</div>An alignment meta character is either >, <, or ^. The number after it is the field width in which to align.<br /><br /><pre class="brush:python">'|{:>3}|{:<3}|{:^3}|'.format('a', 'b', 'c')<br /></pre><pre>| a|b | c |</pre><br />Any character before the alignment meta character is used instead of spaces.<br /><pre class="brush:python">'{:_>5} {:0>5}'.format('a', 12)<br /></pre><pre>____a 00012</pre><br /><div class="subsectitle">Grouping</div>You can group digits in numbers into triples with commas automatically by just putting a comma in the curly brackets.<br /><br /><pre class="brush:python">'{:,}'.format(12345678)</pre><pre>12,345,678</pre><br /><div class="subsectitle">Precision</div>When formatting floats you can specify how many digits to keep after the point by adding a dot and a number. By default the format of the number is in scientific notation which means that your floating point number will look like 12e-4 with the precision referring to the number of digits in the mantissa (the number before 'e'). See the next section to format your number in fixed point notation.<br /><br /><pre class="brush:python">'{:.3}'.format(12345678.0)</pre><pre>1.23e+07</pre><br /><div class="subsectitle">Data types</div>This is the most important part. The data type always comes at the end and is a single letter.<br /><br />You can use it to change the numeric base of a whole number.<br /><pre class="brush:python">'dec-{:d} hex-{:X} oct-{:o} bin-{:b}'.format(16, 16, 16, 16)</pre><pre>dec-16 hex-10 oct-20 bin-10000</pre><br />You can use it to convert unicode values to characters.<br /><pre class="brush:python">'{:c}'.format(97)</pre><pre>a</pre><br />You can use it to convert numbers into scientific notations,<br /><pre class="brush:python">'{:e}'.format(12345678)</pre><pre>1.234568e+07</pre><br />or fixed point notations,<br /><pre class="brush:python">'{:f}'.format(12345678)</pre><pre>12345678.000000</pre><br />or percentages.<br /><pre class="brush:python">'{:%}'.format(0.1234)</pre><pre>12.340000%</pre><br /><div class="sectitle">Mixed examples</div><br /><pre class="brush:python">'{:0>2X}'.format(10)</pre><pre>0A</pre><br /><pre class="brush:python">'{:,d}'.format(1234567)</pre><pre>1,234,567</pre><br /><pre class="brush:python">'{:,.2f}'.format(12345.6)</pre><pre>12,345.60</pre><br /><pre class="brush:python">'{:.2%}'.format(0.1234)</pre><pre>12.34%</pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91605694465384295222018-12-27T19:37:00.000+01:002018-12-27T21:31:20.191+01:00Language models and the unknown token: Why perplexity gets worse as the corpus size increasesUnless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.<br /><br />The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.<br /><br />It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.<br /><br />Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.<br /><br />We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.<br /><br />Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.<br /><br />Let's say that we want to find the probability of the following sentence:<br /><pre>the loud dog barked at the loud man</pre><br />and let's say that the language model we're using to do that has the following vocabulary:<br /><pre>the at dog man</pre><br />this means that the sentence is now changed to:<br /><pre>the UNK dog UNK at the UNK man</pre><br />Now the naive way to get the probability of the sentence is as follows:<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})<br />$$<br /><br />But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})<br />$$<br /><br />Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:<br /><br />$$<br />P(\text{the UNK dog UNK at the UNK man}) = \\<br />p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\<br />\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\<br />\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3<br />$$<br /><br />Now the sentence probability goes up as the vocabulary size increases!Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-20510933309610367662018-11-29T22:14:00.002+01:002018-11-29T22:21:46.349+01:00Explainable AI (XAI): Sensitivity AnalysisImagine you have a neural network that can classify images well. At test time, would it be enough to just know what the class of the image is? Or would you also want to know why the image was classified the way it was? This is one of the goals of explainable AI and in this post we'll see what sensitivity analysis is.<br /><br />Sensitivity analysis is a way to measure the importance of different parts of an input to one part of an output. In other words, you want to know which pixels were most important to give a high probability for a particular class. The way sensitivity analysis measures this is by measuring how sensitive the output is to each pixel, that is, which pixels, when changed, will change the output the most. Most pixels should leave the output unchanged when they themselves are changed, but some pixels would be critical to the particular output that the neural network has given. To measure this, we simply find the following:<br /><br />$\left|\frac{df(x)_i}{dx_j}\right|$<br /><br />that is, the sensitivity of output $i$ to input $j$ is the absolute value of the gradient of the output with respect to the input. In Tensorflow we can compute this as follows:<br /><br /><pre class="brush:python">sensitivity = tf.abs(tf.gradients([outputs[0, output_index]], [images])[0][0])</pre><br />where 'output_index' is a placeholder that is a scalar of type tf.int64, 'outputs' is a softmax with probabilities for each class and where the first index is the batch index and the second index is the class probability, and 'images' is the pixels of images where the first index is batch index and the second index is the image in vector form. This code also assumes that only the first image in the batch is to be analysed. This is because Tensorflow can only find the gradient of a single scalar so we can only find the sensitivity of a single output of a single image.<br /><br />Here are some examples I got when I tried it on a simple fully connected two layer neural network trained to classify MNIST handwritten digits.<br /><br /><a href="https://1.bp.blogspot.com/-kLdO_Qo6vVA/XABXKiTZ30I/AAAAAAAABRw/LE_WF4Blg7oWcndSPQuQduTbUNaJkyQdACLcBGAs/s1600/sensitivityanalysis_0.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-kLdO_Qo6vVA/XABXKiTZ30I/AAAAAAAABRw/LE_WF4Blg7oWcndSPQuQduTbUNaJkyQdACLcBGAs/s1600/sensitivityanalysis_0.png" data-original-width="108" data-original-height="210" /></a><a href="https://4.bp.blogspot.com/-EZPrCkoHGe8/XABXK2pdv-I/AAAAAAAABR0/l3fGgIdEw9YdKccqfKnP-iHX81oIHeS5ACLcBGAs/s1600/sensitivityanalysis_4.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-EZPrCkoHGe8/XABXK2pdv-I/AAAAAAAABR0/l3fGgIdEw9YdKccqfKnP-iHX81oIHeS5ACLcBGAs/s1600/sensitivityanalysis_4.png" data-original-width="108" data-original-height="210" /></a><a href="https://2.bp.blogspot.com/-SrkbuGE47ac/XABXK1jIC6I/AAAAAAAABRs/1p_c1eZrCk0GA4Fgm3swRNY7nT4VPztCgCLcBGAs/s1600/sensitivityanalysis_5.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-SrkbuGE47ac/XABXK1jIC6I/AAAAAAAABRs/1p_c1eZrCk0GA4Fgm3swRNY7nT4VPztCgCLcBGAs/s1600/sensitivityanalysis_5.png" data-original-width="108" data-original-height="209" /></a><a href="https://4.bp.blogspot.com/-Zvss5Mw6XGQ/XABXLI2-xxI/AAAAAAAABR4/TTq4wrHC2vYOmHQ5_Sv_Ui053s--P-amwCLcBGAs/s1600/sensitivityanalysis_7.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Zvss5Mw6XGQ/XABXLI2-xxI/AAAAAAAABR4/TTq4wrHC2vYOmHQ5_Sv_Ui053s--P-amwCLcBGAs/s1600/sensitivityanalysis_7.png" data-original-width="108" data-original-height="209" /></a><a href="https://1.bp.blogspot.com/-SqxUH5Ozhew/XABXLuZg_TI/AAAAAAAABR8/UWUY35aNAx0oirW-jiHeJ8eLnU8ZNr27gCLcBGAs/s1600/sensitivityanalysis_9.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-SqxUH5Ozhew/XABXLuZg_TI/AAAAAAAABR8/UWUY35aNAx0oirW-jiHeJ8eLnU8ZNr27gCLcBGAs/s1600/sensitivityanalysis_9.png" data-original-width="108" data-original-height="209" /></a><br /><br />These heat maps show which pixels have the highest sensitivity score for the digit they were classified as. We can see how the empty space in the middle is important for classifying the zero. The four and the nine can be easy confused for each other were it not for the little bit in the top left corner. The seven can be confused for a nine if we complete the top left loop and the five can be confused with a six or eight if we draw a little diagonal line.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-18073264991404833442018-10-24T22:10:00.000+02:002018-10-24T22:28:24.800+02:00Mentally estimating square rootsWhat's $\sqrt{10}$? I'd reckon it's about $3 + \frac{10-9}{2 \times 3} = 3.166$. The actual answer is 3.162. What about $\sqrt{18}$? Should be about $4 + \frac{18-16}{2 \times 4} = 4.250$. The actual answer is 4.242. I found out about this calculation from <a href="https://www.youtube.com/watch?v=PJHtqMjrStk">this video</a> but there was no explanation for why it works. Here's an explanation.<br /><br />So the method here is as follows.<br /><ol><li>Let the number you want to find the square root of be $a^2$.</li><li>Let the largest square number which is less than $a^2$ be $b^2$ such that $b^2 <= a^2 < (b+1)^2$. For example if $a^2$ is 10 then $b^2$ is 9, if $a^2$ is 18 then $b^2$ is 16.</li><li>The square root of $a^2$ is approximately $b + \frac{a^2 - b^2}{2 b}$.</li></ol><br />This method is easy to carry out mentally but why does it work? The trick here is that the graph of the square root function grows so slowly that we can approximate the curve between two adjacent square numbers as a line.<br /><br /><a href="https://3.bp.blogspot.com/-lgpg71gLShs/W9DVfIo9KHI/AAAAAAAABQE/XwYjBepZ-9g8H-P4mbeFruf1LACrY3bSACLcBGAs/s1600/sqrtestimation_graph.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-lgpg71gLShs/W9DVfIo9KHI/AAAAAAAABQE/XwYjBepZ-9g8H-P4mbeFruf1LACrY3bSACLcBGAs/s640/sqrtestimation_graph.png" width="640" height="338" data-original-width="1104" data-original-height="583" /></a><br /><br />We can use the line to approximate the square root of any number between two square numbers. The first thing we need to know is the gradient of the line. The vertical distance between two adjacent square numbers on the square root curve is 1, since the two square numbers are the squares of two consecutive numbers. The horizontal distance changes and becomes larger as the adjacent square numbers become larger but we can calculate it as follows:<br /><br />$$(b+1)^2 - b^2 = b^2 + 2b + 1 - b^2 = 2b + 1$$<br /><br />So the horizontal distance is twice the square root of the smaller square number plus one. Therefore the gradient of the line is $\frac{1}{2b+1}$. Once we know by how much the line grows vertically for every horizontal unit, we can then determine how much higher than $b$ the point on the line will be at $a$ by multiplying the gradient by $a^2-b^2$, as shown below:<br /><br /><a href="https://3.bp.blogspot.com/-BZyfdJce04Y/W9DVnBb8rFI/AAAAAAAABQI/gCMKoTeR2UwDY02k5__TU929wxxBPue2gCLcBGAs/s1600/sqrtestimation_line.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-BZyfdJce04Y/W9DVnBb8rFI/AAAAAAAABQI/gCMKoTeR2UwDY02k5__TU929wxxBPue2gCLcBGAs/s640/sqrtestimation_line.png" width="640" height="360" data-original-width="1280" data-original-height="720" /></a><br /><br />Since the difference in height is less than 1, it is going to be the part of the square root that comes after the decimal point, with the whole number part being $b$.<br /><br />It might be hard to mentally divide by an odd number in $\frac{a^2-b^2}{2b+1}$ so we further approximate it as $\frac{a^2-b^2}{2b}$ instead. And that's why this method works.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-87712869505393152292018-09-29T17:07:00.004+02:002018-09-29T17:10:25.392+02:00Comparing numpy scalars directly is time consuming, use .tolist() before a comparisonThis is something that I found out about recently when going through the elements of a numpy array in order to do some checks on each numbers. Turns out you shouldn't just do this<br /><br /><pre class="brush:python">for x in nparr:<br /> if x == 0:<br /> something something<br /></pre><br />as that uses a lot more time than doing this<br /><br /><pre class="brush:python">for x in nparr.tolist():<br /> if x == 0:<br /> something something<br /></pre><br />This is because a for loop iterating over a numpy array does not result in a sequence of Python constants but in a sequence of numpy scalars which would result in comparing a numpy array to a constant. Converting the array into a list first before the for loop will then result in a sequence of constants.<br /><br />Here is some profiling I've done using cProfile to check different ways to do an 'if' on a numpy array element:<br /><br /><pre class="brush:python">import cProfile<br />import numpy as np<br /><br />runs = 1000000<br /><br />print('Comparing numpy to numpy')<br />x = np.array(1.0, np.float32)<br />y = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == y:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy to constant')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing constant to constant')<br />x = 1.0<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy.tolist() to constant')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x.tolist() == 1.0:<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy to numpy.array(constant)')<br />x = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x == np.array(1.0, np.float32):<br /> pass<br />''')<br />print()<br /><br />print('Comparing numpy.tolist() to numpy.tolist()')<br />x = np.array(1.0, np.float32)<br />y = np.array(1.0, np.float32)<br />cProfile.run('''<br />for _ in range(runs):<br /> if x.tolist() == y.tolist():<br /> pass<br />''')<br />print()<br /></pre><br />Here are the results in order of speed:<br /><br /><table border="1"><tr><th>Comparing constant to constant:</th><td>0.088 seconds</td></tr><tr><th>Comparing numpy.tolist() to constant:</th><td>0.288 seconds</td></tr><tr><th>Comparing numpy.tolist() to numpy.tolist():</th><td>0.508 seconds</td></tr><tr><th>Comparing numpy to numpy:</th><td>0.684 seconds</td></tr><tr><th>Comparing numpy to constant:</th><td>1.192 seconds</td></tr><tr><th>Comparing numpy to numpy.array(constant):</th><td>1.203 seconds</td></tr></table><br />It turns out that it is always faster to first convert your numpy scalars into constants via .tolist() than to do anything with them as numpy scalars.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-23594156293254387072018-08-30T23:41:00.000+02:002018-09-29T08:26:01.092+02:00The McLauren series and Taylor series (approximating complicated functions with simple polynomials)<div class="sectitle">The McLauren series</div><br />Imagine you had a function $f(x)$ that you knew was a polynomial, but whose details were unknown and you could only apply operations to the function without being able to read it. How could you find the coefficients of this polynomial? We know that for coefficients $a_i$:<br /><br />$f(x) = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br /><br />If we find $f(0)$ then we can find $a_0$.<br /><br />$f(0) = a_0 + a_1 0 + a_2 0^2 + a_3 0^3 + a_4 0^4 + ... = a_0 + 0 + 0 + ... = a_0$<br /><br />That was easy, but how can we find $a_1$? We need an operation that gets rid of $a_0$ and also the $x$ in the term $a_1 x$. That operation turns out to be differentiation with respect to $x$:<br /><br />$f'(x) = a_1 + 2 a_2 x + 3 a_3 x^2 + 4 a_4 x^3 + ...$<br /><br />Great! Now we can find $a_1$ by replacing $x$ with 0:<br /><br />$f'(0) = a_1 + 2 a_2 0 + 3 a_3 0^2 + 4 a_4 0^3 + ... = a_1$<br /><br />We can find $a_2$ by repeating these two steps:<br /><br />$f''(x) = 2 a_2 + 2 \cdot 3 a_3 x + 3 \cdot 4 a_4 x^2 + ...$<br />$f''(0) = 2 a_2 + 2 \cdot 3 a_3 0 + 3 \cdot 4 a_4 0^2 + ... = 2 a_2$<br /><br />What we found is twice of $a_2$ which means that we need to divide by 2 to get $a_2$. The differentiation operation is multiplying constants by each coefficient and the constants get bigger and bigger the more times we apply differentiation. You might notice that what's happening is that $a_0$ was multiplied by 1, $a_1$ was also multiplied by 1, $a_2$ has been multiplied by 2, $a_3$ will be multiplied by 6, $a_4$ by 24, and so on. These are factorials, sequences of whole numbers multiplied together ($3! = 1 \times 2 \times 3 = 6$). Which means that we need to divide by the next factorial after each round of differentiation and substitution by zero.<br /><br />In general we can find the $i$th coefficient in an unknown polynomial function by doing the following:<br /><br />$a_i = \frac{f^i(0)}{i!}$<br /><br />That's great. Now to test it. Let's see if a complex function is actually a polynomial in disguise. Take something simple such as $f(x) = e^x$. This doesn't look like a polynomial, but it may be represented by a polynomial with an infinite number of terms. Let's find what are the coefficients of the hidden polynomial in $e^x$.<br /><br />$f(x) = e^x = a_0 + a_1 x + a_2 x^2 + a_3 x^3 + a_4 x^4 + ...$<br />$\frac{f(0)}{0!} = \frac{e^0}{1} = a_0$<br />$\frac{f(0)}{0!} = 1$<br /><br />OK, so $a_0$ is 1. Let's find the rest of the coefficients.<br /><br />$a_1 = \frac{f'(0)}{1!} = \frac{e^0}{1} = 1$<br />$a_2 = \frac{f''(0)}{2!} = \frac{e^0}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{e^0}{6} = \frac{1}{6}$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $e^x$ are:<br />$f(x) = 1 + x + \frac{1}{2} x^2 + \frac{1}{6} x^3 + ...$<br /><br />Does this partial polynomial look anything like $e^x$ when plotted as a graph?<br /><br /><a href="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s1600/taylor_exp.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-YfDy5V5J0OY/W4hRuEFy1xI/AAAAAAAABNw/zGZlBU2-19IerOgCL9lb9U4ubDqk5qyvgCLcBGAs/s320/taylor_exp.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Pretty good within a boundary! Note how the curve has a perfect fit at $x = 0$ and that it gets worse as we move away from there. Adding more terms to the polynomial will enlarge the area around $x = 0$ that is close to the curve but it will always be radiating out from there.<br /><br />Let's try for $f(x) = cos(x)$ now.<br /><br />$a_0 = \frac{f(0)}{0!} = \frac{cos(0)}{1} = 1$<br />$a_1 = \frac{f'(0)}{1!} = \frac{-sin(0)}{1} = 0$<br />$a_2 = \frac{f''(0)}{2!} = \frac{-cos(0)}{2} = -\frac{1}{2}$<br />$a_3 = \frac{f'''(0)}{3!} = \frac{sin(0)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ is<br />$f(x) = 1 - \frac{1}{2} x^2 + ...$<br /><br /><a href="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s1600/taylor_cos.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-PVy37OfBly4/W4hUwDJ7oDI/AAAAAAAABN8/EI1IcfdV4y8VW3kGMVjBCZ_GV8WlNun-wCLcBGAs/s320/taylor_cos.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br /><div class="sectitle">The Taylor series</div><br />As you can see, this "polynomialisation" of functions is a neat way to approximate a function we might not know how to implement exactly but know how to differentiate and how to find its value when $x$ is 0. But what if we don't know what $f(0)$ is such as in $ln(x)$ or $\frac{1}{x}$? A slight modification to our method allows us to use any value of $x$ and not just 0. Let's call this value $b$. By slightly modifying the polynomial we expect to be hiding inside the function, we can make the polynomial act the same way when $f(b)$ is used instead of $f(0)$:<br /><br />$f(x) = a_0 + a_1 (x - b) + a_2 (x - b)^2 + a_3 (x - b)^3 + a_4 (x - b)^4 + ...$<br />$f(b) = a_0 + a_1 (b - b) + a_2 (b - b)^2 + a_3 (b - b)^3 + a_4 (b - b)^4 + ...$<br />$f(b) = a_0$<br /><br />$f'(x) = a_1 + 2 a_2 (x - b) + 3 a_3 (x - b)^2 + 4 a_4 (x - b)^3 + ...$<br />$f'(b) = a_1$<br /><br />$a_i = \frac{f^i(b)}{i!}$<br /><br />The catch here is that we are now finding coefficients to the polynomial $a_0 + a_1 (x - b) + a_2 (x - b)^2 + ...$ and not of $a_0 + a_1 x + a_2 x^2 + ...$, but that's OK. Let's try this on $ln(x)$ with $b = 1$.<br /><br />$a_0 = \frac{f(1)}{0!} = \frac{ln(1)}{1} = 0$<br />$a_1 = \frac{f'(1)}{1!} = \frac{\frac{1}{1}}{1} = 1$<br />$a_2 = \frac{f''(1)}{2!} = \frac{-\frac{1}{1^2}}{1} = -1$<br />$a_3 = \frac{f'''(1)}{3!} = \frac{\frac{2}{1^3}}{1} = 2$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $ln(x)$ is<br />$f(x) = (x - 1) - (x - 1)^2 + 2 (x - 1)^3 + ...$<br /><br /><a href="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s1600/taylor_ln.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-OMYquo9voH4/W4hglr_yutI/AAAAAAAABOI/LIdM_LyV8bczzNaIUomN1Ibxr8KcOX5vQCLcBGAs/s320/taylor_ln.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />Adding more terms will approximate the original function better and better but what if we didn't have to? Remember how I said in the previous section that the polynomial approximates the original function best close to $x = 0$. Well now we can approximate it best around any point $b$ and not just around 0. This means that if our function has multiple known values, such as $cos(x)$ which is known to be 1 at $x = 0$, 0 at $x = \frac{\pi}{2}$, -1 at $x = \pi$, etc., then we can use several short polynomials centered at different points in the function instead of one large polynomial that approximates it well over a large interval. Let's try approximating $cos(x)$ around $x = \pi$, which means that we'll set $b$ to $\pi$.<br /><br />$a_0 = \frac{f(\pi)}{0!} = \frac{cos(\pi)}{1} = -1$<br />$a_1 = \frac{f'(\pi)}{1!} = \frac{-sin(\pi)}{1} = 0$<br />$a_2 = \frac{f''(\pi)}{2!} = \frac{-cos(\pi)}{2} = \frac{1}{2}$<br />$a_3 = \frac{f'''(\pi)}{3!} = \frac{sin(\pi)}{6} = 0$<br />$...$<br /><br />So the first few terms of the polynomial hidden within $cos(x)$ which best approximates the area around $x = \pi$ is<br />$f(x) = -1 + \frac{1}{2} (x - \pi)^2 + ...$<br /><br /><a href="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s1600/taylor_cos_pi.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-zzagcHK6GKw/W4hjmtagCeI/AAAAAAAABOU/fn9eG7ZYxboXuCtODr0jCJCcTdRw3LQCQCLcBGAs/s320/taylor_cos_pi.png" width="320" height="169" data-original-width="1104" data-original-height="583" /></a><br /><br />This is useful when implementing mathematical functions on a computer. You keep several simple polynomials defined at different points in the domain of the function and then pick the closest one to the $x$ you need to evaluate. You can then compute an approximation that isn't too bad without requiring a lot of computational time.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-86050409192796957252018-07-29T23:25:00.001+02:002018-07-29T23:45:37.534+02:00Hyperparameter tuning using Scikit-OptimizeOne of my favourite academic discoveries this year was <a href="https://scikit-optimize.github.io/">Scikit-Optimize</a>, a nifty little Python automatic hyperparameter tuning library that comes with a lot of features I found missing in other similar libraries.<br /><br />So as explained in an <a href="https://geekyisawesome.blogspot.com/2017/10/hyperparameter-tuning-using-hyperopt.html">earlier blog post</a>, automatic hyperparameter tuning is about finding the right hyperparameters for a machine learning algorithm automatically. Usually this is done manually using human experience but even simple MonteCarlo search random guesses can result in better performance than human tweaking (<a href="http://www.jmlr.org/papers/v13/bergstra12a.html">see here</a>). So automatic methods were developed that try to explore the space of hyperparameters and their resulting performance after training the machine learning model and then try to home in on the best performing hyperparameters. Of course each time you want to evaluate a new hyperparameter combination you'd need to retrain and evaluate your machine learning model, which might take a very long time to finish, so it's important that a good hyperparameter combination is found with as little evaluations as possible. To do this we'll use <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a>, a process where a separate simpler model is trained to predict the resulting performance of the whole hyperparameter space. We check this trained model to predict which hyperparameters will give the best resulting performance and actually evaluate them by training our machine learning model with them. The actual resulting performance is then used to update the hyperparameter space model so that it makes better predictions and then we'll get a new promising hyperparameter combination from it. This is repeated for a set number of times. Now the most common hyperparameter space model to use is a Gaussian Process which maps continuous numerical hyperparameters to a single number which is the predicted performance. This is not very good when your hyperparameters contain categorical data such as a choice of activation function. There is a <a href="https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9993">paper</a> that suggests that random forests are much better at mapping general hyperparameter combinations to predicted performance.<br /><br />Now that we got the theory out of the way, let's see how to use the library. We'll apply it on a gradient descent algorithm that needs to minimize the squared function. For this we'll need 3 hyperparameters: the range of the initial value to be selected randomly, the learning rate, and the number of epochs to run. So we have two continuous values and one discrete integer value.<br /><br /><pre class="brush:python">import random<br /><br />def cost(x):<br /> return x**2<br /><br />def cost_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x, num_epochs):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(num_epochs):<br /> x = x - learning_rate*cost_grad(x)<br /> return x<br /></pre><br />Now we need to define the skopt optimizer:<br /><br /><pre class="brush:python">import skopt<br /><br />opt = skopt.Optimizer(<br /> [<br /> skopt.space.Real(0.0, 10.0, name='max_init_x'),<br /> skopt.space.Real(1.0, 1e-10, 'log-uniform', name='learning_rate'),<br /> skopt.space.Integer(1, 20, name='num_epochs'),<br /> ],<br /> n_initial_points=5,<br /> base_estimator='RF',<br /> acq_func='EI',<br /> acq_optimizer='auto',<br /> random_state=0,<br /> )<br /></pre><br />The above code is specifying 3 hyperparameters:<br /><ul><li>the maximum initial value that is a real number (continuous) and that can be between 10 and 0</li><li>the learning rate that is also a real number but that is also on a logarithmic scale (so that you are equally likely to try very large values and very small values) and can be between 1 and 1e-10</li><li>the number of epochs that is an integer (whole number) and that can be between 1 and 20</li></ul>It is also saying that the hyperparameter space model should be initialized based on 5 random hyperparameter combinations (you train the hyperparameter space model on 5 randomly chosen hyperparameters in order to be able to get the first suggested hyperparameter), that this model should be a random forest (RF), that the acquisition function (the function to decide which hyperparameter combination is the most promising to try next according to the hyperparameter space model) is the expected improvement (EI) of the hyperparameter combination, that the acquisition optimizer (the optimizer to find the next promising hyperparameter combination) is automatically chosen, and that the random state is set to a fixed number (zero) so that it always gives the same random values each time you run it.<br /><br />Next we will use the optimizer to find good hyperparameter combinations.<br /><br /><pre class="brush:python">best_cost = 1e100<br />best_hyperparams = None<br />for trial in range(5 + 20):<br /> [max_init_x, learning_rate, num_epochs] = opt.ask()<br /> [max_init_x, learning_rate, num_epochs] = [max_init_x.tolist(), learning_rate.tolist(), num_epochs.tolist()]<br /> next_hyperparams = [max_init_x, learning_rate, num_epochs]<br /> next_cost = cost(graddesc(max_init_x, learning_rate, num_epochs))<br /> if next_cost < best_cost:<br /> best_cost = next_cost<br /> best_hyperparams = next_hyperparams<br /> print(trial+1, next_cost, next_hyperparams)<br /> opt.tell(next_hyperparams, next_cost)<br />print(best_hyperparams)<br /></pre>The nice thing about this library is that you can use an 'ask/tell' system where you ask the library to give you the next hyperparameters to try and then you do something with them in order to get the actual performance value and finally you tell the library what this performance value is. This lets you do some nifty things such as ask for another value if the hyperparameters resulted in an invalid state in the machine learning model or even to save your progress and continue later.<br /><br />In the above code we're running a for loop to run the number of times we want to evaluate different hyperparameters. We need to run it for the 5 random values we specified before to initialize the hyperparameter space model plus another 20 evaluations to actually optimize the hyperameters. Now skopt does something funny which is that it returns not plain Python values for hyperparameters but rather each number is represented as a numpy scalar. Because of this we convert each numpy scalar back into a plain Python float or int using ".tolist()". We ask for the next hyperparamters to try, convert them to plain Python values, get their resulting cost after running gradient descent, store the best hyperparameters found up to now, and tell the library what the given hyperparameters' resulting performance was. At the end we print the best hyperparamter combination found.<br /><br />Some extra stuff:<br /><ul><li>You can ask for categorical hyperparameters using "skopt.space.Categorical(['option1', 'option2'], name='options')" which will return one of the values in the list when calling "ask".</li><li>You can ask for a different hyperparameter combination in case of an invalid one by changing "ask" to give you several hyperparameter suggestions rather than just one and then trying each one of them until one works by using "opt.ask(num_hyperpars)" (you can also incrementally ask for more values and always take the last one).</li><li>You can save and continue by saving all the hyperparameters evaluated and their corresponding performance value in a text file. You then later resupply the saved hyperparameters and their performance using "tell" for each one. This is much faster than actually evaluating them on the machine learning model so straight supplying known values will be ready quickly. Just be careful that you also call "ask" before each "tell" in order to get the same exact behaviour from the optimizer or else the next "tell" will give different values from what it would have given had you not loaded the previous ones manually.</li></ul>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-53404097895830233462018-06-28T23:19:00.002+02:002018-06-28T23:41:14.889+02:00Saving/Loading a Tensorflow model using HDF5 (h5py)The normal way to save the parameters of a neural network in Tensorflow is to create a tf.train.Saver() object and then calling the object's "save" and "restore" methods. But this can be a bit cumbersome and you might prefer to have more control on what and how things get saved and loaded. The standard file format for saving large tensors (such as the parameters of a neural network) is to use <a href="http://docs.h5py.org/en/stable/">HDF5</a>.<br /><br />Saving is easy. Here is the code you'll need:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'w') as f:<br /> for var in tf.trainable_variables():<br /> key = var.name.replace('/', ' ')<br /> value = session.run(var)<br /> f.create_dataset(key, data=value)<br /></pre><br />Notice that we need to use the session in order to get a parameter value.<br /><br />If you're using variable scopes then your variable names will have slashes in them and here we're replacing slashes with spaces. The reason is because the HDF5 format treats key values as directories where folder names are separated by slashes. This means that you need to traverse the keys recursively in order to arrive at the data (one folder name at a time) if you do not know the full name at the start. This replacement of slashes simplifies the code for loading a model later.<br /><br />Notice also that you can filter the variables to save as you like as well as save extra stuff. I like to save the Tensorflow version in the file in order to be able to check for incompatible variable names in contrib modules (RNNs had some variable names changed in version 1.2).<br /><br />Now comes the loading part. Loading is a tiny bit more involved because it requires that you make you neural network code include stuff for assigning values to the variables. All you need to do whilst constructing your Tensorflow graph is to include the following code:<br /><pre class="brush:python">param_setters = dict()<br />for var in tf.trainable_variables():<br /> placeholder = tf.placeholder(var.dtype, var.shape, var.name.split(':')[0]+'_setter')<br /> param_setters[var.name] = (tf.assign(var, placeholder), placeholder)<br /></pre><br />What this code does is it creates separate placeholder and assign nodes for each variable in your graph. In order to modify a variable you need to run the corresponding assign node in a session and pass the value through the corresponding placeholder. All the corresponding assign nodes and placeholders are kept in a dictionary called param_setters. We're also naming the placeholder the same as the variable but with '_setter' at the end.<br /><br />Notice that param_setters is a dictionary mapping variable names to a tuple consisting of the assign node and the placeholder.<br /><br />Now we can load the HDF5 file as follows:<br /><pre class="brush:python">with h5py.File('model.hdf5', 'r') as f:<br /> for (name, val) in f.items()<br /> name = name.replace(' ', '/')<br /> val = np.array(val)<br /> session.run(param_setters[name][0], { param_setters[name][1]: val })<br /></pre><br />What's happening here is that we're loading each parameter from the file and replacing the spaces in names back into slashes. We then run the corresponding assign node for the given parameter name in param_setters and set it to the loaded value.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-36560109695240696702018-05-23T22:29:00.002+02:002018-05-23T22:34:35.379+02:00Fancy indexing in Tensorflow: Getting a different element from every row in a matrixLet's say you have the following 4 by 3 matrix:<br /><pre class="brush:python">M = np.array([[ 1, 2, 3 ],<br /> [ 4, 5, 6 ],<br /> [ 7, 8, 9 ],<br /> [ 10, 11, 12 ]])<br /></pre>When in Tensorflow we'd also have this line:<br /><pre class="brush:python">M = tf.constant(M)<br /></pre><br />Let's say that you want to get the first element of every row. In both Numpy and Tensorflow you can just do this:<br /><pre class="brush:python">x = M[:, 0]<br /></pre>which means 'get every row and from every row get the element at index 0'. "x" is now equal to:<br /><pre class="brush:python">np.array([1, 4, 7, 10])<br /></pre><br />Now let's say that instead of the first element of every row, you wanted the third element of the first row, the second element of the second row, the first element of the third row, and the second element of the fourth row. In Numpy, this is how you do that:<br /><pre class="brush:python">idx = [2,1,0,1]<br />x = M[[0,1,2,3], idx]<br /></pre>or equivalently:<br /><pre class="brush:python">x = M[np.arange(M.shape[0]), idx]<br /></pre>This is just breaking up the 'coordinates' of the desired elements into separate lists for each axis. "x" is now equal to:<br /><pre class="brush:python">np.array([3, 5, 7, 11])<br /></pre><br />Unfortunately this sort of fancy indexing isn't available in Tensorflow. Instead we have the function "tf.gather_nd" which lets us provide a list of 'coordinates'. Unlike Numpy, tf.gather_nd does not take separate lists per axis but expects a single list with one coordinate per item like this:<br /><pre class="brush:python">idx = tf.constant([[0,2],[1,1],[2,0],[3,1]], tf.int32)<br />x = tf.gather_nd(M, idx)<br /></pre><br />This is typically inconvenient as we usually have a single vector of indexes rather then a list of coordinates. It would be better to be able to just put a "range" like we did with Numpy. We can use the range and then join it to the vector of indexes sideways using "tf.stack", like this:<br /><pre class="brush:python">idx = tf.constant([2,1,0,1], tf.int32)<br />x = tf.gather_nd(M, tf.stack([tf.range(M.shape[0]), idx], axis=1))<br /></pre><br />Kind of bulky but at least it's possible. I miss Theano's Numpy-like interface.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-7222975979872868722018-04-30T22:26:00.000+02:002018-04-30T22:49:16.196+02:00Why Bahdanau's neural attention requires two layersThe neural attention model was first described for the task of machine translation in Bahdanau's <a href="https://arxiv.org/abs/1409.0473">Neural machine translation by jointly learning to align and translate</a>. The source sentence is translated into the target sentence by attending to different words in the source sentence as the target sentence is generated one word at a time. The attended source sentence is defined as follows:<br /><br />$$<br />\begin{align}<br />c_i &= \sum_j \alpha_{ij} s_j \\<br />\alpha_{ij} &= \frac{e^{z_{ij}}}{\sum_k e^{z_{ik}}} \\<br />z_{ij} &= W \tanh(V (s_j ++ p_{i-1}))<br />\end{align}<br />$$<br />where $c_i$ is the attended source sentence taken from the weighted sum of the source sentence vectors $s_j$ and the weights $\alpha_{ij}$, $p_{i-1}$ is the prefix vector produced by the 'decoder' RNN that remembers what has been generated thus far, $++$ means concatenation, and $W$ and $V$ are two weight matrices.<br /><br />So the question we're asking is, why do we need to use two layers to produce $z_{ij}$? Can we do with just one layer?<br /><br />In reality what happens when you use a single layer is that the attention weights will remain the same across time steps such that, although the attention will be different on different words in the source sentence, as the target sentence gets generated these words will keep receiving the same attention they did throughout the whole generation process. The reason for this is that softmax, which is the function the produces $\alpha_{ij}$, is shift invariant, that is, does not change if you add the same number to each of its inputs.<br /><br />Let's say $z$ is defined as [ 1, 2, 3 ]. Then $\alpha$ will be<br />$$<br />\begin{matrix}<br />(\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br />but if we add the constant k to each of the three numbers then the result will still be the same<br />$$<br />\begin{matrix}<br />& (\frac{e^{1+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{2+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) & (\frac{e^{3+k}}{e^{1+k} + e^{2+k} + e^{3+k}}) \\<br />=& (\frac{e^1e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^2e^k}{e^1e^k + e^2e^k + e^3e^k}) & (\frac{e^3e^k}{e^1e^k + e^2e^k + e^3e^k}) \\<br />=& (\frac{e^1e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^2e^k}{e^k(e^1 + e^2 + e^3)}) & (\frac{e^3e^k}{e^k(e^1 + e^2 + e^3)}) \\<br />=& (\frac{e^1}{e^1 + e^2 + e^3}) & (\frac{e^2}{e^1 + e^2 + e^3}) & (\frac{e^3}{e^1 + e^2 + e^3})<br />\end{matrix}<br />$$<br /><br />This proves that adding the same constant to every number in $z$ will leave the softmax unaltered. Softmax is shift invariant.<br /><br />Now let's say that $z$ is determined by a single neural layer such that $z_{ij} = W (s_j ++ p_{i-1}) = W_0 s_j + W_1 p_{i-1}$. We can draw a matrix of all possible $z$ where the columns are the source vectors $s$ and the rows are the decoder prefix states $p$.<br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0) & (W_0 s_1 + W_1 p_0) & (W_0 s_2 + W_1 p_0) & \dots \\<br />(W_0 s_0 + W_1 p_1) & (W_0 s_1 + W_1 p_1) & (W_0 s_2 + W_1 p_1) & \dots \\<br />(W_0 s_0 + W_1 p_2) & (W_0 s_1 + W_1 p_2) & (W_0 s_2 + W_1 p_2) & \dots \\<br />\dots & \dots & \dots & \dots<br />\end{matrix}<br />$$<br /><br />Given a single row, the $p$s are always the same, which means that the only source of variation between $z$s of the same prefix $p$ is from the source vectors $s$. This makes sense.<br /><br />Now take the first two rows. What is the result of subtracting the second from the first?<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 - W_0 s_0 - W_1 p_1) & (W_0 s_1 + W_1 p_0 - W_0 s_1 - W_1 p_1) & (W_0 s_2 + W_1 p_0 - W_0 s_2 - W_1 p_1) & \dots \\<br />(W_1(p_0-p_1)) & (W_1(p_0-p_1)) & (W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />Notice how all the columns have the same difference, which means that the second column can be rewritten as:<br /><br />$$<br />\begin{matrix}<br />(W_0 s_0 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_1 + W_1 p_0 + W_1(p_0-p_1)) & (W_0 s_2 + W_1 p_0 + W_1(p_0-p_1)) & \dots<br />\end{matrix}<br />$$<br /><br />We know that adding the same constant to every $z$ will leave the softmax unaltered, which means that every time step in the decoder RNN will lead to the same attention vector. The individual attention values will be different, but they will not change throughout the whole generation process. Using two layers with a non-linear activation function in the middle will disrupt this as the difference between two consecutive $z$ will now be different at each time step.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-65373482219869214102018-03-29T20:47:00.000+02:002018-03-29T20:54:11.218+02:00Word Mover's Distance (a text semantic similarity measure)<a href="http://proceedings.mlr.press/v37/kusnerb15.pdf">Word Mover's Distance</a> is a way to measure the similarity between the meaning of two sentences. It's basically a bag of words method (ignores the order of the words in sentences) but it's been found to work really well, including for <a href="http://aclweb.org/anthology/E17-1019">evaluating automatic caption generation</a>. You can find code to perform k-nearest neighbours classification of sentiment written by the creator of WMD <a href="https://github.com/mkusner/wmd">here</a>, but reading the code is a challenge so here's how WMD works.<br /><br />This measure makes use of an existing distance measure called <a href="http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/RUBNER/emd.htm">Earth Mover's Distance</a>. Given a set of heaps of dirt and a set of holes, and both the heaps and the holes have a point as a location (a vector) and a certain mass or capacity (a weight), what is the least amount of work needed to move all the dirt into the holes or to fill all the holes completely (depending on whether there is more total mass or total capacity)? By work we mean the total distance travelled multiplied by the mass carried. This is what EMD calculates. It measures the distance between a set of vectors paired with weights and another set of vectors paired with weights, without requiring that both sets have the same number of vectors. It is a linear programming problem and there is a Python library for it called "pyemd" that can be installed via pip.<br /><br />Now back to Word Mover's Distance. WMD uses EMD between two sentences. It assumes some kind of pre-computed word embedding vectors such as <a href="https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing">word2vec</a>, which are vectors that encode the meaning of words. The words in the sentences are the heaps and holes and these word vectors are their locations. For weights, we use the normalized frequency of the words in the sentence, that is, the number of times the word was found in the sentence divided by the total number of words in the sentence. We also ignore any words that are stop words (non-informative common words like "the" and "of") and any words that are not found in the list of word vectors. Once we have collected the set of vectors and weights of the two sentences we then just put them in the EMD solver to get the distance.<br /><br />The Python library "gensim" implements WMD given a word2vec dataset and two lists of words (the sentences) without stop words. You can find a tutorial on how to use it <a href="https://markroxor.github.io/gensim/static/notebooks/WMD_tutorial.html">here</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-48848289051116548772018-02-26T19:47:00.000+01:002018-02-26T19:55:49.144+01:00Finding the Greatest Common Divisor (GCD) / Highest Common Factor (HCF) of two numbers<p>The greatest common divisor or highest common factor of two numbers can be found using a very simple algorithm described since the time of Euclid, known as the Euclidean algorithm. The nice thing about it is that it can be computed visually by drawing squares in a rectangle as follows.</p><br /><p>Let's say that you have a line segment of length 6 units and a smaller line segment of length 2 units.</p><canvas id="c1" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c1').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(210, 10); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); </script><br /><br /><p>We can check if the smaller line's length is a divisor of the larger line's length if we can place several of the smaller lines next to each other and reach the exact length of the larger line.</p><canvas id="c2" width="620" height="30"></canvas><br /><script>var ctx = document.getElementById('c2').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.moveTo(10, 20); ctx.lineTo(610, 20); ctx.stroke(); ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.moveTo(10, 10); ctx.lineTo(610, 10); ctx.stroke(); </script><br /><br /><p>Likewise we can check if a number is a common divisor of two other numbers by checking if a square can fill exactly a rectangle.</p><canvas id="c3" width="620" height="420"></canvas><br /><script>var ctx = document.getElementById('c3').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 600, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400; y+=200) { for (var x = 10; x < 600; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>In the above diagram, the big blue rectangle has sides of 6 units by 4 units and the small red squares have sides 2 units by 2 units. Since the red square can exactly fill the blue rectangle by placing copies of it next to each other, 2 is a common divisor of 4 and 6.</p><p>The greatest common divisor is the largest square that can exactly fill a rectangle. The largest square that can do this has its side equal to the smallest side of the rectangle (otherwise not even one square will fit in the rectangle) whilst the smallest square has its side equal to 1 unit (assuming that sides are always whole number lengths). Let's take the above rectangle as an example and use the Euclidean algorithm to find the largest square that fits a 6 by 4 rectangle.</p><p>The Euclidean algorithm takes advantage of the fact that the greatest common divisor of two numbers is also a divisor of the difference of the two numbers. If two numbers "a" and "b" have "c" as a common divisor, then they can be rewritten as "cÃ—p" and "cÃ—q". The difference "a-b" is then equal to "cÃ—p - cÃ—q" which is equal to "cÃ—(p - q)" which is therefore a divisor of "c" since one of its factors is "c". Notice that "c" can be any divisor of "a" and "b", which means that "a-b" should also have the greatest common divisor of "a" and "b" as a divisor.</p><p>This means that the largest square to exactly fill the original 4 by 6 square will also be the largest square to exactly fill a 2 by 4 rectangle, since 2 is the difference between 4 and 6.</p><canvas id="c4" width="620" height="420"></canvas><script>var ctx = document.getElementById('c4').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 200, 400); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 400+10; y+=200) { for (var x = 10; x < 200+10; x+=200) { ctx.rect(x, y, 200, 200); } } ctx.stroke(); </script> <p>This piece of information allows us to substitute the difficult problem of finding the largest square that exactly fills a rectangle with an easier problem of finding the same square but on a smaller rectangle. The smaller the rectangle, the easier it is to find the largest square to exactly fill it and the Euclidean algorithm allows us to progressively chop off parts of the rectangle until all that remains is a small square which would be the square that exactly fills the original rectangle.</p><p>So the algorithm goes as follows:</p><ol><li>If the rectangle has both sides equal then it itself is the square that is the largest square that exactly fills the rectangle and we are done.</li><li>Otherwise put a square inside the rectangle of length equal to the shortest side of the rectangle.</li><li>Chop off the part of the rectangle that is covered by the square and repeat.</li></ol><p>The 6 by 4 rectangle is trivial so let's try this algorithm on a slightly harder problem of 217 by 124. This time we'll be using exact pixels to represent the rectangle lengths.</p><p>This is a 217 by 124 rectangle.</p><canvas id="c5" width="237" height="144"></canvas><script>var ctx = document.getElementById('c5').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); </script><br /><br /><p>This is the overlap it has with a square of length equal to its shortest side.</p><canvas id="c6" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c6').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10, 10, 124, 124); ctx.stroke(); </script><br /><br /><p>We now chop off that part of the rectangle and find the largest square which fits the remaining 93 by 124 rectangle.</p><canvas id="c7" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c7').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c8" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c8').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10, 93, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10, 93, 92); ctx.stroke(); </script><br /><br /><p>This is the remaining 93 by 31 rectangle.</p><canvas id="c9" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c9').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c10" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c10').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+124, 10+93, 93, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+124, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 62 by 31 rectangle.</p><canvas id="c11" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c11').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); </script><br /><br /><p>This is the new square's overlap.</p><canvas id="c12" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c12').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+155, 10+93, 62, 31); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; ctx.rect(10+155, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>This is the remaining 31 by 31 rectangle.</p><canvas id="c13" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c13').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10+186, 10+93, 31, 31); ctx.stroke(); </script><br /><br /><p>We have finally reached a square of side 31, which means that 31 is the greatest common divisor of 217 and 124.</p><canvas id="c14" width="237" height="144"></canvas><br /><script>var ctx = document.getElementById('c14').getContext('2d'); ctx.lineWidth = 5; ctx.beginPath(); ctx.strokeStyle = 'blue'; ctx.rect(10, 10, 217, 124); ctx.stroke(); ctx.lineWidth = 2; ctx.beginPath(); ctx.strokeStyle = 'red'; for (var y = 10; y < 124+10; y+=31) { for (var x = 10; x < 217+10; x+=31) { ctx.rect(x, y, 31, 31); } } ctx.stroke(); </script> <p>Of course we don't really need to visualise anything. We can do this all numerically by just repeatedly subtracting the smallest number from the biggest and replacing the biggest with the difference. Here's a Python function that does that:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> if a < b:<br /> b = b - a<br /> else:<br /> a = a - b<br /> return a<br /></pre><p>We can make this shorter by using pattern matching:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while a != b:<br /> (a, b) = (min(a,b), max(a,b)-min(a,b))<br /> return a<br /></pre><p>If one of the numbers is much bigger than the other, this program will waste a lot of time repeatedly subtracting the same number until the bigger number becomes smaller than the other. Fortunately this process can be done in a single operation by dividing and getting the remainder. The remainder is what's left after you can't subtract the second number from the first any more. This speeds the algorithm up considerably. The operation is called the modulo operation. With modulo the subtractions will not stop when both numbers are equal as modulo assumes that you can do one more subtraction and make the first number zero. Instead we will now keep on looping until the remainder is zero, in which case the last number you used to get the remainder was the greatest common divisor:</p><pre class="brush:python"><br />def gcd(a, b):<br /> while b > 0:<br /> (a, b) = (min(a,b), max(a,b)%min(a,b))<br /> return a<br /></pre>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-54863868494296382542018-01-30T11:32:00.001+01:002018-01-30T11:47:52.208+01:00A quick guide to running deep learning applications on Amazon AWS virtual server supercomputerDeep learning is serious business and if you want to work on sizeable problems you're going to need more hardware than you probably have at home. Here is how to use Amazon Web Services in order to be able to upload your code and datasets to an online server which will run it for you and then let you download the results. It's basically a virtual computer that you use over the Internet by sending commands.<br /><br />First of all, when I got started I was initially following this video on how to use AWS:<br /><a href="https://youtu.be/vTAMSm06baA?t=1476">https://youtu.be/vTAMSm06baA?t=1476</a><br /><br /><div class="sectitle">AWS Educate</div><br />If you're a student then you can take advantage of AWS Educate where if you are accepted you will receive 100 free hours of runtime to get started. 100 hours in deep learning will not last long but it will allow you to experiment and get your bearings with AWS without worrying about the bill too much. It will also be a few days before you get a reply. Here's the link:<br /><a href="https://aws.amazon.com/education/awseducate/">https://aws.amazon.com/education/awseducate/</a><br /><br /><div class="sectitle">Get the applications you need</div><br />If you're on Windows, then before creating your server you should install <a href="https://winscp.net/eng/index.php">WinSCP</a> and <a href="https://www.putty.org/">PuTTY</a> in order to be able to upload your data and run your programs. If you're not on Windows then you can use the terminal with ssh to do this.<br /><br /><div class="sectitle">Create an account</div><br />Start by visiting this link and making an account:<br /><a href="https://aws.amazon.com/">https://aws.amazon.com/</a><br /><br />You need to provide your credit card details before starting. You will be charged automatically every month so make sure to keep an eye on your usage as you pay for the amount of time you let the server run per hour.<br /><br /><div class="sectitle">Enter the AWS console</div><br />Next go into the AWS console which is where you get to manage everything that has to do with virtual servers:<br /><a href="https://console.aws.amazon.com/">https://console.aws.amazon.com/</a><br /><br />Note that there is the name of a city at the top such as "Ohio". This is to say where you want your servers to be in. Amazon has servers all around the world and you might prefer one region over another, mostly to reduce latency. Different regions also have different prices, so that might take priority in your decision. Ohio and W. Virginia seem to be the cheapest. See here for more information:<br /><a href="https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/">https://www.concurrencylabs.com/blog/choose-your-aws-region-wisely/</a><br /><br />Keep in mind that each region has its own interface so that if you reserve a server in one region, you can only configure your server when that region is selected. You will not see a list of all your servers in any region. So make sure you remember which regions you use.<br /><br />After having chosen a region, next go to Services and click on EC2:<br /><a href="https://console.aws.amazon.com/ec2/">https://console.aws.amazon.com/ec2/</a><br /><br /><div class="sectitle">Create a virtual server</div><br />Click on the big blue button called "Launch Instance" in order to create your virtual server. You can now see a list of AMIs which are preconfigured virtual servers that are specialized for some kind of task, such as deep learning. You're going to copy an instance of one of these and then upload your files to it. Look for the AMI called "Deep Learning AMI (Ubuntu)" which contains a bunch of deep learning libraries together with CUDA drivers. Click on "Select".<br /><br />This is where you choose the computing power you want. The better it is, the more expensive. If you just want to see how it works then choose a free one which says "Free tier eligible". If you want to get down to business then choose one of the "GPU Compute" instances such as "p2.xlarge" which has <a href="https://aws.amazon.com/ec2/instance-types/">12GB of GPU memory</a>. The "p2.xlarge" costs about 90 cents per hour (which starts from the moment you create the instance, not when you start running your programs so it also includes the time it takes to upload your data).<br /><br />If this is your first time creating a powerful instance then you will first need to <a href="https://console.aws.amazon.com/support">ask Amazon to let you use it</a> (this is something they do to avoid people hogging all the resources). Under "Regarding" choose "Service Limit Increase", under "Limit Type" choose "EC2 Instances", and under "Region" choose the region you selected previously. You also have to say something about how you'll use it. After being granted access you can then continue from where we left off.<br /><br />After ticking the check box next to your selected computer power, click on "Next: Configure Instance Details".<br /><br />Leave this next step with default settings. Click on "Next: Add Storage".<br /><br />This is where you choose your storage space. You will need at least 50GB of space in order to store the deep learning AMI but you will need additional space for your datasets and results. Keep in mind that you have to pay per GB per month for storage. If you need frequent access to the hard drive (such as loading minibatches from disk) then you'll need to use an SSD drive which costs about <a href="https://aws.amazon.com/ebs/pricing/">10 cents per GB per month</a>. Otherwise you can use a magnetic drive which costs about <a href="https://aws.amazon.com/ebs/previous-generation/">5 cents per GB per month</a>.<br /><br />Next click on "Next: Add Tags". This is where you make up some informative tags for your virtual server in order to differentiate it from other virtual servers. You can do something like "Name=Deep learning". If you only have one server then this is not important. Click on "Next: Configure Security Group".<br /><br />This is where you create some firewall rules to make sure that only your computer has access to the virtual server, even if someone else knows the password. It might be the case that this doesn't work for you and you won't be able to connect at all even from your IP address in which case choose "Anywhere" as a source which will not make any restricts based on IP. Click "Review and Launch".<br /><br />As soon as you click on "Launch" at the bottom the clock starts ticking and you will start being billed. If you've got the AWS Educate package then you will have 100 free hours but they will start being used up. You can stop an instance any time you want but you will be billed for one hour as soon as it starts, even if you stop it immediately.<br /><br />If this is your first time using a server in the selected region (the place you want your server to be in) then you will be asked to create a cryptographic private key which is a file that will be used instead of a password to connect to the virtual server. If you're on Windows then you need to use a program that comes with PuTTY called PuTTYgen which converts the .pem file that Amazon sends you to a .ppk file that PuTTY can use. Follow the section called "To convert your private key" in the following link to do this:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br /><br /><div class="sectitle">Connecting to your virtual server</div><br />Now that we have our server ready we need to connect to it, upload our stuff, and get it to run. We're also on the clock so we need to hurry. Start by visiting your list of server instances, which can be found in the side bar when you're in the EC2 dashboard:<br /><a href="https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId">https://console.aws.amazon.com/ec2/v2/home#Instances:sort=instanceId</a><br /><br />Your server should be running. You can stop it by right clicking it and under "Instance state" choosing "Stop". This will stop it from charging you every hour but you will be charged for at least one hour as soon as you start.<br /><br />Clicking on a server and then clicking on "Connect" will give you information to access the server using PuTTY or ssh.<br /><br />If you're using Windows, you connect to your server using PuTTY, which you can find out how using this guide:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html?icmpid=docs_ec2_console</a><br />After connecting using PuTTY you can transfer the configuration to WinSCP by opening WinSCP, going on Tools, and choosing "Import sites". Now you can connect using WinSCP in order to upload and download files as well.<br /><br />If you're not using Windows then you should use the terminal and follow this guide instead:<br /><a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html">https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html</a><br />You can upload stuff from Linux by using FileZilla or by using the normal directory explorer and clicking CTRL+L to enter an FTP location.<br /><br />Upload all the files you need including datasets and scripts. You can zip the files before uploading them and then unzip them on the server using the "<a href="https://www.lifewire.com/examples-linux-unzip-command-2201157">unzip</a>" Linux command.<br /><br />DO NOT INSTALL LIBRARIES WITH PIP YET! See the next section first.<br /><br /><div class="sectitle">Using your virtual server</div><br />Finally we are close to start running our scripts. But we still need to do two more things first.<br /><br />First of all, look at the top of the PuTTY or terminal which tells you how to activate different Anaconda environments. These are Python environments with different libraries available which are connected to CUDA so that you will be able to run on the GPU. In PuTTY or terminal enter "source activate <environment name>". Remember this line to use later.<br /><br />Now you can install anything you want using pip.<br /><br />Secondly as soon as you close PuTTY or terminal all running processes will stop (but the instance will still be running so you'll still be paying). What you need to do is to use the application called "screen" which will keep everything running on its own. See this link:<br /><a href="https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/">https://www.rackaid.com/blog/linux-screen-tutorial-and-how-to/</a><br /><br />Now you'll need to activate the environment again because screen creates a new session which is disconnected from the previous one.<br /><br /><div class="sectitle">Hoorray!</div><br />You're done! You can now start using your server. When you're ready, make sure to stop it and you can even terminate it but that will completely delete the server with all data so only do it when you're really ready, otherwise you'll be wasting money reuploading and rerunning everything.<br /><br />You can see your current bill by clicking on Services and going on Bill. This will show you your current month's bill as well as previous bills as well. You can ever get a daily break down and forecasts.<br /><a href="https://console.aws.amazon.com/billing/home">https://console.aws.amazon.com/billing/home</a>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-38909971087674260032017-12-29T12:03:00.002+01:002018-04-04T15:33:30.503+02:00An introduction to the Long Short Term Memory (LSTM) RNN and Gated Recurrent Unit (GRU)In my <a href="https://geekyisawesome.blogspot.com.mt/2017/11/an-introduction-to-recurrent-neural.html">previous post</a> I talked about the recurrent neural network and said that RNNs are used to encode a sequence of vectors into a single vector. The way it works is that the RNN has a memory, which we call a "state", and this memory gets updated each time a new input is introduced to the RNN. Even if the same input is shown twice, the memory can still be changed significantly because the new state is determined by both the new input and the previous state. After all the inputs have been shown to the RNN, the final state will be a vector that represents everything that the RNN has seen. Of course the RNN cannot remember everything in perfect detail since the state vector has a fixed size whilst the sequence of inputs can be arbitrarily long. This means that the final state will actually be a summary of the whole sequence which only "remembers" important or salient features in the sequence. This process has been likened to the short term memory in animals.<br /><br />The simple RNN described in the previous post, when using sigmoid or hyperbolic tangent as an activation function, tends to suffer from "vanishing gradients" when trained to work with long sequences. What does it mean to have vanishing gradients?<br /><br />In order to train your RNN you need to find the gradient of the error with respect to each weight in order to know whether to increase or decrease the weight. Now let's assume that we're using $tanh$ as an activation function for computing the next state. The function that turns an input and a state into a new state is something like the following:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />We're leaving out the bias term in order to keep things short. If you want to find the gradient of $s^1$ with respect to $w_i$ (for a sequence length one), that gradient will be<br /><br />$\frac{ds^1}{dw_i} = (1 - tanh^2(i^1 w_i + s^0 w_s))(i^1)$<br /><br />Notice how the gradient involves a multiplication with $tanh^2$ of the input. $tanh$ gives a number between 1 and -1 which is a fraction and squaring a fraction makes it smaller. So the gradient might be a pretty small number.<br /><br />What happens when we have a sequence of two inputs?<br /><br />$s^2 = tanh(i^2 w_i + tanh(i^1 w_i + s^0 w_s) w_s)$<br /><br />The gradient will be:<br /><br />$\frac{ds^2}{dw_i} = (1 - tanh^2(\dots))(i^2 + (1 - tanh^2(\dots))(i^1))$<br /><br />Notice how now we've multiplied the already small number produced by $tanh^2$ by another number produced by $tanh^2$ before getting to our first input $i^1$. Multiplying a fraction by another fraction results in a fraction that's smaller than both fractions. This means that $i^1$ will contribute significantly less to the gradient than $i^2$. If we have a sequence of 3 inputs then the first input would require 3 fractional multiplications which makes it even smaller. This will eventually make the gradient negligible (it vanishes) and hence result in the RNN only learning to reduce the error with respect to recent inputs only, completely forgetting inputs towards the beginning of the sequence.<br /><br />One solution for this problem is to use activation functions which do not result in fractional multiplications for each time step, such as the rectified linear unit function (ReLU). $ReLU$ is a function that returns numbers unchanged if they're positive whilst replacing negative numbers with zero:<br /><br />$ReLU(x) = max(x, 0)$<br /><br />The gradient of this function is either 1 (if $x$ is positive) or 0 (if $x$ is negative). In mathematical terms this is equivalent to the indicator function $1_{x>0}$ which gives 1 when the subscript condition is met and 0 otherwise. So if we replace our $tanh$ with $ReLU$ for our previous equation we'll get the following:<br /><br />$s^2 = ReLU(i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s)$<br />$\frac{ds^2}{dw_i} = 1_{i^2 w_i + ReLU(i^1 w_i + s^0 w_s) w_s > 0}(i^2 + 1_{i^1 w_i + s^0 w_s > 0}(i^1))$<br /><br />You might be worried about how a single failed condition in the indicator functions will result in the making all previous time steps being multiplied by 0 which will vanish them completely. But keep in mind that $i$ and $s$ are not single numbers but vectors of numbers, and that the $ReLU$ function will work on each number in the vectors independently. So whilst some numbers in the vectors will be negative, others will not. With large enough vectors it is unlikely that an entire vector will consist of just negative numbers so you're likely to have a number of cases with only multiplications by 1 (never 0) which will preserve at least parts of the vectors of early inputs.<br /><br />Although simple RNNs with ReLUs have been reported to give good results in some papers such as <a href="https://arxiv.org/abs/1504.00941">this</a>, a more complex form of RNN called a long short term memory (LSTM) RNN, which was designed to reduce the problem of vanishing gradients, is much more popular. According to one of the authors of the LSTM (Jurgen Schmidhuber), its name means that it is <a href="https://soundcloud.com/twiml/twiml-talk-44-jurgen-schmidhuber-lstms-plus-deep-learning-history-lesson">an RNN with a short term memory that is long</a>.<br /><br />The idea is to make the state pass through to the next state without going through an activation function. You pass the input through an activation function, but the previous state is just added to it linearly. So instead of having this state update:<br /><br />$s^1 = tanh(i^1 w_i + s^0 w_s)$<br /><br />the LSTM has this state update:<br /><br />$s^1 = tanh(i^1 w_i) + s^0$<br /><br />This changes how the gradient changes as the sequence gets longer because now we don't have nested activation functions any more. Instead this is what you end up with when you add another time step:<br /><br />$s^2 = tanh(i^2 w_i) + tanh(i^1 w_i) + s^0$<br /><br />This equation perfectly preserves the gradients as it gets longer since all you're doing is adding another term but it also loses a lot of its expressive power since the order in which you present the items in a sequence doesn't matter any more. You'll get the same state vector regardless of how you shuffle the sequence.<br /><br />What we need is a mixture of the expressiveness of the simple RNN mentioned in the beginning and gradient preservation of this new function. The solution is to have two state vectors: one for expressiveness called the "hidden state" $h$ and one for gradient preservation called the "cell state" $c$. Here is the new equation:<br /><br />$c^1 = tanh(i^1 w_i + h^0 w_h) + c^0$<br />$h^1 = tanh(c^1)$<br /><br />Notice the following things:<br /><ul><li>The order of the inputs matters now because each input is combined with a different hidden state vector.</li><li>The hidden state vector is derived from the cell state vector by just passing the cell state through a $tanh$.</li><li>Even though the hidden state is derived from the cell state, $h^0$ and $c^0$ are separate constants and $h^0$ is not equal to $tanh(c^0)$.</li></ul><br />Let's derive the equation for the final state of a length 2 sequence step by step:<br /><br />$c^2 = tanh(i^2 w_i + h^1 w_h) + c^1$<br />$h^2 = tanh(c^2)$<br /><br />$c^2 = tanh(i^2 w_i + tanh(c^1) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />$c^2 = tanh(i^2 w_i + tanh(tanh(i^1 w_i + h^0 w_h)) w_h) + tanh(i^1 w_i + h^0 w_h) + c^0$<br /><br />You can see that what is happening is that you end up with a separate term for each input where the input of that term is just inside a $tanh$ and no deeper. On the other hand each term also consists of all the previous inputs but at much lower levels of influence since each previous input is within an additional two nested $tanh$ functions the further back in the sequence it is found. This allows for a distinction in the order of the inputs whilst keeping each term focused on one particular input.<br /><br />The creators of the LSTM didn't stop there however. They also introduced gating functions in the LSTM. A gating function is basically a single layer sub neural net that outputs a fraction between 0 and 1 using a sigmoid. This number is then multiplied by another number in order to either leave the second number unchanged or turn it to zero. In other words, if the gate is 1 then it allows the second number to pass whilst if it is 0 then it blocks the second number (and gates in between will make the number smaller). <a href="http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf">Originally there were two gates</a>: one for the input terms $tanh(i^1 w_i + h^0 w_h)$ and one for the new hidden state $tanh(c^1)$. The gates control whether to accept the new input (the input gate) and whether to allow previous information to be represented by the hidden state (the output gate). Later, <a href="http://www.jmlr.org/papers/volume3/gers02a/gers02a.pdf">another paper</a> introduced a forget gate that regulates the cell state as well. All of these gates are controlled by sub neural nets that take a decision based on the current input and the previous hidden state.<br /><br />Here is what the final LSTM equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = sig(i^{t} w_{g_{i}i} + h^{t-1} w_{g_{i}h} + b_{g_i})$<br />$g_o = sig(i^{t} w_{g_{o}i} + h^{t-1} w_{g_{o}h} + b_{g_o})$<br />$c^{t} = g_i \odot tanh(i^{t} w_{ci} + h^{t-1} w_{ch} + b_{c}) + g_f \odot c^{t-1}$<br />$h^{t} = g_o \odot tanh(c^{t})$<br /><br />where $g_f$, $g_i$, and $g_o$ are the forget gate, input gate, and output gate respectively. $w_{g_{f}i}$ means weight for the input being used in the calculation of $g_f$. $b$ are the bias terms of the sub neural nets (we didn't use biases in the previous equations to keep things short). $\odot$ is the element-wise product of two vectors and $sig$ stands for sigmoid.<br /><br />Here is a diagram illustrating the different components of the LSTM with gating function application (element-wise multiplication) being represented by triangles:<br /><br /><a href="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-QIWJ3Ou77NI/WkYzRvV99RI/AAAAAAAABK0/gtcgPGMvMigG15BHrJfYEEN6C0wo0DRIACLcBGAs/s1600/lstm.png" data-original-width="756" data-original-height="378" /></a><br /><br />A little while before, you might have been a little sceptical about using two different state vectors in order to combine the expressivity of the simple RNN with the gradient preservation of the addition based RNN described above. Can't we just use the cell state in the $tanh$ as well? In fact there was later <a href="https://arxiv.org/abs/1412.3555">a paper</a> describing another kind of RNN called a gated recurrent unit (GRU) which does just that. Instead of using one state for differentiating between time steps and another for flowing previous states into later calculations, only one state is used. Gating functions are still used, except that the input gate is replaced with 1 minus the forget gate. Basically it finds a compromise between either forgetting previous states and focussing everything on the new input or ignoring the new input and preserving the previous states. Finally, instead of having an output gate we now have a reset gate which is used to control how much the state should contribute to the input term. Here is what the GRU equation looks like:<br /><br />$g_f = sig(i^{t} w_{g_{f}i} + h^{t-1} w_{g_{f}h} + b_{g_f})$<br />$g_i = 1 - g_f$<br />$g_r = sig(i^{t} w_{g_{r}i} + h^{t-1} w_{g_{r}h} + b_{g_r})$<br />$h^{t} = g_i \odot tanh(i^{t} w_{hi} + w_{hh}(g_r \odot h^{t-1}) + b_{h}) + g_f \odot h^{t-1}$<br /><br />And here is the diagram:<br /><br /><a href="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-WY9AJydm_ZM/WkYzXugXw5I/AAAAAAAABK4/e0O02cSJPiIE1gqEoC713_MgGO9g6SErQCLcBGAs/s1600/gru.png" data-original-width="756" data-original-height="378" /></a>Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-4318944459823143473.post-50014336646792906302017-11-29T21:28:00.001+01:002018-02-25T20:17:49.223+01:00An introduction to recurrent neural networks: Simple RNNsTraditional feed forward neural networks work by taking a single fixed-length vector of inputs and producing a single fixed-length vector of outputs. After being trained on a training set, the neural network should be able to not only map inputs in the training set to their correct outputs, but also do so with new unseen inputs. The network is able to generalise to new inputs, but the new inputs must be of the same size. The network is not to generalise across complexity. For example if you train the network to perform addition on 4 digit numbers, it will not be able to perform addition on 5 digit numbers or even 3 digit numbers. Likewise if it learns to understand 5 word sentences then it will not be able to do anything with 6 word sentences. With images we usually solve this by resizing differently sized images into a standard size. This is not as straight forward to do with things like sentences. This is where recurrent neural networks come in.<br /><br />Recurrent neural networks (RNNs) are used to give a neural network a short term memory which is used to be able to read a sequence of inputs and remember just a summary of what is important in the sequence. This summary, which would be a fixed-length vector called a state, can then be used by the rest of the neural network as usual. It can be used to predict the next word in a partial sentence or to determine the sentiment of a sentence. A simple RNN is a feed forward neural network where neurons in a layer have extra connections that loop around to the same layer as shown below:<br /><br /><a href="https://4.bp.blogspot.com/-aSg0QgTaNUI/Wh8Xk2OdMoI/AAAAAAAABJI/hEsmLrUCKGcZkxyXqIZQ4w1u6U2BR0COgCEwYBhgL/s1600/rnns_01.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-aSg0QgTaNUI/Wh8Xk2OdMoI/AAAAAAAABJI/hEsmLrUCKGcZkxyXqIZQ4w1u6U2BR0COgCEwYBhgL/s320/rnns_01.png" width="320" height="291" data-original-width="596" data-original-height="542" /></a><br /><br />The figure shows a neural network consisting of 2 inputs and a state of neurons. The red connections allow each state neuron to produce an output based on a combination of the input neurons and the state neurons themselves. In other words the next state vector is produced based on a combination of the current input and previous state. This is the basis of short-term memory and the result is that after being exposed to a number of inputs, the state will be a vector that is influenced by each of the input vectors. The point is to train the neural network to remember what is important according to the task at hand. This means that we also need to use the final state (after processing all inputs) to generate an output (the state itself is not usually a useful output) which will make the RNN learn a useful representation of the input sequence.<br /><br /><a href="https://4.bp.blogspot.com/-qAIWWwYQdxE/Wh8XxT_VqXI/AAAAAAAABJM/L9SPhT_NMxwF-uQaIsprc4bZSpPzCLXsACLcBGAs/s1600/rnns_02.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-qAIWWwYQdxE/Wh8XxT_VqXI/AAAAAAAABJM/L9SPhT_NMxwF-uQaIsprc4bZSpPzCLXsACLcBGAs/s320/rnns_02.png" width="320" height="204" data-original-width="829" data-original-height="529" /></a><br /><br />How are the inputs and the recurrent connections combined together? By concatenating the input and state vectors together and then passing them through a weight matrix as usual, generating the next state vector.<br /><br /><a href="https://2.bp.blogspot.com/-Ju_-TrIqrNo/Wh8X15rX-UI/AAAAAAAABJQ/NAdGlhzua3Uxfg2js_SF4Q7-9fZYJXNBgCLcBGAs/s1600/rnns_03.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-Ju_-TrIqrNo/Wh8X15rX-UI/AAAAAAAABJQ/NAdGlhzua3Uxfg2js_SF4Q7-9fZYJXNBgCLcBGAs/s320/rnns_03.png" width="320" height="301" data-original-width="269" data-original-height="253" /></a><br /><br />But what happens for the first input? What's the first input going to concatenated with if there is no previous state? We have to define a default initial state for the first input. This is usually the all zeros vector but you can instead learn a constant vector that gets optimized during training.<br /><br />Great, so that's all the basics sorted. Now for the formal notation. It is common to use the terminology of time series when talking about RNNs such that each input in a sequence belongs to a different time step in a series. In our formal notation, let's use superscripts to refer to time steps such that the first time step is 1 and the number of time steps is $T$.<br /><br />$$s^0 = \mathbf{0}$$<br />$$s^t = f_s([s^{t-1} i^t] W_s + b_s)$$<br />$$o = f_o(s^T W_o + b_o)$$<br /><br />where $s^t$ is the state vector at time $t$, $i^t$ is the input vector at time $t$, $o$ is the output vector, $\mathbf{0}$ is the all zeros vector, $f_s$ and $f_o$ are the activation functions of the state vector and output vector respectively, $W_s$ $b_s$ $W_o$ and $b_o$ are the weights and biases of the state vector and output vector respectively.<br /><br />The question is how to learn the parameters of a recurrent function, which is not as single as with a feed forward neural network. The first thing we need to do is to unroll the recurrent network into a linear network that reuses the same parameters throughout. Although an RNN can be unrolled to infinity it is also the case that you only need to unroll it as far as your input sequences require. So if your training set contains an input sequence that is 3 time steps long, then you can use an RNN that is unrolled 3 times like this:<br /><br /><a href="https://3.bp.blogspot.com/-xNZPsxw7gtY/Wh8ay2nFidI/AAAAAAAABJo/1PdsiGCM92ceobTPdQPR8PktYcAxiqztQCLcBGAs/s1600/rnns_04.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-xNZPsxw7gtY/Wh8ay2nFidI/AAAAAAAABJo/1PdsiGCM92ceobTPdQPR8PktYcAxiqztQCLcBGAs/s640/rnns_04.png" width="640" height="341" data-original-width="1371" data-original-height="730" /></a><br /><br />Now it makes more sense and we can view it as a feed forward neural network. In fact an RNN is a feed forward network with the constraint that corresponding weights across time steps have to be identical. Notice how $W_{s00}$ and $W_{i00}$ are repeated with every time step. So whereas the weights in different layers in a feed forward neural net can be different, in an RNN they have to be the same. You might be thinking about how to handle the other input sequences of different lengths in the training set. We'll get to that later. For now let's assume that all sequences in the training set are grouped by length and that each minibatch consists of same length sequences.<br /><br />Since training will involve finding the gradient of the loss with respect to each weight, let's start by finding the gradient of the output with respect to a sample of weights. If you're not familiar with the back propagation algorithm and how gradients are used to train neural networks in general you should check out this previous <a href="https://geekyisawesome.blogspot.com.mt/2016/06/the-backpropagation-algorithm-for.html">blog post</a> before continuing on.<br /><br />Let's start with the gradient with respect to a non-recurrent weight in the output:<br /><br />$$\frac{do_0}{dW_{o00}} = \frac{d}{dW_{o00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3) = f_o'(\ldots)s_0^3$$<br /><br />That was straight forward. What about for recurrent weights?<br /><br />$$\frac{do_0}{dW_{s00}} = \frac{d}{dW_{s00}}f_o(W_{o00}s_0^3 + W_{o10}s_1^3)$$<br />$$= f_o'(\ldots)(\frac{d}{dW_{s00}}f_s(W_{s00}s_0^2 + W_{s10}s_1^2 + W_{i00}i_0^3 + W_{i10}i_1^3) + \frac{d}{dW_{s00}}f_s(W_{s01}s_0^2 + W_{s11}s_1^2 + W_{i01}i_0^3 + W_{i11}i_1^3))$$<br />$$= f_o'(\ldots)(f_s'(\ldots)(\frac{d}{dW_{s00}}W_{s00}s_0^2))$$<br /><br />And it is at this point that we will realise that things are more complicated than usual with feed forward neural nets. This is because $s_0^2$ can be decomposed to reveal more terms that contain $W_{s00}$ which means that we need to use the product rule.<br /><br />$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2\frac{d}{dW_{s00}}W_{s00} + W_{s00}\frac{d}{dW_{s00}}s_0^2))$$<br />$$= f_o'(\ldots)(f_s'(\ldots)(s_0^2 + W_{s00}\frac{d}{dW_{s00}}f_s(W_{s00}s_0^1 + W_{s10}s_1^1 + W_{i00}i_0^2 + W_{i10}i_1^2)))$$<br /><br />...and so on, which would require as many decompositions as the length of the sequence. This is not compatible with the back propagation algorithm as it's not easy to extract a pattern that works for any sequence length. Keep in mind that we need to do this for the input weights as well.<br /><br />Fortunately there is a simple solution: Treat all weights as being different and then add together the corresponding derivatives. What this means is that you put a superscript on each weight which indicates the time step it belongs to, hence making each weight different. So instead of having $W_{s00}$ we'd have $W_{s00}^3$, $W_{s00}^2$ and $W_{s00}^1$. Then we find the derivatives of each separate weight and finally add them all up:<br /><br />$$\frac{do_0}{dW_{s00}} = \frac{do_0}{dW_{s00}^3} + \frac{do_0}{dW_{s00}^2} + \frac{do_0}{dW_{s00}^1}$$<br /><br />This allows us to use normal back propagation to find each individual weight as if we were working on a feed forward neural net and then finally just add together corresponding derivatives in order to keep the weights identical. Notice that this is not a hack to force the weights to remain identical. The sum of the subderivatives really does equal $\frac{do_0}{dW_{s00}}$. You can try proving it yourself by trying to find the derivative using product rules as I was doing before. This trick is called back propagation through time.<br /><br />We now get back to the question of handling variable length sequences in a training set. The solution to this is to make all sequences of equal length by padding them with pad vectors (the all zeros vector for example) and then make the RNN simply return the current state unmodified if the input is a pad vector. That way the state will remain the same beyond the length of the sequence, as if there were no pad vectors. This is the new RNN equation:<br /><br />$$s^0 = \mathbf{0}$$<br />$$s^t =<br />\begin{cases}<br />f_s([s^{t-1} i^t] W_s + b_s) & \quad \text{if } i^t \text{ is not a pad}\\<br />s^{t-1} & \quad \text{otherwise}<br />\end{cases}<br />$$<br />$$o = f_o(s^T W_o + b_o)$$<br /><br />You can now see how to implement a language model which predicts the next word in a partial sentence using Tensorflow by checking this previous <a href="https://geekyisawesome.blogspot.com.mt/2017/05/neural-language-models-and-how-to-make.html">blog post</a>. You might also want to learn about how to represent words as vectors using word embeddings in this other <a href="https://geekyisawesome.blogspot.com.mt/2017/03/word-embeddings-how-word2vec-and-glove.html">blog post</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91087400828036759802017-10-21T23:19:00.001+02:002017-10-22T12:18:40.051+02:00Hyperparameter tuning using HyperoptOne of the most tedious but important things there is in machine learning is tuning the hyperparameters of your machine learning algorithm, such as the learning rate and initial parameters in gradient descent. For example you might want to check how your gradient descent algorithm performs when the learning rate is 1.0, 0.1, or 0.01 and the initial parameters being randomly chosen between -10.0 and 10.0 or between -1.0 and 1.0.<br /><br />The problem with doing this is that unless you have several supercomputers at your disposal, testing the learning algorithm is a slow process and there are so many different ways to configure your hyperparameters. In the above example you'd have to try six different configurations: (1.0,(-10.0,10.0)), (1.0,(-1.0,1.0)), (0.1,(-10.0,10.0)), etc.. An exhaustive search (grid search) would take too long if each test takes very long and in practice you'd have a search space which is much bigger than six. We can test a random sample of the configurations but ideally the sample would be chosen intelligently rather than randomly. We could start out by trying some randomly chosen configurations and then start homing in on some of the more promising hyperparameter choices.<br /><br />This is where the Python library <a href="https://jaberg.github.io/hyperopt/">Hyperopt</a> comes in. It's a simple library that searches a space of hyperparameters in a rational way so that you'll end up with a better result after trying 100 configurations than if you just randomly sampled 100 different configurations and picked the best performing one. It does this by using a <a href="http://optunity.readthedocs.io/en/latest/user/solvers/TPE.html">tree of Parzen Estimators</a>.<br /><br />Let's start with a simple gradient descent example that finds the minimum of "y = x^2". We'll write a function that takes in a learning rate, and a range within which to initialize "x" (we'll only ask for the maximum of this range and assume that the negative of this number is the minimum). We'll then apply gradient descent on the initial "x" value for 10 epochs. The function will finally return the value of "x" that was found near the minimum.<br /><br /><pre class="brush:python">import random<br /><br />def loss(x):<br /> return x**2<br /><br />def loss_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(10):<br /> x = x - learning_rate*loss_grad(x)<br /> return x<br /></pre><br />Great, now we want to find the best hyperparameters to use. In practice we'd have a validation set for our machine learning model to learn to perform well on. Once the hyperparameters that result in the best performance on the validation set are found, we'd apply them to learn on a separate test set and it is this performance that is used to judge the quality of the learning algorithm. However, for this blog post we'll instead focus on the simpler mathematical optimization problem of finding the minimum of "y = x^2".<br /><br />This is how you use Hyperopt to find the best hyperparameter combination. First you create a function called an objective function that takes in a tuple with the chosen hyperparameters and returns a dictionary that is read by hyperopt to assess the fitness of the chosen hyperparameters. Then you take this function and the possible hyperparameter choices you want to allow and pass them to the hyperopt function called "fmin" which finds the hyperparameters that give the smallest loss.<br /><br /><pre class="brush:python">import hyperopt<br /><br />def hyperopt_objective(hyperparams):<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /><br />best = hyperopt.fmin(<br /> hyperopt_objective,<br /> space = [<br /> hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),<br /> hyperopt.hp.choice('max_init_x', [ 10.0, 1.0 ]),<br /> ],<br /> algo = hyperopt.tpe.suggest,<br /> max_evals = 10<br /> )<br />print(best)<br /></pre><br />The output of this program is this:<br /><pre>{'learning_rate': 1, 'max_init_x': 1}<br /></pre>This is saying that the best loss is obtained when the learning rate is the item at index 1 in the given list (0.1) and the maximum initial value is the item at index 1 in the given list (1.0). The "space" parameter in fmin is there to say how to construct a combination of hyperparameters. We specified that we want a list of two things: a learning rate that can be either 1.0, or 0.1, or 0.01, and a maximum initial value that can be either 10.0 or 1.0. We use "hp.choice" to let fmin choose among a list. We could instead use "hp.uniform" in order to allow any number within a range. I prefer to use a list of human friendly numbers instead of allowing any number so I use the choice function instead. We have also said that we want to allow exactly 10 evaluations of the objective function in order to find the best hyperparameter combination.<br /><br />Although this is how we are expected to use this library, it is not very user friendly in general. For example there is no feedback given throughout the search process so if each evaluation of a hyperparameter combination takes hours to complete then we could end up waiting for several days without seeing anything, just waiting for the function to return a value. The return value also requires further processing as it's just indexes. We can make this better by adding some extra stuff to the objective function:<br /><br /><pre class="brush:python">eval_num = 0<br />best_loss = None<br />best_hyperparams = None<br />def hyperopt_objective(hyperparams):<br /> global eval_num<br /> global best_loss<br /> global best_hyperparams<br /><br /> eval_num += 1<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> print(eval_num, l, hyperparams)<br /><br /> if best_loss is None or l < best_loss:<br /> best_loss = l<br /> best_hyperparams = hyperparams<br /><br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /></pre><br />We can now see each hyperparameter combination being evaluated. This is the output we'll see:<br /><pre>1 1.6868761146697238 (1.0, 10.0)<br />2 0.34976768426779775 (0.01, 1.0)<br />3 0.006508209785146999 (0.1, 1.0)<br />4 1.5999357079405185 (0.01, 10.0)<br />5 0.2646974732349057 (0.01, 1.0)<br />6 0.5182259594937579 (1.0, 10.0)<br />7 53.61565213613977 (1.0, 10.0)<br />8 1.8239879002601682 (1.0, 10.0)<br />9 0.15820396975495435 (0.01, 1.0)<br />10 0.784445725853568 (1.0, 1.0)<br /></pre><br />Also, the variable best_hyperparams will contain the tuple with the best hyperparameters found. Printing best_hyperparams will show "(0.1, 1.0)". We can even save the best hyperparameters found till now in a file so that we can stop the search early if we run out of patience.<br /><br />Here is the full code in one place:<br /><br /><pre class="brush:python" style="height:400px; overflow:scroll;">import random<br />import hyperopt<br /><br />def loss(x):<br /> return x**2<br /><br />def loss_grad(x):<br /> return 2*x<br /><br />def graddesc(learning_rate, max_init_x):<br /> x = random.uniform(-max_init_x, max_init_x)<br /> for _ in range(10):<br /> x = x - learning_rate*loss_grad(x)<br /> return x<br /><br />eval_num = 0<br />best_loss = None<br />best_hyperparams = None<br />def hyperopt_objective(hyperparams):<br /> global eval_num<br /> global best_loss<br /> global best_hyperparams<br /><br /> eval_num += 1<br /> (learning_rate, max_init_x) = hyperparams<br /> l = loss(graddesc(learning_rate, max_init_x))<br /> print(eval_num, l, hyperparams)<br /><br /> if best_loss is None or l < best_loss:<br /> best_loss = l<br /> best_hyperparams = hyperparams<br /><br /> return {<br /> 'loss': l,<br /> 'status': hyperopt.STATUS_OK,<br /> }<br /><br />hyperopt.fmin(<br /> hyperopt_objective,<br /> space = [<br /> hyperopt.hp.choice('learning_rate', [ 1.0, 0.1, 0.01 ]),<br /> hyperopt.hp.choice('max_init_x', [ 10.0, 1.0 ]),<br /> ],<br /> algo = hyperopt.tpe.suggest,<br /> max_evals = 10<br /> )<br /><br />print(best_hyperparams)<br /></pre><br />To find out more about Hyperopt see this <a href="https://github.com/jaberg/hyperopt/wiki/FMin">documentation page</a> and the <a href="https://github.com/jaberg/hyperopt">Github repository</a>.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-15675259591453729732017-09-18T16:27:00.003+02:002017-09-19T08:39:37.834+02:00Find an equation that passes through a set of points (Lagrange polynomial interpolation)When I was in school, I remember learning about how to find the equation of a line that passes through two points. When I asked how to find the equation of a curve instead I was told that this was not possible, probably because there is an infinite number of curves that pass through a finite set of points (as we'll see below). However I would have been happy to learn about this simple method to produce a polynomial curve that passes through any given set of points. In general this task is called interpolation and the particular interpolating method we shall see here is called <a href="https://en.wikipedia.org/wiki/Lagrange_polynomial">Lagrange polynomials</a>.<br /><br />A polynomial is an equation of the form<br /><br />$$a_n x^n + a_{n-1} x^{n-1} + \dots + a_1 x + a_0$$<br /><br />For example $2x^3 + 4x^2 - 3x + 1$ is a polynomial where 2 is $a_3$, 4 is $a_2$, 4 is $a_1$ and 1 is $a_0$. A polynomial plus another polynomial is also a polynomial, for example $(x^2 + 2x) + (3x + 2) = x^2 + 5x + 2$. A polynomial times another polynomial is also a polynomial, for example $(x^2 + 2x)(3x + 2) = 3x^3 + 8x^2 + 4x$.<br /><br />Now on to polynomial interpolation. Let's start with the simplest case, when you have only one point. Let's say you want a polynomial that passes through (3,2), that is, when $x$ is 3, the polynomial should equal 2. In this case our equation can simply be<br /><br />$$y = 2$$<br /><br />that is, the polynomial is just 2. The equation is always 2 regardless of where $x$ is so we have found our polynomial.<br /><br /><figure><a href="https://1.bp.blogspot.com/-pZ0esix4-30/Wb_VeecRCNI/AAAAAAAABHY/1l_lHMsinhcBgmz961xJPXpD1M3s_XCKgCLcBGAs/s1600/lagrange01.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-pZ0esix4-30/Wb_VeecRCNI/AAAAAAAABHY/1l_lHMsinhcBgmz961xJPXpD1M3s_XCKgCLcBGAs/s320/lagrange01.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 2$.</figcaption></figure><br /><br />Now on to a more interesting case. Let's say we want to pass through these points:<br /><ul><li>(3,2)</li><li>(4,3)</li><li>(6,4)</li></ul><br />The trick is to add up a set of polynomials each of which goes through one point. We need a polynomial that equals 2 when $x$ is 3, another polynomial that equals 3 when $x$ is 4, and another polynomial that equals 4 when $x$ is 6. But we have to be careful as we don't want these three polynomials to interfere with each other after being added up together. For example if we can't just do $y = (2) + (3) + (4)$ as the two terms will interfere with each other and the result will not pass through any of the points. Instead we need each polynomial to have a shape such that each one passes through its corresponding point but then passes through 0 in place of the remaining points. The three polynomials we need to add up together are:<br /><ul><li>Polynomial corresponding to (3,2): must equal 2 when $x$ is 3 and equal 0 when $x$ is 4 or 6.</li><li>Polynomial corresponding to (4,3): must equal 3 when $x$ is 4 and equal 0 when $x$ is 3 or 6.</li><li>Polynomial corresponding to (6,4): must equal 4 when $x$ is 6 and equal 0 when $x$ is 3 or 4.</li></ul>Since the polynomials equal 0 where the other points are then they will not interfere with each other where the polynomials pass through their corresponding point.<br /><br />Let's focus on the zeros first. There is an easy trick to ensure that a polynomial goes through zero at certain values of $x$. If you want your polynomial to be zero when $x$ is 4 or 6, then use $(x-4)(x-6)$. In this polynomial, when $x$ is 4 then the first bracket will equal 0, which makes the whole thing 0, and when $x$ is 6 the second bracket will equal 0 which will also make the whole thing 0. This is what $y = (x-4)(x-6)$, $y = (x-3)(x-6)$ and $y = (x-3)(x-4)$ looks like:<br /><br /><figure><a href="https://4.bp.blogspot.com/-JHj_I6n6hws/Wb_VlBjGlYI/AAAAAAAABHc/uGVrY8bXcY0zvjsKELIiXkVNDUEAcTMawCLcBGAs/s1600/lagrange02a.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-JHj_I6n6hws/Wb_VlBjGlYI/AAAAAAAABHc/uGVrY8bXcY0zvjsKELIiXkVNDUEAcTMawCLcBGAs/s320/lagrange02a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://2.bp.blogspot.com/-gi8vPgqOXHU/Wb_WNMOiSLI/AAAAAAAABHk/i8FZRs1FUvQ9A0Iwt1xRVXI6KR9gL1JJACLcBGAs/s1600/lagrange02b.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-gi8vPgqOXHU/Wb_WNMOiSLI/AAAAAAAABHk/i8FZRs1FUvQ9A0Iwt1xRVXI6KR9gL1JJACLcBGAs/s320/lagrange02b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-6)$. Passes through zero when $x$ is 3 or 6.</figcaption></figure><br /><br /><figure><a href="https://4.bp.blogspot.com/-18aHYqEuCi0/Wb_WQi_QBfI/AAAAAAAABHo/otev8mhn5w8O9sKIZ8m2gFNNv-neFfN0QCLcBGAs/s1600/lagrange02c.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-18aHYqEuCi0/Wb_WQi_QBfI/AAAAAAAABHo/otev8mhn5w8O9sKIZ8m2gFNNv-neFfN0QCLcBGAs/s320/lagrange02c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-4)$. Passes through zero when $x$ is 3 or 4.</figcaption></figure><br /><br />See how each curve passes through zero at every point except one? That exception point will be the corresponding point of each polynomial. Adding these polynomials up will not make them interfere with each other at the x-values of each point. But in order to get the resultant polynomial to pass through the points, we need each separate polynomial to pass through its corresponding point whilst still being zero at every other point. For this we apply another easy trick which is to multiply the polynomials by a number. Multiplying an equation by a number will not change where it is zero. It will change the shape of the curve but the places at which it is zero will not be moved. This is what $(x-4)(x-6)$ looks like after being multiplied by 2, 0.5, and -1:<br /><br /><figure><a href="https://3.bp.blogspot.com/-VdPamJMwdV0/Wb_WVjc8tJI/AAAAAAAABHs/KIpgKh4SGvU0KxA3VekQt9awicunCzeTgCLcBGAs/s1600/lagrange03a.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-VdPamJMwdV0/Wb_WVjc8tJI/AAAAAAAABHs/KIpgKh4SGvU0KxA3VekQt9awicunCzeTgCLcBGAs/s320/lagrange03a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 2(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://3.bp.blogspot.com/-ojV3beeUvVo/Wb_WYhlxOgI/AAAAAAAABHw/D7Jkta6YGNUvxW6ocXJhtV79QnEcs1TfwCLcBGAs/s1600/lagrange03b.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-ojV3beeUvVo/Wb_WYhlxOgI/AAAAAAAABHw/D7Jkta6YGNUvxW6ocXJhtV79QnEcs1TfwCLcBGAs/s320/lagrange03b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = 0.5(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://4.bp.blogspot.com/-7QB-ON_M0vQ/Wb_Wc11zSQI/AAAAAAAABH0/LE9vFrXyI64jce9D7NijPanjNls6T6SlACLcBGAs/s1600/lagrange03c.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-7QB-ON_M0vQ/Wb_Wc11zSQI/AAAAAAAABH0/LE9vFrXyI64jce9D7NijPanjNls6T6SlACLcBGAs/s320/lagrange03c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = -(x-4)(x-6)$. Passes through zero when $x$ is 4 or 6.</figcaption></figure><br /><br />So we just need to find the number that when multiplied by each separate polynomial will make it pass through its corresponding point. This number is the y-value of the corresponding point divided by the current value of the polynomial there. What we're doing is first making the polynomial equal 1 at its corresponding point and then multiplying it by whatever number we want it to be. For example if you want to multiply the number 2 by a number that will make the result 3, that number should be $\frac{3}{2}$, that is, $2 \times \frac{3}{2} = 3$.<br /><br />So what is the current value of $(x-4)(x-6)$ at its corresponding point, (3,2)? It's $(3-4)(3-6) = 3$. So by multiplying $(x-4)(x-6)$ by $\frac{2}{(3-4)(3-6)}$ we can make the polynomial pass through 2, without changing where it is zero. This is what $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$, $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$ and $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$ looks like:<br /><br /><figure><a href="https://1.bp.blogspot.com/-rQyM6etoPNA/Wb_WiX6zllI/AAAAAAAABH4/rVZp6eRE9Lk7_QrHrkDWXFrBFAv1C7dlQCLcBGAs/s1600/lagrange04a.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-rQyM6etoPNA/Wb_WiX6zllI/AAAAAAAABH4/rVZp6eRE9Lk7_QrHrkDWXFrBFAv1C7dlQCLcBGAs/s320/lagrange04a.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-4)(x-6)\frac{2}{(3-4)(3-6)}$. Passes through corresponding point (3,2) but through zero when $x$ is 4 or 6.</figcaption></figure><br /><br /><figure><a href="https://3.bp.blogspot.com/-dFQHLfFXY8E/Wb_Wlcgi-cI/AAAAAAAABIA/EQbgcRpk_Rs9SejS8AwDMu4FrZs44vXmQCLcBGAs/s1600/lagrange04b.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-dFQHLfFXY8E/Wb_Wlcgi-cI/AAAAAAAABIA/EQbgcRpk_Rs9SejS8AwDMu4FrZs44vXmQCLcBGAs/s320/lagrange04b.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-6)\frac{3}{(4-3)(4-6)}$. Passes through corresponding point (4,3) but through zero when $x$ is 3 or 6.</figcaption></figure><br /><br /><figure><a href="https://1.bp.blogspot.com/-03ofA81EJqI/Wb_WlJQAz6I/AAAAAAAABH8/SpZmHQtlpGob6kod7DJ458_ajErZVPbEQCLcBGAs/s1600/lagrange04c.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-03ofA81EJqI/Wb_WlJQAz6I/AAAAAAAABH8/SpZmHQtlpGob6kod7DJ458_ajErZVPbEQCLcBGAs/s320/lagrange04c.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph of equation $y = (x-3)(x-4)\frac{4}{(6-3)(6-4)}$. Passes through corresponding point (6,4) but through zero when $x$ is 3 or 4.</figcaption></figure><br /><br />Now we can add them up into<br />$$y = (x-4)(x-6)\frac{2}{(3-4)(3-6)} + (x-3)(x-6)\frac{3}{(4-3)(4-6)} + (x-3)(x-4)\frac{4}{(6-3)(6-4)}$$<br />which is simplified into $-\frac{1}{6} x^2 + \frac{13}{6} x - 3$<br /><br /><figure><a href="https://4.bp.blogspot.com/-Tubqgg1WStM/Wb_WxUTM_wI/AAAAAAAABIE/OgVD8QcNQ3YWA_H7LQGFd8IQsYSUVJo-ACLcBGAs/s1600/lagrange05.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-Tubqgg1WStM/Wb_WxUTM_wI/AAAAAAAABIE/OgVD8QcNQ3YWA_H7LQGFd8IQsYSUVJo-ACLcBGAs/s320/lagrange05.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Final graph that passes through the 3 points.</figcaption></figure><br /><br />If we added another point at (10,7) then the polynomial passing through all the points would be<br />$$y = \frac{2(x-4)(x-6)(x-10)}{(3-4)(3-6)(3-10)} + \frac{3(x-3)(x-6)(x-10)}{(4-3)(4-6)(4-10)} + \frac{4(x-3)(x-4)(x-10)}{(6-3)(6-4)(6-10)} + \frac{7(x-3)(x-4)(x-6)}{(10-3)(10-4)(10-6)}$$<br /><br /><figure><a href="https://2.bp.blogspot.com/-z-8_Xr9H9Fo/Wb_W0l02EEI/AAAAAAAABII/7j5_eygGgj4polSjmWqDiv9uDO4d7dIRwCLcBGAs/s1600/lagrange06.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-z-8_Xr9H9Fo/Wb_W0l02EEI/AAAAAAAABII/7j5_eygGgj4polSjmWqDiv9uDO4d7dIRwCLcBGAs/s320/lagrange06.png" width="320" height="169" data-original-width="1103" data-original-height="582" /></a><figcaption>Graph that passes through the 3 points plus (10,7).</figcaption></figure><br /><br />See how we have found another curve that also passes through the previous 3 points? This equation building method proves that you can find an infinite number of polynomials that pass through a finite number of points, since you can always make a polynomial that passes through the given set of points plus any other point anywhere where each position of the new point requires a differently shaped curve. Since there is an infinite amount of positions where to place the extra point, there is an infinite number of polynomials (and thus curves) that can pass through the given set of points.<br /><br />In general, given points $(x_1,y_1), (x_2,y_2), \dots, (x_n,y_n)$, your polynomial will be<br />$$\sum^{n}_{i=1} y_i \frac{\prod^{n}_{j=1,j\neq i} (x-x_j)}{\prod^{n}_{j=1,j\neq i} (x_i-x_j)}$$<br /><br />Notice that you cannot have two points with the same x-value or you'll get a division by zero. Notice also how you are not guaranteed to get an extrapolation from your points. The polynomial will completely derail into an upward or downward curve beyond the range of the given points, regardless of any pattern suggested by the points. Finally your equation will get more and more complicated with every new point, even after simplification. If you have $n$ points then you will have up to $n$ terms in your polynomial. It might be simplifiable into less terms, but that is rare.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-91711000339029954592017-08-28T09:06:00.001+02:002017-09-19T08:40:07.588+02:00Proof that the sum of digits of any multiple of 3 (or 9) is itself a multiple of 3 (or 9)Take any multiple of 3, such as 327. You can verify that it really is a multiple of 3 by adding together all its digits, that is 3+2+7=12, and checking if the sum is itself a multiple of 3. 12 is a multiple of 3, so 327 must also be a multiple of 3. Since the sum of digits will be a much smaller number than the original number, it will be easier to determine if it is a multiple of 3 or not. Of course you can then take the sum of digits and find its sum of digits again repeatedly until it's a single digit number: 3, 6, 9. The same thing is true of multiples of 9 but instead the sum of digits will be a multiple of 9.<br /><br />This is the proof that this works for every multiple of 3:<br /><br />Start with a number with $n$ digits, such as the 3 digit number 327. We'll call each of these digits $a_i$, starting from $a_{n-1}$ for the first digit and ending with $a_0$ for the last digit. So 327 has $a_2 = 3$, $a_1 = 2$, and $a_0 = 7$.<br /><br />Our number is equal to the following:<br />$$a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$$<br />This is just the hundreds, tens, units formulation of numbers. For example:<br />$$327 = 3 \times 10^2 + 2 \times 10^1 + 7 \times 10^0 = 3 \times 100 + 2 \times 10 + 7 \times 1 = 327$$<br /><br />Now the trick is to replace each $10^x$ with $999...+1$. For example $100 = 99+1$.<br /><br /><div style="width:100%; height:120px; overflow:scroll; border:solid 1px black;">$$\begin{eqnarray}<br />a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0 &=& a_{n-1} \left(999... + 1\right) + a_{n-2} \left(99... + 1\right) + \ldots + a_1 \left(9 + 1\right) + a_0 \left(0 + 1\right) \\<br />&=& \left(a_{n-1} 999... + a_{n-1}\right) + \left(a_{n-2} 99... + a_{n-2}\right) + \ldots + \left(a_1 9 + a_1\right) + \left(a_0 0 + a_0\right) \\<br />&=& \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0\right) + \left(a_{n-1} + a_{n-2} + \ldots + a_1 + a_0\right) \\<br />\end{eqnarray}$$<br /></div><br />Now we'll make some terms switch places in the equation:<br /><div style="width:100%; height:100px; overflow:scroll; border:solid 1px black;">$$\begin{eqnarray}<br />a_{n-1} + a_{n-2} + \ldots + a_1 + a_0 &=& \left(a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0\right) - \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9 + a_0 0\right) \\<br />&=& \left(a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0\right) - \left(a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1 9\right)<br />\end{eqnarray}$$<br /></div><br />Notice that in the last line above we have eliminate the term $a_0 0$ because it's a multiplication by 0.<br /><br />Now, if our original number is a multiple of 3, then it must be that the right hand side at the end of the above equation is a multiple of 3. Any string of 9s is always a multiple of 3 since $999\ldots = 3 \times 3 \times 111\ldots$. It is also a multiple of 9, but we'll get to that later. This means that the two bracketed terms in the last line of the above equation are both multiples of 3 because:<br /><ul><li>$a_{n-1} 10^{n-1} + a_{n-2} 10^{n-2} + \ldots + a_1 10^1 + a_0 10^0$: is a multiple of 3 by definition (we said that the number would be one).</li><li>$a_{n-1} 999... + a_{n-2} 99... + \ldots + a_1<br />9$: is a sum of numbers multiplied by strings of 9s, which means that we can factor out the common factor 3.</li></ul><br />The difference of two multiples of 3 is itself a multiple of 3. Therefore the left hand side must also be a multiple of 3 (since the two sides are equal), and the left hand side just happens to be:<br /><br />$$a_{n-1} + a_{n-2} + \ldots + a_1 + a_0$$<br /><br />which is the sum of all digits. Hence for any number that is a multiple of 3, the sum of its digits must also be a multiple of 3.<br /><br />Until now we have only shown that any multiple of 3 will have the sum of its digits also be a multiple of 3. But can a number which is not a multiple of 3 coincidentally have the sum of its digits be a multiple of 3? No, because otherwise it would imply that a non-multiple of 3 minus a multiple of 3 is a multiple of 3. The second term in the subtraction in the right hand side of the above derivation is always going to be a multiple of 3, but in order for the whole of the right hand side to be a multiple of 3 you will need both terms being a multiple of 3. So you can rest your mind that checking if a large number is a multiple of 3 can be as simple as checking if the sum of its digits is a multiple of 3. And if that sum is still too big you can check if the sum of its digits are a multiple of 3 and so on, because you're always just reusing the same test each time.<br /><br />What about multiples of 9? As already mentioned above, a string of 9s is also always a multiple of 9. So the second term in the subtraction above is also always a multiple of 9. Hence if both terms of the subtraction are multiples of 9, then the sum of digits is going to be a multiple of 9.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-77056332836214567652017-07-14T09:48:00.004+02:002017-07-14T09:57:07.632+02:00Making a point and click / room escape game using PowerPointBefore we begin, it is important that you understand that there is no simple way to make a complex point and click game with PowerPoint where actions in one room affect what happens in other rooms. You can only make games where everything happens in the same room. You can have different rooms but all the puzzles of a room must be solved within the room. This isn't so bad. There are existing games where you have to go through several levels with each level having a small puzzle to solve, such as <a href="http://www.primarygames.com/puzzles/action/monkeygohappy/">this</a>. You also cannot have an inventory or pass codes without going into a lot of complexity that would make more sense using actual game programming. That said, it can still be a creative use of PowerPoint for children so here is how you do it.<br /><br />The heart of the point and click game is the discovery of items that can be used to get over obstacles. You find a key and then you open a door. In PowerPoint such a thing can be done using transparent boxes that are in front of clickable objects. When you click on a key, a trigger event will set an exit animation on the transparent box, making whatever was behind it available for clicking. Let's begin.<br /><br />Start by drawing a key and a door. I'm using a star to represent a key and I put a star shape on the door to represent a key hole. Also add a second slide with the next room after going through the door. I just put a smiley on the next slide to show that the game has ended.<br /><br /><a href="https://3.bp.blogspot.com/-Xb4cXJEYB9Q/WWh24lOwPfI/AAAAAAAABFk/TdfhYwUxj9M9tDxA71nAZ8y8cU5ojsDRQCLcBGAs/s1600/pointandclick_1.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-Xb4cXJEYB9Q/WWh24lOwPfI/AAAAAAAABFk/TdfhYwUxj9M9tDxA71nAZ8y8cU5ojsDRQCLcBGAs/s320/pointandclick_1.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Next make the door take you to the next slide when clicked. This is done using a hyperlink. Just right click on the door, go on hyperlink, place in this document, next slide, OK. Now when you start the presentation, clicking on the door will take you to the next slide.<br /><br /><a href="https://2.bp.blogspot.com/-58cSxDfOBhc/WWh280KyYLI/AAAAAAAABFo/h75I-FFw5Qsa-re1N2selbCEmjXRNAelwCLcBGAs/s1600/pointandclick_2.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-58cSxDfOBhc/WWh280KyYLI/AAAAAAAABFo/h75I-FFw5Qsa-re1N2selbCEmjXRNAelwCLcBGAs/s320/pointandclick_2.png" width="320" height="176" data-original-width="946" data-original-height="519" /></a><br /><br />Of course clicking anywhere in a slide will take you to the next slide. We don't want that. You should only be able to go to the next slide by clicking on the door. Go on transitions and unset advance slide on mouse click.<br /><br /><a href="https://3.bp.blogspot.com/-d0-m8-erDcw/WWh3AGltWoI/AAAAAAAABFs/Nc3w5keMCZcNdfgGb-kGnEy9D_99nd8cACLcBGAs/s1600/pointandclick_3_.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-d0-m8-erDcw/WWh3AGltWoI/AAAAAAAABFs/Nc3w5keMCZcNdfgGb-kGnEy9D_99nd8cACLcBGAs/s320/pointandclick_3_.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Now put a box over the door covering it completely. Make this box transparent (not "no fill"). The door is now not clickable any more because there is a box over it. Just right click the box, format shape, fill, transparency 100%. I left the border there so that you can see where the box is but you can remove the outline so that it looks better.<br /><br /><a href="https://2.bp.blogspot.com/-vHHLXwTexqg/WWh3GO-X1VI/AAAAAAAABFw/imaN75p1MqAVHpF7cTCr0Zzy0H3oBXFCACLcBGAs/s1600/pointandclick_4.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-vHHLXwTexqg/WWh3GO-X1VI/AAAAAAAABFw/imaN75p1MqAVHpF7cTCr0Zzy0H3oBXFCACLcBGAs/s320/pointandclick_4.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />Now we want the key to disappear when clicked on. Click on the key, go on animations, add animation, and choose an exit animation such as Fade. Open the animation pane. You'll see the animation you just created and the automatically assigned name of the key. In my case the name is "4-Point Star 5"<br /><br /><a href="https://3.bp.blogspot.com/-crwssVIGmgg/WWh3JVFFQ5I/AAAAAAAABF0/wTFp4SMBEJomR84gT7WG95b7RYiqpHLNACLcBGAs/s1600/pointandclick_5.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-crwssVIGmgg/WWh3JVFFQ5I/AAAAAAAABF0/wTFp4SMBEJomR84gT7WG95b7RYiqpHLNACLcBGAs/s320/pointandclick_5.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />At the moment, the key will disappear as soon as you click anywhere, not when you click on the key. To fix this we can click on the key animation (the orange box with "1" in it next to the key), go on trigger, on click of, and choose the name of the key shape you saw in the animation pane. Now the key's exit animation will only start when you click on the key itself.<br /><br /><a href="https://2.bp.blogspot.com/-N85zqZzEGsc/WWh3NzvRAYI/AAAAAAAABF4/BPryAskDxssSnMXTB5BvkX0HmpRiB4AbwCLcBGAs/s1600/pointandclick_6.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-N85zqZzEGsc/WWh3NzvRAYI/AAAAAAAABF4/BPryAskDxssSnMXTB5BvkX0HmpRiB4AbwCLcBGAs/s320/pointandclick_6.png" width="320" height="170" data-original-width="1600" data-original-height="851" /></a><br /><br />When the key is picked up, the door should be usable. This mean that we want the transparent box over the door to disappear. We can do the same exit animation with trigger that we did with the key to the transparent box and make it exit when the key is clicked on. I made the door disappear rather than fade so that there is no delay and the transparent box is removed immediately.<br /><br /><a href="https://4.bp.blogspot.com/-jh7bUCd9328/WWh3R7iVgpI/AAAAAAAABF8/S6gw9Xw9q4sJDxcOZG2G7QyEnSKcRuMsQCLcBGAs/s1600/pointandclick_7.png" imageanchor="1" ><img border="0" src="https://4.bp.blogspot.com/-jh7bUCd9328/WWh3R7iVgpI/AAAAAAAABF8/S6gw9Xw9q4sJDxcOZG2G7QyEnSKcRuMsQCLcBGAs/s320/pointandclick_7.png" width="320" height="172" data-original-width="1600" data-original-height="860" /></a><br /><br />As things stand, we have two exit animations: one for the key and one for the transparent box. However these two animations will not happen together but will each wait for their own click on the key. To make both animations happen together just click on the transparent box's animation and choose start with previous instead of start on click.<br /><br /><a href="https://3.bp.blogspot.com/-sfqGaBk0Dlw/WWh3U9sNpiI/AAAAAAAABGA/FoWqXzstVAM19FAhKcU3rFB1JnOh4vytQCLcBGAs/s1600/pointandclick_8.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-sfqGaBk0Dlw/WWh3U9sNpiI/AAAAAAAABGA/FoWqXzstVAM19FAhKcU3rFB1JnOh4vytQCLcBGAs/s320/pointandclick_8.png" width="320" height="171" data-original-width="1600" data-original-height="853" /></a><br /><br />That's it. Your one-key-one-door point and click game is ready. Start the animation by pressing F5 and see what happens. Notice that you cannot click on anything other than the key. After clicking on the key it will disappear and then you can click on the door. Clicking on the door will take you to the next room.<br /><br />This basic mechanic that I just described can be used to open chests that contain other keys, kill dragons that guard lairs, and turn off booby traps. You can put keys behind objects that have a motion animation when clicked on so that you find a key behind them. You can even put multiple keys, each of which is accessible only after getting the previous one and the final key is the one that opens the door. You can also have multiple rooms where the next slide is another is another room with a locked door. Can you think of other things to make an interesting point and click game in PowerPoint?Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-4318944459823143473.post-6509579955028572092017-06-25T11:16:00.001+02:002017-06-25T11:17:58.275+02:00Display all outputs in Python immediately when in cmd/terminalWhen running a Python program in the terminal, it sometimes happens that there is a long delay before I see some print statement's output. Sometimes I have to wait until the program finishes.<br /><br />If it's just one print statement you're concerned with then you only need to import the "sys" module and call "sys.stdout.flush()" in order to force the program to show anything that has been printed but is still hidden in the buffer.<br /><br /><pre class="brush:python">import sys<br /><br />print('bla bla bla')<br />sys.stdout.flush()<br /></pre><br />But most of the time you want this to always happen and not after some particular point in the program. To force Python to always show whatever is printed just add the command line option "-u" when running python.<br /><br /><pre>> python -u myscript.py<br /></pre><br />You can read more about it here:<br /><a href="https://docs.python.org/3.6/using/cmdline.html#cmdoption-u">https://docs.python.org/3.6/using/cmdline.html#cmdoption-u</a>Unknownnoreply@blogger.com0