Thursday, May 30, 2019

Word square maker

I made a word square searcher. A word square is a square grid with equal length words in each row and column such that the word in the first row is the word in the first column, the word in the second row is the word in the second column, and so on. Here's an example:

It works by taking a file of words and indexing them by character and position such that I can find all the words that have a particular character at a particular position. The program will then build every possible word square that can be made from these words. It goes through all the words and tries them all as a first row/column in a word square. Each word is used as a basis for finding the rest of the words.

In the above example, the first row/column is "care". In order to find a suitable word for the second row/column, the index is used to find all the words that have a first letter equal to the first row's second letter. "acid"'s first letter is equal to "care"'s second letter, so it is a good candidate. Each one of these candidate words is used tentatively such that if it doesn't pan out then we can backtrack and try another candidate word. For the third row/column, the index is used to find all the words that have a first letter equal to the first row's third letter and a second letter equal to the second row's third letter. The intersection of these two groups of words would be all the suitable words that can be used as a third row. "ring"'s first letter is equal to "care"'s third letter and its second letter is equal to "acid"'s third letter as well, so it is a good candidate for the third row. This process keeps on going until none of the words make a word square given the first row/column. The program then moves on to another possible word in the first row/column.

You can find the program here:

Wednesday, May 1, 2019

Neutral Room Escape Game: Lights - Plus 7 Times 10 puzzle solver

I've been playing a point and click game called Lights which features a puzzle consisting of a 4 digit counter which starts from '0000' and which needs to be made equal to a certain number by a sequence of operations which are to either add 7 to the current value or multiply the current value by 10. I wrote a Python program to quickly solve the puzzle and I'm sharing the code for any one who needs to use it as well. You can find the program online at Just run the program, enter the number you need to reach when requested and you'll see a sequence of steps showing which operations to perform in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached numbers until the goal number is reached. Enjoy!

Tuesday, April 30, 2019

Sliding tile puzzle solver

So I've been playing a point and click game called Loom Blend which features a sliding tile puzzle. I wrote a Python program to quickly solve the puzzle there and I'm sharing the code for any one who needs to use it as well. You can find the program online at Just run the program and you'll see a sequence of steps showing which tiles to move in order to solve the puzzle in the least number of steps. The program works using a breadth first search which explores all possible moves without repeating previously reached tile configurations until the goal configuration is reached. Enjoy!

Sunday, March 17, 2019

The Gambler's Fallacy: Why after tossing 100 heads in a row you can still get heads again

The Gambler's fallacy is a deceptively intuitive fallacy about how randomness works. Imagine you're tossing a fair coin repeatedly. You're astonished to see that every toss is resulting in heads. Heads, heads, heads, heads, over and over. As the number of consequtive heads keeps going up, does the probability of getting tails start increasing? It seems reasonable to conclude this as the number of heads and tails should balance each other out given that the coin is fair, as is stated by the Law of Large Numbers. But the idea that there is some kind of memory in the coin that is used to prevent the fair coin from acting as if it is unfair is wrong.

First, let's show that the Gambler's Fallacy is actually a fallacy by simulating a bunch of coin tosses in a program. We'll keep track of all the times a streak of 10 heads was obtained and the following toss after it and see if the number of heads following the streak is significantly less than the number of tails. We do not consider 11 heads in a row as two streaks of 10 heads but instead start searching for a completely new streak after every streak. Here is a Python 3 program to do that (we also count the number of heads and tails obtained in general to show that the simulated coin is not biased).

import random

seed = 0
num_trails = 1000000
head_streak_length_limit = 10


num_heads = 0
num_tails = 0
heads_streak_length = 0
next_was_head = 0
next_was_tail = 0
for trial in range(num_trails):
    result = random.choice(['head', 'tail'])

    if result == 'head':
        num_heads += 1
        num_tails += 1

    if heads_streak_length < head_streak_length_limit:
        if result == 'head':
            heads_streak_length += 1
            heads_streak_length = 0
        if result == 'head':
            next_was_head += 1
            next_was_tail += 1

        heads_streak_length = 0

print('heads / tails:', round(num_heads/num_tails, 4))
print('after heads streak was head / after heads streak was tail:', round(next_was_head/next_was_tail, 4))

And here are the results:
heads / tails: 1.002
after heads streak was head / after heads streak was tail: 1.0579

Both the ratio of heads and tails in general and the ratio of heads and tails following a streak of 10 heads in a row are close to 1. You can try changing the streak length limit to see how making it shorter does not significantly change this ratio.

So why does this happen? It's a simple matter of probability theory. Both the probability of getting a head and of getting a tail is 0.5 and each toss is independent from the previous one. Therefore the probability of getting a head after 10 heads is $0.5^{10} \times 0.5 = 0.5^{11}$ and the probability of getting a tail after 10 heads is also $0.5^{10} \times 0.5 = 0.5^{11}$. It does not make a difference what the previous tosses were, the next toss being heads will always have a probability of 0.5, so both heads and tails are equally likely.

So then the more interesting question is why do we think otherwise? Why do we think that as the chain of consecutive heads gets longer, the probability of getting another head goes down? It could be because we expect the coin to be fair and a coin that never gives tails is a biased coin, but that doesn't really mean anything in terms of predicting what the next toss will be. It is entirely possible that the fairest possible coin will give a million heads in a row. There is nothing stopping this from happening. Not only that, but a million heads in a row is just as likely to happen as any other random looking sequence resulting from a million coin tosses. Let's write down a sequence of heads and tails as a string of 'H's and 'T's. The sequence 'HHHHHH' is just as likely to come up as the sequence 'HHHTTH', yet we think of the first sequence as being unlikely whilst of the second sequence as... what? Is the second sequence more likely to come up than the first? Would you bet more money on the second sequence than the first? This is what what I think is the root cause of the Gambler's Fallacy.

What I think happens in our mind is that we don't really predict whether the next toss will give a heads or a tails but whether the whole sequence will turn out to be an 'interesting' looking sequence or not. We love patterns, and when we see that a random sequence generator (coin tosses in this case) is giving out a sequence that looks like a simple pattern, we become suspicious of the randomness behind the sequence generator and assume that it isn't really random. I mean we're pretty terrible at generating random sequences manually because we assume that in a random process you will not get things like a streak of heads. But it's not really just a streak of heads that would make us suspicious isn't it? It's also a sequence of alternating heads and tails for example or a long sequence of heads followed by a long sequence of tails. I think that the source of the Gambler's Fallacy is that we categorise entire sequences into interesting and uninteresting sequences. As we watch a sequence being generated one toss at a time we expect that interesting sequences of that length are much less likely to happen that uninteresting ones, which is true. This makes us expect the next toss to break the interestingness of the current sequence, which, in the case of a streak of heads, is by the next toss turning out to be tails. Of course, although interesting sequences are less likely to happen as the length grows, given that the probability of the interesting pattern becoming uninteresting on the next toss is equal to the probability of it remaining interesting, it is still an incorrect assumption.

So what is the probability of an interesting sequence? This is a subjective question but we can still try to investigate it. Ideally a survey is conducted on a population of people to check what is considered as interesting by people on average. But that would be weird so I'll just conduct the survey on myself. I have generated all the possible coin tosses that can be obtained after 4 coin tosses and I will be annotating each sequence according to what I find interesting in it. I also did the same on 6 coin tosses to see how the ratio of interesting to uninteresting sequences changes with sequence length.

Here is what I find interesting in sequences of 4 coin tosses.
TTTHnot interesting
TTHTnot interesting
THTTnot interesting
THHHnot interesting
HTTTnot interesting
HTHHnot interesting
HHTHnot interesting
HHHTnot interesting

And here is what I find interesting in sequences of 6 coin tosses (organised into two columns to reduce vertical space).
TTTTTTsameTTTTTHnot interesting
TTTTHTnot interestingTTTTHHnot interesting
TTTHTTnot interestingTTTHTHnot interesting
TTTHHTnot interestingTTTHHHswitched
TTHTTTnot interestingTTHTTHdoubled
TTHTHTnot interestingTTHTHHnot interesting
TTHHTTmirroredTTHHTHnot interesting
TTHHHTnot interestingTTHHHHnot interesting
THTTTTnot interestingTHTTTHnot interesting
THTTHTmirroredTHTTHHnot interesting
THTHTTnot interestingTHTHTHalternated
THTHHTnot interestingTHTHHHnot interesting
THHTTTnot interestingTHHTTHnot interesting
THHTHTnot interestingTHHTHHdoubled
THHHTTnot interestingTHHHTHnot interesting
THHHHTmirroredTHHHHHnot interesting
HTTTTTnot interestingHTTTTHmirrored
HTTTHTnot interestingHTTTHHnot interesting
HTTHTTdoubledHTTHTHnot interesting
HTTHHTdon't knowHTTHHHnot interesting
HTHTTTnot interestingHTHTTHnot interesting
HTHTHTalternatedHTHTHHnot interesting
HTHHTTnot interestingHTHHTHmirrored
HTHHHTnot interestingHTHHHHnot interesting
HHTTTTnot interestingHHTTTHnot interesting
HHTTHTnot interestingHHTTHHmirrored
HHTHTTnot interestingHHTHTHnot interesting
HHTHHTdoubledHHTHHHnot interesting
HHHTTTswitchedHHHTTHnot interesting
HHHTHTnot interestingHHHTHHnot interesting
HHHHTTnot interestingHHHHTHnot interesting
HHHHHTnot interestingHHHHHHsame

Among the 4 coin toss sequences, 8 out of the 16 sequences were considered interesting, meaning that you have a 50% chance of generating an interesting sequence. But among the 6 coin toss sequences, only 16 out of the 64 sequences were considered interesting, meaning that you have a 25% chance of generating an interesting sequence. Now I don't know if the patterns I identified in the tables above are the only interesting patterns for any sequence length, since maybe some patterns only emerge in very long sequences, but it is safe to say that the longer the sequence, the greater the proportion of uninteresting and random looking sequences will be. Therefore, the longer the sequence, the less likely that a sequence will be interesting. This might be what is going on when we think that the probability of getting a tails goes up as the streak of heads gets longer; we'd really be comparing how likely it is for a sequence of a given length to be uninteresting, which becomes less likely as the length grows.

Interesting patterns are interesting
I have to admit that whilst annotating interesting sequences I was more interested in being consistent in my reasoning than in saying what I genuinely found interesting. This is because I didn't actually feel that, for example, all "repeated" patterns were interesting, just some of them. But I wanted to avoid explaining myself too much based on what I felt. This means that there might be more going on in feeling that a sequence is aesthetically pleasing than simple mathematical patterns. It would be interesting if someone made an experiment where people are subjected to an EEG to see what is really considered interesting and what is not. How knows, maybe it might even turn out to be a kind of Rorschach test where different patterns of interest indicate different personalities. And maybe it might even be an interesting machine learning challenge to try to make a program predict which sequences would be considered interesting by humans in order to quantify how interesting a given sequence is.

Tuesday, February 26, 2019

MySQL extra inner join taking too long: optimizer_search_depth

MySQL has a query optimiser that takes in SQL queries and tries to find the best way to execute them. Then the query gets executed using the best execution plan found. When you use joins in your query, the time the optimiser takes to finish optimising is exponential to the number of joins. This means that for a few joins it only takes a negligible amount of time to finish but after a point, which is somewhere between 7 and 10 tables, the optimisation time will shoot up. In my case I went from 0.1 seconds to 17 seconds but just adding another table in my joins.

To check if this is happening to you, you can profile your query by executing the following:

SET profiling = 1;

*your query*


(If you using phpMyAdmin you might need to add "SHOW PROFILE;" as well at the end)

This will show you a table of execution times for each phase in the execution of the query. The time taken by the query optimiser is in the "statistics" row. If this is too high, then you need to stop your optimiser from spending so much time.

Controlling how much time the optimiser should spend doing its job can be accomplished using the optimizer_search_depth variable. This is the maximum depth to search (presumably in some tree search algorithm) when optimising the query and is unfortunately set to the highest value by default. Being set to the highest value makes it the most likely to find the best execution plan but may also be unnecessarily time consuming. Thankfully there is a better default value to use: 0. When this variable is set to zero, the optimiser automatically picks a value based on the query and that seems to work well for me. To do this just execute the following query:

SET optimizer_search_depth = 0;

Tuesday, January 29, 2019

Python string format quick guide

This is a quick summary guide for when you just want to remind yourself how to format a value with Python's 'string'.format().


Curly brackets are used to refer to parameter values in the 'format' method. On their own, the first pair of curly brackets will refer to the first parameter, second to the second, etc.

'{} {} {}'.format('a', 1, True)
a 1 True

Alternatively you can specify which parameter you want to use where using indexes.

'{0} {0} {2}'.format('a', 1, True)
a a True

You can even use names instead of indexes.

'{a} {b} {c}'.format(a='a', b=1, c=True)
a 1 True

And even refer to indexes in lists and to instance variables in objects.

'{a[0]} {b.numerator}'.format(a=[3,1,2], b=fractions.Fraction(1,2))
3 1


Adding a colon inside the curly brackets allows you to format the values before they are added to the string. You can combine these with any of the above methods of referencing. The following formatting meta characters are to be used in the following order:
{:[alignment] [grouping] [precision] [data type]}

Aligning/padding/leading zeros
An alignment meta character is either >, <, or ^. The number after it is the field width in which to align.

'|{:>3}|{:<3}|{:^3}|'.format('a', 'b', 'c')
|  a|b  | c |

Any character before the alignment meta character is used instead of spaces.
'{:_>5} {:0>5}'.format('a', 12)
____a 00012

You can group digits in numbers into triples with commas automatically by just putting a comma in the curly brackets.


When formatting floats you can specify how many digits to keep after the point by adding a dot and a number. By default the format of the number is in scientific notation which means that your floating point number will look like 12e-4 with the precision referring to the number of digits in the mantissa (the number before 'e'). See the next section to format your number in fixed point notation.


Data types
This is the most important part. The data type always comes at the end and is a single letter.

You can use it to change the numeric base of a whole number.
'dec-{:d} hex-{:X} oct-{:o} bin-{:b}'.format(16, 16, 16, 16)
dec-16 hex-10 oct-20 bin-10000

You can use it to convert unicode values to characters.

You can use it to convert numbers into scientific notations,

or fixed point notations,

or percentages.

Mixed examples





Thursday, December 27, 2018

Language models and the unknown token: Why perplexity gets worse as the corpus size increases

Unless you're making use of a character-based language model somehow, you're going to have a finite vocabulary of words that you can handle and so you need a way to handle out-of-vocabulary words. One common way to handle such words is by replacing them all with the unknown token, a pseudo word that replaces all out-of-vocabulary words. The unknown token has several advantages: Infrequent words in the training corpus will probably either not be learned by the neural network or not be used when generating sentences. Ignoring them will save a lot in model size as the embedding layer and softmax take a lot of parameters. Plus all the infrequent words put together will occur very frequently so the language model will definitely learn to use the unknown token. It also makes it possible to handle new words at test time (that are not found even once in the training corpus) as they will be replaced by the unknown token.

The problem with this approach is what happens when measuring the probability or perplexity of a sentence based on the probabilities of individual words. If you're comparing language models to see which ones make the best predictions, you usually use them all on a corpus to see how well they predict the words in the corpus. The higher the probabilities assigned to those sentences, the better the language model. This is usually measured using language model perplexity. But see what happens when you vary the vocabulary size. You will find that smaller vocabulary sizes lead to better language models, even though this makes no sense.

It turns out that if you just multiply all the probabilities of individual words as-is, including that of the unknown token, then your probability will be sensitive to the vocabulary size. Let's say that you have a vocabulary size of 0, that is, you're considering all the words to be out-of-vocabulary and hence all of them will be replaced by the unknown token. This means that during training, the language model will learn that after every unknown token there is (probably) another unknown token, unless its the end of the sentence. This will make the language model give very high probabilities for the unknown token. High word probabilities mean high sentence probabilities which mean good perplexities.

Now If we add another word to the vocabulary then we'll be introducing some uncertainty into the language model as now it has to decide between using the unknown token or the known word. Even in a perfect language model, the same prefix of words can be followed by either of the two words so there is no way to correctly assign 100% of the probability to one or the other. This means that the probabilities will be split between the two words, leading to an overall decrease in probabilities, leading to a worse perplexity. Adding more words to the vocabulary makes this even worse, which means that language models with smaller vocabularies have a huge unfair advantage over language models that actually do their job and correctly predict the right word.

We can't do away with the unknown token but we can strip away the unknown token's power. Assuming that all the language models are being evaluated on the same corpus, then different vocabularies will have different words being turned into the unknown token. Let's say that your language model considers 1000 different words in its vocabulary but the corpus you're evaluating it on has 500 different words that are out-of-vocabulary. So in reality your language model is predicting one of 1500 different words; it's just that 500 of those words are assumed to be a single word with a single probability. But really there should be 500 separate probabilities for those out-of-vocabulary words and not just one. If we avoid merging all those probabilities into one, then all the language models will have a fair comparison all they will all have the same vocabulary and they will all have the same amount of uncertainty about which word should come next. The question is how to distribute that single unknown token probability between the 500 out-of-vocabulary words. The simplest solution is to assume a uniform distribution and just give each word the same slice of probability from the whole. So if the unknown token has a probability of $p$, then each out-of-vocabulary word gets a probability of $\frac{p}{500}$.

Now every time you encounter the unknown token in the evaluation corpus you know that the token is being used in place of one of those 500 words. But you don't know which one it is. Not a problem, just divide the probability by 500 and it's as if all words in the corpus are in the vocabulary. Do this to every unknown token probability and now you have a fair measure of perplexity. Let's see an example.

Let's say that we want to find the probability of the following sentence:
the loud dog barked at the loud man

and let's say that the language model we're using to do that has the following vocabulary:
the at dog man

this means that the sentence is now changed to:
the UNK dog UNK at the UNK man

Now the naive way to get the probability of the sentence is as follows:

P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})

But now with the new way we'll divide the unknown token's probabilities by 2, the number of different out of vocabulary words ('loud' and 'barked'):

P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the})/2 \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog})/2 \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the})/2 \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})

Of course we can leave the re-weighting till the end of the equation by dividing the first equation by the number of different out-of-vocabulary words for as many times as there are unknown tokens, like this:

P(\text{the UNK dog UNK at the UNK man}) = \\
p(\text{the}|\text{}) \times p(\text{UNK}|\text{the}) \times p(\text{dog}|\text{the UNK}) \times p(\text{UNK}|\text{the UNK dog}) \\
\times p(\text{at}|\text{the UNK dog UNK}) \times p(\text{the}|\text{the UNK dog UNK at}) \times p(\text{UNK}|\text{the UNK dog UNK at the}) \\
\times p(\text{man}|\text{the UNK dog UNK at the UNK}) \times p(\text{}|\text{the UNK dog UNK at the UNK man})/ 2^3

Now the sentence probability goes up as the vocabulary size increases!