## Wednesday, July 28, 2010

### Cool proofs

When I was in university, our course had a forum that we maintained ourselves in which we posted anything which would be of help to each other or which we found worth sharing. At the time of writing this, the forum will soon be closed down and all of the information in it will be probably lost. I shall post here my geekiest post ever which was titled "Cool proofs" in which I posted 3 mathematical proofs used in computer science which I literally found cool. ^^

Power set of an infinite set is uncountable
By uncountable we mean that you cannot have a one to one mapping from a countable infinite set, usually the natural numbers to the uncountable one.

For example the set of all integers is countable because there exists a bijective mapping from the natural numbers to the integers, for example:
0 -> 0
1 -> -1
2 -> 1
3 -> -2
4 -> 2
...
That is, even numbers point to positive integers and odd numbers point to negatives. So for every integer there exists a natural number which points to it and every natural number points to an integer. So even though there appears to be more integers than natural numbers, in reality they are equal because they are both countable infinite sets.

Now in order to show that the power set of an infinite set is uncountable we need to show that there exists at least one element of the power set which cannot be pointed by a natural number. Consider this random relation of natural numbers pointing to elements of the power set of all natural numbers:
0 -> {}
1 -> {1}
2 -> {0, 1, 2}
3 -> {1, 3}
4 -> {3, 5}
5 -> {1, 2, 3, 4}
6 -> {100, 1000, 10000}
...
Notice that some natural numbers are pointing to sets which contain themselves and others do not. Lets call the numbers which point to themselves Selfish and the others Non-Selfish. For example 1, 2 and 3 are selfish whilst 0, 4, 5 and 6 are not. Given that the power set of all natural numbers contains every possible subset of natural numbers, it stands to reason that one of the subsets should happen to be the set of all Non-Selfish numbers in our relation. Let's call the natural number which points to the set of all Non-Selfish numbers 'd' and the set of all Non-Selfish numbers S. Since every natural number must either be a Selfish number or a Non-Selfish number, we can safely say that d itself is either a Selfish number or a Non-Selfish number. So what is it?

If d is a Selfish number then it is an element of S. But only Non-Selfish numbers are elements of S.
If d is a Non-Selfish number then it is not an element of S. But S contains all Non-Selfish numbers.
Therefore we have reached a contradiction, because we have found an element which cannot be mapped by a natural number. Therefore it is impossible to make a bijective relation between the natural numbers and the power set of all natural numbers. Therefore the power set of an infinite set is uncountable.

QED

Set of all infinite length binary strings is uncountable
Enumerate the infinite length binary strings in a table in some order. For example:
0011111000110011...
1011100101100100...
1010101010010101...
0000000000000000...
...
Now we can refer to a particular string by referring to its corresponding row number. We can also refer to a particular bit in the string by referring to a particular column number. Let B(r,c) be the bit in column c of the string in row r.

If this set was countable, we can create a bijective mapping from the natural numbers to each and every binary string. But it is impossible to enumerate each and every infinite length binary string because no matter what order we use and no matter how many strings we include, it is always possible to find a string which is not included. Here's how:
Consider the diagonal bits in the table, that is, consider B(i,i) for every natural number i. Construct a string, s, which is made of the diagonal bits, inverted. That is, s[i] = !B(i,i). Now for every string, there is at least one bit which is different from s, because the diagonal includes one bit from every string and s has that bit inverted. Given the previous table, s would be 1101...

So no matter how many strings you enumerate (infinity included) there are always strings which are not included in the enumeration. Or conversely, no matter how you construct a relation of natural numbers pointing to the strings, there are always strings which are left out. Therefore the set of all infinite length binary strings is uncountable.

QED

Notice that for any power set, its elements can be encoded into binary strings, where each bit represents whether a particular element in the original set is in that subset. For example the power set of {1,2} is {{}, {1}, {2}, {1,2}} which can be represented as {00, 10, 01, 11} where the first bit says if 1 is in the set and the second says if 2 is in the set. Therefore if the set of all infinite length binary string was countable, the power set of an infinite set was also countable.

The halting problem
The halting problem proves that it is impossible to construct an algorithm (program which always terminates) which will analyze any program and correctly answer if, given a particular input, will terminate. Here's how.

Assume that such a program exists and can be used by simply calling the function Halt(p,i) which will return whether program p will terminate given input i. What we shall do is construct a program which will use Halt on itself and do the opposite. If this is possible than we have just found a case where Halt will fail, no matter what the implementation.

Construct program D which is defined as:
D(p) = if Halt(p,p) then infinite_loop else terminate
Program D will take a program p, check if it halts when given itself as input and enter into an infinite loop if so or terminate otherwise. If this self input confuses you, think of the input to be a copy of the original program. So if I have written a compression program which takes other files and compresses them, I can copy the program and compress the copy, essentially giving the program itself as input.

The reason we constructed such a funny program is that if we give D itself as input, the following will happen:
D(D) = if Halt(D,D) then infinite_loop else terminate
and as you can see, Halt will essentially check if D(D) will terminate, therefore checking if the whole program will terminate. But what does the program mean? The program will either terminate with itself as input or not, but if it terminates then Halt will return true, causing the program to enter an infinite loop, whilst if it does not terminate then Halt will return false, causing the program to terminate.

So if we now call Halt on this program, by calling Halt(D,D), Halt will not be able to come up with a correct answer because D will check what the output will be and do the opposite. Notice that we will receive an answer, it is the same as that given in the if-condition, but it will obviously be the opposite of what will happen. Remember that Halt will call D(p) and pass D as input (instead of p) and that D as a program and D as an input are two separate 'files' which are copies of each other. Self reference can be detected by for example changing a global variable to know how many times a function has been called but
a) A semantically equivalent function can be rewritten which avoids such tricks, yet results in the same problem.
b) This means that some program-input pairs will not be accepted, which means that you have not constructed the required function. After all, what do you think Halt(D,D) should output?

Therefore we have proven that it is always possible to construct a program which will cause Halt to give a wrong output. Similarly, you cannot construct an algorithm which will tell you if a particular action will be carried out in a program given an input because you can construct a program similar to D which will cause it to give the wrong output. Namely D(p) = if ActionPerfomed(p,p) then avoid_action else perform_action.

QED

## Tuesday, July 27, 2010

### Typocalypse

Lately me and my friend Andreas Grech (http://blog.dreasgrech.com/) developed jointly a Microsoft XNA game in just under a sleepless and coffee ridden 24 hours called Typocalypse which is a game where you have to type the words under zombies before they reach you.

My involvement in the game was in using my previous two blog posts, the biasing function for gradually increasing the difficulty of the game by gradually biasing the randomly chosen words to prefer longer words as the score increases and the trie for determining which zombies have a word which has a prefix that matches the currently typed text. I was also responsible for handling keyboard events, displaying the words under the zombies and sound effects which I created orally.

To give credit where credit is due, the background music was taken from the Amiga game The Great Giana Sisters, the pictures including the background image where found from google images and we couldn't be bothered with looking for them again and the library which enabled capturing of keyboard input was taken from here which was developed by George Mamaladze.

The game is open source and you can download and use the source code if you wish, but keep in mind that this was developed in under 24 sleepless hours so at some point coding started getting messy.

Enjoy :)

## Wednesday, July 21, 2010

### C# Trie

*Update* I have developed a new an improved version of this data structure which you can find here.

For anyone interested, I have also written a Java Trie here.

In a recent project I have ventured into, I had to develop a set of classes in C# for a simple trie which are now released as open source (http://code.google.com/p/typocalypse/source/browse/#hg/Trie). This blog post shall describe how it works and how to use it.

Introduction to tries
Simple tries are a data structure for mapping strings to values such that it enables you to efficiently retrieve all the values whose key strings have a particular prefix. This works by storing all the strings in a tree where each node is a prefix to a number of strings and each edge is a character to append to the prefix in the parent node. The root node is the empty string which is a prefix to all strings.

As you traverse a path from the root downwards, you follow the edges which lead to prefixes of the string you are trying to match. Eventually if such a string exists, a node will be pointing to the value mapped by the string. By traversing the descendants of a particular node, you can retrieve all the stored strings which have that prefix in common and their corresponding value.

Here's a diagram example from wikipedia:

Implemented trie
The implemented trie supports adding, removing and character-by-character prefix searching of string-value pairs.

The way it works is by keeping a reference in each node to a value object such that if the reference is null then the node does not point to a value object and hence the prefix is not a complete string. The type of the values is generic, any referencable type is allowed. Each node points to its children via a Dictionary object which maps characters to nodes, thereby allowing a fast search by following the characters.

The following is a description of each class:

TrieNode
This is a non-public class which represents the nodes of the trie, that is, the prefixes. However in this implementation, the prefix itself is not stored, only the last character of the prefix called the Key is stored in order to know which character was followed to access this node. It also stores the value (generic type), a pointer to the parent node and a dictionary which maps characters to children nodes. If this node terminates a string then it's value field points to an object, otherwise it is null.

PrefixMatcher
This is a non-public class which is dedicated to allow the user to search the trie for a particular string or strings with a particular prefix. It was designed for enabling the user to search for the values of said strings on a character by character basis rather than providing the whole string immediately so that it can be used to search for user input as the user types in the query. This requires the maintenance of a state so that it is known which prefix has been entered thus far as well as which node that prefix is found in so as to enable an efficient search. For this reason the search feature was kept in a class of its own. Each time the trie is modified with the removal of a string, the current search is cancelled and must be redone from first character so as to avoid the user from searching non-existent strings.

IPrefixMatcher
This is a public interface which abstracts the methods of the PrefixMatcher class so that the user is provided with an abstraction rather than a concrete class in order to search.

Trie
This is the main class which provides the features of the trie data structure. Searching is done through the public property "Matcher" which returns an instance of IPrefixMatcher. When a string is removed, the current search is cleared so as to avoid the user from searching non-existent strings.

Examples
Here is an example of how to use the implemented trie:
```//Create trie
Trie < string > trie = new Trie < string > ();

//Add some key-value pairs to the trie
trie.Put("James", "112");
trie.Put("Jake", "222");
trie.Put("Fred", "326");

//Search the trie
trie.Matcher.NextMatch('J'); //Prefix thus far: "J"
trie.Matcher.GetPrefixMatches(); //[112, 222]
trie.Matcher.IsExactMatch(); //false
trie.Matcher.NextMatch('a');
trie.Matcher.NextMatch('m'); //Prefix thus far: "Jam"
trie.Matcher.GetPrefixMatches(); //[112]
trie.Matcher.NextMatch('e');
trie.Matcher.NextMatch('s'); //Prefix thus far: "James"
trie.Matcher.IsExactMatch(); //true
trie.Matcher.GetExactMatch(); //112

//Remove a string-value pair
trie.Remove("James");
```

## Wednesday, July 7, 2010

### Biasing uniform random fraction generators towards larger or smaller fractions

The need for a function which allows you to bias a uniform probability distribution random number generator towards fractions closer to 0.0 or 1.0, and to control the degree of the bias, has been felt a few times whilst I was programming. C# only has uniform random number generators and in a recent project I had, I had to devise a function to enable me to bias the random numbers generated. In this post I shall describe this function.

Problem
We would like a function which skews the number line between 0.0 and 1.0 such that the range of numbers closer to 0.0 are denser than those closer to 1.0 or vice versa.

Consider the following statistics gathered from the Python function random.random(), where we shall count the number of random numbers generated that are greater than 0.5 and those that are less out of 10000:

Seedrandom()<0.5random()>=0.5
049805020
150234977
250384962

The frequencies are very close to each other. This is because random.random() uses a uniform probability distribution, meaning that each number between 0.0 and 1.0 has an equal chance of being generated. But what if we wanted to bias the frequencies so that more numbers are less than 0.5?

An application for this could be to randomly select an element from an array which is sorted on priority such that you want to randomly select the first element more often than the last. To do this you bias the random number generator towards smaller indices.

Solution
We need a function which takes the set of inputs from 0.0 to 1.0 and returns numbers in the same range, however skewing the mapping such that a larger part of the number range gets mapped to a smaller part of the output range.

In the graph below, y = x, the mapping is unbiased, no part of the output range has been squashed.

However if we were to bend the line so that it still goes through (0,0) and (1,1) but is curved then we could skew the mapping as demonstrated below:

The graph on the left skews the output such that there is a bias for smaller numbers; all inputs which are less than 0.6 will be mapped to be closer to 0.0 whilst the rest of the inputs will be mapped to be closer to 1.0. On the other hand, the graph on the right skews the output such that there is a bias for larger numbers; only inputs smaller than 0.4 will be mapped to be closer to 0.0 whilst the rest of the inputs will be mapped to be closer to 1.0. Also, in the graph on the left, the smaller the number the greater the chance of it being returned whilst in the graph on the right, the larger the number the greater the chance of it being returned.

We would like to devise a function which will give us this curve, and to some how parameterise the curvature so that the bias can be made stronger or weaker. Such a curve is given by the exponent function
y = x^k           (1)
which, given that k is positive, goes through (0,0) and (1,1) and which different values of k will change the curvature such that when k is between 0.0 and 1.0 the curvature will be going flatter and when k is between 1.0 and infinity the curvature will be going steeper.

In order to control the curvature we shall allow the user to specify a point which the curve is to pass through. To find which value of k will go through the point (x',y'), we make k subject of the formula in (1) after substituting x with x' and y with y' and we get
k = ln y' / ln x'           (2)
and by substituting (2) into (1) we get
y = x^(ln y' / ln x')           (3)
which is useful for describing our curve, provided both y' and x' are between 0 and 1, both not included.

However we would like the way we control the curvature to be more meaningful to the user. So instead we let the user provide just one variable, called the bias, which will be the x value which will be mapped to 0.5, hence all smaller values will be mapped to be less than 0.5 and all greater values will be mapped to be more than 0.5. To do this, we set y' to be 0.5, so that x' will be the bias, that is, the x value which gives 0.5. After modifying equation (3), the new equation is
y = x^(ln 0.5 / ln b)           (4)
where b is the bias.

The following graphs are obtained for different values of b:

Results
The following statistics demonstrate the validity of the function. Again 10000 random numbers are generated using Python's random.random() except that this time we shall pass the random numbers to the devised funtion as x using different values for b and all cases use a seed of 0:

bbias(random())<0.5bias(random())>=0.5
0.110268974
0.549805020
0.99043957

So using this function we can easily give a bias to random number generators. The only problem with it is that a bias of zero or one will result in a log of zero and division by zero respectively and hence cannot be done.

Summary
To bias a random number generator which gives random fractions so that it is biased towards numbers closer to 1 or 0, use the equation y = x^(ln 0.5 / ln b) where x is the unbiased random number, b is the fraction of numbers to be less than 0.5 and y is the biased random number.