Monday, October 14, 2013

Python N-gram Map

I have developed a data structure in Python to store and query n-grams which is released as open source here. This blog post shall give examples on how to use it.

To add n-grams

Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) which map to None

x = NGramMap()
x[('a','a','a')] = None
x[('a','a','b')] = None
x[('a','b','a')] = None
x[('b','a','a')] = None

To count n-gram frequency

Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','a','a'), ('a','a','b'), ('a','a','a'), ('a','a','b'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('b','a','a'), ('a','b','a'), ('a','b','a') ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams():
    print(ngram, x[ngram])

To find n-grams which contain particular elements

Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_with_eles({ 'a', 'y' }):
    print(ngram, x[ngram])

To find n-grams which follow a particular pattern

Adding the n-grams (a,a,a), (a,a,b), (a,b,a), (b,a,a) several times which map to the number of times they were encountered.

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for ngram in x.ngrams_by_pattern(( None, 'x', 'y' ), { 0 }):
    print(ngram, x[ngram])

To find elements which share similar contexts

You can find elements which occur in the same context in their n-grams, for example 'a' and 'b' share a context in the n-grams (a, x, y) and (b, x, y) as do 'p' and 'q' in the n-grams (x, p, y) and (x, q, y).

ngrams = [ ('a','x','y'), ('x','a','y'), ('a','x','y'), ('b','x','y'), ('c','x','z')  ]

x = NGramMap()
for ngram in ngrams:
    if ngram not in x:
        x[ngram] = 0
    x[ngram] += 1

for element in x.elements():
    print(element)
    for ngram in x.ngrams_with_eles({ element }):
        ele_index = ngram.index(element)
        for ngram_ in x.ngrams_by_pattern(ngram, { ele_index }):
            if ngram_[ele_index] != element:
                print("\t", ngram_[ele_index])

Friday, September 6, 2013

Summary of research paper "Automatic Word Sense Discrimination" by Hinrich Schütze

This is a summary of the research paper http://delivery.acm.org/10.1145/980000/972724/p97-schutze.pdf. This paper describes a way to automatically identify different senses (meanings) for words with multiple senses and to disambiguate the sense of a word in context according to these identified senses.

Overview

We start by creating word vectors for each word by counting how often certain other words occur close to it. These other words are called the dimensions of the vector. This vector space is called the word space.

For example, if the word "judge" is to be projected into a word space with dimensions being "legal" and "clothes", then we go through a corpus and count the number of times the words "legal" and "clothes" occur close to "judge" and record these frequencies in a vector. This vector would represent the meaning of "judge" and we'd expect the count for "legal" to be higher than the count for "clothes". On the other hand the word vector for "robe" using the same dimensions would have a higher count for "clothes" than for "legal". The problem is that, in a corpus, an ambiguous word like "suit" would have both the legal sense and the clothes sense so its vector would not give a good indication of its meaning.

In order to disambiguate the sense of an ambiguous word like "suit", the sentence in which it occurs is used as a context to see what that particular instance of the word means. Words are chosen from the sentence, called features, in order to be used as cues to identify in which sense the ambiguous word is being used. For example, if in the sentence with the word "suit" the words "law", "judge" and "statute" are found, then we can be confident that the sense of "suit" in that context is the legal sense, since all of these words have a high count for the "legal" dimension in their word vector. In practice, centroid of the word vectors of these features is calculated and this centroid is then used to see in which dimension the word vectors of the features are generally pointing. This centroid is called the context vector of the ambiguous word.

Of course in practice there will be a lot of dimensions in the word space so senses will not be limited to a single dimension but to combinations of dimensions. This means that in order to establish what the senses of an ambiguous word are, the context vectors of each instance of the word must be clustered such that similar context vectors are clustered as sharing the same sense and each cluster will be a sense of the ambiguous word. The centroid of each cluster is called a sense vector and when an ambiguous word is to be disambiguated into a sense, the sense vector which is most similar to the word's context vector would be considered its sense.

Nitty gritty

Since there must be of a fixed amount of dimensions of the word space and the feature words which can be used for context vectors, a method to select them must be used. Two methods were tested: local and global selection. In local selection the feature and dimension words are exactly the same and different dimensions/features are used for each ambiguous word; so the system will only disambiguate ambiguous words for which dimensions and features were already determined. In global selection the dimension and feature words are different from each other and are selected once such that all ambiguous words are disambiguated using these same dimensions and features.

Local selection works by simply checking all words which are not stop words and which occur within 25 words of the ambiguous word and accepting as dimensions/features only the 1000 most important words. Importance is measured using one of two criteria: the number of times the dimension/feature occurs near the word in question and the chi-squared test of how much the ambiguous word and the dimension/feature word correlate together.

Global selection on the other hand uses the 20,000 most frequent non-stop words in the corpus as features and the 2,000 most frequent words as dimensions.

Once the dimensions and features are collected, they are arranged into a matrix such that the rows represent the dimensions and the columns the features. In this way the columns of the matrix are the word vectors of the different features. In order to disambiguate an ambiguous word, either these word vectors are used as is or they are reduced using Singular Value Decomposition (SVD) in order to reduce the size of the word vectors (number of rows) to 100 (which means that the word vectors will not correspond to frequencies of neighbouring words anymore but to abstract dimensions). Global selection was not tested with SVD.

Context vector elements are normalized and weighted by inverse document frequency.

Similarity between vectors is calculated using cosine similarity and clustering is done using the Buckshot algorithm.

Evaluation

In order to evaluate the system, pseudowords were created by taking two different words and replacing them with the same invented word. If we have two sentences "The boy peeled the banana." and "The boy opened the door.", and the words "banana" and "door" are replaced with the pseudoword "banana/door" such that the sentences are now "The boy peeled the banana/door." and "The boy opened the banana/door.", then the pseudoword is now an ambiguous word with two senses: the banana sense and the door sense. The system is now to attempt to identify these two senses and identify when it is used in the banana sense and when it is used in the door sense. This is easy to evaluated since we already know how each instance should be disambiguated and there is no need for prior manual identification of the senses.

The following is the average percent correct disambiguation for 20 pseudo words for different experimental conditions:

Condition	%
Local-Frequency-no SVD:	77.8
Local-Frequency-SVD:	82.9
Local-Chi squared-no SVD:	72.1
Local-Chi squared-SVD:	84.1
Global-Frequency-SVD:	89.7

Tuesday, August 13, 2013

Python functions for enumerating combinatorics

Here are some useful enumerating functions in Python which have to do with generating permutations and combinations.

Enumerating all permutations of a sequence

This is a function which will give you all the different ways an ordered sequence can be ordered. It takes each element from the sequence and then recursively finds all the permutations of the sequence without that element. For every sub permutation the function will then prefix it with the removed element. The base case is that a sequence of just one element can only be ordered as itself.

def permutations(initial_permutation):
 if len(initial_permutation) == 1:
  yield initial_permutation
 else:
  for i in range(len(initial_permutation)):
   ith_element = initial_permutation[i:i+1]
   without_ith_element = initial_permutation[:i]+initial_permutation[i+1:]
   for sub_perm in permutations(without_ith_element):
    yield ith_element + sub_perm

Output:

list(permutations("ABCD"))

['ABCD', 'ABDC', 'ACBD', 'ACDB', 'ADBC', 'ADCB', 'BACD', 'BADC', 'BCAD', 'BCDA', 'BDAC', 'BDCA', 'CABD', 'CADB', 'CBAD', 'CBDA', 'CDAB', 'CDBA', 'DABC', 'DACB', 'DBAC', 'DBCA', 'DCAB', 'DCBA']

Enumerating all unordered pairs in a sequence

This is a function which will give you all the different ways an unordered pair of elements can be taken from a sequence. It is a simple case of the next function. It simply runs two nested for loops, each of which selects an index of an element in such as way that the two selected indices are unique.

def pairs(seq):
 for i in range(0, len(seq)-1):
  for j in range(i+1, len(seq)):
   yield (seq[i], seq[j])

Output:

list(pairs("ABCD"))

[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]

Enumerating all unordered subsets of a given length in a sequence

This is the general case of the previous function, where instead of pairs it will enumerate tuples of a given length. It simply replaces the nested for loops with recursive for loops, each time the number of elements to choose goes down by one until the function has to choose 0 elements at which point it returns an empty tuple.

def choices(seq, choice_len):
 if choice_len == 0:
  yield ()
 else:
  for i in range(0, len(seq)-(choice_len-1)):
   for sub_choice in choices(seq[i+1:], choice_len-1):
    yield (seq[i],) + sub_choice

Output:

list(choices("ABCD", 3))

[('A', 'B', 'C'), ('A', 'B', 'D'), ('A', 'C', 'D'), ('B', 'C', 'D')]

Saturday, July 20, 2013

No Free Lunch

There's no free lunch, you always have to lose something in order to gain something. Well at least this is true in the case of computer science where there is something called the No Free Lunch theorem. But in this post I'd like to highlight various instances in computer science where you have to lose something in order to gain something and offer some other fields where I think this also applies.

As you read these examples keep in mind those games where you have to set the stats of your character such as his speed, strength, etc. where you have a total number of points which you can distribute as you see fit, but which you cannot use any more points than those. This creates a condition where there is no free lunch, because you cannot have a perfect character who has perfect stats. There must always be some stats which are sacrificed in order to improve others.

Lossless compression

When you use WinZip in order to make a file smaller in such a way that you can extract the original file back again, you are using something called lossless compression. Can you design a compression algorithm which can make any file smaller and be able to decompress it back? No, because there are more large files than there are smaller ones.

Computer files consist of a sequence of ones and zeros called bits. Let's say that we have designed an algorithm which will compress files consisting of 2 bits. There can only be 4 such files which are:
00
01
10
11
Now let's say that our algorithm will somehow compress these files into 1 bit files. How many different 1 bit files are there? Just 2 which are:
0
1
So we want to compress each of the four 2 bit files into one of the two 1 bit files. What this results in is that two of the large files will turn into the same compressed file and so when you get to decompress it, it would be impossible to know which of the two possible files the original was, so you have lost information and hence the process is not reversible. It would be a lossy compression, not a lossless one.

So what lossless compression algorithms do, inevitably, is to make some files smaller and others bigger than they were. It is the only way to compress some files; you must enlarge some others. A good algorithm would be designed to compress the most typical files whilst enlarging the less common ones, but what would be a typical file? This depends on the application, which is why good compression results when an algorithm is designed for compressing only certain types of files such as pictures or music.

Here is an example of how the 2 bit files can be compressed:
00: 0
01: 10
10: 110
11: 111
So the file "00" will be actually compressed whilst the rest will be either the same size or bigger. As you can see each file was compressed into a unique file so it is possible to reverse the process and obtain the original files.

There's no free lunch in lossless compression, you've got to have some files which will be enlarged if you want some others to be compressed.

Hashing functions

A hashing function is an algorithm which transforms a piece of data (such as a person's name) into a number within a fixed range. This number is called a hash.

Here is how the name "john" can be hashed into the number 7. Start by associating each letter with a number (for example "a" can be associated with 1 and "b" with 2, etc.) and then taking each letter in a person's name and finding their associated numbers. These numbers would then be added together producing a single number. So "john" would be associated with the numbers [10,15,8,14] whose sum is 47. However we want the number to always be within a fixed range, such as between 0 and 9. We can do this by dividing the number by 10 and keeping the remainder which will always be between 0 and 9. So 47 divided by 10 gives 4 remainder 7, so the hash of the name "john" would be 7. This is of course just an example and hash functions can be much more complicated than this.

Hashing functions are used in order to create a hash table which is a table where every row is associated with a number and data is placed in a row according to its hash. The number of rows would be the range of the hashes. So "john" in the previous example would be placed in row 7. This speeds up searching for names since you can quickly find if a name is in the table by just computing its hash and checking if that particular row contains that particular name.

The problem is that different names can give the same hash which will create a collision in the hash table. There are ways around it but a good hashing function will minimize the number of collisions. So is there a hash function which will give the minimal number of collisions for any data?

Obviously not. Let's say that the range of the hashing function allows you to use a hash table with 1000 rows. This would mean that at most you can have 1000 different names without a collision, after which the next name will always result in a collision. Let's say that there is a list of 1000 names which when hashed will all be given different hashes. The hash function works well when the names are picked from that list. But there are more names than 1000. There are potentially infinite names which can be hashed. Since there are potentially infinite names and only 1000 can be placed in the hash table, then there are potentially infinite names which will all try to occupy the same row, regardless of which hashing function is used. This is simply because of the limited available space. So you can always find a list of names which will result in collisions. You can increase the range of the hashing function, but it must always be a finite range or it will not be a hashing function and cannot be used in a hash table.

There is no free lunch when it comes to hashing functions. You can only find a hashing function which works well with a particular subset of the data (usually the most typical data, just like in compression). There will always be some data which will result in a collision.

Machine learning / regression analysis

Both machine learning and regression analysis are the process of trying to guess a pattern from some examples. For example I say that I'm thinking of a sequence of numbers which starts as 0,2,4,6,8,... what will the next number be? You might be quick to guess that it would be 10 but this isn't necessarily true. You simply recognized one pattern which the list follows, namely the sequence of even numbers. But this isn't the only pattern which starts with those numbers. If there were more examples it might look like this: 0,2,4,6,8,1,3,5,7,9,10,12,14,16,18,11,13,15,17,19,... in which case we might guess the next number to be 20 since it alternated between even and odd numbers after every 5 numbers.

In fact for any sequence of numbers that you can think of, there exists an equation which will start with that sequence. A method for finding such an equation is called Polynomial Interpolation where an equation which starts with the sequence 1,3,2,4 (for whole number values of x) is:
f(x) = 1((x-2)(x-3)(x-4))/((1-2)(1-3)(1-4)) + 3((x-1)(x-3)(x-4))/((2-1)(2-3)(2-4)) + 2((x-1)(x-2)(x-4))/((3-1)(3-2)(3-4)) + 4((x-1)(x-2)(x-3))/((4-1)(4-2)(4-3))
which simplifies to:
f(x) = (2x^3 - 15x^2 + 35x - 20)/2
which gives the following values:
f(1) = 1
f(2) = 3
f(3) = 2
f(4) = 4

And this can be done for any finite sequence. The equation might become more and more complex (even when simplified) but there will always be one. Which means that for any example of sequence that you come up with, there is an infinite number of different equations which start with that sequence and then all continue differently. This means that there is no finite sequence of numbers you can give which will suffice to describe a single pattern such that we will always regress those examples into the original pattern which describes them, because there are an infinite number of such patterns, not just the one you were thinking of initially.

In general it is not simply a sequence of numbers that we need to find a pattern for. Machine learning is used to learn how to do a task from a set of examples of that task. It could be Youtube trying to guess what sort of videos you like based on the ones you have already seen. It could be Amazon trying to guess what sort of products you tend to buy based on the ones you have already bought. It could be a natural language production program which given a number of good stories will try to produce a new original good story. It could be a program which tries to find what you should invest in that will not fail given the current economic situation and using past economic situations as examples. It could be an automatic car which tries to learn how to drive automatically by observing how people drive. All these can be thought of as more complex versions of the previous equation shown, with the difference that instead of an equation you have a full blown program.

But the same problem persists. Different learning methods will give different patterns which describe the examples. The problem is choosing which learning method will give you the pattern that you want. This is called an induction bias and it is the fact that different learning methods tend towards different patterns and so a particular learning method will tend to give patterns which are suitable for particular uses whilst another method will give patterns which are suitable for different uses. And this applies even to us humans because we try to find patterns which describe our observations all the time, from a student who is trying to learn how to do an exercise based on the examples given by the teacher to a scientist who is trying to find a scientific theory which describes some natural process. There is not one best method of learning since each can give a pattern which fits the examples but will not fit new observations.

There is no free lunch in learning. Each learning method can fail and so there is no universally good method.

Programming languages?

We now come to the first of my unfounded hypothesis. I think that programming languages also have no free lunch. Each programming language is designed to be efficient at creating certain programs but none is efficient at everything. The more general purpose a programming language is, the harder it is to find errors in the code, which is why Haskell was reduced in generality to produce Helium which has better error messages. It seems like a perfect programming language cannot be created, it will either be too unwieldy to be used by humans or it will be too specialized to be useful in general. So there is no free lunch, there cannot be a general and easy to use programming language, it must be domain specific (usually implicitly).

Data structures and algorithms?

Here is another of my unfounded hypothesis. Data structures are ways of representing data in order to be efficient for some particular use such as searching. The problem is that there are always tradeoffs involved and sometimes these tradeoffs might not be obvious. For example although a sorted list is fast to search through, it is slower to add new data to. On the other hand an unsorted list is faster to add data to (just append it to the end) but slower to search through. Sorting algorithms also have tradeoffs. General sorting algorithms have a worst time complexity of O(n^2) except for merge sort which has a worst time complexity of O(n log n) but at the expense of using twice as much memory. Heap sort does have a worst time complexity of O(n log n) (without big O notation it is 2n log n) and uses the same amount of memory but it still has its disadvantages. On the other hand when an algorithm is optimized and improved to handle more and more special cases in order to avoid reaching the worst time complexity, then there is another trade off being involved: the size of the program needed to implement the algorithm. And different computers will be more efficient at executing certain algorithms than others, such as we humans being better at using the slow algorithms than the fast ones when it comes to sorting manually. It seems like there cannot be perfect data structures and algorithms which are good in any situation and use, there will always be some situation where the data structure or algorithm is not well suited for. So there is no free lunch, there cannot be a general algorithm or data structure which works well in every situation, it must be application specific.

Monday, June 10, 2013

Fitness function for multi objective optimization / Mapping vectors/tuples to numbers

When you want to let the computer discover how to best solve a problem, you can use a genetic algorithm to evolve a solution for a problem. For example, if you're trying to find which schedule of lessons will cause the least clashes (hopefully none), you create a population of possible schedules and let them evolve into better and better schedules. A more interesting example would be evolving a program which does something in particular, such as control a robot.

In genetic algorithms you have to supply a function called a fitness function which returns a number which quantifies how good a particular solution is at solving the problem. For example, in the case of the time table scheduler, a fitness function would be the number of clashes between lessons that are caused by a given schedule. The smaller this number, the better the schedule. If you're trying to evolve a program which generates prime numbers, then a fitness function would be the percentage of numbers generated which are actually prime numbers. The higher this number, the better the program.

The problems start when you want to maximize or minimize more than one thing, that is, when you have a number of separate fitnesses that you want to optimize together. An example of this is evolving your schedule such that both the number of lesson clashes (when students are expected to be in two different lessons at the same time) and the number of room clashes (when the same room is assigned to be used for more than one lesson at the same time) will be minimized. In the case of the program, you usually not only want to evolve a correct program, but also one which gives the correct output quickly. This is called multi objective optimization.

What is usually done is a weighted summation of the individual fitnesses. So you say that the correctness of a program is, say, twice as important as the speed of the program, so the fitnesses are combined as

fitness = 2 * percentage_correct_outputs + time_taken

Of course this is not correct since the correctness must go up whilst the time should go down. So a better combination might be

fitness = 2 * percentage_correct_outputs + 1/(time_taken + 1)

You might not think that this is the best fitness function but choosing a correct fitness function is part of the difficulty of using genetic algorithms so have fun finding a better one.

The problem with weighted summations is that the fitness will increase by simply increasing one of the sub fitnesses. So in the previous example, you can get a good fitness by simply creating a program that does nothing and takes 0 seconds to do so. This program is much easier to find than a program which is correct so the genetic algorithm tends to improve the execution time only and gets stuck as a program which increases the correctness will be much much slower and thus will reduce the fitness rather than increase it. A better way to combine the sub fitnesses is needed.

I found two ways to do this for two different needs. The first is for when the sub fitnesses are of equal importance, for example the time table scheduler minimizing the number of lesson and room clashes. The second is for when the sub fitnesses are prioritized such that the most important sub fitness should always be improved regardless of how the second is affected, for example the prime number generator being correct has the highest priority but between two equally correct programs, the faster one is preferred.

Equal priority fitness

If both sub fitnesses must be optimized together such that improving one without the other is not helpful, then we might add the two fitnesses together and subtract their absolute difference like this:

fitness = (sub1 + sub2) - |sub1 - sub2|

This way as one starts improving without the other, their difference will increase and the whole fitness will start becoming smaller. We can simplify this equation by calling the largest sub fitness "max" and the smallest sub fitness "min":

fitness = (max + min) - (max - min)
fitness = max + min - max + min
fitness = 2*min

So this fitness is equivalent to finding twice the minimum of the sub fitnesses, which makes sense since if you only care about increasing the minimum fitness, then both fitnesses will increase together. Of course multiplying by 2 is redundant since comparing a minimum with another minimum and comparing twice a minimum with twice another minimum will give the same result, so our final fitness function is:

fitness = min(sub1, sub2)

and you can add more sub fitnesses by just finding the minimum of all of them:

fitness = min(sub1, sub2, sub3, ..., subn)

So in our time table example, the fitness function would be:

fitness = min(-lesson_clashes, -room_clashes)

The negations are used so that in order to increase the fitness, the number of clashes have to be decreased.

Prioritized fitness

When one sub fitness is more important than the other, such as program correctness versus execution time, a different method is needed. What is needed is a function "f" which combines the two sub fitnesses ("sub_1" and "sub_2" where sub_1 has a greater priority than sub_2) such that the following conditions are met:

f(subA_1, subA_2) > f(subB_1, subB_2)
if subA_1 > subB_1 or
   subA_1 = subB_1 and subA_2 > subB_2

f(subA_1, subA_2) = f(subA_1, subB_2)
if subA_1 = subB_1 and subA_2 = subB_2

So for example f(2, 1) > f(1, 100), f(5, 20) > f(5, 1) and f(3, 3) = f(3, 3).

This is called a lexicographical ordering of the sub fitnesses. With strings this is natural. When sorting names, "John" comes before "Zach" because "J" comes before "Z", regardless of what the second letter is. But "James" comes before "John" because since the first letter is the same then we look at the second one and "a" comes before "o".

The problem we're facing is with finding a function which when given a pair of numbers will return a number which can be used to sort a list of pairs of numbers in lexicographical order. This is similar to Cantor's mapping of pairs of numbers to the natural numbers, except that this time we want a particular mapping which is lexicographic.

A natural lexicographic ordering in numbers comes when we look at numbers with a decimal point such as 2.1 and 3.5. The whole number part will always determine the ordering, unless they are equal, in which case the fractional part is then used. So if we can map our pairs of sub fitnesses to real numbers in such a way that the high priority sub fitness becomes the whole number part and the low priority sub fitness becomes the fractional part, then we would have found our "f".

What I found was that you can do this by squashing the low priority number into a proper fraction (a number between 0 and 1) and then adding it to the second number, assuming that it is a whole number. Unfortunately this method will only work if the high priority number is a whole number. The low priority number can be a real number however. In order to squash the low priority number, you can use a sigmoid function which given any number will return a number between 0 and 1 in such a way that ordering is preserved (sigmoid(a) > sigmoid(b) if and only if a > b). Another and perhaps simpler way would be to use the hyperbolic tangent or the arc tangent and then modify them so that their output is between 0 and 1 (y = (tanh(x)+1)/2 and y = (atan(x)+pi/2)/pi).

So the combined fitness would be:

fitness = sub1 + sigmoid(sub2)

The nice this about this is that you can add more sub fitnesses in the following way, assuming that only the least important sub fitness is a real number whilst the rest are integers:

fitness = sub1 + sigmoid(sub2 + sigmoid(sub3 + ... + sigmoid(subn)...))

So in our program example, the fitness function would be:

fitness = percentage_correct_outputs + sigmoid(time_taken)

It should be noted that this is a way to map vectors/tuples of integers to real numbers whilst preserving lexicographic ordering.

Combined

Now we can even use both of the combinations in order to combine sub fitnesses in complex ways where some are of equal priority whilst others are of different priority. For example, say we are evolving a time table schedule which minimizes lesson clashes, minimizes room clashes and minimizes density such that you avoid cramming all the lessons in one day. The lesson and room clashes are of equal priority but the density has a lower priority than the other two. So the fitness function might be:

fitness = min(-lesson_clashes, -room_clashes) - sigmoid(density)

This will give us a real number which has a whole number part representing the number of undesirable clashes and a fractional part representing the density, both of which must be minimized. Since fitness should be increased, minimization is done by using the negations and subtraction.

Sunday, May 5, 2013

An equation for enumerating fractions

In a previous post I described Cantor's way of proving that some infinite lists are bigger than others. I said that his is done by showing that a number of infinite lists, which are called countably infinite, can be paired up with every natural number from 0 onward. I shall now describe an equation that gives the integer that is paired up with a particular fraction.

The fractions are:

1/1 1/2 1/3 1/4 ...
2/1 2/2 2/2 2/3 ...
3/1 3/2 3/2 4/3 ...
4/1 4/2 4/2 3/3 ...
...

and can be paired up with the natural numbers by doing the following:

Organize them in a grid like above, where the first row consists of all the fractions with a numerator of 1, the second row with a numerator of 2, etc.
Start from the corner (1/1) and pair it with 0
Go down each diagonal in turn, that is, [1/2, 2/1] would be the first diagonal, [1/3, 2/2, 3/1] would be the second, etc. and pair them with the next consecutive natural number

This will give us the following mapping:
[0:1/1, 1:2/1, 2:1/2, 3:3/1, 4:2/2, ...]

In order to find an equation which does this mapping, we first need to know in which diagonal a given natural number would be paired up in. So 0 is paired up with the corner (the diagonal 0), 1 is paired up with 1/2 which is in the second diagonal (diagonal 1), 2 is with 2/1 which is in the second (diagonal 1), 3 with 3/1 in the third (diagonal 2), etc. This will let us know what the nth fraction in the diagonal is.

Let's start by determining which natural number would be paired up with the first fraction in each diagonal. First fraction in diagonal 0 is paired up with 0, the first fraction in diagonal 1 is paired up with 1, in diagonal 2 with 3, in diagonal 3 with 6, etc. Notice that, since the size of each diagonal increases by 1 each time, the sequence of numbers just mentioned (0, 1, 3, 6, ...) is an arithmetic series. This means that the first fraction in diagonal 0 is paired up with 0, diagonal 1 with 0+1, diagonal 2 with 0+1+2, diagonal 3 with 0+1+2+3, etc. So in order to find which natural number n will be paired up with the first fraction in diagonal d, we use:

n = d(d+1)/2

d = 0 => n = 0
d = 1 => n = 1
d = 2 => n = 3
d = 3 => n = 6
d = 4 => n = 10

We can use this equation to find the diagonal d in which natural number n is paired up. This is done by making d subject of the formula using the completing the square method: d = -0.5 ± sqrt(2n + 1/4)

This new equation gives the following:

d = 0 => n = 0, -1
d = 1 => n = 1, -2
d = 2 => n = 1.56, -2.56
d = 3 => n = 2, -3
d = 4 => n = 2.37, -3.37
d = 5 => n = 2.7, -3.7
d = 6 => n = 3, -4
d = 7 => n = 3.27, -4.27
d = 8 => n = 3.53, -4.53
d = 9 => n = 3.77, -4.77
d = 10 => n = 4, -5

As you can see, the answer that comes out when using the minus sign is non-nonsensical, whilst the whole number part of the answer that comes out when using the plus sign is the diagonal position in which natural number y belongs. The plus part is what we want and we can extract the whole number part by using the floor function:

d = floor(-0.5 + sqrt(2n + 1/4))
or
d = floor((sqrt(8n + 1) - 1)/2)

This gives us:

n = 0 => d = 0
n = 1 => d = 1
n = 2 => d = 1
n = 3 => d = 2
n = 4 => d = 2
n = 5 => d = 2
n = 6 => d = 3
n = 7 => d = 3
n = 8 => d = 3
n = 9 => d = 3
n = 10 => d = 4

So natural number 0 will be paired up with a fraction in diagonal 0, 1 will be paired up with a fraction in diagonal 1, 2 in diagonal 1, 3 in diagonal 2, 4 in diagonal 2, etc.

Since the largest numerator and denominator in a given diagonal is 1 more than the position of the diagonal (diagonal 2 has a largest numerator, as well as a largest denominator, of 3, etc.), we now also know that the largest numerator and denominator in the diagonal in which the natural number n belongs is floor((sqrt(8n + 1) - 1)/2)+1. So the first fraction in that diagonal will be 1/(floor((sqrt(8n + 1) - 1)/2)+1) whilst the last fraction will be (floor((sqrt(8n + 1) - 1)/2)+1)/1.

Since we also know which natural number is paired up with the first fraction in a given diagonal, we can know which natural number is paired up with the first fraction in the diagonal in which n is paired.

The natural number n which is paired up with the first fraction in diagonal d is: n = d(d+1)/2
The diagonal d in which natural number n is paired up is: d = floor((sqrt(8n + 1) - 1)/2)
So the natural number n which is paired up with the first fraction in the diagonal which natural number m is paired up is: n = floor((sqrt(8m + 1) - 1)/2)(floor((sqrt(8m + 1) - 1)/2) + 1)/2

This gives us:

m = 0 => n = 0
m = 1 => n = 1
m = 2 => n = 1
m = 3 => n = 3
m = 4 => n = 3
m = 5 => n = 3
m = 6 => n = 6
m = 7 => n = 6
m = 8 => n = 6
m = 9 => n = 6
m = 10 => n = 10

Observe that in a given diagonal such as [1/4, 2/3, 3/2, 4/1], you can find what any of the other fractions will be by knowing which is the first fraction and how far the fraction you want to find is from it (the numerator increases whilst the denominator decreases). For example, the 0th fraction in the diagonal is (1+0)/(4-0), the 1th fraction is (1+1)/(4-1), the 2th fraction is (1+2)/(4-2), etc.

We know how far m is from the first natural number n in the diagonal in which m resides:
m - floor((sqrt(8m + 1) - 1)/2)(floor((sqrt(8m + 1) - 1)/2) + 1)/2

And we know the first fraction in the diagonal in which m resides:
1/(floor((sqrt(8m + 1) - 1)/2)+1)

So the fraction which will be paired up with natural number n is:

numerator: 1 + (n - floor((sqrt(8n + 1) - 1)/2)(floor((sqrt(8n + 1) - 1)/2) + 1)/2)
denominator: floor((sqrt(8m + 1) - 1)/2)+1 - (n - floor((sqrt(8n + 1) - 1)/2)(floor((sqrt(8n + 1) - 1)/2) + 1)/2)

which I'm too lazy to simplify.

The equation in Python 3 is as follows:

for n in range(20):
    print(n, ":", int(1 + (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)), "/", int(math.floor((math.sqrt(8*n + 1) - 1)/2)+1 - (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)))

And here is the output:

So using this enumeration method, the fraction 100/173 will be paired with:

def findNat(num, den):
    while int(1 + (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)) != num and int(math.floor((math.sqrt(8*n + 1) - 1)/2)+1 - (n - math.floor((math.sqrt(8*n + 1) - 1)/2)*(math.floor((math.sqrt(8*n + 1) - 1)/2) + 1)/2)) != den:
        n += 1
    return n
print(findNat(100, 173))

Friday, April 12, 2013

The Monty Hall problem

This is an interesting mathematical problem in probability which has an intuitive answer. The only problem is that the intuitive answer is wrong, which is what I love about it. It's called the Monty Hall problem and it goes like this:

Imagine you're in a game show and presented with three doors.

Behind one of the doors is a prize. You are to guess behind which door it is. At this point we should all agree that the chance of guessing the right door would be 1/3. You decide to pick door number 1.

You think that you have sealed you fate, but in an surprising turn of events, the game show host says that in order to help you, since at least one of the two doors left must be wrong, the host is going to tell you that door number 3 is surely wrong.

So now you know that the prize is either behind the door you chose or the other remaining door, door number 2. The host now asks you if you'd rather stick with your first choice or choose the other door.

Did the host actually help you? Is there some advantage to knowing which one of the three doors is surely wrong? Intuitively you'd say that your chances of winning by sticking with your first choice or choosing door number 2 are both 1/2. It's either behind one or the other right? So your chances of winning are equal right? Right guys? Guys?

Most people will not be convinced that this intuition is false. So here's a program I've written in Python 3 which simulates this situation 100 times and it will count the number of times you would have won if you stayed with the first choice and the number of time you would have won if you swapped your choice.

import random

stay_win = 0
swap_win = 0
doors = {"Door 1", "Door 2", "Door 3"}
for _ in range(100): #For 100 times do the following...
    #Select the correct door
    correct = random.choice(list(doors))
    #The contestant makes their choice
    first_choice = random.choice(list(doors))

    #The door which can be eliminated cannot be the correct door or the contestant's choice
    can_eliminate = doors - { correct, first_choice }
    #Select the eliminated door
    eliminated = random.choice(list(can_eliminate))

    #The contestant has only one door left to choose if they swap their door
    choice_left = doors - { first_choice, eliminated }
    #Make this unchosen, non-eliminated door the second choice for swapping the door
    second_choice = choice_left.pop()

    if first_choice == correct:
        stay_win += 1
    if second_choice == correct:
        swap_win += 1

    print("Correct:", correct, " | ", "First choice:", first_choice, " | ", "Eliminated:", eliminated, " | ", "Second choice:", second_choice, " | ", "Won if stay:", first_choice == correct)
print()
print("stay:", stay_win, "swap:", swap_win)

In order to let you scrutinize the result, all the individual situations will be outputted as well:

stay: 29 swap: 71

What? That can't be right! You win by swapping almost twice as often as when you stay with your first choice. Let's try it again!

stay: 39 swap: 61

Um...

stay: 37 swap: 63

Hmm... Why would you win twice as often when you swap than when you stay? The only way to know this is by seeing what needs to happen in order to win by sticking with your first choice and to win by swapping.

Here is what the different possibilities are when the correct door is door number 1:

And here is what the different possibilities are when the correct door is door number 2:

And finally here is what the different possibilities are when the correct door is door number 3:

Did you get it? No? Well, look carefully at what needs to happen in order to win using the first choice. You need to guess the correct door on first try. What are the chances of you doing that? It's 1/3 of course. Could you ever win by choosing a wrong door on first try and stick to it? No.

Well then what do you need to do in order to win by swapping doors? Clearly, you will only lose if you choose the correct door on first try, but if it's the wrong door you've chosen on first try then you'll win. What are the chances of you choosing the wrong door on first try? 2/3, twice as many as choosing the correct one.

Clearly you're twice as likely to win by swapping than by staying. Obviously this doesn't mean that you'll always win, it just means that you're more likely to win. This illustrates a very important point which is that intuition isn't perfect. Sometimes it's just not enough to use intuition in order to predict what will happen, you will need to get down and dirty and actually do boring work in order to make a correct judgement. Never overestimate yourself.