Saturday, December 13, 2014

Testing for uniformly distributed random events

Uniformly distributed randomness is when the probability of each individual outcome of the randomness is equal. A fair coin toss is uniformly distributed because the probability of getting a heads is equal to the probability of getting a tails.

Say you developed a program which uses randomness. How can you test that the randomness is actually uniformly distributed? An example is testing that a mutation function for a genetic algorithm mutates each bit with equal probability. Another example is that a random shuffle of a list results in each item occupying any position in the list with equal probability.

Creating the test

The way to test for uniform distribution is by running the random process multiple times and recording the output each time. Count the number of times each output is given. After many trials, the frequencies should be similar. For example, in order to test if a coin flip is uniformly distributed, count the number of times of each heads and tails. The counts should be similar. In the case of shuffling a list, make a table with a row for each position in the list and a column for each item in the list and after each shuffle, count the number of times each item ended up in each position. The counts across the whole table should be similar.

Here is an example unit test which tests if a dice simulator is uniformly distributed. Assume that there is a function called "similarity" which tells you how similar to each other a bunch of frequencies are.

import unittest

class TestDice(unittest.TestCase):

    def test_uniform(self):
        trials = 1000000
        best_similarity = similarity({ 1 : trials//6, 2 : trials//6, 3 : trials//6, 4 : trials//6, 5 : trials//6, 6 : trials//6 }.values())
        threshold = 0.9*best_similarity

        frequencies = { 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 }
        for i in range(trials):
            outcome = roll_dice()
            frequencies[outcome] += 1

        self.assertTrue(similarity(frequencies.values()) >= threshold)

This program starts by creating a threshold against which to compare the similarity of the frequency list. This threshold is 90% of the similarity measure when the frequencies are all the same. The percentage can be tweaked of course. The similarity between the actual frequencies must be at least as high as this threshold for the test case to pass.

It could be that case that instead of a similarity function there is a "distance" function which is 0 when all frequencies are the same and greater otherwise. In this case the code should be a little different.

import unittest

class TestDice(unittest.TestCase):

    def test_uniform(self):
        trials = 1000000
        worst_distance = distance({ 1 : trials, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 }.values())
        threshold = 0.1*best_distance

        frequencies = { 1 : 0, 2 : 0, 3 : 0, 4 : 0, 5 : 0, 6 : 0 }
        for i in range(trials):
            outcome = roll_dice()
            frequencies[outcome] += 1

        self.assertTrue(distance(frequencies.values()) <= threshold)
In this case instead of comparing against the best similarity, we compare against the worst similarity which is when only one of the outcomes gets all of the frequency. The threshold would then be 10% of this worst measure. The distance between the actual frequencies must not exceed this threshold for the test case to pass.

Measuring the distance/similarity
The problem is how to measure the similarity between a list of frequencies, in such a way that the similarity is quantified into a single number by a program (a unit test) in order to check if it is greater than or smaller than some threshold. Here are three methods to quantify the similarity between a list of frequencies.

Standard Deviation
The average of a bunch of frequencies is a number which is nearest to every frequency. But an average is only half of the story, it is also important to know how close the frequencies are to this average. To measure how close all the frequencies are to their average you need to find their standard deviation, which is basically the average difference between every frequency and their average. It is found in the following way:

distance(xs) = sqrt( average( (x - average(xs))^2 for x in xs ) )
The closer they are to their average, the more similar they are. So the smaller the standard deviation, the more similar the frequencies are; hence, it is a distance measure rather than a similarity measure.

Chi-squared test
Let's say that you're conducting an experiment to measure the effects of smoking cigarettes on a person's health. You take two groups of people, one group consisting of smokers and another group consisting of non-smokers, and you measure their collective health. I have no idea how to measure collective health but let's say that you have two numbers measure how healthy each group is. How do you quantify how similar the two numbers are in order to determine how significant the effect of smoking is on a person's health?

A common way to do this in science is by using Pearson's Chi-squared test which is used to measure how similar a list of pairs of numbers are. In order to apply this to similarity between frequencies, we can measure how similar each frequency is to their average.

distance(xs) = sum( (x - average(xs))^2 / average(xs) for x in xs )
This number will be smaller the more similar the frequencies are, making this a distance measure rather than a similarity measure.

Min max ratio
Of course there's a less fancy way to measure how similar a bunch of frequencies are, and that is by measuring the difference between the largest and smallest frequency. If the largest and smallest numbers are the same then all the numbers are the same. If they are different then that is the largest difference between any two numbers.

The difference between maximum and minimum isn't a useful measure of similarity since the similarity between 10 and 5 is not the same as the similarity between 1000 and 995. Instead it is more useful to divide the minimum by the maximum. But this will result in a value of zero whenever one of the frequencies is zero (and hence the minimum is zero), which is not measuring anything useful either. A solution for this is to add one to both the minimum and maximum frequency in order to avoid having zeros but still have a maximum similarity of 1.

similarity(xs) = (min(xs) + 1)/(max(xs) + 1)

Apart from this function giving a larger number the more similar the frequencies are, the difference between this measure and the others is that if all the frequencies are the same except for one, this method will identify the frequencies as different, whereas the others will be affected less. This might be considered an advantage when checking if a random process is uniformly distributed.

Wednesday, November 26, 2014

Dijkstra's Algorithm, Uniform Cost Search, and A* Search (graph search series)

Let's say that you're developing a game which involves moving a hero around in order to avoid a monster that is chasing him or her. It's easy to program the hero to just move around wherever the arrow keys on the keyboard indicate, but how do you make the monster automatically follow a sensible path towards the hero?

If there are no obstacles in the way then the monster just has to move in the direction of the hero, but you don't need me to tell you that. In order to get from point A to point B with obstacles in the way you need to use some path finding algorithms, which ideally find the shortest path. Let's start from Dijkstra's Algorithm.

Dijkstra's Algorithm

Dijkstra's Algorithm finds the shortest path from a point to every other point. In other words it is not ideal for finding the shortest path between two points. This might not be what you need but it's a good basis to understand the more focused algorithms.

It is a dynamic programming algorithm which exploits the following observation about shortest paths. Consider the following diagram which shows the shortest path between two crosses with an obstacle in the middle:

If you take any point in the path, such as the circle, the shortest path from this point and one of the crosses is the same path as the shortest path between the two crosses. In other words, imagine that you are searching for the shortest path between the two crosses, and you meet a wizard who tells you that he knows what this shortest path is but that he will not tell you how to take it; however, he will tell you that the circle is on this path. With this information you can simplify your search by looking for the shortest path from one cross to the circle and from the circle to the other cross. These two paths can then be combined into a single shortest path between the two crosses.

Obviously if the circle is not on the shortest path between the two crosses then this strategy will not work, but it is useful for reducing the number of different paths to check. In general we probably will not have access to such a wizard, but we can still take advantage of this fact by looking for the shortest path between the first cross and every point along the way to the second cross. The shortest path to the second cross must begin with one of these paths. But how can we find the shortest path to these intermediate points? But using intermediate points that are along the way to these intermediate points of course. This strategy allows us to find the shortest path between two points bit by bit, by finding other shorter shortest paths along the way.

This is how Dijkstra's algorithm works. Here is the algorithm:
distance_from_start = { starting_point : 0 }
previous_point = { starting_point : null }
next_points = [ starting_point ]

while next_points is not empty do
    next_point = point in next_points where distance_from_start[point] is minimum
    remove next_point from next_points

    for each neighbouring_point around next_point do
        new_distance = distance_from_start[next_point] + distance from next_point to neighbouring_point
        if neighbouring_point is not in distance_from_start or new_distance < distance_from_start[neighbouring_point] then
            add neighbouring_point to next_points if not in distance_from_start

            distance_from_start[neighbouring_point] = new_distance
            previous_point[neighbouring_point] = next_point
The algorithm works by starting at the starting point and checking every neighbouring point around it. The path from the starting point to each one of these points is considered to be the shortest path to the points. Each of these points is checked for its neighbouring points also and if the point is a previously visited point, then the distance of the new path found is compared to the distance of the previous path that was used. If the new distance is shorter than the previous one then the new path is used instead of the old one. This is repeated until every point has been checked. At the end, every point is associated with some previous point which if followed to the starting point will form the shortest path to the starting point.

Uniform Cost Search
Uniform Cost Search is Dijkstra's Algorithm which is focused on finding a single shortest path to a single finishing point rather than a shortest path to every point. It does this by stopping as soon as the finishing point is found.

Here is the algorithm:
distance_from_start = { starting_point : 0 }
previous_point = { starting_point : null }
next_points = [ starting_point ]

while next_points is not empty do
    next_point = point in next_points where distance_from_start[point] is minimum
    if next_point = finishing_point then exit
    remove next_point from next_points

    for each neighbouring_point around next_point do
        new_distance = distance_from_start[next_point] + distance from next_point to neighbouring_point
        if neighbouring_point is not in distance_from_start or new_distance < distance_from_start[neighbouring_point] then
            add neighbouring_point to next_points if not in distance_from_start

            distance_from_start[neighbouring_point] = new_distance
            previous_point[neighbouring_point] = next_point
It is not enough to check if a neighbouring point is the finishing point to exit as there might be other points which are the finishing point's neighbours that are better. In order to be sure that no other neighbouring point is better, you have to wait until it is the closest point available in the list of next points to explore. In this way you can be sure that any other point that can lead to the finishing point is further away from the starting point (and thus have a longer path) than the finishing node is.

The problem with this approach is that although it is guaranteed to give you the shortest path from the starting point to the finishing point, it does so because it checks every possible path blindly, which would be too long for practical use in a game. A faster algorithm exists.

A* Search
A* Search is the informed version of Uniform Cost Search. It differs in that you have to give it a way to estimate how close any point is to the finishing point which it will use to make informed decisions on which point it should follow next. This takes the "blindly" part out of the Uniform Cost Search. The estimate is used in order to estimate what the total distance of the path from the starting point to the finishing point is, provided that a particular point is included in the path. The way the path distance is estimated is by adding together the distance of the point from the starting point (which is done as usual) and the estimated distance from the point to the finishing point. In this way, the selection of the next point to explore is not made just on the point's distance from the starting point, but on the whole path length.

The important thing is that this estimate is never more than the actual distance to the finishing point, although it can be shorter. If it is longer then it will not give the shortest path. The reason for this is because as soon as an actual path is found (when the finishing point is discovered), if the actual distance of this path is shorter than the underestimated distances of the other paths, then it can be safely assumed that the others paths actual distances will be more than the actual path found. On the other hand, if the estimated distances are overestimated, then they cannot be compared to an actual path distance since their actual distance might be less.

Another thing to notice is that the closer the estimate is to zero, the more the search will behave like Uniform Cost Search and the more likely the path given is to be optimal. A good estimate would be the Euclidian distance (straight line distance) to the finishing point. This estimate is used to select the most promising point to process from all the ones discovered.

Here is the algorithm:

distance_from_start = { starting_point : 0 }
previous_point = { starting_point : null }
next_points = [ starting_point ]

while next_points is not empty do
>>> next_point = point in next_points where distance_from_start[point] + estimate to finishing point is minimum
    if next_point = finishing_point then exit
    remove next_point from next_points

    for each neighbouring_point around next_point do
        new_distance = distance_from_start[next_point] + distance from next_point to neighbouring_point
        if neighbouring_point is not in distance_from_start or new_distance < distance_from_start[neighbouring_point] then
            add neighbouring_point to next_points if not in distance_from_start

            distance_from_start[neighbouring_point] = new_distance
            previous_point[neighbouring_point] = next_point

Wednesday, October 22, 2014

Thoughts on lexical substitution: a contextual thesaurus

Lexical substitution is the task of selecting a word that can replace a target word in a context (a sentence, paragraph, document, etc.) without changing the meaning of the context. If you think about it, this is the main application of a thesaurus: using synonyms to replace words.

The problem is, however, that a thesaurus is not necessary for substituting words. Consider the words "questions" and "answers". Would you find them as synonyms in a thesaurus? In the sentence "I got 10 answers right", the word "answers" can be substituted with "questions". Thesauri are not sufficient either because two words which are commonly considered synonyms might not be substitutable in certain contexts even if they are of the same sense, such as "error" and "mistake" in the sentence "An error message popped up."

Many systems that either automatically perform lexical substitution or automatically find synonyms, see SemEval 2007 English lexical substitution task for example, do so by doing the following a simple recipe. To substitute the target word "sat" in the sentence "The cat sat on the mat.":

  • Extract the words "the", "cat", "on", and "mat" from the sentence, that is, the words which are near the target word in the sentence. These are called feature words.
  • Look for another word which occurs in sentences containing the same feature words.
  • That word can replace "sat" in "The cat sat on the mat".

There might be restrictions on which feature words are used, such as not using common words like "the" and "on"; there might be word order restrictions such as the feature words having to be in order in the sentence; there might even be some higher order relationship such as using feature words of feature words; however, essentially all of these methods assume that if another sentence is found which is "The cat ____ on the mat.", then whatever word fills that blank will be a substitutable word for "sat".

Even if thesauri are used to filter words which have completely different meanings, as is usually done, it is not enough for two sentences "a b c X p q r." and "a b c Y p q r." to have identical meaning if X and Y are found in the same thesaurus entry. The problem is that lexical substitution is much more complex than that. In reality, proper lexical substitution requires sentential semantics rather than lexical semantics. When you substitute a word in a sentence, you read out the whole sentence to see if it still sounds like it means the same thing. This is not simply checking if the sentence makes sense, which is what the previous method does, but is a higher cognitive task which involves understanding the meaning of the sentence.

Consider the sentences "I have a big sister." and "I have a large sister.", or "I saw Mary." and "I watched Mary.". In these sentences, the words "big" and "large" and the words "saw" and "watch" have different meanings; however in some other sentences such as "I have a big house." and "I have a large house.", or "I saw a movie." and "I watched a movie.", then the words are synonymous. No thesaurus tells you in which contexts these words mean the same thing, and both sentences make sense and are existent.

Consider also that thesauri are not perfect for this task. For example, WordNet groups together "mother" and "father", whilst Roget's Thesaurus groups "man" with "brother".

It's also important to recognize whether or not the target word is part of an expression or idiom. For example, is the sentence "John kicked the bucket." referring to John actually kicking a bucket or to John dying? If it's an expression then none of the words can be changed.

It seems like understanding the meaning of the sentence is necessary in order to perform lexical substitution. After all, the best performing systems in the previously mentioned SemEval 2007 did no better than 20% correct substitutions. Sentential semantics is a very complex task to automate, since even the exact same sentence can have different correct meanings, such as the previous "John kicked the bucket." However, if idioms, collocations, and other pragmatic constructs (such as saying "expired" instead of "died" due to taboo) could be precisely detected then the rest of the sentences could be solved by using a fine grained lexical ontology.

An ontology maps the relationship between words, such as appropriate verbs and adjectives to nouns. This means hard coding all possible word relations in such detail that the system would know that watching a person is different from watching a movie and that a large sister is different from a big sister. This would also require that words be categorized into semantic categories such as "person". In this way, together with a precise thesaurus, it may be possible to use dependency parsing in order to find which words are connected to the target word in the sentence and check if a synonym found in the thesaurus can be semantically used in the same way as the target word in the sentence. If it changes the meaning of the sentence, then the relations between the words would also change.

If you think about it, this system would be a contextual thesaurus, which is a thesaurus on steroids. Imagine a thesaurus, book form or online, which does not only list words in alphabetical order, but also includes the different words that can be connected to them (using word categories). So the thesaurus would not just have entries that belong to single words like this:

big: large
see: watch

but would instead state which words can be connected to it like this:

big [sibling]: older
big [object]: large
see [movie]: watch
seeing [person]: dating

This would also simplify word sense disambiguation. For example, the word "suit" can be both an article of clothing and a legal action. But in the sentence "I'm wearing a suit.", only one of those two senses can be combined with the verb "wearing". This would help finding which word category a word belongs to. If a word can be in more than one category, simply check which category can be matched with the word's connected words.

Perhaps this thesaurus can be compiled manually by some lexicographers. The challenge is to automatically compile this thesaurus on steroids from online text in the same way that people learn word relations. Until we have such a resource lexical substitution cannot be moved forward to a usable state.

Sunday, September 14, 2014

Doing an UPSERT (update or insert if new) in MS SQL Server and MySQL

The UPSERT is an incredibly useful SQL instruction for databases, especially for data synchronization and key-value data structures. Unfortunately it is ignored by some databases.

An UPSERT is an SQL instruction which attempts to update a record (with a particular primary key) and will automatically create a new record with the updated data if it does not exist. This would otherwise be accomplished by first doing a SELECT and then doing an INSERT or UPDATE depending on whether the SELECT found anything. This would require two separate queries plus some programmed logic. The UPSERT allows you to do these two queries in one.

Here's how to do it for MySQL and MS SQL Server. The examples below are based on a table called "tbl" which contains 3 fields: "pk" (primary key), "f1" (field 1), and "f2" (field 2). In both examples the value being inserted is "key" for pk, "data1" for f1, and "data2" for f2.

Notice: Be sure to avoid using autoincrement primary keys for tables in which you want to upsert or it wouldn't make sense to insert a specific key. It is also advisable to lock the tables prior to upserting if the data is busy with multiple upserts being performed concurrently.


The ON DUPLICATE KEY UPDATE statement allows you to perform an update if an insert is not possible due to an attempt to insert a duplicate value in a field with a UNIQUE constraint.

INSERT INTO tbl (pk, f1, f2) 
  VALUES ('key', 'data1', 'data2')
    pk = VALUES(pk),
    f1 = VALUES(f1),
    f2 = VALUES(f2)

In order to avoid rewriting the values twice, we can refer to values already specified in an INSERT by using the VALUES function.

MS SQL Server

In MS SQL, an UPSERT can be achieved using a MERGE. This is originally meant to be used to insert many rows together into a table and then perform some actions (delete, update, insert, etc.) depending on whether or not a particular field already exists.

    (SELECT pk='key', f1='data1', f2='data2') AS s
  ON =
    UPDATE SET, f1=s.f1, f2=s.f2
    INSERT (pk, f1, f2)
    VALUES (, s.f1, s.f2); 

In order to avoid rewriting the values three times, the values were placed in a named nested SELECT. "s" refers to the source of the values, that is the values to add, whilst "t" refers to the target of the values, that is the table in which to add the values.

Sunday, August 31, 2014

Tips for LaTeX

So I used LaTeX using Texmaker in order to do my thesis. Here are some tips worth sharing.

For bibliography use BibTeX since you'll find BibTeX citations all over the Internet and use JabRef to edit the bibliographic details. Unfortunately the default for rendering citations is number in square brackets type like [1]. This might not be your preferred option as it doesn't show any information about what you're citing (although its great to save space in your text). If you want to use the "(author, year)" type of citation like "(Smith, 2014)" you'll need to use "\usepackage[round]{natbib}" in your preamble and then use "\pcite{}" instead of "\cite{}" to insert citations.

Unless you're writing only in pure English, you'll often need to use non-ASCII characters. The LaTeX compiler doesn't let you compile non-ASCII text but allows you to make these characters using codes like \'{o}; however they make the text unreadable when used frequently. Instead you can type your text using unicode and then compile using XeLaTeX which allows you to compile unicode text.

Single line compilation
LaTeX requires multiple compilations to get a final product. So, if using command line commands, you'll need to enter something like this to have a bibliography in your document:
latex x.tex
bibtex x.tex
latex x.tex

In order to get around this, IDEs such as Texmaker allow you to customize a compilation button to make it run a command line instruction of your choice. In Linux you can use | to combine multiple instructions in one line (use && in Windows). Here is how I was doing my compilation in the Texmaker quick build function:

xelatex -synctex=1 -interaction=nonstopmode %.tex|makeindex %.idx|bibtex %.aux|makeindex %.nlo -s -o %.nls|xelatex -synctex=1 -interaction=nonstopmode %.tex|xelatex -synctex=1 -interaction=nonstopmode %.tex

Notice that in Texmaker "%" means the file name of the main LaTeX source file.

Overfull/underfull hbox error handling
LaTeX has a rather poor handling of word wrap and often needs some help from you. When this happens, it gives you an overful hbox or underfull hbox error, depending on whether a line in the text is too long or too short. To fix it you need to either reword the words in the offending line (yes it's normal in the LaTeX community), or you add hyphenation marks using "\-" inside long words such as "ele\-phant". This tells LaTeX where to split a word if it occurs at the end of a line. LaTeX usually does this on its own but if it's an unknown word you'll need to help it.

Overfull/underfull hbox error handling on URLs
Long URLs are particularly problematic in LaTeX because you cannot hyphenate them as the hyphen might be thought to be part of the link. I found that the best way to handle it is to just line break the URL using "\\". Here's an example:


In this way you'll be telling LaTeX where to split the URL without adding hyphens.

Ouch. Tables are quite nightmare in LaTeX since you get no word wrapping in the table cells. You can fix this by putting your text in a par box which wraps words by itself and then putting this par box in the table cells here's the inefficient way to do this:

     \begin{flushleft}\vspace{-0.3cm}Title 1\vspace{-0.3cm}\end{flushleft}
     \begin{center}\vspace{-0.3cm}Title 2\vspace{-0.3cm}\end{center}
     \begin{flushright}\vspace{-0.3cm}Title 3\vspace{-0.3cm}\end{flushright}
   \\ \hline

     \begin{flushleft}\vspace{-0.3cm}Data 1\vspace{-0.3cm}\end{flushleft}
     \begin{center}\vspace{-0.3cm}Data 2\vspace{-0.3cm}\end{center}
     \begin{flushright}\vspace{-0.3cm}Data 3\vspace{-0.3cm}\end{flushright}

   \\ \hline
 \caption{A caption about this table.}

This will make the table have the first column be left aligned 2cm wide, the second centred 4cm, and the third right aligned 6cm. The \vspace is there to keep the rows from being too big. Of course the problem with this approach is if you need to change the width or alignment of a column, in which case using custom commands would be the best way forward:

\newcommand{\col1}[1] {
\newcommand{\col2}[1] {
\newcommand{\col3}[1] {

    \col1{Title 1}
    \col2{Title 2}
    \col3{Title 3}
   \\ \hline

    \col1{Data 1}
    \col2{Data 2}
    \col3{Data 3}

   \\ \hline
 \caption{A caption about this table.}

Commands are very useful for repetitive code in LaTeX but you need to use a different name for each different table; otherwise you'll need to use \renewcommand instead of \newcommand.

Positioning images and tables (floats)
I used to think that LaTeX intelligently places my floats at the best, most sensible place, but no it does not. It places them in the middle of bullet points without feeling guilty if you place your \begin{table} or \being{figure} anywhere near them. So what I did was that I placed the floats where ever it was sensible myself and then forced LaTeX to do it by surrounding the float with "\FloatBarrier" which requires "\usepackage{placeins}" in the preamble.

Thursday, July 24, 2014

Shape of the Universe

At the moment I'm reading A Universe From Nothing by Lawrence Krauss and there's a section which talks about the shape of the universe. So what does it mean for the universe to have a shape?

Let's imagine a flat 2D universe. What this means is that things can move left, right, forward, and backward but not up and down. Everything is flat, as shown in this picture:

Now think of this universe as a bed sheet. The sheet may be laid on the floor, in which case it would have a flat shape, but it might be placed on some curved surface, such as a ball.

Is there a way for the inhabitants of such a spherical universe to know that their universe is a sphere? Yes. By using triangles. A triangle in a flat universe follows Euclidian geometry rules, such as that the sum of its angles add up to 180 degrees, but what about on the surface of a sphere?

On a sphere, it is actually possible to draw a triangle consisting of three 90 degree angles. Divide the sphere into 8 quadrants (break it into two halves, then each halve into another two halves, and each of those into yet another two halves). Each quadrant is a triangle consisting of only 90 degree angles. How can this happen? Its because lines on a sphere that diverge tend to reconverge at some point, which doesn't happen on a flat plane.

So, if the inhabitants of this spherical universe were to try to construct a triangle (which has to be really big) and find that each angle is 90 degrees, then they would have proven that they live in a spherical universe.

There are other shapes of the universe that are possible, such as the saddle shape below, also from Wikipedia:

"HyperbolicParaboloid". Licensed under CC BY-SA 3.0 via Wikimedia Commons.

The inhabitants of a saddle shaped universe can tell that their universe is a saddle shape and not a sphere because lines tend to diverge rather than converge, and so the triangle will have angles which are too small, rather than too big.

This is what the inhabitants would see if they drew a triangle on a...

spherical universe:

flat universe:

saddle shaped universe:

You might be asking about other weird shapes that the universe can be in. These shapes were derived from general relativity and depend on the mass of the universe. Different masses give rise to variations of these three shapes (how curved the triangles are) so this is all we will consider.

Now there is no actual need to measure all the angles of the triangle in order to see if they add up to 180 degrees. A more practical solution might be to stand at one of the points, say the top point, put a ruler at the bottom of the triangle a known distance away from you, and check how wide you have to open your field of vision in order to see the whole ruler. From the above three triangles you will have noticed that the top corner of each triangle has a different angle in order to achieve the same base length. In a spherical universe you'd have to have a wider top angle in order to achieve a 1 km base length, whilst in a saddle shaped universe you'd have to have a narrower top angle in order to achieve a 1 km base length. By measuring how wide the angle of vision has to be in order to see a ruler of a known length, a known distance away, the inhabitants of the universe can know what shape their universe is.

Of course there is no ruler readily available for us to use, especially not one that is big enough and far enough to measure subtle curvatures, and of course the universe is not two dimensional. Let's deal with these one at a time.

Does it matter than the universe is three dimensional? No, these same shapes can happen in any number of dimensions, as long as there is another dimension along which to curve. So can there be four dimensions? Determining if our universe is flat or not can answer that question. But if there is a fourth dimension, then we wouldn't be aware of it, just like the inhabitants of the flat universe would not be aware of a third dimension as they can only travel along the surface and cannot jump up or dig down. In the same way, we can only travel along our three dimensional space but we cannot move along a fourth dimension so we never needed to be aware of it. One thing is for sure however, 3D triangles still behave in the same way on a 4D sphere.

So how do we use the ruler experiment to determine the shape of the universe? Obviously we won't be using an actual ruler, but we can use something that is almost as good. The universe is expanding as we speak and everything is going further apart from each other, which means that in the past it was smaller and everything was closer together. What does this mean? That's not all. If you look far away, you will be seeing things from the past. This is because light is not instantaneous but travels at a constant speed, which means if you're looking at something which is so far away that the light reaching your eyes took 1000 years of travel, you are seeing what it looked like 1000 years ago. The oldest thing you can see is called the cosmic microwave background radiation. This is when the universe was 300,000 years old and was so dense that it was an opaque blob (light could not pass). This means that we cannot "see" past this 300,000 year mark, it is the oldest thing we can see in the universe.

Now here comes the crux of the matter: Gravity also travels at the speed of light. If the sun suddenly disappeared, the Earth would still be revolving around it as if nothing happened for 8 minutes. After 8 minutes, the Earth would fling out of orbit. This is because the Earth is 8 light minutes away from the sun. Since this Cosmic Microwave Background Radiation is only 300,000 years old, anything further apart than 300,000 light years would not be gravitationally affected. In other words, if there are two blobs that are more than 300,000 light years apart would not be gravitationally attracted to each other. This sets our ruler, a 300,000 light year long ruler. All you have to do is see how far apart from each other blobs have to be in order to not be attracted to each other. This is translated as checking how packed together the blobs were; with gravity attracting only blobs that were closer than 300,000 light years together, there will be a certain density of blobs that can be found by computer simulation.

So the trick is to check how wide an angle of vision must be used in order to see a certain number of blobs. If the universe is spherical, a narrower angle must be used, if saddle shaped, a wide angle must be used, and if flat, an angle in between must be used. Another approach was taken however in the BOOMERanG experiment. Rather than check what angle must be used to see a number of blobs, the number of blobs for a particular angle was found, together with the actual number of blobs seen, and illustrated in here (see the last picture).

The actual observation is very similar to what should have been seen if it was flat. This is evidence that the universe is, in fact, "flat", that is it is not curved upon some fourth dimensional curve.

Sunday, June 22, 2014

Thoughts on Aritificial Intelligence

I've always had issues with the Turing Test, not because I disagree with equating computed intelligence with the ability to imitate intelligence, but simply because it equates intelligence with speaking skills. Obviously intelligence has numerous ways of expressing itself other than through speech. Apart from there being no reason for this particular expression of intelligence to be superior to other expressions of intelligence (provided it is a sufficiently powerful intelligence), the Turing Test is actually impeding the progress of artificial intelligence by diverting attention to natural language processing rather than pure artificial intelligence.

The fact is that a sufficiently intelligent organism will learn to communicate with us in whatever way is most appropriate for it, even if it is not in English, just like a dog would. Communication is a consequence of intelligence rather than the other way round. So what I'm saying is that we should focus on developing pure artificial intelligence rather than chat bots, which will then be able to learn to communicate with us in whatever form is most efficient rather than using English. It may be the case than pure artificial intelligence is easier to achieve than programming a convincing chat bot using the methods we are using right now.

If human imitation is what is being sought, than perhaps it would be more practical to imitate something simpler than conversation. An example would be playing a game. Any video game player can vouch for the differences between playing against a computer and playing against another player. Even simple path finding can give away that you are playing against a computer rather than a real human, or even the ability to adapt to strategies used by the player rather than having a fixed exploitable pattern.

But even this is limiting the expression of intelligence, which the computer would ideally be free to express its intelligence in any way. So then how can intelligence be measured without being limited by a particular expression? Intelligence is the ability to discover patterns which can be used to increase the probability of reaching a goal. The process of discovery is usually called "learning". If we are to accept this definition, then I shall make a case against the Turing Test by showing how a plant can be tested for intelligence.

I have lately been dabbling in some indoor gardening by placing a potato in a pot and watering it. Let's illustrate this in a picture:

I have noticed that if you place a light source to one side of the plant, after some time the plant will bend itself to face the light, like so:

And if the light is then moved, the plant will bend appropriately:

This seems like pretty nifty behaviour, but it is not intelligence for it is only responding to a fixed stimulus, in other words, there are no patterns being taken advantage of. But now suppose that we set a revolving light that orbits around the plant which goes around slowly enough for the plant to bend towards it synchronously with the light revolutions. In the absence of intelligence, the plant will simply keep on bending towards the light forever, but what if we come back to check on the plant a few months later and find this:

The plant, rather than bending towards the light, responds to the pattern of the revolving light by adapting itself so that there are leaves at every side of the plant so that at any point in time, there are some of the leaves which are facing the light. Should we ignore biological details on whether or not the extra leaves do more good than harm (maybe the leaves take less energy to maintain than the constant bending), this would be evidence that the plant is intelligent.

Now, if even a plant can be considered intelligent, which is as far away from holding a conversation with a human as an organism can be (I think), think of how unfair the Turing Test is on computers, which requires them to master the human language so much that it fools people into thinking that it is a human. Alan Turing came up with a test for human imitation, and the Turing Test is a perfect test for this, but to claim that this is the best test of intelligence is silly.

I will intentionally not propose a specific test myself, because a specific test would be quantitative and formal in nature, and intelligence is not that. Intelligence can be surprising and we should accept it as that. I would much rather take a qualitative approach to intelligence, where someone proposes a pattern that is being learnt by a computer and researchers investigate its validity.

Sunday, June 8, 2014

Operant Conditioning

In behaviour psychology, in order to change the behaviour of a subject, you can use one of 4 different methods. These are categorized into reinforcements (rewards) and punishments and further into positive and negative. It is important to note that positive and negative have nothing to do with how they are perceived, only with whether something is added or removed to the subject.

Positive reinforcement This is when a pleasurable experience is given to the subject in exchange for increasing the desired behaviour Getting a good mark for studying or giving a dog a treat for doing a trick.
Negative reinforcement This is when an unpleasant experience is removed from the subject in exchange for increasing the desired behaviour Removing some of a class's homework for being quiet or removing a dog's chain for not pulling it.
Positive punishment This is when an unpleasant experience is given to the subject in exchange for decreasing the undesired behaviour Getting a report for bad behaviour or giving a dog a scolding for not obeying.
Negative punishment This is when a pleasurable experience is removed from the subject in exchange for decreasing the undesired behaviour Removing a class's break time for being disruptive or removing a dog's toy for making noise.

Notice that these need not be manually applied by someone. It could be something like eating too many sweets giving you a stomach ache, which would be positive punishment.

Thursday, May 22, 2014

Boolean Algebra and Probability

Boolean Algebra is a field of algebra which formalizes the operations performed in logic, that is, on propositional statements (statements that are either true or false). This requires operations such as "AND", "OR" and "NOT" in order to compose complex statements from simpler ones, such as "it is raining" and "the road is wet" being composed into something like "NOT(it is raining) AND the road is wet". These operators work in the following way:

The statement "Math is awesome AND One Direction suck" is true because each of the simpler statements being joined together is true. If at least one of the simpler statements is false, then the whole thing becomes false.

The nice thing about this algebra is that it can be naturally extended from normal numeric algebra by representing a true statement with 1 and a false statement with 0. So then we can turn the statement "Math is awesome AND One Direction suck" into "1 AND 1" which is equal to "1". Why does it equal 1? Because for "X AND Y" to be true, both X and Y have be true individually. For example the statement "Math is awesome AND One Direction make great music" is false because one of those substatements is false. With the OR operator, as long as at least one of the substatements is true, the whole thing becomes true. For example "Math is awesome OR One Direction make great music" is true because one of the substatements is true. The NOT operator takes one substatement and inverts it, that is, if it was true then it becomes false and if it was false then it becomes true. For example "NOT Math is awesome" is false and "NOT One Direction make great music" is true. Numerically speaking, the following list summarizes these rules:

0 AND 0 = 0
0 AND 1 = 0
1 AND 0 = 0
1 AND 1 = 1

0 OR 0 = 0
0 OR 1 = 1
1 OR 0 = 1
1 OR 1 = 1

NOT 0 = 1
NOT 1 = 0

Can the be some normal arithmetic equations which will give us these outputs?

NOT is easy. What arithmetic equation returns 1 when given 0 and returns 0 when given 1?
NOT X = 1 - X
1 - 0 gives 1 and 1 - 1 gives 0.

AND is also easy. Just multiply the two operands together.
X AND Y = X × Y

OR is a little more tricky. This is because there is not single operation that gives the same outputs as OR. However we can easily derive an equation by using DeMorgan's laws which express an OR using AND and NOT only. The statement "in order to apply for the job you need a degree or five year's experience" can be expressed as "do not apply for the job if you do not have a degree and have less than five year's experience". Formally this is written as


which numerically becomes

X OR Y = 1 - ((1 - X)(1 - Y))
X OR Y = 1 - (1 - Y - X + XY)
X OR Y = 1 - 1 + Y + X - XY
X OR Y = Y + X - XY

This makes sense. The OR operator is the addition of the two substatements minus one in case they are both true. This will return 1 whenever at least one of the two substatements is a 1.

But it also makes sense on a higher scale. If you think about it, these rules are the exact same rules for probability. Probability is a generalization of Boolean algebra. Rather than just considering truth and falsity like Boolean algebra does, probability considers values between those two cases, such that 1 is certainty, 0 is impossibility and all the numbers in between are a degree of uncertainty, where the closer the number is to 1 the more certain the statement is.

In fact AND and NOT are defined in exactly the same way in probability. But isn't OR defined simply as X + Y? That is in fact only the case when X and Y are events which are independent, that is, cannot happen together. In this case X AND Y is always zero and hence can be dropped from OR's equation. The equation we have derived is useful for dependent events as well such as asking for the probability of "two fair dice resulting in an even number and a number greater than 3". In this case you can't just add the probability of a die being even (3/6) and the probability of a die being greater than 3 (3/6) as otherwise you get 3/6 + 3/6 = 1 which means that you are certain that this will happen. The problem is that it is possible for the two substatements to be both true, that is, a die being both even and greater than 3 (namely 4 and 6). To fix this just subtract the probability of this special case occurring (2/6) which gives 3/6 + 3/6 - 2/6 = 2/3.

Let's illustrate it with a Venn diagram:

The Venn diagram above is showing two events as a red and blue circle. They have some overlap between them which is coloured purple. If the red circle represented the chance of a die being even and the blue circle represented the chance of a die being greater than 3, then the purple overlap would be the numbers 4 and 6. Now we want to find the probability of the red OR the blue circle occurring. If we just add their areas together we end up with the purple overlap being added twice since it is common to both circles, but we just want to add it once. So we subtract it once after adding both circles, which will result in having just one purple overlap. This is what is happening when you use the equation X + Y - XY.

Sunday, April 20, 2014

Don't edit batch files during runtime

So it turns out that Windows batch files running under cmd (at least in Windows 7) are not preloaded into memory before runtime. Instead each line is loaded from disk right before it is executed. What this means is that if you edit the batch file during its runtime, you'll end up modifying the behaviour of the batch file being run.

Try this:
echo started
timeout 10
echo continued
echo ended

This will make the batch file show "started", then wait for 10 seconds and finally show "continued" and "ended". To run it just paste the above code into Notepad and save it as "something.bat", then just double click the saved file.

Now open the batch file with Notepad again and run the file at the same time. As the batch file is waiting for 10 seconds, delete the "echo continued" line and save (before the 10 seconds are up). When the batch file continues, it will not show "continued" but will immediately jump to "ended".

Strange behaviour since this sort of thing does not happen with Python for example. I noticed this when I was trying to run multiple versions of the same batch file by just changing something in the file and running a new instance of it. Be careful of this.

Tuesday, March 18, 2014

Random Indexing for word space context vectors

Last time I had described what a term document matrix is and how to make it's use faster using Latent Semantic Analysis. Today I shall be explaining how to use Random Indexing as an alternative to LSA. You can read more about it here.

Unlike LSA, random indexing is a very easy and very fast online process. It is online in the sense that you don't need to first completely construct the co-occurrence matrix in order to use it, but can instead use it as you are counting co-occurrences. It also works just as well as LSA. Interested yet?

First of all, how do you construct a co-occurrence matrix? You start off with a matrix full of zeros with columns for words in context w1, w2, w3, and w4 and rows for neighbouring words n1, n2, n3, and n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
You go through the corpus and you encounter word w2 and near it is word n4. You record this encounter by incrementing the matrix element for w2-n4:
   n1 n2 n3 n4
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
You do this for every neighbouring word next to every word in context and finally you have your co-occurrence matrix.

Now this very same process can be viewed in a different way. Instead of incrementing single numbers, it can be viewed as vector addition. Forget the names given to the columns:
w1 0  0  0  0
w2 0  0  0  0
w3 0  0  0  0
w4 0  0  0  0
Instead assign a vector to each neighbour word where each vector consists of just zeros and a single 1. No two vectors can have a 1 in the same position. So these could be the vectors assigned to our 4 neighbour words:
n1: (1 0 0 0)
n2: (0 1 0 0)
n3: (0 0 1 0)
n4: (0 0 0 1)
Now when you encounter neighbour word n4 next to word in context w2, you just take w2's row, which is a vector, and add it to n4's vector:
w2: (0 0 0 0) +
n4: (0 0 0 1)
    (0 0 0 1)
Which gives you the updated matrix:
w1 0  0  0  0
w2 0  0  0  1
w3 0  0  0  0
w4 0  0  0  0
Do this for every neighbour next to every word in context and you have your co-occurrence matrix.

All random indexing does is change the neighbour words' vectors. Instead of the vectors being as long as the number of neighbours, the vectors are shortened to a fraction of what they are supposed to be. Consequently the number of columns in the matrix will also be reduced. "But how can you put a 1 in a different position for every neighbour word?" I hear you ask. You don't. What you do is you still keep the vectors mostly zero, but now you randomly put a few 1s and -1s in it:
w1 0  0  0
w2 0  0  0
w3 0  0  0
w4 0  0  0

n1: ( 1  0 -1)
n2: ( 0  1 -1)
n3: ( 0 -1  1)
n4: (-1  1  0)
If we encounter word in context w2 with neighbour word n4, then the matrix will become:
w1  0  0  0
w2 -1  1  0
w3  0  0  0
w4  0  0  0
If word in context w2 also has word n2 as a neighbour, then the matrix will become:
w1  0  0  0
w2 -1  2 -1
w3  0  0  0
w4  0  0  0

Now it will not be possible to know how many times each neighbour word co-occurs with each word in context, but that's OK. What matters is that each time you encounter a neighbour word, it leaves its mark on the word in context's row and that the more often it is encountered, the more its mark is pronounced. This means that you can still find similar words by comparing their rows together. The difference is that the rows are shorter and hence faster to compare.

Notice the online nature of this process. At any point in the corpus traversal, the matrix is always in its compressed form and there are no complex transformations that need to be made on it. You can ever add more words to or remove words from the matrix, without any further processing or extra data. You just have to find a compromise between how many columns to remove from the matrix and the individuality of the rows. The smaller the rows, the less of an individual mark each neighbour word will leave after adding its vector.

Wednesday, March 12, 2014

Summary of research paper "Unsupervised Lexical Substitution with a Word Space Model" by Dario Pucci, Marco Baroni, Franco Cutugno, and Alessandro Lenci

This is a summary of the research paper This paper describes a way to automatically choose a word which can substitute another word in a sentence with minimal resources.

The system described in this paper was used for the EVALITA 2009 where a manually constructed evaluation set in Italian was provided.

In order to determine how well a word can replace another in a context, two types of vectors are used: a contextualized vector and an out-of-context vector. The contextualized vector is a composition of out-of-context vectors.

First, out-of-context vectors are collected from the corpus for each word using the usual method of counting how often each word co-occurs with the word being represented by the vector. The counts are then weighted to give less importance to words that co-occur with many words.

In order to find a substitutable word for a target word in a sentence, the out-of-context vectors of each content word in the sentence are collected and their weights are summed together. This is the contextualized vector. The word with the out-of-context vector that is most similar to this contextualized vector is chosen as the best word to substitute.

Nitty gritty
A lemmatized and part of speech tagged 2 billion token Italian corpus was constructed from which to extract the out-of-context vectors. The 20,000 most frequent nouns, verbs, adjectives and adverbs were considered as content words. Out-of-context vectors for these content words were extracted by counting how often they appear with other content words in the same sentence. The counts were weighted using log likelihood ratios and the co-occurrence matrix of counts was compressed to half its size using random indexing.

To create a contextualized context vector, the context sentence is lemmatized and part of speech tagged. The sentence is then split into a left and right side with respect to the target word and each side is filtered of all non-content words. The words in the sentence sub-parts that are within a set number of content words from the target word are used to construct the contextualized sentence. Their out-of-context vectors are normalized, multiplied by "1/(4*d)" where "d" is the distance of the word in the sentence sub-part from the target word (the closest content word has a distance of 1), and summed together.

Finally, cosine measure is used to find the most substitutable words by measuring the distance of the contextualized context vector to the out-of-context vectors of all the words that have the same part of speech tag as the target word.

The system was evaluated by using an evaluation set which consisted of sentences with a target word and suggested substitutable words by human annotators. Each suggestion included the number of annotators that suggested it. When comparing the best suggested substitutable word by the system with the most popular suggested word by the annotators, the system matched 10.84% of the evaluation set (that included a most popular suggested word).

Sunday, February 16, 2014

Proof that there is only one prime number preceding a square number and any other power number (or "What is the use of factoring polynomials?")

Students are made to study how to factor polynomials but are never given an interesting task that involves using polynomial factorization. Here is a use that I find interesting.

Let's say we have the polynomial "x^2 - 1". This is a difference of two squares, which means that you can easily factorize it into "(x+1)(x-1)". But what does this factorization mean?

It means that any number which precedes a square number can always be factored into two numbers: one more than the number that was squared and one less than the number that was squared. For example, 15 precedes the square of 4, that is "4^2 - 1", so we can immediately conclude that "4+1" (5) and "4-1" (3) are factors of 15. Another example is 99 which precedes the square of 10, that is "10^2 - 1", so we can immediately conclude that "10+1" (11) and "10-1" (9) are factors of 99.

OK so maybe this is interesting, but can we use this for something more interesting?

Are there any numbers preceding a square number that are prime numbers? If there is such a number, than it's factors must be 1 and itself. So we are looking for a number of the form "x^2 - 1" such that it can only be factored into 1 and the number itself (which is prime). We said that any number of the form "x^2 - 1" can always be factored into two numbers, "x+1" and "x-1". So if this number, "(x+1)(x-1)", can only be factorized into 1 and itself, then surely one of these two factors must be 1. Which of these factors can be equal to 1? The "x+1" or the "x-1"? Since "x" is a positive number (the number that is squared), it can only be the "x-1" factor.

Will the "x-1" factor equalling 1 automatically make the other factor, "x+1", a prime number? No, because the "x+1" factor could itself be factorizable by other numbers. The fact that "x-1" equals 1 means that "(x+1)(x-1)" could be prime. But if "x-1" is not equal to 1 then, given that "x+1" cannot be equal to 1, "(x+1)(x-1)" is surely factorizable by "x-1", making it not a prime.

In order for "x-1" to equal 1, "x" must be 2. In this case, the number preceding a square number is "2^2 - 1" which is 3. Notice that the other factor, "x+1" would also equal 3 when "x" is 2 which makes sense since we said that one of the factors must be 1 and the other must be the number itself. If 3 a prime number? Yes. So when "x" is 2, "x-1" is 1 and "x+1" is a prime number. Therefore 3 is a prime number which precedes a square number.

Are there any other prime numbers which precede a square number? No, because if "x-1" is anything else but 1, then neither of the two factors will be 1, making the number factorizable into two numbers which are not 1 and itself, hence not being a prime number. This means that 2 is the only number "x" can be, which means that there are no other numbers which precede a square number that are prime.

It's interesting to also note that using the exact same method, one can proof that 7 is the only number preceding a cube number that is prime. Use this website to find the factors of numbers preceding other powers of "x".

It seems that there can only one number preceding an nth power of "x" that can be prime, for every power of "n". Can we prove this?

What will the factors of "x^n - 1" for any "n" be? Regardless of "n", will "x-1" always be a factor of "x^n - 1"? That is, will 1 always be a root of "x^n - 1"? Yes, because "1^n - 1" will always be 0, for all "n". So we can be sure that for "x^n - 1", regardless of "n", "x-1" will always be a factor. This means that "x^n - 1" will always be factored into "(x-1)(...)" where the "..." is some other polynomial factor. Since "x-1" is equal to 1 only when "x" equals 2, then "2^n - 1" is the only number preceding the power number "x^n" that can be a prime number. Of course it could be that the other number could be factored into other numbers hence making "(x-1)(...)" not a prime number when "x" is 2, but if "x" is not 2 then "(x-1)(...)" cannot be a prime number because "x-1" will be a factor.

So, regardless of what "n" is, there can only be one number which precedes "x^n" that can be prime, and that is "2^n - 1", since "x^n - 1" can always be factored into "(x-1)" and some other factor that cannot be equalled to 1. The other factor might not be prime, but if it is then it is the only one preceding "x^n" for all "x". Such prime numbers are called Mersenne primes.

Is this useful? This might be useful for testing whether a number is a prime number by checking if the number precedes a number raised to some power, x^n. If it does then it can only be a prime number if it is a number of the form of "2^n - 1", otherwise it will be always be factorizable by "x-1", regardless of what "x" or "n" are.

Saturday, January 18, 2014

Term-Document matrices and Latent Semantic Analysis

Latent Semantic Analysis, or LSA, is a process used in natural language processing in order to find similar words or documents in a term-document matrix.

Term-Document matrix
A term-document matrix is a matrix which tells you how often a given word is found in a given document. Here is an example taken from the paper "Indexing by Latent Semantic Analysis" by Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman:


In this example, there are 9 nameless documents (could be websites, Wikipedia articles, anything) and 12 selected terms. Each numbers are the number of times the given term is found in the given document. For example the term "human" is found once in D1 but never occurs in D2.

Term document-matrices are neat because the meaning of both the words and the documents can be characterized by using a row or a column from the matrix. For example the row belonging to the term "human" is [1, 0, 0, 1, 0, 0, 0, 0, 0]. This says something about its meaning because it occurs in D1 and D4. Other words which are also found in the same documents would imply that they are somehow related or similar. On the other hand the column belonging to the document D1 is [1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]. This says something about its meaning as well because it contains the words "human", "interface" and "computer" only. Other documents which also contain the same words would imply that they are also somehow related or similar.

This means that by comparing their respective row or column, that is, their a vector, you can determine how similar two words or documents are. You can even create a search engine which searches the documents based on keywords by treating the keyword list as a tiny document and comparing its vector with the vectors of the documents.

How can you compare vectors? You can use a vector similarity function such as Cosine similarity which given two vectors will give you a number between -1 and 1, depending on how similar they are.

Unfortunately this is not always the case because words in isolation do not discriminate meaning well enough. This is because of synonymous (different words which mean the same thing such as "couch" and "sofa") and polysemous (same word which means different things such as "bank account" and "river bank") words. This means that in the above matrix, the words "human" and "user" which should be similar will have the following vectors:
[1, 0, 0, 1, 0, 0, 0, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 0, 0]

These vectors do not hint at a similarity between the two words. However, although they appear in different documents, they may co-occur with other words which are shared between these documents. For example both "human" and "user" occur in documents which contain the words "computer" and "interface". This indirect co-occurrence can be harnessed by using LSA.

Latent Semantic Analysis
LSA uses linear algebra to bring out these latent (hidden) co-occurrences. In particular, it uses Singular Vector Decomposition (SVD). We will not go into the method used to perform this operation but will instead assume that there is a library which does it for us, in particular we will be using python's NumPy. SVD will basically decompose a matrix into 3 matrices which when multiplied together will give the original matrix again. These matrices are called U, S and V. Here is the python 2.7 code with NumPy which does this:

import numpy

a = numpy.matrix("""
1 0 0 1 0 0 0 0 0;
1 0 1 0 0 0 0 0 0;
1 1 0 0 0 0 0 0 0;
0 1 1 0 1 0 0 0 0;
0 1 1 2 0 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 1 0 0 1 0 0 0 0;
0 0 1 1 0 0 0 0 0;
0 1 0 0 0 0 0 0 1;
0 0 0 0 0 1 1 1 0;
0 0 0 0 0 0 1 1 1;
0 0 0 0 0 0 0 1 1

(U, s, V) = numpy.linalg.svd(a, full_matrices=False)
S = numpy.diag(s)

print U
print S
print V




Rank reduction
If U, S and V are multiplied together in that order, you'll get the original matrix again (approximately due to rounding errors). Notice that the second matrix S is a diagonal matrix, that is, every element in the matrix is zero except the diagonal. I will not go into the mathematics of this but if you take the second matrix S and replace the smallest number in the diagonal with zeros, when you multiply the three matrices together again you will get a matrix which is similar to the original matrix but which has some important differences. The more of the smallest values in the diagonal are replaced with zeros, the more similar the rows of similar words will become. This only works to a point of course. There has to be a balance between how many values are replaced with zeros and how much information is left in the matrix. Conviniently the values in the diagonal are sorted. If we replace the smallest 7 values with zeros, we get the following:

S[2][2] = 0.0
S[3][3] = 0.0
S[4][4] = 0.0
S[5][5] = 0.0
S[6][6] = 0.0
S[7][7] = 0.0
S[8][8] = 0.0

A = U*S*V
print A


Now if we take the first and fourth rows which belong to "human" and "user", we get these:
0.162057973899 0.400498283106 0.378954540321 0.467566261179 0.175953674216 -0.05265494658 -0.115142842816 -0.159101981799 -0.0918382678755
0.258049326846 0.841123434512 0.60571994881 0.697357170797 0.392317949475 0.0331180051639 0.0832449070523 0.121772385778 0.187379725005
These rows are much more similar than they were before and they are more similar than other rows in the matrix.

This is called a rank reduction because the effect of replacing the values with zeros was that you get a matrix with a smaller "rank" which is just a numeric property of matrices.

Dimension reduction
The problem with a rank reduction is that suddenly the matrix becomes full of numbers where it used to be full of zeros. This means that in order to use the matrix you need a lot of processing which is slow. So a slight modification is used instead which has the effect of making the matrix smaller, whilst still exposing the latent co-occurrences of the matrix. After zeroing out the values, rather than multiplying all 3 matrices, only U and S or S and T are multiplied together, depending on whether you want to compare rows or columns.

S[2][2] = 0.0
S[3][3] = 0.0
S[4][4] = 0.0
S[5][5] = 0.0
S[6][6] = 0.0
S[7][7] = 0.0
S[8][8] = 0.0

A = U*S
print A


The columns of zeros on the right of the matrix correspond to the values which were zeroed out in the diagonal of S. We can simply remove these columns.


Now we only have only two columns to compare. The rows corresponding to "human" and "user" are:
-0.739507219222 -0.287668746646
-1.34837688543 0.145055532965

This is called a dimension reduction because the vectors being compared will become smaller, which allows for faster computation. The same thing could have been done to compare documents by multiplying S and V instead and then removing the bottom rows instead of the rows on the right.