Monday, June 10, 2013

Fitness function for multi objective optimization / Mapping vectors/tuples to numbers

When you want to let the computer discover how to best solve a problem, you can use a genetic algorithm to evolve a solution for a problem. For example, if you're trying to find which schedule of lessons will cause the least clashes (hopefully none), you create a population of possible schedules and let them evolve into better and better schedules. A more interesting example would be evolving a program which does something in particular, such as control a robot.

In genetic algorithms you have to supply a function called a fitness function which returns a number which quantifies how good a particular solution is at solving the problem. For example, in the case of the time table scheduler, a fitness function would be the number of clashes between lessons that are caused by a given schedule. The smaller this number, the better the schedule. If you're trying to evolve a program which generates prime numbers, then a fitness function would be the percentage of numbers generated which are actually prime numbers. The higher this number, the better the program.

The problems start when you want to maximize or minimize more than one thing, that is, when you have a number of separate fitnesses that you want to optimize together. An example of this is evolving your schedule such that both the number of lesson clashes (when students are expected to be in two different lessons at the same time) and the number of room clashes (when the same room is assigned to be used for more than one lesson at the same time) will be minimized. In the case of the program, you usually not only want to evolve a correct program, but also one which gives the correct output quickly. This is called multi objective optimization.

What is usually done is a weighted summation of the individual fitnesses. So you say that the correctness of a program is, say, twice as important as the speed of the program, so the fitnesses are combined as
fitness = 2 * percentage_correct_outputs + time_taken
Of course this is not correct since the correctness must go up whilst the time should go down. So a better combination might be
fitness = 2 * percentage_correct_outputs + 1/(time_taken + 1)
You might not think that this is the best fitness function but choosing a correct fitness function is part of the difficulty of using genetic algorithms so have fun finding a better one.

The problem with weighted summations is that the fitness will increase by simply increasing one of the sub fitnesses. So in the previous example, you can get a good fitness by simply creating a program that does nothing and takes 0 seconds to do so. This program is much easier to find than a program which is correct so the genetic algorithm tends to improve the execution time only and gets stuck as a program which increases the correctness will be much much slower and thus will reduce the fitness rather than increase it. A better way to combine the sub fitnesses is needed.

I found two ways to do this for two different needs. The first is for when the sub fitnesses are of equal importance, for example the time table scheduler minimizing the number of lesson and room clashes. The second is for when the sub fitnesses are prioritized such that the most important sub fitness should always be improved regardless of how the second is affected, for example the prime number generator being correct has the highest priority but between two equally correct programs, the faster one is preferred.

Equal priority fitness

If both sub fitnesses must be optimized together such that improving one without the other is not helpful, then we might add the two fitnesses together and subtract their absolute difference like this:

fitness = (sub1 + sub2) - |sub1 - sub2|

This way as one starts improving without the other, their difference will increase and the whole fitness will start becoming smaller. We can simplify this equation by calling the largest sub fitness "max" and the smallest sub fitness "min":

fitness = (max + min) - (max - min)
fitness = max + min - max + min
fitness = 2*min

So this fitness is equivalent to finding twice the minimum of the sub fitnesses, which makes sense since if you only care about increasing the minimum fitness, then both fitnesses will increase together. Of course multiplying by 2 is redundant since comparing a minimum with another minimum and comparing twice a minimum with twice another minimum will give the same result, so our final fitness function is:

fitness = min(sub1, sub2)

and you can add more sub fitnesses by just finding the minimum of all of them:
fitness = min(sub1, sub2, sub3, ..., subn)

So in our time table example, the fitness function would be:
fitness = min(-lesson_clashes, -room_clashes)

The negations are used so that in order to increase the fitness, the number of clashes have to be decreased.

Prioritized fitness

When one sub fitness is more important than the other, such as program correctness versus execution time, a different method is needed. What is needed is a function "f" which combines the two sub fitnesses ("sub_1" and "sub_2" where sub_1 has a greater priority than sub_2) such that the following conditions are met:

f(subA_1, subA_2) > f(subB_1, subB_2)
if subA_1 > subB_1 or
   subA_1 = subB_1 and subA_2 > subB_2

f(subA_1, subA_2) = f(subA_1, subB_2)
if subA_1 = subB_1 and subA_2 = subB_2

So for example f(2, 1) > f(1, 100), f(5, 20) > f(5, 1) and f(3, 3) = f(3, 3).

This is called a lexicographical ordering of the sub fitnesses. With strings this is natural. When sorting names, "John" comes before "Zach" because "J" comes before "Z", regardless of what the second letter is. But "James" comes before "John" because since the first letter is the same then we look at the second one and "a" comes before "o".

The problem we're facing is with finding a function which when given a pair of numbers will return a number which can be used to sort a list of pairs of numbers in lexicographical order. This is similar to Cantor's mapping of pairs of numbers to the natural numbers, except that this time we want a particular mapping which is lexicographic.

A natural lexicographic ordering in numbers comes when we look at numbers with a decimal point such as 2.1 and 3.5. The whole number part will always determine the ordering, unless they are equal, in which case the fractional part is then used. So if we can map our pairs of sub fitnesses to real numbers in such a way that the high priority sub fitness becomes the whole number part and the low priority sub fitness becomes the fractional part, then we would have found our "f".

What I found was that you can do this by squashing the low priority number into a proper fraction (a number between 0 and 1) and then adding it to the second number, assuming that it is a whole number. Unfortunately this method will only work if the high priority number is a whole number. The low priority number can be a real number however. In order to squash the low priority number, you can use a sigmoid function which given any number will return a number between 0 and 1 in such a way that ordering is preserved (sigmoid(a) > sigmoid(b) if and only if a > b). Another and perhaps simpler way would be to use the hyperbolic tangent or the arc tangent and then modify them so that their output is between 0 and 1 (y = (tanh(x)+1)/2 and y = (atan(x)+pi/2)/pi).

So the combined fitness would be:
fitness = sub1 + sigmoid(sub2)

The nice this about this is that you can add more sub fitnesses in the following way, assuming that only the least important sub fitness is a real number whilst the rest are integers:
fitness = sub1 + sigmoid(sub2 + sigmoid(sub3 + ... + sigmoid(subn)...))

So in our program example, the fitness function would be:
fitness = percentage_correct_outputs + sigmoid(time_taken)

It should be noted that this is a way to map vectors/tuples of integers to real numbers whilst preserving lexicographic ordering.


Now we can even use both of the combinations in order to combine sub fitnesses in complex ways where some are of equal priority whilst others are of different priority. For example, say we are evolving a time table schedule which minimizes lesson clashes, minimizes room clashes and minimizes density such that you avoid cramming all the lessons in one day. The lesson and room clashes are of equal priority but the density has a lower priority than the other two. So the fitness function might be:

fitness = min(-lesson_clashes, -room_clashes) - sigmoid(density)

This will give us a real number which has a whole number part representing the number of undesirable clashes and a fractional part representing the density, both of which must be minimized. Since fitness should be increased, minimization is done by using the negations and subtraction.


  1. Very Nice. Solved my problem. Thank you.

  2. This is exactly what i needed! Thank you for this perfect explanation.

  3. The equal priority fitness function work for minimization problems?

    1. Yes but you need to take the maximum instead of the minimum. Decreasing the maximum will make both values go down.

  4. Many thanks, very nice and clear explanation! Would you be able to provide some reference in order to deepen your work and/or cite it?

    1. Thanks! No I don't have any references to supplement this post. This particular post is stuff I thought of when trying to solve problems I had, not academic work. If this kind of problem pops up in my current line of study I'll let you know.

  5. This is perfect, you haven't mentioned pareto efficiency though! can i use this formula as fitness function without pareto?! iam a little confused!
    My fitness is Min (execution-time, -throughput, power-consumption)
    iam planning to sum the 3 values so f(x) = execution-time -throughput + power-consumption is it true?

    1. The minimum function is used so that you do not get to one extreme of the pareto frontier by just optimizing one dimension that happens to be easy to optimize. You force the optimization algorithm to optimize all dimension at once so that you get to a more interesting part of the pareto frontier.

      You cannot just sum your subobjectives (or take their minimum) without first normalizing them to a common scale. For example, with your function you are assuming that one second of execution time is equal to one watt of power consumption. Are the two amounts equally beneficial? Would you sacrifice a watt of power consumption in order to gain a second in execution time? You have to first find a way to scale them (such as by multiplying them by a constant) such that one unit in a subobjective is equivalent to one unit of another objective. Of course by summing the values instead of taking their minimum you risk optimizing just one of the values.

  6. To combine several objective functions using "min" it seems to me they should be normalized to a comparable scale. But how do you normalize time_taken to a comparable scale like percentage_outputs_taken? It's comparing apples to lemons really. I agree that (1/(time_taken+1)) will vary between 0 and 1, whereas correct_percentage_taken can also be between 0 and 1 if you write it as a float, but depending on whether you put time_taken in milliseconds or years you will effectively assign a wildly different weight to one or the other objective function. The devil's in the details :)

    1. That is very true Stefaan since here I am assuming that a decrease of 0.1 in the normalized time taken is equal to a decrease of 0.1 in normalized output correctness. Plus normalization is probably not going to be a linear function so decreasing normalized time taken from 0.5 to 0.4 is not the same as decreasing from 0.2 to 0.1.

      Although there might be simple fixes to this by using different weights between two normalized variables (make correctness twice as important as run time) or even to bias the normalization in order to give more weighting to some part of the domain over another (see my earlier blog post, the point here is to avoid the problem of having an optimizer latch on to an easily optimized variables whilst ignoring all the others. You want to avoid the situation where the optimizer immediately goes to a pareto optimal value where the returned optimized program takes zero seconds to execute but does nothing. This trick with using "min" will stop this from happening and will force the optimizer to look for a way to reduce the other variables as well.

      So while what you're saying about normalized scales is true, it is still better than a simple sum of unnormalized variables.