Wednesday, March 31, 2021

How to check which Python installation/environment is being used within Python

Say you have several Python environments, how can you check that you're using the right one whilst running a Python script? The 'executable' constant in the 'sys' module will give you the path to the Python executable currently being used, which is also the path to the environment.
import sys
print(sys.executable)

Sunday, February 28, 2021

Micro averaged precision, recall, and F1 are always the same

This is something that causes confusion when it happens because it isn't very well known. If you're evaluating a multi-class classifier using micro-averaging, the precision, recall, and F1 scores will be exactly the same, always. In fact this is explained in the sklearn documentation as "Note that if all labels are included, “micro”-averaging in a multiclass setting will produce precision, recall and that are all identical to accuracy.". Let's prove this, first using an example and then more generally.

Let's say that your classifier's classes (or labels) are A, B, and C and that you have the following classifier predictions together with the true classes:

predtrue
AA
AA
BA
CA
AB
BB
BC
CC
CC

To measure the micro-averaged precision, recall, and F1 you need to first measure the number true positives (TP), false positives (FP), and false negatives (FN) of each class. For example, if we look at just class A, the TP is the number of rows where A was predicted and also the true class, that is, 2 (first two rows). The FP is the number of rows where A was predicted but was not the true class, that is, 1 (the fifth row). The FN is the number of rows where A was not predicted but was the true class, that is, 1 (the third row).

ClassTPFPFN
A212
B121
C211

You calculate the micro-averaged precision ($P$), recall ($R$), and $F1$ for labels $L$ as follows:

  • $P = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{2+1+2}{(2+1)+(1+2)+(2+1)} = \frac{5}{9} = 55.6\%$
  • $R = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} = \frac{2+1+2}{(2+2)+(1+1)+(2+1)} = \frac{5}{9} = 55.6\%$
  • $F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times \frac{5}{9} \times \frac{5}{9}}{\frac{5}{9} + \frac{5}{9}} = 55.6\%$
So we got all the scores the same. If $P$ and $R$ are the same than $F1$ will be the same because the harmonic mean of the same number is that number (just like the arithmetic mean). We can easily prove this: $$ \frac{2 \times X \times X}{X + X} = \frac{2X^2}{2X} = X $$ This is assuming that $X$ is not zero, which would mean that the precision and the recall are both equal to zero, which is impossible.

So the question we should ask is why are the precision and recall always the same. If the precision and recall are always the same, then their definition should be identical. Let's see if we can simplify their identity. $$ P = R \\ \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} \\ \sum_{l\in L}{({TP}_l + {FP}_l)} = \sum_{l\in L}{({TP}_l + {FN}_l)} \\ \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FN}_l} \\ \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{FN}_l} $$

So it all boils down to whether or not the sum of each label's false positives is always equal to the sum of each label's false negatives. Let's say that you have the following row in the first table:

predtrue
AB

This is an error, but is it a false positive or a false negative? From the perspective of label A it's a false positive, but from the perspective of label B it's a false negative. So for every error in the table, it will be both a false positive for the label in the prediction column and a false negative for the label in the true column. This means that the number of false positives and false negatives will always be balanced, as with every false positive for one label there will be a false negative for another label.

And with this we have proven that the number of false negatives will always equal the number of false positives, which further proves that the precision will always be equal to the recall, which further proves that the F1 score will always be equal to them as well.

Wednesday, January 27, 2021

The secretary problem

Say you're dating several partners and want to get into a serious relationship eventually. After how many partners can you have a good idea of what a good partner is like? This is called the secretary problem, and it can be formalised as follows:

Let $n$ be the number of partners in total you can date. Let $A_i$ be the fitness of the $i$th partner, such that the order of the partners is random and we are interested in the partner with the maximum fitness. Let $r$ be the number of partners you date before you start looking for a serious relationship. Once a partner has been evaluated, you cannot go back to them if you try the next partner. What you're doing is going through all partners $i$ starting from 1 up to $r$, at which point you will take $\max_{i=1}^{r}(A_i)$ as a reference of what a good partner looks like and continue going through the rest of the partners $r$ to $n$ and stop as soon as $A_i$ is greater than the reference ($\max_{i=1}^{r}(A_i)$). If none of the partners are better than the best of the first $r-1$ then you settle for the last partner. The question is, what should $r$ be such that the probability of picking the partner with maximum $A_i$ is maximised.

The choice of $r$ actually makes a significant difference, as we can show in the following simulation in Python 3 (in Python everything needs to be 0-based such that the first item is at position 0 rather than 1):

import random

def select(A, r):
    ref = max(A[:r]) # first r items
    for i in range(r, len(A)): # rest of the items
        if A[i] > ref:
            return i
    return len(A)-1 # return last item if search failed

n = 20
for r in range(1, n): # first item not included
    successes = 0
    for _ in range(10000):
        A = [random.random() for _ in range(n)]
        best = max(range(n), key=lambda i:A[i])
        choice = select(A, r)
        if choice == best:
            successes += 1
    print(r+1, successes/10000)
2 0.1746
3 0.2532
4 0.3099
5 0.3459
6 0.3581
7 0.3856
8 0.3927
9 0.3866
10 0.3717
11 0.366
12 0.3382
13 0.3227
14 0.2918
15 0.263
16 0.2228
17 0.1908
18 0.149
19 0.0969
20 0.0474

These are the fraction of times the best partner is selected out of 10000 simulations for each $r$ from 2 to 20. You will notice that the fraction steadily increases up to a peak at 8 and then steadily decreases until the end. It's interesting that the peak is not in the middle of the list but closer to the beginning. How can we find what the best $r$ would be for any $n$?

First, we need to find the probability of choosing the best partner for a given $r$ and $n$. This probability can be expressed as two events happening together: you stopping at element $i+1$ and the maximum element being element $i+1$ (we're using $i+1$ instead of just $i$ because it will be mathematically convenient later). These two events are independent and so we can find the probability of each one and then multiply them together.

The probability of the maximum element from all $n$ elements being the $i+1$th element is $\frac{1}{n}$.

The other probability is a little trickier. To stop at element $i+1$, all the elements prior to $i+1$ have to be less than the reference we got from the first $r$ elements. If a larger element is found sooner, we would have stopped there. Therefore, we are interested in the probability that the maximum element from the first $i$ elements is among the first $r$ elements, which is $\frac{r}{i}$. This is assuming that $i+1 \geq r+1$, since we cannot choose any partner before the $r+1$th partner.

Multiplying these two probabilities, we get $\frac{r}{n \times i}$. But this is the probability for a given $i+1$, whereas we want the probability for any $i+1$. Each $i+1$ is mutually exclusive so we can add up the probability of each $i+1$. Given that $i+1$ must be greater than or equal to $r+1$ by definition,

$$P(r, n) = \sum_{i=r}^{n}\frac{r}{n \times i} = \frac{r}{n}\sum_{i=r}^{n}\frac{1}{i}$$

Great, so now we have the probability of making a successful selection given $r$ and $n$. Now how do we find at what $r$ is this maximum? Normally we would find the derivative of this equation with respect to $r$ and see what $r$ has to be for the derivative to be 0 (the turning point), but we can't differentiate this expression because $r$ is discrete. We also cannot treat $r$ as continuous because it is used as the starting point in the summation. But if we focus on finding $\frac{r}{n}$ rather than $r$, then we can do so in a round about way.

Start by multiplying $\frac{1}{n}$ with the expression like this:

$$\frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

We can do this since $n$ cannot be zero. Now what will this expression look like if we let $n$ go to infinity?

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

Let $x = \lim_{n \to \infty} \frac{r}{n}$ and $t = \lim_{n \to \infty} \frac{i}{n}$. In this case, the summation looks like a Reinmann sum, that is, it is adding together the areas of equally wide rectangular strips in a fixed region under a graph. As the width of the strips goes to zero, the total area of all the strips equals the area under the graph (in the fixed region), making it equal to the definite integral of the graph. In this case, the graph is $f(t) = \frac{1}{t}$, the width of the strips is $\frac{1}{n}$, and the region is between $\frac{r}{n}$ and $\frac{n}{n}$. This means the following:

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n} = x\int_x^1 \frac{1}{t} dt$$

Great. So now we need to find the turning point of $x\int_x^1 \frac{1}{t} dt$ with respect to $x$ in order to know what $\frac{r}{n}$ gives the most probable success. This comes as follows:

$$ \begin{align} \frac{d}{dx} (x\int_x^1 \frac{1}{t} dt) &= 0 \\ \frac{d}{dx} x(\ln(1) - \ln(x)) &= 0 \\ \frac{d}{dx} -x\ln(x) &= 0 \\ -\ln(x) - 1 &= 0 \\ x &= \frac{1}{e} \\ \end{align} $$

Which says that the number of partners you should date before looking for a serious relationship is the total number of partners you will date divided by $\frac{1}{e}$, or about 37% of the number of people you'll go out with.

Tuesday, December 29, 2020

Completing the square

For all quadratic expressions, that is, of the form $ax^2 + bx + c$, which we'll call canonical forms, there exists an identical expression of the form $a(x + p)^2 + q$, which we'll call completed square forms. For example, $3x^2 + 12x + 27 \equiv 3(x + 2)^2 + 15$. Among other reasons, completed square forms are convient for finding the root of the quadratic's graph because it is easier to manipulate algebraically. Below is the method for converting a quadratic expression in canonical form into an identical quadratic in completed square form.

The trick to completing the square is to see what would happen if, given $x^2 + bx + c$, you assume that the completed square form is $(x + \frac{b}{2})^2$ and correct it. If we have an $a$ in front of $x^2$, we can factor it out of the expression so that we have $a(x^2 + \frac{b}{a}x + \frac{c}{a})$ and then complete the square of what's inside the brackets. Let's use $3x^2 + 12x + 27$ as an example:

$3x^2 + 12x + 27 \equiv 3(x^2 + 4x + 9)$

Focus on $x^2 + 4x + 9$

Assume that $x^2 + 4x + 9 \equiv (x + 2)^2$

Expanding the brackets:
$(x + 2)^2 \equiv x^2 + 2 \times 2 \times x + 2^2 \equiv x^2 + 4x + 4$

We can see that dividing $b$ (that is, 4) by 2 is useful because that term gets doubled after expanding the brackets. We can further see that instead of 9, the last term is 4. So we need to correct for this. Fortunately, we can introduce the term $q$ to our completed square form, which can be used to correct it. We correct it by adding 5 (that is, $9 - 4$) after the squared brackets, like this:

$(x + 2)^2 + 5$

Now, when we expand the brackets, we get:

$(x + 2)^2 + 5 \equiv (x^2 + 2 \times 2 \times x + 2^2) + 5 \equiv x^2 + 4x + 9$

which is exactly what we want. Therefore,

$3x^2 + 12x + 27 \equiv 3(x^2 + 4x + 9) \equiv 3((x + 2)^2 + 5) \equiv 3(x + 2)^2 + 15$

Let's try this with the general case:

$ax^2 + bx + c \equiv a(x^2 + \frac{b}{a}x + \frac{c}{a})$

Focus on $x^2 + \frac{b}{a}x + \frac{c}{a}$

Assume that $x^2 + \frac{b}{a}x + \frac{c}{a} \equiv (x + \frac{b}{2a})^2$

$(x + \frac{b}{2a})^2 \equiv x^2 + 2 \times \frac{b}{2a} \times x + (\frac{b}{2a})^2 \equiv x^2 + \frac{b}{a}x + (\frac{b}{2a})^2$

Add $\frac{c}{a} - (\frac{b}{2a})^2$ to the completed square expression.

$(x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2$

Therefore,

$$ \begin{align*} ax^2 + bx + c &\equiv a(x^2 + \frac{b}{a}x + \frac{c}{a}) \\ &\equiv a((x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2) \\ &\equiv a(x + \frac{b}{2a})^2 + a\frac{c}{a} - a(\frac{b}{2a})^2 \\ &\equiv a(x + \frac{b}{2a})^2 + (c - \frac{b^2}{4a}) \end{align*} $$

You can use the last line to convert the expression directly.

Sunday, November 29, 2020

Summary of research paper "A Three-Player GAN: Generating Hard Samples To Improve Classification Networks" by Vandenhende

After reading this paper, I was confused by Algorithm 1 (page 2) as it did not really match parts of the text. So I contacted the author and would now like to clarify it for everyone else.

The paper describes a generative adversarial network which generates images that are hard to classify in order to augment a training set for a second classifier. This is done by attaching a classifier to the generator (apart from the discriminator) which learns to classify the generated data with a gradient reversal layer between the classifier and the generator. The gradient reversal layer forces the generator to generate examples which are difficult to classify by negating the gradient signal coming from the classifier loss before passing it to the generator. This makes the generator learn to maximise the classifier loss whilst the classifier is learning the minimise it, resulting in adversarial learning. The author does not give a reason for why the same kind of adversarial learning is not applied to both parts of the model.

What confused me is how would you know what class the difficult to classify examples produced by the generator would be when training the target classifier with the augmented training set. The reason I was confused is because I did not realise that this is not a GAN, but a CGAN, a conditional generative adversarial network. A CGAN works by feeding both the generator and the discriminator a class label. The label would be the label of the real data and a random label in generated data. Since the labels are the same for both the generator and the discriminator, the discriminator learns to discriminate the fake data by matching it to the label and seeing if it looks like what is expected given the label whilst the generator produce images which match the given label in order to fool the discriminator.

By combining this with the gradient reversal layer, the generated images would be both similar to real data of the given label and also difficult to classify. These two properties would be contradictory except for the fact that the effect of the classifier is scaled down due to the lambda hyperparameter of the gradient reversal layer, making the discriminator take precedence.

Sunday, October 25, 2020

Python generic methods/functions

This is something minor but that could use with some more attention. In C# you can use generics with methods so that if a function needs to use a generic type which can be different with different calls then you can simply do this:
class C {
    T getValue() {
    }
}
Now Python has type hints which are used by the program mypy, but I was having trouble finding how to do generic methods with it. The problem is that generic method examples shown require you to pass in a parameter with the generic type. But if the type is not used in the parameters but only in the return, as shown in the C# example above, then you'd be in trouble. The solution is to pass the type itself as a parameter, in the same way that the C# example above does but as a parameter instead of outside the parameter list. It is shown here but not as a generic function.
from typing import TypeVar, Type

T = TypeVar('T')

class C:
    def get_value(return_type: Type[T]) -> T:
        pass

Tuesday, September 29, 2020

Proof of the volume of a pyramid and cone formulae

Let's take a square based pyramid, one which is arranged such that it fits in the corner of a cube like this:
The volume of this kind of pyramid is a third of that of the cube. This is because 3 such pyramids will perfectly fit in the cube, as shown below:
In fact you don't even need the pyramid to fit in a cube, it can fit in a cuboid instead. Three such rectangular based pyramids will always exactly fit the cuboid. Now the next natural question is about pyramids that are arranged differently, such as when the top point is in the center of the base, as shown below:
You will notice that no matter how the pyramid is arranged, it always consists of an infinite number of stacked squares, starting from the square at the base of the pyramid and getting progressively smaller until the top point is reached. Since the base and top point of both pyramids are exactly the same, the only thing that is changing is where these square slices are horizontally. So the same set of squares in the orange pyramid are used in the green pyramid with the only difference being where they are placed. This means that the same volume is being used by both pyramids so the same formula is used for both. Therefore, the formula for the volume of a rectangular based pyramid is: $$V = \frac{1}{3}hA$$ where $h$ is the height of the pyramid and $A$ is the area of the base. Now what about when the base is not a square? Will a circle work? A circular based pyramid is a cone:
The cone is exactly the same as the square based pyramid except that instead of consisting of a stack of squares consists of a stack of circles (discs, to be exact). Now if the circles fit exactly inside those of a square based pyramid, then the area of the circle is going to be a fraction of the area of the square:
In terms of the radius of the circle ($r$), that fraction is: $$\frac{\pi r^2}{(2r)^2} = \frac{\pi r^2}{4r^2} = \frac{\pi}{4}$$ Now keeping in mind that this fraction will always be the same for any square in the stack of squares forming the pyramid (from the base to the top point), this means that the volume of the cone must be the same fraction of the volume of the pyramid. Therefore: $$V = \frac{\pi}{4}\frac{1}{3}h(2r)^2 = \frac{1}{3}h\pi r^2$$ In fact, the equation did not change from the previous one because it's still using the base area except that the base area is now of a circle instead of a rectangle. What about a trangular based pyramid (tetrahedron)? Same concept. The stack is made of triangles instead of rectangles or circles:
The fraction of the area occupied by the triangle over the area occupied by the square is half: $$\frac{l^2/2}{l^2} = \frac{1}{2}$$ where $l$ is the side length of the square. So now the volume will be half that of the square based pyramid: $$V = \frac{1}{2}\frac{1}{3}h(l)^2 = \frac{1}{3}h\frac{l^2}{2} = \frac{1}{6}hl^2$$ Again we see that the base area equation still holds and must hold for any shaped base pyramid (where the base shape is linearly reduced to a point along the height of the pyramid).