import sys print(sys.executable)

# Geeky is Awesome

To learn is to teach.

## Wednesday, March 31, 2021

### How to check which Python installation/environment is being used within Python

## Sunday, February 28, 2021

### Micro averaged precision, recall, and F1 are always the same

This is something that causes confusion when it happens because it isn't very well known. If you're evaluating a multi-class classifier using micro-averaging, the precision, recall, and F1 scores will be exactly the same, always. In fact this is explained in the sklearn documentation as "Note that if all labels are included, “micro”-averaging in a multiclass setting will produce precision, recall and that are all identical to accuracy.". Let's prove this, first using an example and then more generally.

Let's say that your classifier's classes (or labels) are A, B, and C and that you have the following classifier predictions together with the true classes:

pred | true |
---|---|

A | A |

A | A |

B | A |

C | A |

A | B |

B | B |

B | C |

C | C |

C | C |

To measure the micro-averaged precision, recall, and F1 you need to first measure the number true positives (TP), false positives (FP), and false negatives (FN) of each class. For example, if we look at just class A, the TP is the number of rows where A was predicted and also the true class, that is, 2 (first two rows). The FP is the number of rows where A was predicted but was not the true class, that is, 1 (the fifth row). The FN is the number of rows where A was not predicted but was the true class, that is, 1 (the third row).

Class | TP | FP | FN |
---|---|---|---|

A | 2 | 1 | 2 |

B | 1 | 2 | 1 |

C | 2 | 1 | 1 |

You calculate the micro-averaged precision ($P$), recall ($R$), and $F1$ for labels $L$ as follows:

- $P = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{2+1+2}{(2+1)+(1+2)+(2+1)} = \frac{5}{9} = 55.6\%$
- $R = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} = \frac{2+1+2}{(2+2)+(1+1)+(2+1)} = \frac{5}{9} = 55.6\%$
- $F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times \frac{5}{9} \times \frac{5}{9}}{\frac{5}{9} + \frac{5}{9}} = 55.6\%$

So the question we should ask is why are the precision and recall always the same. If the precision and recall are always the same, then their definition should be identical. Let's see if we can simplify their identity. $$ P = R \\ \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} \\ \sum_{l\in L}{({TP}_l + {FP}_l)} = \sum_{l\in L}{({TP}_l + {FN}_l)} \\ \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FN}_l} \\ \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{FN}_l} $$

So it all boils down to whether or not the sum of each label's false positives is always equal to the sum of each label's false negatives. Let's say that you have the following row in the first table:

pred | true |
---|---|

A | B |

This is an error, but is it a false positive or a false negative? From the perspective of label A it's a false positive, but from the perspective of label B it's a false negative. So for every error in the table, it will be both a false positive for the label in the prediction column and a false negative for the label in the true column. This means that the number of false positives and false negatives will always be balanced, as with every false positive for one label there will be a false negative for another label.

And with this we have proven that the number of false negatives will always equal the number of false positives, which further proves that the precision will always be equal to the recall, which further proves that the F1 score will always be equal to them as well.

## Wednesday, January 27, 2021

### The secretary problem

Say you're dating several partners and want to get into a serious relationship eventually. After how many partners can you have a good idea of what a good partner is like? This is called the secretary problem, and it can be formalised as follows:

Let $n$ be the number of partners in total you can date. Let $A_i$ be the fitness of the $i$th partner, such that the order of the partners is random and we are interested in the partner with the maximum fitness. Let $r$ be the number of partners you date before you start looking for a serious relationship. Once a partner has been evaluated, you cannot go back to them if you try the next partner. What you're doing is going through all partners $i$ starting from 1 up to $r$, at which point you will take $\max_{i=1}^{r}(A_i)$ as a reference of what a good partner looks like and continue going through the rest of the partners $r$ to $n$ and stop as soon as $A_i$ is greater than the reference ($\max_{i=1}^{r}(A_i)$). If none of the partners are better than the best of the first $r-1$ then you settle for the last partner. The question is, what should $r$ be such that the probability of picking the partner with maximum $A_i$ is maximised.

The choice of $r$ actually makes a significant difference, as we can show in the following simulation in Python 3 (in Python everything needs to be 0-based such that the first item is at position 0 rather than 1):

import random def select(A, r): ref = max(A[:r]) # first r items for i in range(r, len(A)): # rest of the items if A[i] > ref: return i return len(A)-1 # return last item if search failed n = 20 for r in range(1, n): # first item not included successes = 0 for _ in range(10000): A = [random.random() for _ in range(n)] best = max(range(n), key=lambda i:A[i]) choice = select(A, r) if choice == best: successes += 1 print(r+1, successes/10000)

2 0.1746 3 0.2532 4 0.3099 5 0.3459 6 0.3581 7 0.3856 8 0.3927 9 0.3866 10 0.3717 11 0.366 12 0.3382 13 0.3227 14 0.2918 15 0.263 16 0.2228 17 0.1908 18 0.149 19 0.0969 20 0.0474

These are the fraction of times the best partner is selected out of 10000 simulations for each $r$ from 2 to 20. You will notice that the fraction steadily increases up to a peak at 8 and then steadily decreases until the end. It's interesting that the peak is not in the middle of the list but closer to the beginning. How can we find what the best $r$ would be for any $n$?

First, we need to find the probability of choosing the best partner for a given $r$ and $n$. This probability can be expressed as two events happening together: you stopping at element $i+1$ and the maximum element being element $i+1$ (we're using $i+1$ instead of just $i$ because it will be mathematically convenient later). These two events are independent and so we can find the probability of each one and then multiply them together.

The probability of the maximum element from all $n$ elements being the $i+1$th element is $\frac{1}{n}$.

The other probability is a little trickier. To stop at element $i+1$, all the elements prior to $i+1$ have to be less than the reference we got from the first $r$ elements. If a larger element is found sooner, we would have stopped there. Therefore, we are interested in the probability that the maximum element from the first $i$ elements is among the first $r$ elements, which is $\frac{r}{i}$. This is assuming that $i+1 \geq r+1$, since we cannot choose any partner before the $r+1$th partner.

Multiplying these two probabilities, we get $\frac{r}{n \times i}$. But this is the probability for a given $i+1$, whereas we want the probability for any $i+1$. Each $i+1$ is mutually exclusive so we can add up the probability of each $i+1$. Given that $i+1$ must be greater than or equal to $r+1$ by definition,

$$P(r, n) = \sum_{i=r}^{n}\frac{r}{n \times i} = \frac{r}{n}\sum_{i=r}^{n}\frac{1}{i}$$Great, so now we have the probability of making a successful selection given $r$ and $n$. Now how do we find at what $r$ is this maximum? Normally we would find the derivative of this equation with respect to $r$ and see what $r$ has to be for the derivative to be 0 (the turning point), but we can't differentiate this expression because $r$ is discrete. We also cannot treat $r$ as continuous because it is used as the starting point in the summation. But if we focus on finding $\frac{r}{n}$ rather than $r$, then we can do so in a round about way.

Start by multiplying $\frac{1}{n}$ with the expression like this:

$$\frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$We can do this since $n$ cannot be zero. Now what will this expression look like if we let $n$ go to infinity?

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$Let $x = \lim_{n \to \infty} \frac{r}{n}$ and $t = \lim_{n \to \infty} \frac{i}{n}$. In this case, the summation looks like a Reinmann sum, that is, it is adding together the areas of equally wide rectangular strips in a fixed region under a graph. As the width of the strips goes to zero, the total area of all the strips equals the area under the graph (in the fixed region), making it equal to the definite integral of the graph. In this case, the graph is $f(t) = \frac{1}{t}$, the width of the strips is $\frac{1}{n}$, and the region is between $\frac{r}{n}$ and $\frac{n}{n}$. This means the following:

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n} = x\int_x^1 \frac{1}{t} dt$$Great. So now we need to find the turning point of $x\int_x^1 \frac{1}{t} dt$ with respect to $x$ in order to know what $\frac{r}{n}$ gives the most probable success. This comes as follows:

$$ \begin{align} \frac{d}{dx} (x\int_x^1 \frac{1}{t} dt) &= 0 \\ \frac{d}{dx} x(\ln(1) - \ln(x)) &= 0 \\ \frac{d}{dx} -x\ln(x) &= 0 \\ -\ln(x) - 1 &= 0 \\ x &= \frac{1}{e} \\ \end{align} $$Which says that the number of partners you should date before looking for a serious relationship is the total number of partners you will date divided by $\frac{1}{e}$, or about 37% of the number of people you'll go out with.

## Tuesday, December 29, 2020

### Completing the square

For all quadratic expressions, that is, of the form $ax^2 + bx + c$, which we'll call canonical forms, there exists an identical expression of the form $a(x + p)^2 + q$, which we'll call completed square forms. For example, $3x^2 + 12x + 27 \equiv 3(x + 2)^2 + 15$. Among other reasons, completed square forms are convient for finding the root of the quadratic's graph because it is easier to manipulate algebraically. Below is the method for converting a quadratic expression in canonical form into an identical quadratic in completed square form.

The trick to completing the square is to see what would happen if, given $x^2 + bx + c$, you assume that the completed square form is $(x + \frac{b}{2})^2$ and correct it. If we have an $a$ in front of $x^2$, we can factor it out of the expression so that we have $a(x^2 + \frac{b}{a}x + \frac{c}{a})$ and then complete the square of what's inside the brackets. Let's use $3x^2 + 12x + 27$ as an example:

Focus on $x^2 + 4x + 9$

Assume that $x^2 + 4x + 9 \equiv (x + 2)^2$

Expanding the brackets:

$(x + 2)^2 \equiv x^2 + 2 \times 2 \times x + 2^2 \equiv x^2 + 4x + 4$

We can see that dividing $b$ (that is, 4) by 2 is useful because that term gets doubled after expanding the brackets. We can further see that instead of 9, the last term is 4. So we need to correct for this. Fortunately, we can introduce the term $q$ to our completed square form, which can be used to correct it. We correct it by adding 5 (that is, $9 - 4$) after the squared brackets, like this:

$(x + 2)^2 + 5$

Now, when we expand the brackets, we get:

$(x + 2)^2 + 5 \equiv (x^2 + 2 \times 2 \times x + 2^2) + 5 \equiv x^2 + 4x + 9$

which is exactly what we want. Therefore,

$3x^2 + 12x + 27 \equiv 3(x^2 + 4x + 9) \equiv 3((x + 2)^2 + 5) \equiv 3(x + 2)^2 + 15$

Let's try this with the general case:

Focus on $x^2 + \frac{b}{a}x + \frac{c}{a}$

Assume that $x^2 + \frac{b}{a}x + \frac{c}{a} \equiv (x + \frac{b}{2a})^2$

$(x + \frac{b}{2a})^2 \equiv x^2 + 2 \times \frac{b}{2a} \times x + (\frac{b}{2a})^2 \equiv x^2 + \frac{b}{a}x + (\frac{b}{2a})^2$

Add $\frac{c}{a} - (\frac{b}{2a})^2$ to the completed square expression.

$(x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2$

Therefore,

$$ \begin{align*} ax^2 + bx + c &\equiv a(x^2 + \frac{b}{a}x + \frac{c}{a}) \\ &\equiv a((x + \frac{b}{2a})^2 + \frac{c}{a} - (\frac{b}{2a})^2) \\ &\equiv a(x + \frac{b}{2a})^2 + a\frac{c}{a} - a(\frac{b}{2a})^2 \\ &\equiv a(x + \frac{b}{2a})^2 + (c - \frac{b^2}{4a}) \end{align*} $$

You can use the last line to convert the expression directly.

## Sunday, November 29, 2020

### Summary of research paper "A Three-Player GAN: Generating Hard Samples To Improve Classification Networks" by Vandenhende

The paper describes a generative adversarial network which generates images that are hard to classify in order to augment a training set for a second classifier. This is done by attaching a classifier to the generator (apart from the discriminator) which learns to classify the generated data with a gradient reversal layer between the classifier and the generator. The gradient reversal layer forces the generator to generate examples which are difficult to classify by negating the gradient signal coming from the classifier loss before passing it to the generator. This makes the generator learn to maximise the classifier loss whilst the classifier is learning the minimise it, resulting in adversarial learning. The author does not give a reason for why the same kind of adversarial learning is not applied to both parts of the model.

What confused me is how would you know what class the difficult to classify examples produced by the generator would be when training the target classifier with the augmented training set. The reason I was confused is because I did not realise that this is not a GAN, but a CGAN, a conditional generative adversarial network. A CGAN works by feeding both the generator and the discriminator a class label. The label would be the label of the real data and a random label in generated data. Since the labels are the same for both the generator and the discriminator, the discriminator learns to discriminate the fake data by matching it to the label and seeing if it looks like what is expected given the label whilst the generator produce images which match the given label in order to fool the discriminator.

By combining this with the gradient reversal layer, the generated images would be both similar to real data of the given label and also difficult to classify. These two properties would be contradictory except for the fact that the effect of the classifier is scaled down due to the lambda hyperparameter of the gradient reversal layer, making the discriminator take precedence.

## Sunday, October 25, 2020

### Python generic methods/functions

class C { T getValueNow Python has type hints which are used by the program mypy, but I was having trouble finding how to do generic methods with it. The problem is that generic method examples shown require you to pass in a parameter with the generic type. But if the type is not used in the parameters but only in the return, as shown in the C# example above, then you'd be in trouble. The solution is to pass the type itself as a parameter, in the same way that the C# example above does but as a parameter instead of outside the parameter list. It is shown here but not as a generic function.() { } }

from typing import TypeVar, Type T = TypeVar('T') class C: def get_value(return_type: Type[T]) -> T: pass