Tuesday, August 31, 2021

Triangles in a diagonally segmented trapezium/trapezoid

When a trapezium (quadriletaral with two parallel sides) is segmented diagonally, you end up with 4 triangles as shown below:

These 4 triangles have interesting properties which I shall be proving here, namely that triangles P and Q are similar (same angles, different sizes) and triangles R and S have the same area.

Let's look at just triangles P and Q.

The angles with the same colour are equal for the following reasons: the green angles are opposite angles whilst the red and blue angles are alternative angles.

Now let's look at just triangles Q, R, and S.

In order to show that R and S have the same area, we'll first look at combined triangles Q and R and combined triangles Q and S.

These two combined triangles have the same base and height, which means that they have the same area ($A = \frac{1}{2}bh$). Now triangle Q in both figures is the same triangle, which means that it has the same area. Therefore, removing the area of triangle Q from both shapes will still result in an equal area ($R+Q = S+Q \implies R = S$). So therefore, triangles R and S have the same area.

Thursday, July 29, 2021

The last digit of square numbers cycles/repeats (modulo of square numbers)

In a previous blog post, I showed that the last digit of a square number could not be 2, 3, 7, or 8. I will now show that the last digit cycles through the same sequence of 10 digits: 0, 1, 4, 9, 6, 5, 6, 9, 4, 1; as shown in these first 31 square numbers:

 0^2:   0
 1^2:   1
 2^2:   4
 3^2:   9
 4^2:  16
 5^2:  25
 6^2:  36
 7^2:  49
 8^2:  64
 9^2:  81
10^2: 100
11^2: 121
12^2: 144
13^2: 169
14^2: 196
15^2: 225
16^2: 256
17^2: 289
18^2: 324
19^2: 361
20^2: 400
21^2: 441
22^2: 484
23^2: 529
24^2: 576
25^2: 625
26^2: 676
27^2: 729
28^2: 784
29^2: 841
30^2: 900

Last time I showed that the reason for only a subset of digits could be the last digit of square numbers was because of the following module equivalence:

$x^2 \text{ mod } 10 = (x \times x) \text{ mod } 10 = ((x \text{ mod } 10) \times (x \text{ mod } 10)) \text{ mod } 10 = (x \text{ mod } 10)^2 \text{ mod } 10$

which says that the last digit of a square number will always be the last digit of the square of a single digit, which limits the possible last digits to the last digit of 0^2, 1^2, 2^2, 3^2, 4^2, 5^2, 6^2, 7^2, 8^2, and 9^2, that is, 0, 1, 4, 9, 6, or 5. From this we can easily see why there is always a cycle among the same 10 digit sequence.

Given that $x^2 \text{ mod } 10 = (x \text{ mod } 10)^2 \text{ mod } 10$, we see that the $(x \text{ mod } 10)$ part is cyclic. This is because it gives the last digit of the number $x$ which cycles through 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. Therefore the last digit of the square of these digits must also cycle. Note that this holds for any $n$ in $x^2 \text{ mod } n$ because, given that $x^2 \text{ mod } n = (x \text{ mod } n)^2 \text{ mod } n$, the $(x \text{ mod } n)$ part is cyclic every $n$ numbers.

You'll also notice that the cycle of digits is symmetric (0, 1, 4, 9, 6, 5, 6, 9, 4, 1), that is, goes back through the sequence in reverse order after 5. In a future blog post we'll see why this is.

Monday, June 28, 2021

SQL Group By with Limit in MySQL: Finding the top n values per group

Say you have the following database table called "tbl_test" which contains rows which can be grouped by a column.
idgroup_namevalue
1a0
2a8
3a9
4a3
5b5
6b6
7b4
8c2
9c9
If you want to find the maximum value in each group, that's easy using a Group By statement and a MAX function:
SELECT
	`tbl_test`.`group_name` AS `group_name`,
    MAX(`tbl_test`.`value`) AS `value`
FROM
	`tbl_test`
GROUP BY
	`tbl_test`.`group_name`
ORDER BY
	`tbl_test`.`group_name`
;
group_namevalue
a9
b6
c9
But what if you wanted to find the top 2 values in each group? In MySQL, this can be achieved using the aggregate function GROUP_CONCAT which takes all the values in a column in a group, sorts them, and concatenates them into a single string. The problem is that in MySQL v5 there is no way to limit the number of values used, so we'll simulate this using the SUBSTRING_INDEX function, which takes the first n values in a string containing delimited values. For example SUBSTRING_INDEX("ab,c,de", ",", 2) returns "ab,c".
SELECT
	`tbl_test`.`group_name` AS `group_name`,
    SUBSTRING_INDEX(
        GROUP_CONCAT(`tbl_test`.`value` ORDER BY `tbl_test`.`value` DESC SEPARATOR ","),
        ",", 2
    ) AS `values`
FROM
	`tbl_test`
GROUP BY
	`tbl_test`.`group_name`
;
group_namevalue
a9,8
b6,5
c9,2
This puts the values together in one column. But sometimes you'll want to have each value being in its own row. In that case you can use the same trick we saw in an earlier post about finding the median value in MySQL, which involves using session variables. I got this trick from this other blog.
SET @row_num := 0;
SET @prev_group := NULL;
SELECT
	`t`.`group_name` AS `group_name`,
    `t`.`value` AS `value`
FROM (
    SELECT
        @row_num := IF(@prev_group = `tbl_test`.`group_name`, @row_num + 1, 1) AS `row_num`,
        @prev_group := `tbl_test`.`group_name`,
        `tbl_test`.`group_name` AS `group_name`,
        `tbl_test`.`value` AS `value`
    FROM
        `tbl_test`
    HAVING
        `row_num` <= 2
    ORDER BY
        `tbl_test`.`group_name` ASC,
        `tbl_test`.`value` DESC
) AS `t`
;
The idea here is to sort by the grouping column and the value and then associate a row number with each row, where the row number restarts on the first row of each group. We then use a Having clause to filter out only the rows that have a row number that is two or less.
group_namevalue
a9
a8
b6
b5
c9
c2

Sunday, May 30, 2021

VSCode Jupyter Notebook is not opening in Anaconda/conda environment

The Jupyter Notebook extension in VSCode let's you choose the Python environment you want to run. As with the Python extension, this is done from the bottom tab shown in red below:
But sometimes this is not enough and you still get module-not-found errors. In this case you need to also change the environment from the top, shown in red below:

Wednesday, March 31, 2021

How to check which Python installation/environment is being used within Python

Say you have several Python environments, how can you check that you're using the right one whilst running a Python script? The 'executable' constant in the 'sys' module will give you the path to the Python executable currently being used, which is also the path to the environment.
import sys
print(sys.executable)

Sunday, February 28, 2021

Micro averaged precision, recall, and F1 are always the same

This is something that causes confusion when it happens because it isn't very well known. If you're evaluating a multi-class classifier using micro-averaging, the precision, recall, and F1 scores will be exactly the same, always. In fact this is explained in the sklearn documentation as "Note that if all labels are included, “micro”-averaging in a multiclass setting will produce precision, recall and that are all identical to accuracy.". Let's prove this, first using an example and then more generally.

Let's say that your classifier's classes (or labels) are A, B, and C and that you have the following classifier predictions together with the true classes:

predtrue
AA
AA
BA
CA
AB
BB
BC
CC
CC

To measure the micro-averaged precision, recall, and F1 you need to first measure the number true positives (TP), false positives (FP), and false negatives (FN) of each class. For example, if we look at just class A, the TP is the number of rows where A was predicted and also the true class, that is, 2 (first two rows). The FP is the number of rows where A was predicted but was not the true class, that is, 1 (the fifth row). The FN is the number of rows where A was not predicted but was the true class, that is, 1 (the third row).

ClassTPFPFN
A212
B121
C211

You calculate the micro-averaged precision ($P$), recall ($R$), and $F1$ for labels $L$ as follows:

  • $P = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{2+1+2}{(2+1)+(1+2)+(2+1)} = \frac{5}{9} = 55.6\%$
  • $R = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} = \frac{2+1+2}{(2+2)+(1+1)+(2+1)} = \frac{5}{9} = 55.6\%$
  • $F1 = \frac{2 \times P \times R}{P + R} = \frac{2 \times \frac{5}{9} \times \frac{5}{9}}{\frac{5}{9} + \frac{5}{9}} = 55.6\%$
So we got all the scores the same. If $P$ and $R$ are the same than $F1$ will be the same because the harmonic mean of the same number is that number (just like the arithmetic mean). We can easily prove this: $$ \frac{2 \times X \times X}{X + X} = \frac{2X^2}{2X} = X $$ This is assuming that $X$ is not zero, which would mean that the precision and the recall are both equal to zero, which is impossible.

So the question we should ask is why are the precision and recall always the same. If the precision and recall are always the same, then their definition should be identical. Let's see if we can simplify their identity. $$ P = R \\ \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FP}_l)}} = \frac{\sum_{l\in L}{{TP}_l}}{\sum_{l\in L}{({TP}_l + {FN}_l)}} \\ \sum_{l\in L}{({TP}_l + {FP}_l)} = \sum_{l\in L}{({TP}_l + {FN}_l)} \\ \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{TP}_l} + \sum_{l\in L}{{FN}_l} \\ \sum_{l\in L}{{FP}_l} = \sum_{l\in L}{{FN}_l} $$

So it all boils down to whether or not the sum of each label's false positives is always equal to the sum of each label's false negatives. Let's say that you have the following row in the first table:

predtrue
AB

This is an error, but is it a false positive or a false negative? From the perspective of label A it's a false positive, but from the perspective of label B it's a false negative. So for every error in the table, it will be both a false positive for the label in the prediction column and a false negative for the label in the true column. This means that the number of false positives and false negatives will always be balanced, as with every false positive for one label there will be a false negative for another label.

And with this we have proven that the number of false negatives will always equal the number of false positives, which further proves that the precision will always be equal to the recall, which further proves that the F1 score will always be equal to them as well.

Wednesday, January 27, 2021

The secretary problem

Say you're dating several partners and want to get into a serious relationship eventually. After how many partners can you have a good idea of what a good partner is like? This is called the secretary problem, and it can be formalised as follows:

Let $n$ be the number of partners in total you can date. Let $A_i$ be the fitness of the $i$th partner, such that the order of the partners is random and we are interested in the partner with the maximum fitness. Let $r$ be the number of partners you date before you start looking for a serious relationship. Once a partner has been evaluated, you cannot go back to them if you try the next partner. What you're doing is going through all partners $i$ starting from 1 up to $r$, at which point you will take $\max_{i=1}^{r}(A_i)$ as a reference of what a good partner looks like and continue going through the rest of the partners $r$ to $n$ and stop as soon as $A_i$ is greater than the reference ($\max_{i=1}^{r}(A_i)$). If none of the partners are better than the best of the first $r-1$ then you settle for the last partner. The question is, what should $r$ be such that the probability of picking the partner with maximum $A_i$ is maximised.

The choice of $r$ actually makes a significant difference, as we can show in the following simulation in Python 3 (in Python everything needs to be 0-based such that the first item is at position 0 rather than 1):

import random

def select(A, r):
    ref = max(A[:r]) # first r items
    for i in range(r, len(A)): # rest of the items
        if A[i] > ref:
            return i
    return len(A)-1 # return last item if search failed

n = 20
for r in range(1, n): # first item not included
    successes = 0
    for _ in range(10000):
        A = [random.random() for _ in range(n)]
        best = max(range(n), key=lambda i:A[i])
        choice = select(A, r)
        if choice == best:
            successes += 1
    print(r+1, successes/10000)
2 0.1746
3 0.2532
4 0.3099
5 0.3459
6 0.3581
7 0.3856
8 0.3927
9 0.3866
10 0.3717
11 0.366
12 0.3382
13 0.3227
14 0.2918
15 0.263
16 0.2228
17 0.1908
18 0.149
19 0.0969
20 0.0474

These are the fraction of times the best partner is selected out of 10000 simulations for each $r$ from 2 to 20. You will notice that the fraction steadily increases up to a peak at 8 and then steadily decreases until the end. It's interesting that the peak is not in the middle of the list but closer to the beginning. How can we find what the best $r$ would be for any $n$?

First, we need to find the probability of choosing the best partner for a given $r$ and $n$. This probability can be expressed as two events happening together: you stopping at element $i+1$ and the maximum element being element $i+1$ (we're using $i+1$ instead of just $i$ because it will be mathematically convenient later). These two events are independent and so we can find the probability of each one and then multiply them together.

The probability of the maximum element from all $n$ elements being the $i+1$th element is $\frac{1}{n}$.

The other probability is a little trickier. To stop at element $i+1$, all the elements prior to $i+1$ have to be less than the reference we got from the first $r$ elements. If a larger element is found sooner, we would have stopped there. Therefore, we are interested in the probability that the maximum element from the first $i$ elements is among the first $r$ elements, which is $\frac{r}{i}$. This is assuming that $i+1 \geq r+1$, since we cannot choose any partner before the $r+1$th partner.

Multiplying these two probabilities, we get $\frac{r}{n \times i}$. But this is the probability for a given $i+1$, whereas we want the probability for any $i+1$. Each $i+1$ is mutually exclusive so we can add up the probability of each $i+1$. Given that $i+1$ must be greater than or equal to $r+1$ by definition,

$$P(r, n) = \sum_{i=r}^{n}\frac{r}{n \times i} = \frac{r}{n}\sum_{i=r}^{n}\frac{1}{i}$$

Great, so now we have the probability of making a successful selection given $r$ and $n$. Now how do we find at what $r$ is this maximum? Normally we would find the derivative of this equation with respect to $r$ and see what $r$ has to be for the derivative to be 0 (the turning point), but we can't differentiate this expression because $r$ is discrete. We also cannot treat $r$ as continuous because it is used as the starting point in the summation. But if we focus on finding $\frac{r}{n}$ rather than $r$, then we can do so in a round about way.

Start by multiplying $\frac{1}{n}$ with the expression like this:

$$\frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

We can do this since $n$ cannot be zero. Now what will this expression look like if we let $n$ go to infinity?

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n}$$

Let $x = \lim_{n \to \infty} \frac{r}{n}$ and $t = \lim_{n \to \infty} \frac{i}{n}$. In this case, the summation looks like a Reinmann sum, that is, it is adding together the areas of equally wide rectangular strips in a fixed region under a graph. As the width of the strips goes to zero, the total area of all the strips equals the area under the graph (in the fixed region), making it equal to the definite integral of the graph. In this case, the graph is $f(t) = \frac{1}{t}$, the width of the strips is $\frac{1}{n}$, and the region is between $\frac{r}{n}$ and $\frac{n}{n}$. This means the following:

$$\lim_{n \to \infty} \frac{r}{n}\sum_{i=r}^{n}\frac{n}{i} \frac{1}{n} = x\int_x^1 \frac{1}{t} dt$$

Great. So now we need to find the turning point of $x\int_x^1 \frac{1}{t} dt$ with respect to $x$ in order to know what $\frac{r}{n}$ gives the most probable success. This comes as follows:

$$ \begin{align} \frac{d}{dx} (x\int_x^1 \frac{1}{t} dt) &= 0 \\ \frac{d}{dx} x(\ln(1) - \ln(x)) &= 0 \\ \frac{d}{dx} -x\ln(x) &= 0 \\ -\ln(x) - 1 &= 0 \\ x &= \frac{1}{e} \\ \end{align} $$

Which says that the number of partners you should date before looking for a serious relationship is the total number of partners you will date divided by $\frac{1}{e}$, or about 37% of the number of people you'll go out with.