## Wednesday, July 8, 2015

### Expected number of uniformly distributed collisions (birthday problem)

Here's an interesting mathematical problem. If you have "n" objects to be inserted into "m" available slots using a uniformly distributed random placement, how many collisions with already occupied slots should we expect to happen? This is useful for hashtables and other data structures where duplicates are not allowed.

Here is a Python 3 program that simulates inserting objects into random positions in an array and counting the average number of collisions.

```def collisions(n, m):
trials = 10000
total_collisions = 0
for _ in range(trials):
slot_is_occupied = [ False for _ in range(m) ]
for _ in range(n):
slot = random.randint(0, m-1)
if slot_is_occupied[slot]:
total_collisions += 1
else:
slot_is_occupied[slot] = True
```

Here is a sample of the average number of collisions given by the above function for different values of "n" and "m":
n\m12345678910
10.00.00.00.00.00.00.00.00.00.0
21.00.49470.33340.25150.20540.1630.14370.12970.11180.1053
32.01.24470.88610.6850.5570.46330.40230.35370.3250.2819
43.02.12911.58121.25881.04430.90540.77540.69240.61760.5459
54.03.06172.41.9441.64571.42041.23641.11230.99840.9019
65.04.0343.2652.70782.3042.02181.76041.60041.43491.318
76.05.01684.17423.54063.04982.67162.39732.14991.94171.7744
87.06.00755.12064.4033.83633.39053.03042.73782.51512.3219
98.07.00356.07655.30524.67884.16523.7383.41683.12052.8816
109.08.00167.05266.22335.54014.96324.4934.09133.77213.5016

Basically the answer is the number of objects "n" minus the number of occupied slots. This will give us the number of objects excluding the ones which were inserted without collision, that is, in an empty slot. For example, if I insert 5 objects into an array but at the end there are only 3 occupied slots, then that must mean that 2 of those objects were inserted in the same slot as some other objects (they collided with them).

The question is how to predict the expected number of occupied slots.

Expected number of occupied slots

What is the average number of slots ending up being occupied by at least one object? This previous blog post explains that you basically just need to multiply the probability of a given slot being occupied at the end by the number of slots. So what is the probability of a slot being occupied?

Probability of a slot being occupied

What is the probability that an object is inserted into a particular slot out of "m" slots?
1/m

Therefore the probability that the slot remains empty is
1 - 1/m

What is the probability that the slot is still empty after another placement? It's the probability that the first object did not land on the slot AND that the second object did not land on the slot too. These two probabilities are independent of each other, so
(1 - 1/m)(1 - 1/m)

In general, after "n" objects have been placed, the probability that the slot is still empty is
(1 - 1/m)^n

Notice that this makes sense for n = 0 because if no objects were placed, then the probability that the slot is empty is 1.

Which means that after "n" objects have been placed, the probability that the slot is occupied is
1 - (1 - 1/m)^n

Therefore...

Therefore, the expect number of occupied slots among "m" slots after "n" objects have been inserted with uniform probability is
m(1 - (1 - 1/m)^n)

Which means that the expected number of collisions is
n - m(1 - (1 - 1/m)^n)

Here is the same table as the one at the top showing the corresponding predicted number of collisions:

n\m12345678910
10.00.00.00.00.00.00.00.0-0.00.0
21.00.50.33330.250.20.16670.14290.1250.11110.1
32.01.250.88890.68750.560.47220.40820.35940.3210.29
43.02.1251.59261.26561.0480.89350.77840.68950.61870.561
54.03.06252.39511.94921.63841.41131.23871.10330.99440.9049
65.04.03133.26342.71192.31072.00941.7761.59041.43941.3144
76.05.01564.17563.53393.04862.67452.37942.14161.94621.783
87.06.00785.11714.40053.83893.39543.03952.74892.50772.3047
98.07.00396.0785.30034.67114.16283.74813.40533.1182.8742
109.08.0027.0526.22535.53694.9694.49844.10463.77153.4868

The maximum absolute error between the two tables is 0.0179.