Wednesday, May 23, 2012

Birthday Problem

Punk Rock Operations Research has recently been talking about the birthday problem. This problem asks how many people have to be in a room for you to have a 50% chance of two of them sharing a birthday. The usual answer is 23 people.

However as The Birthday Problem and The Birthday Problem with a mating season: A simulation approach post say because Seasons Sway Human Birth Rates the assumption the birthday problem makes of a random sampling of people usually will not hold. This clumping of people births means that the birthday problem could have a lower number than usually supposed. Think of it this way, if everyone was born on January 1st you would only need two people to be sure of having a matched birthday. In the same way if births are clumped and not evenly spread the number of people needed to get a shared birthday should drop.

I found a good dataset at An Analysis of the Distribution of Birthdays of 480,040 birth dates. This is ideal for a simulation. Presumably the US census has even better data but I can't find it. The files used are bday.txt. This is in the form

0101 1482

0102 1213

0103 1220

0104 1319

Also a file with equal numbers of birth each day here This code is to calculate the average number of people you need to add to a set for two to share a birthday. Which is slightly different from the original birthday problem. Knuth studied this variant and came up with a figure of 24.616 people assuming 365 days in a year.


#!/usr/bin/env ruby -w

puts 'Birthday simulation'

#binary search function I stole
def binarySearch(sortedArray, first, last, key) 
   until first > last do
       mid = (first + last)/2
       if (key > sortedArray[mid])
           first = mid + 1;
           binarySearch(sortedArray, first, last, key)
       elsif (key < sortedArray[mid])
           last = mid - 1;
           binarySearch(sortedArray, first, last, key)
       else
           return mid;
      end
   end
   return first;
end

#binarySearch Test
#sorted_array = [1, 5, 6, 10, 5, 25, 40, 78, 100, 130]
#p binarySearch(sorted_array, 0, sorted_array.length-1, 1)

total=0;
matches= Array.new; #array holding how many people so far have this
found= Array.new; #have we seen this day before
tried=0;  #how many people had to be tried 
i=0;

#First load an array with numbers that correspond to days. Day 0 is 1482 people, day 1 1482 upto 1482+1213 etc 

#for the real dataset
File.open('bday.txt', 'r') do |f1|

# and with the everyone born randomly
#File.open('normal', 'r') do |f1|
  while line = f1.gets  
    parts=line.split;#get the second number part of the line
    total=total+parts[1].to_i;
    matches[i]=total;
    i+=1;
  end  
end 

puts total; #sum up all the values

j=0;
matchFound=0;
totalDays=0;

while j<10000#00
where= binarySearch(matches, 0, matches.length-1, rand(total));
tried=tried+1;
 if found[where] ==1
  totalDays=totalDays+tried;
  tried=0;
  found.delete(1);#empty the array for another simulation run
  matchFound=matchFound+1; 
 else
  found[where]=1
 end
 j+=1;
end
ans=(totalDays.to_f/matchFound.to_f)
puts "average birthday number ", ans;
This code with the same number born every day gives a birthday number of 24.0194321675634. And with the real bday.txt dataset from Roy Murphy gets the result of 23.9779163169884.

This code needs a complete rewrite. This code is fugly and wrong. I tried writing this piece a few years ago and failed so I'm just glad to get it out the door. Also I have to make allowances for the dataset size and the number of simulations run.

Other problems with the birthday problem revolve around getting a random sampling of people. About 1 in 80 births are of twins. So an average of one in every 40 people is a twin and twins tend to hang out together.

This dataset comes from life insurance forms. Which could be biased in that Jan 1st seems to be too popular. It could be what a form defaults to. Maybe life insurance selects for people born at a certain time. It could be that rich people buy life insurance more and rich people are more often born in a particular month. Professions seem to clump around certain birth dates. Malcolm Gladwell points out in Outliers that professional athletes tend to have birthdates that make them old when they play under 8,9,10 and 11's sports. Kary Mullis in "dancing naked in the mind field" professes his support for astrology saying “A recent scientific study of the distribution of medical students in birth months discovered that a lot of medical students were born in late June”. If this is right then some other professional groups clump in age.

Mullis is an interesting person, and his book is very entertaining. By the time he describes his meeting with an extra terrestrial glowing raccoon you know he is nuttier than squirrel shit. Partly it is this oddness that helped him win the Nobel prize brilliant "What if I had not taken LSD ever; would I have still invented PCR?" He replied, "I don't know. I doubt it. I seriously doubt it.". PCR is used to amplify DNA fragments and is a major tool in convicting rapists, murderers and other criminals.

While on the subject of doing calculations on peoples births. Laplace did a calculation showing more baby boys were born in France than girls. Which is one of the founding calculations of probability theory.

The birthday problem turns out to be lower than people claim. It also has enough weirdness involved in the randomness assumption that it is a good illustration that people just aren't random.

No comments: