Monday, January 11, 2010

Analytics X Prize

There is a competition here to try and predict what proportion of murders in Philadelphia will occur in each of the cities 47 zip codes. Many people who are interested in these sorts of puzzles have started submitting predictions.

So how would you go about predicting the murder proportion in each zip code?
Well if nothing changes in Philadelphia you would expect each zip code to keep exactly the same proportion of murders, well with some random variation you could not predict. So my first guess is a repeat of exactly what happened last year.

But in the real world things do change. Say the population changes if every person had the same chance of being murdered then the proportion of murders in a zip code would change proportionate to the change in population. If this was the case the prediction problem would become to find out what changes in population will take place over the year.



The dataset I am using is here and some errors in it need to be removed. Each dot is a zip code. It looks like number of murders does roughly follow population but it is not nearly an exact match. So changes in population are important but they are far from the only thing we need to predict.

How expensive the house in an area are or the average income or the number of people per house might help indicate the murder rate. Here I am looking at number of (murders/population)*10000.





So it looks to me a bit like areas with crowded houses could be more likely to have murders.






House cost looks like it is not connected to murder rate. This could be because zip code is too rough grained for this to be a good judge. Maybe the average cost of a house in a block would be a better measure of risk. Philadelphia has even been broken down into 60ft squares here



Does household income look like it is related to murder rate?

So if the graph is a random scattering of dots then it looks like the independant variable on the x-axis has no relation to homicide rate the dependant variable on the y-axis. If the dots form a line (well not just a line but that is another story) then the homicide rate may be related to that independant variable. It really is not this simple but that's the basics.



As Siah pointed out here young black males seem to be murdered out of proportion. The graph above does seem to suggest that predicting changes in ethnicity of a zip code may improve predictions. Age is another important variable and I do not have data on that so that might be the next thing to get.

There are interesting posts already on this puzzle
"Evaluating Spatial Predictions" and "Second Pass at Analytics X Prize" and "Homicides as non homogeneous poisson processes" are very informative.

1 comment:

Anonymous said...
This comment has been removed by a blog administrator.