Showing posts with label Numbercrunching. Show all posts
Showing posts with label Numbercrunching. Show all posts

02/12/2013

A Quick Test of the Matt Yglesias Hypothesis on Density and Crime

I’ve recently come across a 2011 post by Matt Yglesias (via, via) in which he presents the following little theory of population density and crime:
higher density helps reduce street crime in an urban environment in two ways. One is that in a higher density city, any given street is less likely to be empty of passersby at any given time. The other is that if a given patch of land has more citizens, that means it can also support a larger base of police officers. And for policing efficacy both the ratio of cops to citzens and of cops to land matters. Therefore, all else being equal a denser city will be a better policed city.
While plausible, this is also somewhat surprising because in the past people have come up with ideas on how density might increase crime. Which is not too surprising given that there is a positive correlation between density and crime (denser cities have higher crime rates). A while back, I half-heartedly reviewed the literature on this; people seem to have come to the conclusion that there’s not a lot to it. But that would suggest the effect is (close to) zero rather than negative.

As it happens, I have a dataset for 125 U.S. cities sitting on my hard drive. So let’s run some quick regressions. All are weighted by a variable that divides 1990 population size by the unweighted sample mean for 1990 population size. That means that each city is given a weight proportional to its size while the sample size stays the same; as a consequence, each crime has the same influence on the results irrespective of whether it happens in a small or a large city. While Yglesias writes in the context of having been assaulted, I will not use data on assault, which seems not to be particularly valid, but rather robbery, as official robbery rates appear to correlate highly with the true rates and robbery is the prototypical street crime. I use 1990-2000 changes in density per square mile and changes in robberies per 100,000 population known to the police as the variables of interest. The use of change data takes care of stable differences between cities that may contaminate the results. The estimation method is linear WLS of changes in (untransformed) rates.

I am not going to go through the trouble of embedding tables in blogger, but simply report results for the variable of interest in the text. Bivariate regression: B = -.25 (p < .001), meaning that an increase in density of 1 person per square mile is associated with a decrease of .25 robberies per 100,000 persons.

Next, let’s worry about immigration. It is, unsurprisingly, correlated positively with density and there are some students of crime who think that immigration decreased crime rates in the 1990s U.S. While I don’t necessarily agree with this, let’s control for changes in the percentage of the population that is foreign-born anyway. This makes next to no difference: B = -.27 (p < .001).

This may mean that density reduces robbery, robbery reduces density, there’s a bunch of unmeasured variables that influence both, or a combination of the above. I am not going to solve that problem here. But what I will do is control for some initial conditions (i.e., 1990 levels of variables) that may influence both of our variables of interest. First, the robbery rate is particularly likely to decline where it is high, so let’s control for 1990 levels of robbery rates. Also, better economic conditions will tend to attract people (and hence increase density) and perhaps also foster future decreases in crime. So let’s throw in 1990 values for poverty and unemployment rates, as well as the median of 1989 household income. This leads to a substantial reduction in the coefficient: -.14 (p < 0.001).

Is that a lot? The mean of changes in population density is 579 per square mile with a standard deviation of 842; for changes in the robbery rate, those values are 366 and 345, respectively. If we were to interpret the coefficient of the last regression as causal, this would mean that, in the sample as a whole, increases in density averted 579*(-.14) = 81 robberies per 100,000 population, meaning that changes in density would be responsible for about a seventh of the observed decline in homicides robberies. That’s a lot.

Of course, you shouldn’t take these little analyses all that seriously. I haven’t worried about functional form or heteroskedasticity and the equation isn’t all that convincing as a causal model.

Still. File under “suggestive”.

18/05/2013

The Monotonic Decline of The Strokes (Edited)

I've listened to the new Strokes album, Comedown Machine, and it ain't much. That shouldn't surprise anyone, really, as, ever since Is This It?, each Strokes album was poorer than the previous one, a trend which is only continuing. Don't take my word for it. Instead, trust the users of rateyourmusic.com (average ratings out of 5, as of noon today):





I guess there reasons for Comedown Machine are inflated due to a recency/fan effect: The biggest fans will check something out first, and give higher ratings than the general population. So I would expect the last datapoint to sink down a little in the months to come.

Another thing this graph demonstrates is that my Excel skills are also in decline.

(Edited so the graph reflects the fact that the lowest possible rating is 1, not 0. Also, I should mention that the songs "One Way Trigger" and "Happy Ending" are really good.)

06/04/2012

How Good Is Good Reads?

A while back, Andy McKenzie wondered where "the imdb of books" is. By which he means a site that presents an authorative ranking of books on the basis of a large sample of votes, combined with a clever algorithm for scoring these. One of the candidates he considered for such a role was the site Good Reads:
Upside: As far as I can tell, this is the largest "bookshelf" site with the most user ratings. Huge potential. Downside: They've made no attempt to publish a list of the highest rated books across the site! All I can ask is, what is holding you back, GoodReads editors? Qualms about alienating authors whose works won't make the list? Fears of being labelled imperialistic? These are both hogwash. Our time is scarce and in order to be informed consumers we need to know what the best books are. If you are worried about the arbitrariness of the minimum votes cut-off, then publish multiple lists with different scaling parameters. You will thank me later when the list gets out-of-control traffic. Indeed, a group of passionate GoodReads users recently called for such a list. To this valiant effort I can only say, Viva la Résistance!
As it turns out, Good Reads has now come round to providing such lists. Andy won't be to thrilled, however, and neither am I. That's because they're doing it all wrong. Two big problems: One, if you want to calculate a score on the basis of multiple votes, your measure needs to be metric. But Good Reads provides labels for each point of its five-point scale, and if voters take those labels seriously, the scale is decidedly nonmetric (the labels are "it was amazing", "really liked it", "liked it", "it was o.k.", "didn't like it"). Second, and worse, their scoring method is quite obviously not a variant of the ingenious Bayesian formula used by imdb. I don't know how exactly the Good Reads algorithm works, but it seems to give a lot of weight to the total number of votes, so you won't be surprised to find that their Best Books Ever list looks like this.

Now, I'll admit I have an Aspergery fascination with lists (including references sections in academic texts. Really.), but their main use is giving recommendations that I like. Good Reads has hence a chance to redeem itself because it's heavy on recommendations. That system doesn't look to hot either (it seems to be based simply on matches with books one has rated 3 or higher), but there's no need to speculate on its quality: let's make it empirical! Here are six books that the site has recommended to me, that hadn't been on my mental "to read" list and that looked interesting:
  • The Man Who Was Thursday, by G.K. Chesterton
  • The Speed Queen, by Stewart O'Nan
  • If on a Winter's Night a Traveler, by Italo Calvino
  • Hawksmoor, by Peter Ackroyd
  • Snow Crash, by Neal Stephenson
  • The End of the World News, by Anthony Burgess
In order to test how useful Good Reads is for me, I'll read five of those and see how much I like them. If the average rating will be 3 or above, I'll consider the site's recommendations a failure, as I can easily do that well without it. If the average rating will be 4 or above, I'll consider the system a big success. In between will be, well, in between. Ratings will be conceived of as metric. Half points are allowed. If I don't like a book enough, I'm going to put it down and rate it on estimated overall quality.

To wit, these are books I picked from the recommendations list on the basis of anticipated enjoyment. While it might seem "fairer" to pick a random selection of recommendations, such a test would have very low ecological validity: I want to know how useful the site will be to me in the future, and in the future I will not be picking books at random from the list.

Testing will be finished when I've rated five of the six books on the list. Given that I won't restrict my reading to these titles, and have other stuff to do as well, this will probably take a few months' time. I'll keep you posted. Anyone reading this is invited to play along.

And now, a literature review: I finished James Joyce's Dubliners today. The last two pages are really good.

13/05/2009

It's the Austrians, Stupid!

A popular stereotype in the UK about Germans is that we're all David Hasselhoff fans. Let's subject this hypothesis to an empirical test and have a look at Google Trends' country ranking for the search term "david hasselhoff".



A-ha! Relative to all searches from the countries in question, Germany is only no. 10 when it comes to searches for The Hoff. You might be forgiven your mistake, however, as Austrians speak a language related to German proper. Special attention should be directed towards country no.9 on the list.

I rest my case.

14/05/2008

I'm Making This Empirical!

I earlier wrote:

I have not crunched the numbers on this but am pretty sure that holding [football] player quality [...] constant, players on lower-quality teams commit more fouls due to a) weaker teams having possession of the ball less often combined with b) players being much more likely to commit a foul when their team is not in possession.
Another reason to suspect this is that players may sometimes commit fouls because they're frustrated about losing.

I now have crunched some numbers on this. Via EPL Talk I found the English Premier League's fair play table for the season just finished (which, for some reason, does not include data for the last two days). It contains a substatistic for red and yellow cards received (higher values mean less cards received; yes, I've counter-checked that). Although the score could theoretically vary between 0 and 360, the actual minimum and maximum are 271 and 313, respectively. Not holding player quality constant, I related it to the amount of points the clubs gained last season.

It turns out that the correlation between the "card score" and points is .365 - not spectacular, but noteworthy. If you run a linear OLS regression, you get the following equation:

FAIR PLAY SCORE = 283.281 + .211*POINTS

R-squared is .134; in other words, there is a lot left to explain. Both calculations are not statistically significant at conventional levels (p = .113). This isn't too surprising given a sample size of 20, but it also means you shouldn't put too much faith in the generalizability of the results.

Even so, this is some evidence that my previous intuition was correct. Not that I think you should kneel down and hail me; nine out of ten football fans would probably have told you the same.

10/04/2008

Something Every Football Fan Knows to Be True May Be False

All of this season's Champions League quarter-finals were won by the teams that played the second leg at home. Most fans won't be surprised: We all know that playing the second leg at home is a huge advantage, right?

I looked at data on all previous CL seasons' two-legged knockout fixtures but excluded all first knockout stage fixtures, because in these cases, whether you play the second leg at home is determined by previous performance in the tournament. (In scientific parlance, assignment to conditions is nonrandom.)

It turns out that exactly half of the fixtures were won by the teams that played at home first, which suggests that one of football's most cherished wisdoms is wrong. Two caveats, however: 1. Obviously, if you'd add the results from this season's quarter-finals, this would alter the outcome - but let's see how the semis go first. 2. The sample is pretty small (n=42). I'd like to collect someone else to collect a much larger data set and repeat the analysis. If you're aware of any such analysis, please let me know.

Earlier: What's a good first-leg result?

03/03/2008

Time to Place Your Bets. Or Maybe Not.

The Champions League 2nd leg matches are coming up. Judging simply from my earlier number-crunching, and taking into account no other information whatsoever, I'll have to predict the following.

Teams to advance in bold; level of certainty in brackets.

FC Porto - FC Schalke 04 (low)
Real Madrid - Roma (medium)
Chelsea - Olympiakos Piräus (medium)
Inter - Liverpool (high)
Barcelona - Celtic (medium)
Manchester United - Olympique Lyonnais (medium)
AC Milan - Arsenal (medium)
Seville - Fenerbahce Istanbul (medium)

Given the primitivity of the forecasting algorithm, I'll regard 6/8 as a big success.

22/02/2008

Proven by Research: The Chances of Liverpool Advancing to the Quarter-Finals Are... 100%!

When Champions League first-leg knockout ties come around, before, during and after the match there is always discussion about which result is a good one for the home or away teams, and the infamous away goals rule is bound to be invoked. These discussions are always based on the old twins intuition and experience, rather than statistics. It is hence no surprise that I recently heard contradictory views on whether a 0-0 away is a good or a bad result.

So, why not crunch the numbers? The data are from all two-legged Champions League knockout fixtures. The first column gives you the fist-leg result from the perspective of the home team, the second column gives the percentage of times that the team which was at home in the first leg won the overall fixture, and the third column gives the number of observations.

0-3 0%    n=2
0-2 0% n=3
2-3 0% n=1
1-2 0% n=3
0-1 9.1% n=11
4-4 0% n=1
3-3 0% n=1
2-2 20% n=5
1-1 37.5% n=16
0-0 29.4% n=17
3-2 0% n=4
2-1 30% n=10
1-0 58.3% n=12
4-2 50% n=2
3-1 83.3% n=6
2-0 100% n=11
5-2 100% n=1
4-1 100% n=2
3-0 100% n=1
4-0 100% n=1

Of course, the number of observations is small in many cases, so I wouldn't use this to argue that, say, 5-2 is just as good a result as 4-0, but I will say that 0-0 away is a good result. Take that, Michael.