04/01/2009

Linguistics Prediction/Warning: "Outliers"

"If you look, in fact, at emergency room statistics, you'll see that more people are admitted every year for non-dog bites than dog-bites—which is to say that when you see a Pit Bull, you should worry as much about being bitten by the person holding the leash than the dog on the other end."

Malcolm Gladwell, who gave us the above masterpiece of statistical reasoning, strikes again: His new book Outliers out now. I've recently browsed a copy of the book, which deals with success. Gladwell uses the term "outliers" for the exceptionally successful. Steve Sailer comments (emphases omitted):
Gladwell chose the word "outliers" for his title because it sounded scientific. He’s vaguely aware that statistical analysts are much concerned with the outliers in their datasets, so it sounds cool to write a book about why people like Bill Gates and the Beatles are successful and call it Outliers.

Of course, the reason statisticians think about outliers a lot is because, to quote Wikipedia, "Statistics derived from data sets that include outliers may be misleading."

For example, say you are a market researcher doing a random survey of consumers for a mutual fund company to determine the average net worth of Americans by different levels of education. You tote up your results and see that the mean wealth of your 100 college dropouts is $500,050,000.

"That’s weird," you say.

You then look at the individual surveys and see that one respondent claimed to have a fortune of fifty billion dollars.

Is he lying? Is he crazy? Or is he Bill Gates? You don’t know. All you know is that he’s an outlier and therefore you aren’t going to use him in your data set. Otherwise, your innumerate pointy-haired boss in the marketing department (who, by the way, loves Malcolm Gladwell) might take your findings as justifying a huge ad campaign aimed at the evidently vastly wealthy dropout market.
To put it a bit more clearly: In standard usage an outlier is a case in which the usual relationship between two or more variables does not hold. For instance, if you have a sample in which the average relationship between a person’s weight and height is described by the formula “weight in kilos equals height in centimetres divided by two”, someone who is 2.20m tall and weighs 200 kilos is an outlier. Someone who is 2.20m tall and weighs 110 kilos is not, despite being exceptionally tall.

This, erm, unconventional use of the common statistical term is particularly regrettable because one of Gladwell’s main points – he spends an entire chapter on it – is that the unusually successful are not outliers in the usual sense. He cites a study which shows that (if I remember the numbers correctly) very good classical musicians typically practiced about six hours a day, while the merely good spent four and the not-so-good two hours on practice. He interprets this to mean that the merely and not-so-good musicians could have been very good if only they would have practiced six hours a day. Leaving aside the question whether this interpretation is – pun alert! – sound, his very point is that the very good musicians in the sample are not outliers in the standard sense because the difference between their performance and that of the other artists is explained by the differences in practice. Put differently, the relationship between hours of practice and quality of performance holds for all of the sample (as described by Gladwell), including the particularly good musicians. A very good musician who practiced only two hours a week would be an outlier, but Gladwell’s point is that such a case just isn’t there.

Expect the misuse of the term “outliers” to come to a discussion near you soon.

1 comment:

pj said...

Mmm, there's a bit of a trend (cough) for these populist meta-theories at the moment (Gladwell, Taleb etc) but they are always so data poor and filled with vague tautologies and bits stolen from the work of real researchers that they're utterly unconvincing.

I've not read Gladwell's book but an article of his I read was bafflingly incoherent and empty.