Unpacking Karl Smith on Experiments and Regressions (An Introduction to Causality and How to Measure It), Part II

Here's where we left off.

Internal Validity (Part B)
So, multiple regression is an attempt to establish a causal connection between X and Y by controlling for everything that may have an influence on Y besides X. This will never work perfectly. One, there are variables that you would like to have measures for, but don't. Second, there are the ones you haven't even thought of, but which in fact do have an influence on Y. In the social sciences, you typically have both these problems.

Then there's the problem of overcontrol. The textbook example here is the influence of IQ (which we may have measured early in life) on earnings (which we may be measuring at age 40). Education has an influence on earnings, so we want to control for that, right? Well, given that education is itself influenced by IQ (it is a "mediating variable," to use the technical term), you would underestimate the full effect of IQ on earnings if you controlled for education. That's the problem of overcontrol.*

So, you don't control for education. But that might be wrong as well. If education does have an influence of its own on earnings (over and above IQ), and if it is not simply a function of IQ, then you have what Angrist & Pischke call a "proxy control" problem. That is, you want to control for education, but at the same time, you don't want to. There's not really a way out of that dilemma. The best you can do is report estimates with and without education controlled for and call one "upper bound" and the other "lower bound." Let's hope they're similar!

There is also the problem of causal direction. The setup of a multivariate regression does nothing to establish that, but it might be that you have information about your variables that help. For example, if you try to establish an effect of the weather on violent crime rates, you can be pretty sure that an association between the two does not represent an effect of violent crime rates on the weather.

Compare & contrast with the randomized experiment. If your sample is reasonably large, randomization takes care of all other variables that might have an influence on the outcome, whether you had thought of them or not. (Because you've randomized, the treatment and control groups are very similar on other variables.) Plus, you establish causal direction because you know that you have manipulated one variable, but not the other. VoilĂ : Causality established.

So you can say that "[t]here is no fundamental difference between performing a regression on data collected in the field and data generated in the lab," but that's like saying that there is no fundamental difference between the broken-down, rusty Lada with no wheels in my backyard and a brandnew Porsche. They're both cars, right?

As far as I can see, that takes care of the first two of Smith's sentences that I quoted. Which leaves us with one post about the very interesting bit about "double-blind" and the failure of controls, one about external validity, and perhaps an appendix post about assorted issues. I'm confident I'll be finished by the end of summer!

*You might want to know whether an effect of IQ remains after you've controlled for education, but then you'd be asking a different question.

No comments: