Problems with Evidence-Based Medicine

[Corrected for grammar]

I might have found a better title for this post, but I'd like to see whether this title brings more traffic from google than the current frontrunner. If you came here because you are into alternative medicine, this may not be the blog for you. In which case I'm sorry to have wasted ten seconds of your life. It's a pretty long and meandering post about scientific methodology and epistemology; it may not be for everybody. I promise to post another funny little anecdote soon.

Seth Roberts writes:

[S]tatistics such as the mean do not work well for detecting a large change among a small fraction of the sample. If soft drinks cause 2% of children to become hyperactive and leave the other 98% unchanged, looking at mean hyperactivity scores is a poor way to detect this. A good way to detect such changes is to make many measurements per child. Many did-a-drug-harm-my-chlld? cases come down to parents versus experts. The experts are armed with a a study showing no damage. But this study will inevitably have the weaknesses I’ve just mentioned — especially, use of means and few measurements per subject. The parents, on the other hand, will have used, informally, the more sensitive measurement method.

For these reasons, I suspect drug side effects are woefully underreported.
This is not, of course, a problem limited to medical studies. If you studied, say, the effect of listening to heavy metal music on increases in aggression, the same problem might occur: It might make only a subset of children aggressive.

In principle, there's a tool to tackle this in the scientist's box: Studying moderators. (A moderator is something that influences the relationship between the independent and the dependent variable. For example, if you studied the influence of firing bullets at people's chests on people's subsequent health, whether or not people are wearing body armour would be a moderator.) If soft drinks make one child hyperactive, but not the other, there must be some reason for this - there must me a moderator. You could measure the moderator and study its effects.

There are two related problems with this. One, the moderator - call it factor X - may be present in only very few people. If you only have one person that displays factor X in your sample, you can study factor X's influence all you like, you're not going to get statistically significant results. Two, and more fundamentally, you may have no effin' clue what factor X is and thus not even measure it.

(For the same reason I once had an argument with a friend who suggested that matching basically does the same thing as randomization. [I have a funny feeling the song "Born to Be Wild" was written about me.] I argued that would only be true if one could not only match perfectly on every variable, but knew all of the variables that have an influence. I think I won that argument hands-down, but someone might disagree.)

What does all of this mean? It means that if you know all studies testing for a connection between soft drinks and hyperactivity and all come up with a "no effect" verdict, but your child becomes hyperactive when she consumes soft drinks, but doesn't when she doesn't, and you wonder whether there's a causal connection, you have the choice between three hypotheses:

H1: All possible moderators have been looked into properly. Hence the correlation I've observed is due to chance.

H2: My child is like the average child in those studies. Hence the correlation I've observed is due to chance.

H3: The correlation is causal, although the studies, conducted in line with the best scientific principles, did not suggest this. My child just happens to be different.

I think H1 can be ruled out pretty much a priori; choosing between the other two, I don't think it would take an awful lot of observations before I went with H3. Two explanations.

One, the "soft drink effect" could well be what you might call a placebo effect. But that doesn't matter. For practical purposes an effect is an effect.

Two, yeah, I know, observing my child, I don't have a counterfactual - I don't know whether she'd have been hyperactive on day X if she hadn't had a softdrink. That, after all, is the reason that scientists do randomized trials in the first place: The controls maybe should be called "counterfactuals". But the fact of the matter is that one has to make a guess about the truth, and the evidence for both H2 and H3 is imperfect.

I'm not speaking out against well-established methods scientists use when studying a subject. These are the best methods we have for establishing general knowledge, and I wish more people appreciated that. All I'm saying is that their power is nowhere near unlimited.

I guess what I'm really saying is that I'm doing my best to be a good Bayesian.


pj said...

Of course the problem with Seth Robert's premise is that medical studies rarely utilise means and normally report rates (ideally of all adverse events that occur). Thus undermining his whole argument. oh dear.

Seth Roberts said...

Rates are means of zeros and ones. And zero/one measurement of a problem, such as hyperactivity, is less sensitive than a more graded measurement scale.

LemmusLemmus said...


how did I know you'd comment on this one?

You know much more about medical studies than I do; the psychology example was closer to home. If I read you correctly, you mean information such as "1% of subject in the treatment condition developed symptom X during the study period." (Could you clarify that?) But then there's no way of knowing whether that's causal without the use of comparing treatment and control group (and using significance tests). That is the whole purpose of and RCT, isn't it?

Take a more extreme example: It might be that zero percent of subjects in the study show a certain reaction to treatment A, but still you do.

As the last but one paragraph should have made clear, I'm not arguing against RCTs at all; in fact, I'm a fan.

LemmusLemmus said...

Also, my argument is by no means confined to adverse effects. If I observe an improvement in cognitive performance whenever having a Fanta, but I know a study that shows no effect of Fanta on cognitive performance, when comes the time I should start believing that in my case there is a causal connection? Never?

pj said...

Oh I'm not diagreeing that RCTs can fail to find an effect when there is one (e.g. it is too rare - that is why post-licensing surveillance is very useful), but I am arguing that it is misleading to imply that rare events can be 'averaged out' (as you rightly note, you need to compare the rates in control and treatment groups, and these are usually expressed as odds ratios or relative risks). The statistical approach to discrete events is different to that of continuous variables.

I'm intrigued that seth is also arguing that rate measures are less sensitive than graded or continuous scales - this rather goes against his premise that:

"statistics such as the mean do not work well for detecting a large change among a small fraction of the sample"

If the change is large then it is likely to be detected as a discrete event (and thus counted as a '1'). Sure a mean is probably a better way to pick up a small effect that is consistent across the sample - but that's not what we're talking about, is it?

I note in the original piece that Seth does mention rates, and says:

"It isn’t easy to measure side effects in conventional studies of treatment vs placebo. If you measure the rates of 100 possible side effects, and use a 5% level of significance, one or two true positives will go unnoticed against a background of five or so false positives. So a drug company can paradoxically assure that they will find nothing by casting a very wide net."

I think this is disingenuous since adverse events are not usually corrected for multiple comparisons or dismissed as false positives, and the overall number of adverse events are also compared. In fact analyses are usually optimised to maximise the chances of finding an effect (for example by not using an intention-to-treat analysis which can dilute out adverse events). This is why drug safety leaflets are filled with spurious side effects, erring on the side of safety.

LemmusLemmus said...


the basic point of my rather long post was that it is not the rational choice to always trust the study results over one's everyday observations. You seem to agree with me on this one - or do you? Why shouldn't rare effects be "averageged out", as you put it?

("Averaged out" may not be the best expression here, if I may say so. By "averaged out" I understand a case in which some things go this way, some things go that way, and in the end we have a bit of variance, but the mean is more or less identical to the median and the mode.)

The point was more that there may be systematic but rare deviations from the mean.

The fact that different statistical techniques are used for discrete vs continuous variables seems fairly uninmportant to me in this context.

As for your other points, I'm sure Seth Roberts can defend himself.

J Thomas said...

If the way you sample the population is different from the way the study samples the same population, then of course its results might not apply for you.

So when the study randomises correctly and it finds no significant result, that means there is at least a 5% chance of getting results like the study's results by random chance when there is no effect.

If you ask what is the chance of getting a nonsignificant result when there really is an effect, that's a different question that might be answered by different methods.

You might determine a confidence interval, which says how big an effect could there be on the whole population and still get this result. If the maximum effect on the whole population that could likely be undetected by this study is small, you might figure that it isn't worth pursuing.

Something that only affects part of the population, and you don't know which part? That's much harder. But if the question is about the whole population that's been sampled correctly, then the result can stand. Don't give vaccines to the whole population unless it looks like it will do enough good. Don't give vitamins to the whole population unless it looks like it does enough good. Don't forbid people to take vitamins unless it looks like it does enough harm. Etc.

But if you take a nonrandom sample from the same population? Who knows? And one individual -- you -- is not a very good random sample.

pj said...

I'm not sure - Seth reports a side effect that isn't reported in a study of 20 people - that's hardly surprising - you'd be a bit more questioning if you thought you had a side effect that wasn't found in a post-licensing surveillance of thousands. There's also different degrees of confidence in whether a drug causes a side effect - if you get a convincing and pronounced physiological effect rapidly after taking something and this is reproducible then that is more convincing than the gardasil example (or MMR) when there is some vague temporal association of the order of months.

There's a difference between averaging out (a small number have a large change but this when expressed as the average of the whole group doesn't show up much difference) and sample sizes being too small to detect a rare event (say a death rate of 1:1000 in a study of 200). I note you count a placebo effect as an effect (i.e. even if the rate is no higher than the placebo arm of a study you'd still call it a side effect) which is fine from one perspective, but it would be unfair to attack medical studies for not talking about these side effects (which they do - they just don't regard them as specific to the drug - and thus they do not regard them as relevant for licensing so they are not headline figures).

So, as a good Bayesian, I'm utterly unimpressed by the 'experiments' by Seth Roberts and Tim Lundeen - not least because they seem to have no idea of practice effects, placebo and unblinding.

LemmusLemmus said...


I have a feeling you are defending the practice of medical research here - I, for one, didn't attack that at all; I only said one shouldn't put unlimited confidence in it. Also, I'm not the spokesperson for Seth Roberts; I'll only defend the bit of text I quoted approvingly.

I agree that finding a side effect that wasn't found in a study of 20 people is rather unsurprising.

I also agree that a "convincing and pronounced physiological effect rapidly after taking something and this is reproducible" is what one wants to be looking for - you'll note that this is the case in the (hypothetical) soft drinks - hyperactivity scenario. I've never heard about gardasil, but the MMR example would certainly not qualify. Vaccination is a one-off (two-off? three-off?) event and autism is a permanent condition, so concluding your child got autism because of the vaccination would be rather bold - it might as well have been because she fell off the bike or what have you. The association is based on exactly one observation. (I did not follow the MMR-autism debate in any detail; this is just an epistemological point.)

The placebo effect is an effect - hence the name. I did not attack medical studies in this respect. To stick with the example, if my child becomes all hectic after drinking coke, I know why: caffeine. But if my child becomes all hectic after drinking Fanta, and I know Fanta has no caffeine in it, at some point I'm still going to say, "What the heck! There appears to be an effect."

(You could come up with a credible learning scenario here: Child first has Coke, notices she gets all hectic, has Fanta, thinks, "Fanta, basically the same as Coke", gets all hectic. I believe such effects have even been observed in people that were lead to believe they were drinking alcohol, although I don't remember the details.)

In your last sentence, what's the difference between "placebo" and "unblinding"?

pj said...

In a placebo effect you know you're getting something but don't know if it is active so you may get side effects by chance or because you think you might be on the drug. Unblinding means you know you're on the drug and you can't compare to a placebo group (unless you lied to the placebo group and told them they were on the drug - but that doesn't usually happen).

LemmusLemmus said...

Thank you,

not quite the same thing indeed.

J Thomas said...

It sounds like placebo effect is what happens because you *think* something useful is being done to you.

Unblinding is what happens when you think you're getting treated, and you happen to be right.

It used to be common for doctors to give their patients placebos and tell them it would do them good.

It makes sense the placebo effect would be stronger when you believe you're actually getting treated, than when you believe you're in a study where you might be getting a placebo. So drugs should have a stronger placebo effect after they're approved.

I think it would be more effective to get people to just agree to be in experimental groups and not tell them when they're being experimented on. Like, in the USA we could get everybody on Medicare or Medicaid to agree to accept experimental treatment whenever their doctors agree. And then you give them placebo or give them drugs without ever telling them they might be in a study.

We do better when the treatments we test are just like the treatments we actually use in practice.

LemmusLemmus said...

"I think it would be more effective to get people to just agree to be in experimental groups and not tell them when they're being experimented on. Like, in the USA we could get everybody on Medicare or Medicaid to agree to accept experimental treatment whenever their doctors agree."

Heh! Try getting that one through congress!

pj said...

"It sounds like placebo effect is what happens because you *think* something useful is being done to you."

Theres a common conflation when we talk about the placebo effect between the psychological effect of being in a trial (thinking you might have an active drug, the extra attention etc) and also things like regression to the mean (where people just get better anyway). The former placebo effect only happens when you give a drug or intervention, the latter is part of the natural history of an illness. This distinction becomes relevant when we realise that clinical trials can't distinguish between the two types of effect, yet people often attribute the latter phenomenon (e.g. people just getting better with time) to some magical mind-over-matter placebo pill effect.

J Thomas said...

"Like, in the USA we could get everybody on Medicare or Medicaid to agree to accept experimental treatment whenever their doctors agree."

Heh! Try getting that one through congress!

It might not be that hard, particularly for Medicaid. A whole lot of americans believe that poor people and unproductive people don't really deserve anything, but we should still grudgingly give them stuff anyway, just in case for some of them it isn't their own fault.

Pitch it as an advance for medical science and no harm to the experimental subjects -- treatments that are not cost-effective terminated as soon as the statistics show that -- and they might go along pretty easily.

Harder if it includes Medicare and people think it involves cold-hearted experiments performed on their grandmothers.