Anders Huitfeldt

Effect heterogeneity and external validity in medicine

2019-10-25T00:00:00-07:00

Our paper “Effect heterogeneity and variable selection for standardizing causal effects to a target population” has just been publised in the European Journal of Epidemiology. While the journal’s version of record is behind a paywall, a preprint is available on ArXiv.

This paper argues for my very deeply held belief that we can make significant advances in quantitative reasoning for medical decision making by thinking more closely about effect heterogeneity and how this relates to the choice of effect scale.

Over the course of the last 7 years, external validity and generalizability have become increasingly hot topics in statistical methodology and computer science. In particular, a lot of progress has been made by Judea Pearl and Elias Bareinboim, who introduced a framework based on causal diagrams that can be used to reason about how to take causal information from one setting (for example: a randomized trial) and apply it in a different setting (for example: a clinically relevant target population).

The key questions of interest are: How do we know whether such extrapolation is even possible? How do we determine what information we need from the study, and what information we need from the target population, in order to extrapolate the findings? How do we put this information together in order to obtain a valid prediction for what happens if the intervention is implemented in the target population?

Pearl and Bareinboim’s framework for answering these questions is, of course, mathematically valid. However, in my opinion, their approach also throws the baby out with the bathwater. In particular, we argue that instead of attempting to extrapolate the magnitude of the effect (i.e. a measure of the “size” of the difference between what happens if the drug is taken, and what happens if the drug is not taken), they attempt to look at the people who were assigned to receive the intervention in the study, and extrapolate their distribution of outcomes to the target population without any reference to how that distribution differs from what happened to the people who were randomized to the control condition.

Theoretically, this approach will work if the extrapolation procedure can account for every cause of the outcome whose distribution differs between the people who were in the study, and the people you are trying to make predictions about. However, since the set of causes of the outcome is very large, it is very unlikely that it is possible to measure all of them. Moreover, it is very likely that we do not even know what the causes of the outcome are. Our inferences then become subject to potential uncertainty which arises from the auxiliary assumption that we know what to control for.

Consider a situation where scientists have conducted a randomized controlled trial in men, on the effects of homeopathy on heart disease. The scientists find that homeopathy has no effects in men, and wonder whether this finding can be extrapolated to women. If the scientists attempt to answer this question using Bareinboim and Pearl’s framework, they will be forced to conclude that no extrapolation can be made, unless they are willing to claim that they know all the causes of heart disease that differ between men and women, and have been able to measure every one of these causes in all the patients in the study.

In contrast, we suggest that scientists who want to extrapolate their findings and make predictions outside of the study should attempt to quantify the size of the effect - that is, by how much the outcomes in the people who were randomized to receive the intervention differ from the outcomes in the people who were randomized to the control condition. This effect size could then potentially be used as the basis for extrapolation. Such an approach would correspond much closer to how external validity and extrapolation has traditionally been understood in the medical literature.

In the real world of clinical medicine, doctors are usually given information about the effects of a drug on the risk ratio scale (the probability of the outcome if treated, divided by the probability of the outcome if untreated). With information on the risk ratio, a doctor may make a prediction for what will happen to the patient if treated, by multiplying the risk ratio and patient’s risk if untreated (which is predicted informally based on observable markers for the patient’s condition). The problem with this approach is that there are multiple scales on which to quantify the magnitude of the effect. Other possible scales for measuring effects include:

The odds ratio, which applies a \\(\frac(p)(1-p)\\) transformation to the risk
The survival ratio, which uses the probability of survival (1-p) instead of the probability of death (p)
The risk difference (which uses an additive scale instead of a multiplicative one)

Unless the intervention has no effect, the empirical predictions will not be invariant to the choice of scale. This is, of course, a serious problem for principled clinical decision-making, but as we will show, it is not necessarily an impossible one. Despite the scale dependence of the reasoning procedure, the risk ratio is in many cases the only summary of the effect size that is made available to clinicians, whether they get their information from journals, clinical guidelines or online resources for clinical information. Given that the reasoning procedure is not scale-invariant, the universal reliance on the risk ratio may plausibly lead to suboptimal medical decision making in a wide range of clinical scenarios. But, in contrast to the implications of the Bareinboim/Pearl framework, we argue that this does not necessarily mean that we should throw out reliance on parametric effect measures altogether.

Our suggestion for how to choose the scale has been discussed earlier on Less Wrong. I am not going to repeat the argument in full here, but I will ask you to consider the following highly stylized thought experiment, which illustrates the underlying intuition:

Consider a randomized controlled trial where the intervention is that everyone is randomized to play Russian roulette once a year. This trial is conducted in Russia. It is found that among those who did not play Russian roulette, 1% of people died over the course of the year. Among the people who played Russian roulette, 18% of people died. We want to extrapolate these findings to Norway, where nobody ordinarily plays Russian roulette and it is known that 0.5% of people die during any year. Our goal is to find out what happens in Norway if everyone took up playing Russian roulette once a year. Bareinboim and Pearl would suggest taking the risk of death among those who played Russian roulette (18%), controlling for all causes of death that differ between Russia and Norway, and producing an estimate for what happens in Norway if everyone plays Russian roulette. However, due to considerable differences between Russia and Norway in terms of predictors of mortality, this is clearly not feasible in this situation.

If we instead attempt to quantify the effect size in Russia, this can be done on any of the previously discussed scales:

The risk ratio is 0.180.01=18
The risk difference is 0.18−0.01=0.17 
The survival ratio is 1−0.181−0.01≈56
The odds ratio is 0.18/(1−0.18)/((0.01)/(1−0.01))≈21.7

Each of these scales will result in a different prediction for what will happen if people in Norway play Russian roulette:

If we use the risk ratio, we will predict that 0.005×18=9% will die.  
If we use the risk difference, we will predict that 0.005+0.17=17.5% will die. 
If we use the survival ratio, we will predict that (1−0.005)×5/6≈82.9% will survive, meaning that 17.1% will die
If we use the odds ratio, we will predict that 0.005/(1−0.005)×21.7/(1+0.0051−0.005×21.7≈9.8% will die.

These predictions differ massively not only in their implications for decision-making but also in their plausibility: Given what we know about Russian roulette, we would expect to see results much closer to 17% than to 9%. So clearly, some of these scales are doing something “right” and other scales are doing something “wrong”.

We argue that the key to understanding the implications of this scale-dependence is that only the survival ratio (5/6) has a structural meaning: it represents the proportion of empty chambers in the revolver, and therefore produces appropriate, valid predictions. In contrast, the risk ratio (18/1) has no possible structural meaning and therefore produces nonsense results.

Any attempt at extrapolation would, of course, have to account for all factors that determine the magnitude of the effect. For example, if Russians are more likely to be drunk when they play Russian roulette, they may be more likely to miss than Norwegians. This may lead to local deviations from effect sizes of 56, which will have implications for extrapolation. But once you have controlled for all of the factors that determine the magnitude of the effect on a scale that has structural meaning, extrapolation may be valid.

Crucially, we argue that controlling for all determinants of effect size (alcohol? how many chambers are there in typical revolvers in each country?) is much more tractable than controlling for all causes of mortality differences between the countries.

The main idea behind my research agenda is to explore how far we can push this argument in more clinically relevant settings. Next, consider a doctor who is trying to determine the pros and cons of treating a patient with a new drug. Suppose a reliable study on the drug shows that among those who received a placebo, 1% got an allergic reaction over the following 12 months; whereas, among those who received the drug, 2% got an allergic reaction. The scientists behind the study can either tell the doctor that the risk ratio is 0.02/0.01=2, or that the survival ratio is 0.98/0.99≈0.99. Both statements are correct, but only the latter has a potential structural interpretation, since it plausibly corresponds to a state of nature where 99% of the population do not have the factors (genes?) that predispose a person to have an allergic reaction if exposed to the drug.

Now consider that this patient also has a severe peanut allergy (which is unrelated to the medical issues that the doctor is treating them for) and lives in an environment where everyone eats peanuts all the time. This patient, therefore, has a 10% baseline risk of getting an allergic reaction over the course of 12 months, even in the absence of treatment with the new drug.

It would be insanity for the doctor to expect that the risk ratio from the study generalizes, and that the patient will have a 20% risk of anaphylaxis if given the new drug. In contrast, it may be meaningful to predict that their risk under treatment is given by 1−(1−0.10)×(0.98/0.99)≈10.9% . This will correspond closely to what one might expect would happen if the patient belongs to a population that has the same distributions of factors that predispose to the specific drug-related allergic reaction, as the population that was studied in the trial.

For these reasons, I consider it crucial for medical scientists to become aware of the need to put significant effort into reasoning about whether an effect measure has plausible structural meaning in the context of their current research question, before deciding to use it as a summary of their findings which is suitable for use in clinical decision making.

If anyone can spot any flaws in our argument, such feedback would be invaluable information. I invoke Crocker’s Rules for all responses to the paper and the post. I would very much appreciate it if this blog post and the paper could be forwarded to anyone who is in a position to evaluate its importance.

Finally, let me note that this paper is the first peer-reviewed academic publication to acknowledge support from the EA Hotel Blackpool in its funding section. The EA Hotel is a project worth supporting; see https://forum.effectivealtruism.org/posts/uyvc6p99vsWFMPZiz/ea-hotel-fundraiser-5-out-of-runway

Note: This blog entry is crossposted from Less Wrong. Please make any comments on the Less Wrong version

Counterfactual outcome state transition parameters

2018-07-27T00:00:00-07:00

Today, my paper “The choice of effect measure for binary outcomes: Introducing counterfactual outcome state transition parameters” has been published in the journal Epidemiologic Methods. The version of record is behind a paywall until December 2019, but the final author manuscript is available as a preprint at arXiv.

This paper is the first publication about an ambitious idea which, if accepted by the statistical community, could have significant impact on how randomized trials are reported. Two other manuscripts from the same project are available as working papers on arXiv. This blog post is intended as a high-level overview of the idea, to explain why I think this work is important.

Q: What problem are you trying to solve?

Randomized controlled trials are often conducted in populations that differ substantially from the clinical populations in which the results will be used to guide clinical decision making. My goal is to clarify the conditions that must be met in order for the randomized trial to be informative about what will happen if the drug is given to a target population which differs from the population that was studied.

As a first step, one could attempt to construct a subgroup of the participants in the randomized trial, such that the subgroup is sufficiently similar to the patients you are interested in, in terms of some observed baseline covariates. However, this leaves open the question of how one can determine what baseline covariates need to be accounted for.

In order to determine this, it would be necessary to provide a priori biological facts which would lead to the effect in one population being equal to the effect in another population. For example, if we somehow knew that the effect of a drug is entirely determined by some gene whose prevalence differs between two countries, it is possible that when we compare people in Country A who have the gene with people in Country B who also have the gene, and compare people in Country A who don’t have the gene with people in Country B who don’t have the gene, the effect is equal between the relevant groups. Using an extension of this approach, we can try to look for a set of baseline covariates such that the effect can be expected to be approximately equal between two populations once we make the comparisons within levels of the covariates.

Unfortunately, things are more complicated than this. Specifically, we need to be more precise about what we mean by the word “effect”. When investigators measure effects, they have several options available to them: They can use multiplicative parameters (such as the risk ratio and the odds ratio), additive parameters (such as the risk difference), or several other alternatives that have fallen out of fashion (such as the arcsine difference). If the baseline risks differ between two populations (for example, between men and women), then at most one of these parameters can be equal between the two groups. Therefore, a biological model that ensures equality of the risk ratio cannot also ensure equality of the risk difference. The logic that determines whether a set of covariates is sufficient in order to get effect equality, is therefore necessarily dependent on how we choose to measure the effect.

Making things even worse, the commonly used risk ratio is not symmetric to the coding of the outcome variable: Generalizations based on the ratio of probability of death, will give different predictions from generalizations based on the ratio of probability survival.. In other words, when using a risk ratio model, your conclusions are not invariant to an arbitrary decision that was made when the person who constructed the dataset decided whether to encode the outcome variable as (death=1, survival=0) or as (survival=1, death=0).

The information that doctors (and the public) extract from randomized trials is often in the form of a summary measure based on a multiplicative parameter. For example, a study will often report that a particular drug “doubled” the effect of a particular side effect, and this then becomes the measure of effect that the clinicians will use in order to inform their decision making. Moreover, the standard methodology for meta-analysis is essentially a weighted average of the multiplicative parameter from each study. Any conclusion that is drawn from these studies would have been different if investigators had chosen a different effect parameter, or a different coding scheme for the outcome variable. These analytic choices are rarely justified by any kind of argument, and instead rely on a convention to always use the risk ratio based on the probability of death. No convincing rationale for this convention exists.

My goal is to provide a general framework that allows an investigator to reason from biological facts about what set of covariates are sufficient to condition on, in order for the effect in one population to be equal to the effect in another, in terms of a specified measure of effect. While the necessary biological conditions can at best be considered approximations of the underlying data generating mechanism, clarifying the precise nature of these conditions will be useful to assist reasoning about how much uncertainty there is about whether the results will generalize to other population.

Q: What are the existing solutions to this problem, and why do you think you can improve on them?

Recently, much attention has been given to a solution by Judea Pearl and Elias Bareinboim, based on an extension of causal directed acyclic graphs. Pearl and Bareinboim’s approach is mathematically valid and elegant. However, the conditions that must be met in order for these graphs to be a reasonable approximation of the data generating mechanism, are much more restrictive than most trialists are comfortable with.

Here, I am going to skip a lot of details about these selection diagrams, and instead focus on the specific aspect that I find problematic. These selection diagrams abandon measures of effect completely, and instead consider the counterfactual distribution of the outcome under the intervention separately from the counterfactual distribution of the outcome under the control condition. This resolves a lot of the problems associated with effect measures, but it also fails to make use of information that is contained in how these two counterfactuals relate to each other.

Consider for example an experiment to determine the effect of homeopathy on heart disease. Suppose this experiment is conducted in men, and determines that there is no effect. If we use selection diagrams to reason about whether these conclusions also hold in women, we will have to construct a causal graph that contains every cause of heart disease whose distribution differs between men and women, measure these variables and control for them. Most likely, this will not be possible, and we will conclude that we are unable to make any prediction for what will happen if women take homeopathic treatments. The approach simply does not allow us to try to extrapolate the effect size (even when it is null), since it cannot make use of information about how what happened under treatment relates to what happens under the control condition. The selection diagram approach therefore leaves key information on the table: In my view the information that is left out is exactly those pieces of information that could most reliably be used to make generalizations about causal effects.

A closely related point is that the Bareinboim-Pearl approach leads to a conclusion that meta-analysis can be conducted separately in the active arm and the control arm. Most meta-analysts would consider this idea crazy, since it arguably abandons randomization (which is an objective fact about how the data was generated) in favor of unverifiable and questionable assumptions encoded in the graph, essentially claiming that all causes of the outcome have been measured.

Q: What are counterfactual outcome state transition parameters?

Our goal is to construct a measure of effect that allows us to capture the relationship between what happens if treated, to what happens if untreated. We want to do this in a way that avoids the mathematical problems with standard measures of effect, and such that magnitude of the parameters has a biological interpretation. If we succeed in doing this, we will be able to determine what covariates to control for on the basis of asking what biological properties are associated with the magnitude of the parameters.

Counterfactual outcome state transition parameters are effect measures that quantify the probability of “switching” outcome state if we move between counterfactual worlds. We define one parameter which measures the probability that the drug kills the patient, conditional on being someone who would have survived without the drug, and another parameter which measures the probability that the drug saves the patient, conditional on being someone who would have died without the drug.

Importantly, these parameters are not identified from the data, except under strong monotonicity conditions. For example, if we believe that the drug helps some people, harms other people and has no effect on a third group, there is no monotonicity and the method cannot be used. However, it is sometimes the case that the drug only operates in one direction. For example, for most drugs, it is very unlikely that the drug prevents someone from getting an allergic reaction to it. Therefore, its effect on allergic reactions is monotonic.

If the effect of treatment is monotonic, one of the COST parameters is equal to 0 or 1, and the other parameter is identified as the risk ratio. If this is a treatment that reduces incidence, the COST parameter associated with a protective effect is equal to the standard risk ratio based on the probability of death. If on the other hand the treatment increases incidence, the COST parameter associated with a harmful effect is identified as the recoded risk ratio based on the probability of survival. Therefore, if we determine which risk ratio to use on the basis of the COST model, the risk ratio is constrained between 0 and 1.

Q: Is this idea new?

The underlying intuition behind this idea is not new. For example, Mindel C. Sheps published a remarkable paper in the New England Journal of Medicine in 1958, in which she works from the same intuition and reaches essentially the same conclusions. Sheps’ classic paper has more than 100 citations in the statistical literature, but her recommendations have not been adapted to any detectable extent in applied statistical literature. Jon Deeks provided empirical evidence for the idea of using the standard risk ratio for protective treatments in Statistics in Medicine in 2012.

What is new to this paper, is that we formalize the intuition Sheps was working from in terms of a formal counterfactual causal model, which is used as a bridge between the background biological knowledge and the choice of effect measure. Formalizing the problem in this way allows us to clarify the scope and limits of the approach, and points the direction to how these ideas can be used to inform future developments in meta-analysis.

Note: This blog entry is crossposted from Less Wrong. Please make any comments on the Less Wrong version