May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« Did You Achieve Balance?! Part II | Main | More Questions About Balance (And No Answers) »

27 October 2005

Don't Use Hypothesis Tests for Balance

Gary King

Jens' last two blog posts constitute an excellent statement of where the literature on matching is, but I think almost all of the literature has this point wrong. Hypothesis tests for checking balance in matching are in fact (1) unhelpful at best and (2) usually harmful.

Suppose you had a control group and a treatment group that are identical (exactly matched) except for one person, or except for a bunch of people in one very minor way. Suppose hypothesis tests indicate no difference between the groups, and so you'd be in the situation of reporting balance was great and no further adjustment was needed. (We might think of this as a real experiment where the outcome variable hasn't been collected but is expensive to do so.) If you were given the chance of dropping the one or few people that caused the two groups to differ and replacing them with others that exactly matched, would you do so? Since the dimension on which the inexact match or matches occurred might be the one that has a huge effect on your outcome variable, the bias due to not switching could be huge. So you'd undoubtedly make the switch, despite the fact that the hypothesis test indicated that there was no problem. Hence (1) the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test.

Now suppose you have data that don't match very well by all hypothesis tests and you randomly (rather than systematically to improve matching) drop observations, in a bad application of matching. what will happen? Your t-tests or ks-tests or any other hypothesis tests will lose power and so will indicate that balance is getting better and better. Yet, bias is not changing at all, and efficency is dropping fast. The tests are telling you to discard data! Hence (2) hypothesis tests to evaluate balance are harmful, quite seriously so.

The fact is that there is no superpopulation to which we need to infer features of the explanatory variables; all analysis models we regularly use after matching are conditional on X. Balance should be assessed on the observed data, and not be the subject of inference or hypothesis tests.

This message rehearses an argument in a to-be-revised version of our matching paper by Ho, Imai, King, and Stuart that we hope to be finished with and post in a couple of weeks.

Posted by Gary King at October 27, 2005 4:40 AM

Comments

Hi Gary,

thanks for your follow up. That's the discussion I hoped for.

I agree with your main point that covariate balance tests should not be considered as hypothesis tests in the conventional sense. Given standard pre-test problems alone, it would be a pretty bad idea to read these p-values in the usual sense (i.e. the probability of obtaining a more extreme value than your result, given that there is zero effect in the population). Yet, I still think they are very useful, because, as Jas and Alexis argue in their paper, p-values provide a common metric to judge balance across a wide variety of tests. I am sure Jas will have more to say about this...

On your point re. dropping observations. It;s true that discarding units and thus changing the estimand usually means trouble (you don't really know what you're estimating anymore). But isn't that a separate issue? See for example Guido's recent "Moving the Goalpost" paper http://www.courses.fas.harvard.edu/~ec2162/Papers/0922_Imbens.pdf

Cheers,
Jens

Posted by: Jens at October 27, 2005 11:18 AM

p-values do not provide a common metric in this instance. a smaller p-value can be worse and a larger p-value can be better in terms of balance or vice versa. The p-value is not even necessarily monotonic in good balance, since the sample size, the variance of X and the fraction of treated vs control units can change, all of which make up the hypothesis test.

(Jas and Alexis' paper on genetic matching is terrific, and as I understand it they're adding to their software a way to put in a different objective function so maximizing p-values, which is not what you want in this instance, can be dropped.)

In the example of dropping observations I gave, I was briefly summarizing a simulation we did for our paper. What we did was keep the treated units the same and then drop increasing numbers of randomly selected control units. If the quantity of interest is the average treatment effect on the treated, then the quantity of interest doesn't change. Bias (imbalance) also doesn't change, but the t-test gets smaller and smaller and the p-value gets bigger and bigger, which makes no sense. The same problem occurs when matching with replacement even if you keep the apparent sample size the same.

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 12:25 PM

Hi Gary,

(1) As you know, I agree with you that the science of balance testing is in its infancy, and you're absolutely correct to argue that we ought to be thinking of better ways of operationalizing matching algorithms. I'm looking to forward to the paper presenting the method that dominates the Diamond & Sekhon (2005) operationalization. It's such a worthwhile thing to focus on, because improving the matching procedure should enable better evaluations across disciplines, which in turn could enable the production of better pharmaceuticals, better public policies, etc.

(2) You write:
"...the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test."
This claim has no support in the canonical Lalonde/DW dataset and (for what it's worth) goes against conventional wisdom and prior work. And to the contrary, the figures in Diamond & Sekhon (2005) show that as the p-values go up, the bias goes down. Granted, this is only one dataset. But it's an important one, and others are on the way very very soon. Note that our claim here is potentially falsifiable: the burden is on someone else to show empirically (and in a real dataset, or at the very minimum, a realistic simulated scenario) that, as you claim, "the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test." We've been showing our figures for a year now, and we've seen nothing to the contrary. Your claim is so absolute; how do you explain the good results we obtain using these "unhelpful" tests?

(3) You write: "Now suppose you...drop observations, in a bad application of matching. what will happen? Your t-tests or ks-tests or any other hypothesis tests will lose power and so will indicate that balance is getting better and better...The tests are telling you to discard data!" I don't understand how these tests could influence the size of your matched sample. Size of sample is influenced by the analyst's decisions--but how is it influenced by the KS-test? Here's the situation that people encounter in practice: You have a dataset, you want to estimate ATT, you match with replacement, and you pick your matching ratio, "M". Conventional wisdom and common advice is to begin with M = 1 for one-to-one matching, because this should produce lowest bias. If your standard error on ATT seems large, you could try M = 2 to see if your s.e.'s shrink without too much loss of matching quality. Excluding the possibility of exact ties, your sample size is fixed by your matching ratio "M". So, for the consequent sample size, you run the balance tests.

Perhaps the discussion in your blog post is implicitly referring to problems associated with dropping treated units due to overlap-related issues, but that's a whole other kettle of fish. Dropping treated units in this manner brings with it a whole other set of problems (see the Imbens paper Jens cites) and depends in the first place upon how one interprets the "Overlap" assumption that matching requires. I'm sure these issues will be further clarified in your forthcoming paper, and I'm really looking forward to reading it.

Posted by: Alexis Diamond at October 27, 2005 1:34 PM

Hi Alexis,

I think you missed my previous comment, which answers some of these points.

I have no quarrels with your paper with Jas. Automating the search for better matches is a great idea, and you guys were the first to do this, and I doubt anyone will do better any time soon. Your approach does not depend on any particular objective function (although obviously you do need one). And with the changes to the software that you're putting in, people can choose their own objective functions to measure balance as they like.

My only claim is that balance in matching has to do with how close the observed treated and control groups are, and has nothing to do with any superpopulation. Hence, hypothesis tests are besides the point, and they are also dangerous as per the example I described in somewhat more detail in my last comment. They shouldn't be in anyone's objective function, and they shouldn't be presented, as they are throughout the literature, as evidence that good balance has been achieved.

For example, take a simple two-sample t-test for one variable. This test statistic (or the p-value on which it is based) is a function of the difference in means, which is a component of balance. That's good. But its also a function three other things that have nothing to do with balance, but that the user effectively chooses by picking a matching solution: the observations you retain after matching, the variance of the remaining observations in the variable, and the fraction of treated vs control units. These have nothing to do with balance and so should not be part of an objective function and should not be part of assessing whether balance has been achieved. The example of randomly dropping observations illustrates this pretty clearly, I think. If you match with replacement and use the same control units multiple times, you have the same issue as dropping observation: altho the t-test math can become more complicated because of the induced dependence the story is identical.

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 1:57 PM

I agree with one very important point which Gary makes. And I have concerns about some other points. I agree that balance tests should not be thought of as hypothesis tests in the usual sense. Alexis and I make this point at length in our GenMatch paper.

I have two main reasons for why balance tests should not be thought of as usual hypothesis tests. First as Alexis and I show, conventional levels of balance (the cult of .05 and .1) simply do not lead to valid causal inferences. The degree of balance required is generally much higher than what is conventionally thought. Second, there is the problem of data mining. We usually try *many* different matches before settling on one (GenMatch goes through thousands), and as such the p-value suffers from standard pre-test problems.

This leads to the question of how do we know if we have good enough balance? More on this below.

Gary King writes: So you'd undoubtedly make the switch, despite the fact that the hypothesis test indicated that there was no problem. Hence (1) the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test.

Passing the test does not mean that there is no balance problem. But this is in part because there is no simple measure for "passing". Gary gives a hypothetical example of this. Alexis and I offer a few examples based on the LaLonde data. For the LaLonde dataset, one needs a "p-value" of over .2 for the paired t-test and over .7 the unpaired t-test in order to obtain decent estimates!! But the metric DOES work. As the calculated p-value goes up, the estimated bias on average goes down ---see the figures in Genetic Matching.

Note that in these figures it is possible that a model with a low p-value can get very close to the experimental estimate. But there is no reason to select it as opposed to another model which has just as low of a p-value but is very far from the experimental estimate. This is where the Dehejia-Wahba and Smith-Todd debate was before we came along. Only with high values on the p-value metric (higher than observed in any matching exercise in the literature we are aware of other than exact matching) do all of the estimated models get close to the true estimate (in the LaLonde data).

This leads to the question of how do we know if we have good enough balance? The way forward I think is to use something like GenMatch to maximize p-values OR a weakly monotonic functions of them using a large variety of function of balance (difference of means, D stats from KS tests, Wilcoxon Mann-Whitney tests and if you have enough data multivariate KS tests). And *then* after balance is as good as it is going to get, to calculate the Rosenbaum sensitivity tests to see how sensitive the causal effects are to the observed level of unbalance in X (and to hidden bias). One can adopt Rosenbaum's hidden bias test to see what effect the maximum *observed* bias in X can have (with the assumption, for example, that the unbalanced covariate correlates perfectly with the outcome). This is a worst case calculation.

A new version of my matching software will automate the calculation of the sensitivity tests. Early results show that in order to stand up to the test of hidden bias and bias based on the observed unbalance, one needs to have *far* higher p-values for the estimated causal effect than is commonly thought.

This sensitivity approach takes into account that the issues raised which can result in a non-monotonic mapping between the p-values and bias for the estimated causal effect.

Gary King writes: (Jas and Alexis' paper on genetic matching is terrific, and as I understand it they're adding to their software a way to put in a different objective function so maximizing p-values, which is not what you want in this instance, can be dropped.)

Thanks. My matching package can already take an arbitrary function: (Matching Software). What function do people suggest which isn't simply a weakly monotonic function of the usual functions?

Gary King writes: The p-value is not even necessarily monotonic in good balance, since the sample size, the variance of X and the fraction of treated vs control units can change, all of which make up the hypothesis test.

Comparing p-values as a metric only makes sense for the same sample size and the same distribution of X---i.e., the same estimand. The fact that one may be uses some observations more than once can be accounted for (the same way it has to be accounted for when calculating SEs for the causal estimate itself). This is done in my GenMatch code.

Gary King writes: The fact is that there is no superpopulation to which we need to infer features of the explanatory variables; all analysis models we regularly use after matching are conditional on X. Balance should be assessed on the observed data, and not be the subject of inference or hypothesis tests.

Note all models make the assumption of fixed X (Abadie and Imbens, for example, do not). Also, if one is going to be *pure*, given observational matching, one needs to give up on any probability statement for the estimated causal effect as well as X. There are a variety of ways of conceiving of where the distribution comes from in an observational matching exercise. One is to assume that the matching exercise approximates an experiment and hence one could even do randomization inference (for example see Rosenbaum's book and also see Imbens and Rosenbaum 2005 for an IV example which is relevant). By the same logic, one can conduct a hypothesis test for the difference between a given X covariate in treated and control. This is similar to the way in which in a given experiment where the overall population X was fixed and treatment randomly assigned one can coherently calculate p-values for differences in X. One may not buy this approximation argument or others like it, but one is going to have to make something like it up in order to make probability statements about the estimated causal effect.


Posted by: Sekhon [TypeKey Profile Page] at October 27, 2005 1:58 PM

Hi Jas,

I would suggest users of GenMatch (which by the way, is also available through MatchIt, along with most other approaches to matching we like) exclude p-values from the objective function. If its a p-value or t-test for the difference in means, just use the numerator, which in this case is the difference in means. As you point out, there are lots of other features of balance too, such as the difference in variances, etc.

P-values and balance are sometimes positively related, but sometimes not, because they are contaminated with other features (see my last post). So its certainly possible that maximizing p-values can help, but it is also possible (as in the example I gave) that they can hurt a great deal.

I agree, as you say, that "Comparing p-values as a metric only makes sense for the same sample size and the same distribution of X". However, in most applications of matching, the sample size changes since observations are dropped or matching is done with replacement. In addition, with almost all applications of matching, the distribution of X changes too. In fact, the point of matching is to adjust the distribution of X (so it is more unrelated to T), so it is only rarely possible to meet the conditions that make the p-value not misleading. Fortunately, its easy to fix: just use the numerator (in the t-test) or the equivalent.

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 2:41 PM

Hi Gary,

Gary King Writes: I agree, as you say, that "Comparing p-values as a metric only makes sense for the same sample size and the same distribution of X". However, in most applications of matching, the sample size changes since observations are dropped or matching is done with replacement.

Sample size does not change with GenMatch. Matching is done with replacement but there is a correction for that (following Abadie-Imbens), just like there is one for adjusting the SEs for the casual estimate. So, I don't understand why the p-value is misleading but the difference in means is okay? Say we are estimating ATT, the variance will converge to be that of the treatment group (GenMatch does NOT change the estimand). So, where is the issue?

Gary King writes: I would suggest users of GenMatch (which by the way, is also available through MatchIt, along with most other approaches to matching we like) exclude p-values from the objective function.

What should they be replaced with? For example, difference of means is not enough so let's add the D stat from something like the KS test and the variance ratio. Now, how does one combine this information? If one just uses, say, the D statistic, performance (in Monte Carlos) is good but worse than the current implementation.

Cheers,
Jas.

Posted by: Sekhon [TypeKey Profile Page] at October 27, 2005 3:16 PM

Every application of matching changes the distribution of X. But changing the distribution of X doesn't necessarily affect balance unless you have a good method of matching or use the method well. Its the latter and not the former we care about and should evaluate. yet, even if the method of matching you apply didn't work to improve balance, it will still affect the distribution of X, the sample size, or effective sample size (due to how much replacement you choose to use). In any of these cases, hypothesis tests can mislead since they include balance plus extranous features. Using t-tests to evaluate balance is like measuring temperature by the reading on the thermometer plus the amount of money in your pocket minus three. It might work, but will often mislead.

What specific measure should one use? The ultimate measure is to compare the two multidimensional empirical histograms, but obviously this isn't that helpful in practice given how sparse they are. We've been using QQ-plots and measures constructed from them (max, mean, etc.), but you ultimately need the multivariate relationships too of course. So there won't ever be a "correct" measure of balance, since there are lots of low dimensional summaries that are all reasonable features of balance. If one shows imbalance in one of these, then you have imbalance in the joint distribution for sure, but the reverse obviously doesn't hold. For t-tests and p-values you don't have a relationship in either direction.

In any event, my original comment is not about GenMatch. It was about the enormous number of applications that use t-tests and the like to claim good balance. Those uses are almost all clearly wrong. Genmatch, of course, can be used for good or evil; I'd use it for the forces of good!

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 3:44 PM

Gary King writes: the enormous number of applications that use t-tests and the like to claim good balance. Those uses are almost all clearly wrong .

This is a very good target to have, and I completely agree with your point. One does not want to run around and do hypothesis tests as are usually done in the literature.

Gary King writes: We've been using QQ-plots and measures constructed from them (max, mean, etc.)

Those features of a QQ-plot (which I have also been using in the form of various nonparametric distribution tests), are, in part, functions of the variance of X. Say, for example, the X covariates are normally distributed. The QQ-plot is simply information about the mean and variance. So, if one does not like using the variance information (in calculating a t-stat), one is using it anyways in the QQ-plot!

Also, QQ-plots do not themselves put different covariates on the same metric so it can be decided that, for example, variable 3 is a bigger problem than variable 2. Decisions like that need to be made. The Rosenbaum sensitivity approach I outlined earlier could be one (computationally intensive) alternative.

Jas.

Posted by: Sekhon [TypeKey Profile Page] at October 27, 2005 4:23 PM

A good balance metric should have the property that bias in estimated ATT is (at least weakly) monotonically decreasing along the entire domain of the balance measure. Any paper proposing a new metric should be able to make the case (as we do in the GenMatch paper) that this property holds in realistic settings where we think we know what the "target estimate" is (eg., Lalonde/DW dataset). I think this is an important requirement for new candidate balance metrics. Best, Alexis

Posted by: Alexis Diamond at October 27, 2005 6:04 PM

I think we all agree on the main point about t-test and p-values.

The minor point about balance stats is that we want a measure that is informative about the equality or proximity of the two multivariate distributions. If one variable differs in the mean, the variance, or the full histogram (as in comparing the qq-plot), then the multivariate empirical distribution differs and there is some imbalance. Even tho the reverse isn't true, this is still informative. Statistics, like p-values or t-tests, do not meet this minimal condition.

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 6:43 PM

Gary King writes: If one variable differs in the mean, the variance, or the full histogram (as in comparing the qq-plot), then the multivariate empirical distribution differs and there is some imbalance. Even tho the reverse isn't true, this is still informative. Statistics, like p-values or t-tests, do not meet this minimal condition.

Why don't p-values and t-tests meet this minimal condition? Let's simplify the situation: say all of the covariates are normal and we are estimating ATT with replacement and a weights correction for using obs multiple times. How is it possible that with a abs(t-stat) > 0 there isn't some imbalance in the multivariate empirical distribution? Just to make sure we are on the same page: how about if we standardized by the variance of the treated X (which does not change)?

Posted by: Sekhon [TypeKey Profile Page] at October 27, 2005 7:12 PM

Yes, right! the problem with t-tests is anything above zero. With exact matches, everything works fine.

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 7:22 PM

Yes, right! the problem with t-tests is anything above zero. With exact matches, everything works fine.

I don't see the problem. How is it possible that with a abs(t-stat) > 0 there isn't some imbalance in the multivariate empirical distribution? Assume the conditions listed above.

Posted by: Sekhon [TypeKey Profile Page] at October 27, 2005 7:33 PM

"some imbalance...," yes: you can distinguish exact matching with all the stats we've been talking about. That's easy. But rankings for different levels imbalance above exact matching doesn't work with t-tests. Think about the example with observations randomly dropped I gave originally. I'll try to post a figure that illustrates this over the weekend.

Gary

Posted by: Gary King [TypeKey Profile Page] at October 27, 2005 8:44 PM

I see the problem with randomly dropping observations. It is an important point given most of the literature, the point is utterly clear. What I don't see is how it relates to the case of estimating ATT with a fixed sample size, a fixed ratio between treated and control and with replacement but weights to ensure that the effective sample size is held constant.

Can your point be boiled down to changing sample size (which some do) plus the fact that, for example, the t-test is sensitive to changes in the second moment but does not test for them? So, we can look like we are doing better by just increasing the variance but not changing the mean differences. If that's the point than it is an important point given most of the literature which naively only uses t-tests, but the point is clear and the way around is also clear---i.e., examine the higher moments.

Posted by: Sekhon [TypeKey Profile Page] at October 28, 2005 1:40 AM

I've been thinking more about this blog, and I now I finally understand the argument, why Gary would raise questions about the power considerations associated with the t-test p-value and the higher moments of the data. It's a subtle and interesting issue in this matching context that I failed to appreciate first time around.

Posted by: Alexis Diamond at October 30, 2005 1:13 AM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)