May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« New IR Data Set with 10 Million Dyadic Events | Main | Don't Use Hypothesis Tests for Balance »

26 October 2005

Did You Achieve Balance?! Part II

Jens Hainmueller

Continuing from yesterday's post, another popular way to test balance is to examine standardized differences (SDIFF) between groups (Rubin and Rosenbaum 1985). SDIFF capture the difference in means in the matched samples, scaled by the square root of the average variance in the un-matched groups. This test has been criticized for the lack of formal criteria for judging the size of the standardized bias. Moreover, it may be open to manipulation as one can add observations to the control group in order to decrease variance in the denominator (Smith and Todd 2005).

Staying in the realm of univariate balance tests, some claim that difference in means tests are insufficient and that Kolmogorov-Smirnov (KS) tests are needed to non-parametrically test for the equality of distributions (Diamond and Sekhon 2005). These KS tests need to be bootstrapped, by the way, to yield correct coverage in the presence of point masses in the distributions of the covariates (Abadie 2002). Again, these tests would substantially increase the balance hurdle. Are they necessary for reliable causal inference?

Apart from univariate tests there are also some multivariate balance tests floating around in the literature such as the Hotelling T^2 test of the joint null of equal means of all covariates, multivariate (bootstrapped) Kolmogorov-Smirnov (KS) and Chi-Square null deviance tests based on the estimated assignment probabilities, as well as various regression-based tests for joint insignificance, etc. Which of these tests is preferable in what situation? What is the relationship between uni- and multivariate balance?

Last but not least, there is the thorny question of significance levels. Is a p-value of 0.10, let's say against the null of equality of means, high enough for satisfactory balance? Is .05 permissible? There is evidence that conventional significance standards are too lenient to obtain reliable causal inference in the canonical LaLonde data set (Diamond and Sekhon 2005).

These are too many questions to which I do not know the answers. The current lack of a scholarly standard for covariate balance strikes me as troubling, because balance affects the quality of the causal inferences we draw. I think it is important to bring the balance issue to the forefront of the matching debate. That is why Jas Sekhon and I are currently working on a paper on this topic. Suppose you are reviewing a matching article. What does it take to convince you that the authors "achieved balance"? Please feel cordially invited to join the debate.

Posted by James Greiner at October 26, 2005 4:08 AM

Comments

I think this entry is a good statement of where the literature on matching is, but I think almost all of the literature has this point wrong. Hypothesis tests for checking balance in matching are in fact (1) unhelpful at best and (2) usually harmful.

Suppose you had a control group and a treatment group that are identical (exactly matched) except for one person, or except for a bunch of people in one very minor way. Suppose hypothesis tests indicate no difference between the groups, and so you'd be in the situation of reporting balance was great and no further adjustment was needed. (We might think of this as a real experiment where the outcome variable hasn't been collected but is expensive to do so.) If you were given the chance of dropping the one or few people that caused the two groups to differ and replacing them with others that exactly matched, would you do so? Since the dimension on which the inexact match or matches occurred might be the one that has a huge effect on your outcome variable, the bias due to not switching could be huge. So you'd undoubtedly make the switch, despite the fact that the hypothesis test indicated that there was no problem. Hence (1) the tests are unhelpful: passing the test does not necessarily protect one from bias more than failing the test.

Now suppose you have data that don't match very well by all hypothesis tests and you randomly (rather than systematically to improve matching) drop observations, in a bad application of matching. what will happen? Your t-tests or ks-tests or any other hypothesis tests will lose power and so will indicate that balance is getting better and better. yet, bias is not changing at all, and efficency is dropping fast. The tests are telling you to discard data! Hence (2) hypothesis tests to evaluate balance are harmful, quite seriously so.

The fact is that there is no superpopulation to which we need to infer features of the explanatory variables; all analysis models we regularly use after matching are conditional on X. Balance should be assessed on the observed data, and not be the subject of inference or hypothesis tests.

This message rehearses an argument in a to-be-revised version of our matching paper (by Ho, Imai, King, and Stuart at http://gking.harvard.edu/preprints.shtml#matchp) that we hope to be finished with and post in a couple of weeks.

Gary King

Posted by: Gary King [TypeKey Profile Page] at October 26, 2005 12:08 PM

Hi Jens,

The question of "When have we achieved balance" goes right to the heart of the matter. As you note, Diamond and Sekhon (2005) makes the case that prior balance tests utilized in previous work have been too lenient. (Not surprisingly: After all, why should it be easy to make reliable causal inferences in non-experimental settings?) In particular, we found that all prior work over the past decade with the canonical LaLonde/Dehejia Wahba dataset failed to achieve a degree of balance adequate to ensure reliable results. My own view is that this problem is not limited to the LaLonde/Dehejia Wahba datasets, but is in fact endemic to > 95% of all matching-based analyses across the disciplines. In the LaLonde/Dehejia-Wahba setting, you don't begin to see good results until the lowest paired t- and KS-tests p-value across ALL covariates and 2-way interactions is greater than 0.15 to 0.2. (Incidentally, this very high p-value threshold was Jas Sekhon's intuition very early on, prior to us doing the work and validating it.) How many prior matching analyses can claim to have achieved this level of balance? None that I know of. That's the bad news (for prior work). The good news (for future work, and aspiring muckraking graduate students) is that this level of balance is achievable in many datasets, if you and your computer work hard enough and use the right methodology (ie., search over the space of covariate weights a la Diamond and Sekhon 2005). Cheers.

Posted by: Alexis Diamond at October 26, 2005 5:02 PM

What's the cite for Diamond and Sekhon 2005?

Posted by: Felix Elwert at October 26, 2005 6:55 PM

Hi Alexis,

thanks again for your muckraking comment. I was actually pretty stunned when I saw your evidence on what balance standards are actually needed to make the inferences in the LaLonde dataset somewhat reliable. That’s a balance hurdle that many articles do not jump over.

It would be interesting to know whether the same pattern holds in other experimental datasets where we can do LaLonde type evaluations by integrating observational data. If we find that the required balance levels at which we get somewhat reliable answers is similarly high this would start to make me anxious about a lot of the published matching results out there.

Did you try that yet?

Best,
Jens

Posted by: Jens at October 26, 2005 7:05 PM

The citation is:

Alexis Diamond and Jasjeet Sekhon. 2005. Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.

You can download the paper at:
http://sekhon.polisci.berkeley.edu/papers/GenMatch.pdf

Posted by: Jens at October 26, 2005 7:09 PM

Jens, your comment about investigating other LaLonde-like projects is right on the mark, and we have a graduate student working on it. You and the rest of the blog community will be among the first to know what we find out.
Of course, there have been a number of studies that have attempted to answer LaLonde-like questions. See the text "Learning More From Social Experiments" by Harold Bloom (an early innovator of IV-estimation) for more information.

Posted by: Alexis Diamond at October 26, 2005 8:21 PM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)