October 2005
Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« October 27, 2005 | Main | October 30, 2005 »

28 October 2005

More Questions About Balance (And No Answers)

Jim Greiner

The recent posts on achieving good balance within matching have stimulated a certain amount of interest. To this debate I offer more questions and, alas, no answers, which are what I'd really like to know. (For what it's worth, I am not doing research in this area. All of my questions are genuine, not rhetorical.)

As I understand it, the genetic algorithm that Diamond and Sekhon favor searches for matches that minimize p-values from hypothesis tests. The subject of the hypothesis tests are the covariates, taken one at a time, and the two-way interactions, also taken one at a time.

My questions:
Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?

If so, do we expect achieving balance in all univariate (i.e. marginal) and two-way distributions to accomplish this goal, given that the marginal distributions of any multidimensional random vector do not determine the joint? On the other hand, if two sets of random vectors have the same joint distribution, would we expect hypothesis tests applied to individual (univariate) covariates or their interactions to achieve p-values of .15 or greater?

Does the dimension of the vector (i.e. the number of covariates) play a role here, in that if we had 20 covariates, we would expect a comparison of individual covariates marginally to produce a few p-values of below .15? Perhaps more broadly, what theory tells us that the genetic algorithm search is actually attempting to do the right thing - and what is it?

A propensity score method has answers to some of these questions, though it raises others. On the plus side, the theorems say that observations with the same propensity score have the same joint (not merely marginal) distribution of the covariates. Thus, if the goal is to replicate a randomized experiment's much-valued ability to produce observations with the same joint covariate distribution, conditioning on the true propensity score will do that. That's the theory that tells us what propensity score matching is attempting to do is the right thing. The problem is, of course, that in any case that matters, we don't know the true propensity scores, and estimation of them raises profound questions about model fit and adequacy. One can check disparities in marginal distributions, but for the reasons stated above, such checks are not really enough. A question for advocates of propensity scores is the following: if propensity score matching is designed to reduce dependence on the substantive model that relates outcomes to covariates, does it do so only by inducing dependence on proper specification of the propensity score model?

For those who would eschew hypothesis tests in assessing balance (see yesterday's post), how does one assess balance? True, one can always reduce the power of any test to reject a null by discarding observations (I have heard that K-S in particular has low power), but any comparison of distributions rests on some set of criteria. Looking at t-scores is a hypothesis test (how else would one decide when the set of scores is too big or too small?). Are hypothesis tests the worst method of assessing balance, except for all of the others?

I have only one suggestion on this subject: whatever method one uses to create matched sets of treated and control groups, after all ordinary checking of marginal distributions is complete, throw something completely wild at the results. For both groups, calculate a fifth moment of covariate one, interact it with a third moment of covariate two and a second moment of covariate three. Do a test and see what happens. If the two groups have the same joint distribution of their covariates . . . .

Posted by James Greiner at 3:19 AM