| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« Don't Use Hypothesis Tests for Balance | Main | Applied Statistics - Guido Imbens »
28 October 2005
Jim Greiner
The recent posts on achieving good balance within matching have stimulated a certain amount of interest. To this debate I offer more questions and, alas, no answers, which are what I'd really like to know. (For what it's worth, I am not doing research in this area. All of my questions are genuine, not rhetorical.)
As I understand it, the genetic algorithm that Diamond and Sekhon favor searches for matches that minimize p-values from hypothesis tests. The subject of the hypothesis tests are the covariates, taken one at a time, and the two-way interactions, also taken one at a time.
My questions:
Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?
If so, do we expect achieving balance in all univariate (i.e. marginal) and two-way distributions to accomplish this goal, given that the marginal distributions of any multidimensional random vector do not determine the joint? On the other hand, if two sets of random vectors have the same joint distribution, would we expect hypothesis tests applied to individual (univariate) covariates or their interactions to achieve p-values of .15 or greater?
Does the dimension of the vector (i.e. the number of covariates) play a role here, in that if we had 20 covariates, we would expect a comparison of individual covariates marginally to produce a few p-values of below .15? Perhaps more broadly, what theory tells us that the genetic algorithm search is actually attempting to do the right thing - and what is it?
A propensity score method has answers to some of these questions, though it raises others. On the plus side, the theorems say that observations with the same propensity score have the same joint (not merely marginal) distribution of the covariates. Thus, if the goal is to replicate a randomized experiment's much-valued ability to produce observations with the same joint covariate distribution, conditioning on the true propensity score will do that. That's the theory that tells us what propensity score matching is attempting to do is the right thing. The problem is, of course, that in any case that matters, we don't know the true propensity scores, and estimation of them raises profound questions about model fit and adequacy. One can check disparities in marginal distributions, but for the reasons stated above, such checks are not really enough. A question for advocates of propensity scores is the following: if propensity score matching is designed to reduce dependence on the substantive model that relates outcomes to covariates, does it do so only by inducing dependence on proper specification of the propensity score model?
For those who would eschew hypothesis tests in assessing balance (see yesterday's post), how does one assess balance? True, one can always reduce the power of any test to reject a null by discarding observations (I have heard that K-S in particular has low power), but any comparison of distributions rests on some set of criteria. Looking at t-scores is a hypothesis test (how else would one decide when the set of scores is too big or too small?). Are hypothesis tests the worst method of assessing balance, except for all of the others?
I have only one suggestion on this subject: whatever method one uses to create matched sets of treated and control groups, after all ordinary checking of marginal distributions is complete, throw something completely wild at the results. For both groups, calculate a fifth moment of covariate one, interact it with a third moment of covariate two and a second moment of covariate three. Do a test and see what happens. If the two groups have the same joint distribution of their covariates . . . .
Posted by James Greiner at October 28, 2005 3:19 AM
James Greiner writes: Perhaps more broadly, what is the theory that tells me that the genetic algorithm search is attempting to do the right thing (and what is that "right thing")?
This is a good question. Just to be clear, let's get the GA part out of it. There is very good theory that a GA will search a space at least well as randomization and if there is weak connective information between points (this assumption is weaker than continuity) it will out perform randomization, usually by a polynomial factor (i.e., swamp randomization).
Now to the main point. There are some distributions for which testing the margins and x-way interactions is good enough, but for most distributions this is not enough! This is why there will always be an art to this. Also, one can use multivariate KS tests which test the complete joint distribution, but these methods will require a lot of data. But there is a theoretical path here.
James Greiner writes: (I have heard that K-S in particular has low power)
For some null hypothesis, but for others it is okay. That's why we advocate that a researcher use a LARGE variety of functions of the data because, as we note, different functions will have different power against different departures from balance. For example, if the distributions just differ in the means, than a paired t-test is much greater power. For differences related to skew and kurtosis, the KS test is better than the t-test. For differences in the tails, the Wilcoxon Mann-Whitney test is generally more powerful than the KS test.
My code will use the Mann-Whitney test if that is passed to it instead of the KS. But it doesn't make much difference because the p-value that comes out of genmatch is IRRELEVANT. As the LaLonde example shows (see figures in Genetic Matching), the conventional test levels are no where near enough to achieve a reliable inference. So all one just needs a metric to move along on. And for that the KS and Mann-Whitney amount to basically the same thing.
James Greiner writes: I have only one suggestion on this
subject: whatever method one uses to create matched sets of treated
and control groups, after all ordinary checking of marginal
distributions is complete, throw something completely wild at the
results. For both groups, calculate a fifth moment of covariate one,
interact it with a third moment of covariate two and a second moment
of covariate three. Do a test and see what happens. If the two
groups have the same joint distribution of their covariates...
I think this is a very good idea and is what I currently recommend. For example, in my Matching Software, after genetic optimization, it is suggested that one use the MatchBalance function and through in some new function of X to test, just as you suggest.
Finally, I will note that GenMatch is agnostic as to what precise fit criterion is used. One could, for example, work with features of the QQ plots directly or a multivariate histogram.
Posted by: Sekhon
at October 28, 2005 12:11 PM
Jim Greiner writes: "Is the objective in matching treated and control units to find sets of observations with the same JOINT distribution of the covariates, which is what one would have in a randomized experiment?" This is *not* what one has in a regular randomized experment in practice. In fact, after GenMatch-ing (in the DW case, for example), treated and GenMatch-ed control units have more similar distributions of covariates than did the the original randomized experimental sample. In practice, Genetic Matching can and often will produce better balance **on observables** than a randomized experiment.
A larger point: This kind of statistical work is, in my view, largely about making your argument coherently, stating your assumptions clearly, and letting the scientific community judge. Assuming we all agree on how to intepret the identifying assumptions a method requires, then the role of the analyst is to muster evidence that these assumptions have been sufficiently satisfied. In OLS, one has to argue that there are no omitted variables. In IV, you've got to make the case for exclusion. And in matching, you have to presume conditional independence. And in practice, in all these cases and in all others I know of (except maybe bounds analysis, sometimes), the analyst that claims all assumptions perfectly obtain is almost always stretching the truth in front of a knowing audience that is supposed to know how to assess the consequences. OLS is unbiased under certain conditions, but nearly all OLS-based estimates are certainly biased. Ditto for IV, and everything else including matching. The question, is the bias going to matter?
So, Jim asks about joint distributions. If that's what it takes to be convincing, check all three-way interactions and beyond.
And by the way, when we are considering how these sorts of arguments are made in our community, we should ask ourselves--does the analyst's method encourage honest or dishonest analysis? Methods (like matching) that do not require the use of outcomes encourage honesty. Methods like regression that involve running analyses, getting an answer, and repeating until JACKPOT... that's a different story altogether.
Posted by: Alexis Diamond at October 28, 2005 2:05 PM