May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« What Did (and Do We Still) Learn from the La Londe Dataset (Part I)? | Main | Consumer Demand for Labor Standards, Part I »

9 December 2005

What Did (and Do We Still) Learn from the La Londe Dataset (Part II)?

Jens Hainmueller

I ended yesterday's post about the famous LaLonde dataset, with the following two questions: (1) What have we learned from the La Londe debate? (2) Does it makes sense to beat this dataset any further or have we essentially exhausted the information that can be extracted from this data and need to move one to new datasets?

On the first point, VERY bluntly summarized, the comic strip history goes somewhat like this. First, La Londe showed that regression and IV do not get it right. Next, Heckman's research group released a string of papers in the late 80s and 90s trying to defend conventional regression and selection-based methods. Enter stage Dehija and Wahba (1999). They showed that apparently, propensity score methods (sub-classification and matching) get it right if one controls for more than one year of pre-intervention earnings. Smith and Todd (2002, 2004) are next in line, claiming that propensity score methods do not get it right. Once one slightly tweaks the propensity score specification, the results are again all over the place. The ensuing debate spawned more than five papers as Rajeev Dehejia replied to the Smith and Todd findings (all papers of this debate can be found here). Then last but not least, Diamond and Sekhon (2005) argue that matching does get it right, if it’s done properly, namely if one achieves a really high standard of balance (we’ve already had quite a controversy about balance on this very blog. See for example here).

So what does this leave applied researchers with? What do we take away from the La Londe debate? Does anyone still think that regression (or maximum likelihood methods more generally) and/or 2-stage least squares IV produce reliable causal inferences in real world observational studies? In all seriousness, where is the validation? . This is the $1 million-dollar question, because MLE and IV methods represent the great majority of what is taught and published across the social sciences. Also, can we trust propensity score methods? How about other matching methods? Or is there little hope for causal inference from observational data in any case (in which case I fear we are all out of a job, and the philosophers get the last laugh?) This is not necessarily my personal opinion, but I would be interested to hear people’s opinion. [The evidence is of course not limited to La Londe; there is ample evidence from other studies with similar findings. For example see Friedlander and Robins (1995), Fraker and Maynard (1987), Agodini and Dynarski (2004), Wilde and Hollister (2002) and various Rubin papers to name just a few].

On the second point, let me play the devil’s advocate again and ask: What can we still learn from the La Londe data? After all it’s just one single dataset, the standard errors even for the experimental dataset are large, and once we match in the observational data, why would we even expect to get it right? There is obviously a strong case to be made for selection on unobservables in the case of the job training experiment. So even if we manage to adjust observed differences, why in the world should we get the estimate right? [Again, this is not my personal opinion, but I have heard a similar contention both at a recent conference and in Stat 214.] Maybe instead of a job training experiment, we should first use experimental and observational data on something like plants or frogs, where hidden bias may (!) be less of a problem (given this is actually the case)? Finally, what alternatives do we have—how would we know what the right answer was if we were not working with a La Londe-esque framework? Again, I would be interested in everybody’s opinion on this point.

Posted by James Greiner at December 9, 2005 6:14 AM

Comments

having worked with nsw data, i agree we need alternatives that permit direct comparison of experimental/scientific versus statistical/observational solutions to the problem of inference. that said, tests based on "messy" data sets such as supported work seem especially appropriate for social scientists modeling messy processes. we just need more tests under different conditions. fortunately, i found 121 disparate data holdings at icpsr under keyword "experimental." we don't have to keep hammering on supported work.

Posted by: chris uggen at December 9, 2005 2:11 PM

Hi Chris Uggen,

thanks for your comment.

I agree that tests based on "messy" data sets such as supported work seem especially appropriate for social scientists modeling messy processes. Yet, I think the question remains: if we do not get it right in messy data, what do we conclude? Is it that our estimators are really ill-equipped to correct the bias we could potentially adjust for given the covariates we have in the data, or is it that our estimators actually do quite well, but hidden bias is too large in which case we would still get it wrong even when we perfectly adjusted for observables because there remains imbalance on unobserved stuff that is strongly correlated to both treatment assignment and the outcomes of interest. In principle, bias adjustment properties and the issue of hidden bias are too distinct problems, but in the Lalonde example they are both enmeshed. Of course, one could argue that in a particular application, if estimators A get it right despite hidden bias while estimator B doesn’t, then A is clearly preferred over B…

Best,
Jens

Posted by: Jens at December 9, 2005 4:04 PM

I think the LaLonde data are a bad data set to be using as a test data set for all sorts of procedures for three reasons.

First, the relevant sample size is really small. There are not very many participants at all and, as Dehejia and Wahba show clearly that the number of non-participants who look anything like the participants is quite small (which is why matching without replacement is a disaster in these data).

Second, the evidence in Smith and Todd (2005), plus even an epsilon of thinking about the nature of the selection in this context, make it clear that the unconfoundness assumption (called the conditional independence assumption in economics) does not hold in these data. Remember that the participants here are ex-cons, ex-addicts and dropouts. No variables are available in the comparison group data sets to identify non-participants who are ex-cons or ex-addicts. We have strong priors that ex-cons and ex-addicts are different than other folks conditional on the age, race, schooling and mismeasured annual earnings, which are the only conditioning variables on offer. Much as I respect Bob LaLonde, it was silly, in retrospect, to think that this would ever work.

Third, the data do not contain any good instruments to use to examine the efficacy of either the bivariate normal selection model or standard IV methods. The exclusion restrictions employed in the LaLonde paper would not pass the laugh test in a seminar these days. In addition, as we note in Smith and Todd (2005), his bivariate normal results are wrong because he does not take account of choice-based sampling (to which the estimator is not robust) and because one of his exclusion restrictions is actually a perfect one-way predictor of treatment status.

Finally, it is important to frame the question correctly. The question is not "does matching work" or "does IV work". Both methods work when their identifying assumptions hold in the data and do not work otherwise. The question we should be asking is, in what contexts, defined by data and institutions, does each method work well. I would argue that the NSW data commonly used for these studies does not satisfy the identifying assumptions for any of the commonly used estimators, which is why I think it is time to stop using it as a canonical data set.

Jeff Smith

Posted by: Jeffrey Smith at January 2, 2006 11:28 AM

to second Jeff's comments, i too think the NSW is a very limited dataset. i also think we should try to apply LaLonde's (86) idea to other, better datasets. in the Diamond/Sekhon (2005) paper, we show Monte Carlo results that represent a significant step away from NSW and into something new.

the NSW doesn't provide a very rich set of covariates at all. i think that's why this dataset requires an extremely high degree of balance (as measured by the leximin p-value) before the simple ATT difference-of-matched-means estimate begins to converge on the experimental result. Once you balance super-duper well on the NSW's Xs and all their quads and 2-way interactions, you're probably balancing on a bunch of other stuff you haven't measured. at least that's my conjecture for what's going on in this case.

but yes, let's move on to other datasets and new Monte Carlos, please! :) and let's use these new analyses to validate current notions of balance tests and balance thresholds and expectations for reliability in observational studies.

alexis

Posted by: Alexis Diamond at January 7, 2006 12:45 AM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)