September 2008
Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Matt Blackwell (Gov)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« Benjamin Fry and Data Visualization | Main | Government as API provider »

23 September 2008

A Handy Trick for Multiple Imputation of Categorical Data

As an applied researcher, I've often come across missing data problems where my data are categorical. This can raise issues because most standard multiple imputation packages assume the multivariate normal (MVN) distribution, which may not hold for certain types of categorical and binary data.

The standard shortcut for overcoming this problem is to just impute under the MVN assumption, then use rounding to finish out the imputation. But as Yucel Recai, Yulei He, and Alan Zaslavsky point out in their May 2008 article in The American Statistician, naive rounding can bias estimates, particularly when the underlying data are asymmetric or multimodal.

So what should the applied researcher do when multiply imputing categorical data? The authors propose a method of calibration whereby one duplicates the original data but sets the observed values for the variable of interest to missing in the duplicated data. The original data and the duplicated data are then stacked and imputation is carried out on the stacked dataset. By comparing the fraction of 1's among the originally observed (but imputed) observations in the duplicated data (Y_obs(dup)) with the fraction of 1's in the original observed data (Y_obs), one can find the appropriate cutoff (c) and assign 0's and 1's using that.

This is a neat technique which benefits from the fact that it's very easy to implement in practice. In any case, check out the entire paper for more details on the method.


Posted by John Graves at September 23, 2008 9:01 PM