May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« Inauthentic Paper Detector | Main | Human irrationality? »

27 April 2006

999

Felix Elwert

Why did people code their missing values as real numbers such as 999 in the old days? Why not “." from the get go? And why do many big, federally funded surveys insist on numerical missing values to this day?

Don’t we all have stories about how funny missing value codes (“-8") got people in trouble (think The Bell Curve)? Are there any anecdotes where people got in trouble for mistaking “." for a legitimate observation?

Posted by Felix Elwert at April 27, 2006 6:00 AM

Comments

I think it might have something to do with the dot not existing at the "get go." In the same way that Western Civilization went a long time without figuring out what a useful thing zero would be to have, it took statistical software awhile to cotton onto the utility of the dot. Besides, to some of us, dot is not sufficient anyway, since we are actually interested in different kinds of missingness. Thereupon: the wonderful invention of the multiple dot missing value codes in Stata, SAS, and whatever else now supports them.

Posted by: jeremy at April 27, 2006 2:51 PM

What are you referring to wrt The Bell Curve? Do you mean this or this? (from the wikipedia entry)

Posted by: Brendan at April 27, 2006 3:02 PM

I spent 4 years at ICPSR where this issue came up frequently.

Using the singular dot for missing data is inadequate for archival purposes. There are many reasons for a datum to be missing. 3 examples come to mind:


  1. No response
  2. Skip Pattern
  3. Answer out of range

These missing data codes have quite different analytic implications. Treating all missing data the same (with a dot) pre-emptively hampers analysis.

Good archival practice (at least in survey research) is to record the verbatum response. Lets say that I first ask if the R has ever used drugs. Those that say no will be branched around the questions about specific drugs. Analytically, this respondent answered no to all the specific questions. Mechanically however, the CAI captured nothing (missing).

Data archivists prefer to leave analytic decisions (e.g., to recode that -8 "Skip" to 0 "No") to the analysts. Using numeric codes to represent these missings is the most transparent way to do this.

A more coherent explanation is offered by ICPSR: http://www.icpsr.umich.edu/access/dpm.html.

Also, as Jeremy mentioned... not all all stats packages recognize "." (or better .a, .b., .c) as missing. Archivists do their best to support a multitude of packages. Numeric missing codes is the most democratic method of distributing the recoding work across packages.

fwiw.

Posted by: Corey at April 28, 2006 1:25 PM

You are right, of course, that there are different types of missingness. (Wouldn't it be nice if more empirical work actually accounted for these different types of missingness?) My point is that giving numerical codes that look just like legitimate observations to the _computer_ increases the risk that analysts will treat these missing values as legitimate observations. Using the dot ".", double dot ".." or letter codes would automatically flag these values in all standard statistical packages and thus reduce the risk of mistakes, that's all. As for the Bell Curve example, I was referring to criticism different from the links provided by Bl to Murray and Herrnstein's argument is that IQ is genetically determined and largely fixed at birth. They support this assumption with an analysis of the effect of education on IQ. The analysis, however, contains five observations with recorded education = -5. Clearly, these are numerical missing value codes that made their way into the analysis. With a mean education of 11.6 years in the sample, even a mere five observations with educ=-5 exert strong leverage on the regression surface. Winship and Korenman show that correcting this mistake increases the estimated effect of an additional year of schooling on IQ from 1.1 to 1.6 points. Murray discovered this problem independently, and fixed the results table in later printings of the book (though without adjusting the conclusions). (Winship and Korenman's article makes several additional corrections, which ultimately increase the estimate to 2.7 IQ points per year of education - which is arguably substantially different from 0.)

Posted by: Felix Elwert at April 28, 2006 3:03 PM

I'm not disagreeing with you. There was an exchange in ASR (American Sociological Review) a couple of years ago where someone used GSS data to make a contrarian argument. Someone else then tried to replicate and discovered that the first author forgot to set 9 (the missing code for the DV) as missing.

Our problem is both conceptual and mechanical. Conceptually, we want to represent all the logical codes. Mechanically, we live in a pluralistic world of statistics packages. SPSS does not recognize .a, .b, .c.. conventions. Rather SPSS allows you to set numeric values as missing. [Try reading a raw datafile with these alpha-dot missing conventions into SPSS. You will wind up with a file full of string variables. Not a problem for small files; but a royal pain in the neck for big ones]. So no matter what the archivist does, the data will require some massaging in some stats packages.

Now if you are a data producer (like the Federal Govt, or ICPSR) are you going to put files out that cause headaches for the bulk of your user-base? [Whether we like SPSS or not, it is the most widely used Stats package out there right now.] The path of least resistance is to use numeric codes for everything and put it back on the user to recode out the missings.

I suppose the moral of the story is to always consult the codebook before interpreting a statistic.

**Note I no longer work for ICPSR and nothing that I write should be interpreted as the consortium's position.

Posted by: Corey at April 30, 2006 6:52 PM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment




Remember Me?

(you may use HTML tags for style)