| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« Inauthentic Paper Detector | Main | Human irrationality? »
27 April 2006
Felix Elwert
Why did people code their missing values as real numbers such as 999 in the old days? Why not “." from the get go? And why do many big, federally funded surveys insist on numerical missing values to this day?
Don’t we all have stories about how funny missing value codes (“-8") got people in trouble (think The Bell Curve)? Are there any anecdotes where people got in trouble for mistaking “." for a legitimate observation?
Posted by Felix Elwert at April 27, 2006 6:00 AM
I think it might have something to do with the dot not existing at the "get go." In the same way that Western Civilization went a long time without figuring out what a useful thing zero would be to have, it took statistical software awhile to cotton onto the utility of the dot. Besides, to some of us, dot is not sufficient anyway, since we are actually interested in different kinds of missingness. Thereupon: the wonderful invention of the multiple dot missing value codes in Stata, SAS, and whatever else now supports them.
Posted by: jeremy at April 27, 2006 2:51 PM
What are you referring to wrt The Bell Curve? Do you mean this or this? (from the wikipedia entry)
Posted by: Brendan at April 27, 2006 3:02 PM
I spent 4 years at ICPSR where this issue came up frequently.
Using the singular dot for missing data is inadequate for archival purposes. There are many reasons for a datum to be missing. 3 examples come to mind:
These missing data codes have quite different analytic implications. Treating all missing data the same (with a dot) pre-emptively hampers analysis.
Good archival practice (at least in survey research) is to record the verbatum response. Lets say that I first ask if the R has ever used drugs. Those that say no will be branched around the questions about specific drugs. Analytically, this respondent answered no to all the specific questions. Mechanically however, the CAI captured nothing (missing).
Data archivists prefer to leave analytic decisions (e.g., to recode that -8 "Skip" to 0 "No") to the analysts. Using numeric codes to represent these missings is the most transparent way to do this.
A more coherent explanation is offered by ICPSR: http://www.icpsr.umich.edu/access/dpm.html.
Also, as Jeremy mentioned... not all all stats packages recognize "." (or better .a, .b., .c) as missing. Archivists do their best to support a multitude of packages. Numeric missing codes is the most democratic method of distributing the recoding work across packages.
fwiw.
Posted by: Corey at April 28, 2006 1:25 PM
Posted by: Felix Elwert at April 28, 2006 3:03 PM
I'm not disagreeing with you. There was an exchange in ASR (American Sociological Review) a couple of years ago where someone used GSS data to make a contrarian argument. Someone else then tried to replicate and discovered that the first author forgot to set 9 (the missing code for the DV) as missing.
Our problem is both conceptual and mechanical. Conceptually, we want to represent all the logical codes. Mechanically, we live in a pluralistic world of statistics packages. SPSS does not recognize .a, .b, .c.. conventions. Rather SPSS allows you to set numeric values as missing. [Try reading a raw datafile with these alpha-dot missing conventions into SPSS. You will wind up with a file full of string variables. Not a problem for small files; but a royal pain in the neck for big ones]. So no matter what the archivist does, the data will require some massaging in some stats packages.
Now if you are a data producer (like the Federal Govt, or ICPSR) are you going to put files out that cause headaches for the bulk of your user-base? [Whether we like SPSS or not, it is the most widely used Stats package out there right now.] The path of least resistance is to use numeric codes for everything and put it back on the user to recode out the missings.
I suppose the moral of the story is to always consult the codebook before interpreting a statistic.
**Note I no longer work for ICPSR and nothing that I write should be interpreted as the consortium's position.
Posted by: Corey at April 30, 2006 6:52 PM