| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« Detecting Attempted Election Theft | Main | Running Statistics On Multiple Processors »
8 May 2006
Jim Greiner
I’m the “teaching fellow” (the “teaching assistant” everywhere but Harvard, which has to have its lovely little quirks: “Spring” semester beginning in February, anyone?) for a course in missing data this semester, and in a recent lecture, an interesting concept came up: coarsened at random.
Suppose you have a dataset in which you know or suspect that some of your data values are rounded. For example, ages of youngsters might be given to the nearest year or half-year. Or perhaps in a survey, you’ve gotten some respondents’ incomes only within certain ranges. Then the data has been “coarsened” in the sense that you know that the true value is within a certain range, but you don’t know where within that range.
Happily, techniques have been developed to handle this sort of situation. In many ways, the game is the same as that in the missing data setting. Just as in the missing data context good things happen when the data are missing at random, so also in this context good things happened when the data are coarsened at random. Thus, to begin with, you have to consider (among other things) whether you think the probability that you will observe only a range of possible data values, as opposed to the specific true value, depends on something you don’t observe (such as that specific true value). A good place to start on all this is Heitjan & Rubin, “Inference from Coarse Data via Multiple Imputation with Application to Age Heaping,” 85 JASA 410 (1990).
One final point: you might think that coarsened at random is a specific case of missing at random. Actually, it’s the other way around. Data can be (and often is assumed to be) coarsened at random but not missing at random. Think and you’ll see why.
Posted by James Greiner at May 8, 2006 6:00 AM
Unless I am missing the point -- and sadly that occurs too often -- rounding is the same thing as adding a uniformly distributed "mismeasurement" term to the variable. Further because with rounding the exact distribution of the mismeasurement term is known, one can correct for the bias in the estimates. For example, in an OLS regression the bias is a reduction in the absolute value of the coefficients (other than the constant, whose variable values are presumably not mismeasured).
I quit reading JASA over 20 years ago so I am not going to try to get the referenced article and see what is being offered. I find it hard to believe that (again for instance) in regression treating the rounding as a missing data problem will improve on treating it as the actual variable plus the mismeasurement term whose distribution is known. But if that is the case, I would be interested in how and why -- maybe enough to travel to an academic library :-).Posted by: Martin Ringo at May 12, 2006 5:43 PM
If you encounter this sort of problem regularly, it might be worth the trip to the library. I won't try to go through the paper on the blog, but sometimes you need heavier artillery than a mismeasurement term. Suppose, for example, that different people have different coarsening (rounding) patterns. Some people, for example, might round to the nearest half-year, while others might round to the nearest year. If you think this might be going on, you'll need to model the kind of person who's giving you the information before applying a rounding adjustment.
Posted by: Jim at May 12, 2006 6:35 PM