October 2009
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« Multiple Instruments | Main | Tchetgen on "Doubly robust estimation in a semi-parametric odds ratio model" »

23 October 2009

Sources of Randomness

During a recent conversation with some colleagues regarding data sources, an interesting point was made that left me pondering. One member of our group stated that he would not trust a particular source of data to provide useful estimates of population means, but he would trust it to estimate regression coefficients. This puzzled me, because a regression coefficient is a (perhaps slightly fancy) version of a mean. Why, then, would a data source that cannot be trusted for a simple average be useful for a coefficient?

I think the answer lies in the assumed source of randomness. When we make inferences from our sample data to a wider universe of cases, there are two sources of randomness involved: probabilities introduced through the sampling design and probabilities introduced through an assumed stochastic model underlying our observed data. In the first case, we are interested in the existing finite population and our outcome of interest Y is regarded as fixed; randomness is introduced through the sample inclusion probabilities. In the second case, we are interested in a broader "superpopulation" which we posit is generated through some random process, and thus our outcome Y is regarded as a random variable. In much of social science, researchers are interested in this second source of randomness. Hypotheses center around parameters associated with the probability distribution for Y - such as regression coefficients.

Identifying the sources of randomness underlying our data is important, because they have implications for our analysis. Särndal, Swensson, and Wretman show that the variance of a parameter from a ordinary regression model estimated using sample data can be decomposed into two elements, one based on the sampling design and one based on the model. In the case of a census, the extra variance introduced from the design is zero, and thus the total variance of the estimated parameter is the variance of the "BLUE" estimator. Otherwise, accounting for the sampling design in the analysis should improve inference.

Posted by Deirdre Bloome at October 23, 2009 5:20 PM

Comments

I agree. Very well written. You know what you are talking about. I hope you plan to write more on this topic.

Posted by: direct student loan consolidation at October 24, 2009 5:01 PM

Just think of it visually: imagine an x-y scatter plot for which a linear regression line fits the data really well. Compute linear regression line and record the slope. Compute the mean of x and mean of y too. Now, take out an eraser and erase all dots above some value of x. Compute a linear regression again: you'll get the same slope. But the mean of x and mean of y in the sample are guaranteed to be different. What's happening here? The linear relationship is pretty strong and selection is based on x alone. Things go awry (that is, the selected sample does not recover the population slope) when selection is based on y or on x and y or when the linear model is not right. This is true whether or not you are basing your inference on finite populations or superpopulations (in either case, the population is slope is the population slope is the population slope...).

Posted by: Cyrus at October 26, 2009 8:59 AM

1. What Cyrus said.

2. Similar issues arise in psychological experimentation all the time. The population isn't representative of anything (convenience sample of psychology class "volunteers") and incapable of estimating the mean of the population. But by manipulating the values of X (x1 = low, x2 = medium, x3 = high) and relating these to the dependent variable y the researcher hopes to show whether and how much X is related to Y.

Posted by: ZBicyclist at November 2, 2009 2:52 PM

This topic is very good and is very interesting, I hope that you write more about it, thanks.

Posted by: Decoração at November 2, 2009 5:06 PM

I agree this topic is really interesting to me too!

Posted by: Social Network at November 5, 2009 4:12 AM

I am not sure I agree. I am still pondering the analysis.

Posted by: WorkFromHomePro at November 8, 2009 2:23 PM

Notification

Enter e-mail address to receive notification of new comments to this entry

Post a comment



Remember Me?

(you may use HTML tags for style)