| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | ||||||
| 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| 9 | 10 | 11 | 12 | 13 | 14 | 15 |
| 16 | 17 | 18 | 19 | 20 | 21 | 22 |
| 23 | 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | 31 |
« March 18, 2008 | Main | March 26, 2008 »
20 March 2008
We're lucky to have two contested Presidential primaries. One of my favorite habits is to look at cross-tabs of candidate preferences by party and county. Here's an example of an Iowa cross-tab, showing the number of Iowa counties by Republican winner and Democratic winner:
| Iowa | Obama | Clinton | Edwards |
| McCain | 0 | 0 | 0 |
| Romney | 15 | 7 | 2 |
| Huckabee | 27 | 21 | 27 |
We can visualize cross-tabs using mosaic plots as in "Visualizing Categorical Data." I did it for nine primary states in the image below. The green represents Obama counties, the orange Hillary counties and the purple Edwards counties. Across the columns are the Republican candidates: McCain, Romney, Huckabee. Across the rows, Obama, Hillary and Edwards. Check it out here. If you instead prefer an inverted version, with Republicans across the rows and Democrats across the columns (this makes it easier to compare the Democrats), check it out here.
The conclusions are the same over most states: Huckabee and Edwards are clearly the most complementary candidates. They shared counties whenever Edwards was in play (Iowa, Florida); after that, Huckabee shared Clinton counties. In Missouri every single county he won was a Clinton county! Huckabee and Clinton are somewhat complementary. Neither McCain nor Romney is particularly complementary with any Democrat (see California, where McCain and Romney split the Hillary-Obama counties), though both did better in Obama counties when Huckabee was in play.
One distracting feature of the plots above is that counties aren't uniformly populous. Obama won Missouri by winning only six counties. An alternative interpretation is to view this as an ecological inference problem, in which we are trying to determine the population totals in each of the cross-tab cells. This isn't perfectly accurate, since Edwards voters don't actually also vote for Huckabee. But it does provide a nice framework for scaling the mosaic plot by population size, and making it look generally less degenerate. I did that using Ryan Moore's eiPack and got this.
Posted by Kevin Bartz at 5:53 PM
Yesterday I went to Professor Stanley Lieberson’s class, Issue in the Interpretation of Empirical Evidence. We discussed a paper, written by Stan and Glenn Fuguitt, titled Correlation of Ratios or Difference Scores Having Common Terms. The basic argument of this paper is that although ratios and difference scores are often used as dependent variables in traditional regression analysis, if there are some independent variables who share the same common term with those dependent variables, the estimated coefficients could be severely biased due to the spurious correlation brought about by this common term (whether it is in the denominator or numerator). For examples, if dependent variables are in the form of X/Z while independent variables are something like Y/Z, Z, or Z/X, etc., the estimated coefficients between the dependent and independent variable could become statistically significant simply due to chance.
For some concrete examples, criminologist often use crime rate (adjusted by city population size) as dependent variable while at the same time using city population size as independent variable; organizational researchers are interested in the relationship between the relative size of administration of organization and the absolute size of organization; and economists often regress GDP per capita on such variables as population growth rate, and/or even population size, etc. According to Stan and Fuguitt’s research, all the above examples will provide spurious coefficients since the dependent variable and the independent variable include common terms. In their paper, they attributed this finding back to a paper written by Kail Pearson in 1897 in which Pearson presented rigorously how the spurious correlation came from and a proximate formula for computing correlations of ratios, etc.
We were asked to do an experiment to prove the above spurious correlation, in which we generated three sets of random integers (namely, X, Y, Z) ranging from 1 to 99, presented the pairwise correlation matrix among them and found no significant correlations between any pair of variables. But we found significant correlation between Y/X and X, and when we regressed Y/X on X, the coefficient became significant too. So after such manipulations like division or subtraction, we artificially build significant correlation among two originally insignificant correlated random integers.
Why not try the following in Stata to see if the above claims are overstated or not?
set obs 50
gen x=int(99*uniform()+1)
gen y=int(99*uniform()+1)
gen z=int(99*uniform()+1)
pwcorr x y z, sig
gen ydx = y/x
pwcorr x ydx, sig
reg x ydx
gen xdz = x/z
gen ydz = y/z
pwcorr xdz ydz, sig
reg xdz ydz
gen zdy = z/y
pwcorr xdz zdy, sig
reg xdz zdy
Are you convinced by now? If not, please go read the source paper below (or just write back and say what is wrong with Stan and Fuguitt’s argument). If yes, the question now becomes what should we do with the spurious correlation. Shall we just use the original forms of variables? Shall we re-specify the Solow model? But what if our research interest is about ratio or difference? … …
Source:
Stanley Lieberson and Glenn Fuguitt, 1974. Correlation of Ratios or Difference Scores Having Common Terms, in Sociological Methodology (1973-1974), edited by Herbert Costner, San Francisco: Jossey-Rass Publishers.