| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
« Duflo on 'Improving School Quality' | Main | Conference on Computational Social Science »
28 November 2007
In political science, as in many other branches of social science, more attention is being paid to the genetic bases of political behavior (I won't say effects, because that opens a whole other barrel of worms). As I was looking around for an overview of some of the statistical issues involved, I came across a couple of blog posts by Cosma Shalizi at Carnegie Mellon that were both informative and amusing. An excerpt:
When we take our favorite population of organisms (e.g., last year's residents of the Morewood Gardens dorm at CMU), and measure the value of our favorite quantitative trait for each organism (e.g., their present zip code), we get a certain distribution of this trait:Genotype ZIP
AATGAAATAAAAAAAAACGAAAATAAAAAA... 15232
AAGGCCATTAAAGTTAAAATAATGAAAGGA... 15213
AAGGCCATTAAAGTTAAAATAATGAAAGGA... 48104
CAATGATTAGGACAATAACATACAAGTTAT... 15212
GGGGTTAATTAATGGTTAGGATGGGTTTTT... 87501
CCTTCAAAGTTAATGAAAAGTTAAAATTTA... 15217
CCTTCAAAGTTAATGAAAAGTTAAAATTTA... 15217
TAAGTATTTGAAGCACAGCAACAACTAGGT... 02474(Note to our institutional review board: No undergraduates had their DNA sequenced in the writing of this essay.)
If we are limited to the tools of early 20th century statistics (in particular, if we are the great R. A. Fisher, and so simultaneously forging those tools while helping to found evolutionary genetics), we summarize the distribution with a mean and a variance. We can inquire as to where the variance in the population comes from. In particular, assuming the organisms are not all clones, it is reasonable to suppose that some of the variation goes along with differences in genes. The fraction of variance which does so is, roughly speaking, the "heritability" of the trait.The most basic sort of analysis of variance (see also: Fisher) would make this conceptually simple, though practically unsuccessful. Simply take all the organisms in the population, and group them by their genotypes. For each group of genetically identical organisms, compute the average value of the trait. Compare the variance of these within-genotype averages (that is, the across-genotype variance) to the total population variance; this is the fraction of variation associated with genotypes. In most mammalian populations, where clones (identical twins, triplets, ...) are rare and every organism otherwise has a unique genotype, this would tell you that almost all of the variance of any trait is associated with genetic differences. On such an analysis, almost all of the variance in zip codes in my example would be "due to" genetic differences, and the same would be true of telephone numbers, social security numbers, etc.
To see why, look at my table again. With one exception (the twins who live in 15213 and 48104), in this population changing zip code means changing your genotype. The vast majority (81%) of the variance in zip codes is between genotypes, not within them. With real human data, a quarter of the people wouldn't be twins living apart, and the proportion of variance in zip codes "due to" genotype would be even higher.
Naively, then, on this analysis we would say that the "heritability" of zip code, the fraction of its variance which goes along with genetic variations, is 81%. It is crucial to be clear on what this means, which is merely and exactly this: in this population, if we take a random group of genetically identical people, the variance within that group should be 19% (=100-81) of the total variance in the population.
Yet more on the heritability and malleability of IQ
g, a statistical myth
Posted by Mike Kellermann at November 28, 2007 4:20 PM
I wanted to comment on Shalizi's quite interesting blog post called "g, a statistical myth" that Mike links to at the end, as it relates to on(and on and on)going stuff that I am doing.
The impetus for Shalizi's post was an article in Slate about racial differences in g, or "general intelligence". I think there are *a lot* of problems with the research Shalizi is responding to, but I wanted to challenge one of the claims that Shalizi makes in doing so.
Shalizi essentially claims that confirmatory factor analysis is better (or at least less bad) for making causal, scientific claims than is exploratory factor analysis. One quote (among many similar ones) from Shalizi:
"This brings me to the other major sort of factor analysis, what's called "confirmatory" factor analysis. This is about checking a model where some latent, unobserved variables are supposed to account for the relations among the actual observations. To simplify, the logic is that if the model is right, then we should get certain patterns of correlations and no others — like checking whether the partial correlations are zero, as Spearman's original model required them to be, but adapted to other latent structures. This is a genuinely inferential and not just descriptive piece of statistics [BG: Shalizi argues that exploratory factor analysis is merely descriptive]. It's also a pretty modest one, since failing one of these tests is decisive, but passing often isn't very informative, because, as we'll see, radically different arrangements of latent factors can give basically the same pattern of observed correlations. (In the jargon, the power of these tests can be very low at reasonable sample sizes.)"
I would agree that exploratory factor analysis (EFA), as practiced, is not that useful because I don't think there is a sufficiently strong scientific basis for choosing the algorithm for rotating the initial loadings. But the long-standing distinction between EFA and confirmatory factor analysis (CFA) is something that I wanted to transcend with my presentation and (still) forthcoming R package that does what I call "semi-exploratory factor analysis" (SEFA). The basic ideal of SEFA is similar to CFA in that the researcher posits that there are a certain number of zeros in the factor loading matrix. The differences is that CFA specifies *exactly* where these zeros are in the factor loading matrix whereas SEFA estimates the *locations* of the zeros and the values of the non-zero loadings. Thus, SEFA is like EFA in the sense that one is not necessarily seeking to confirm a sharp hypothesis. But because SEFA estimates do not require rotation and are determined entirely by the likelihood (or the prior times the likelihood), SEFA is no less "scientific" than a CFA. Certainly, SEFA could be (mis)used to study intelligence or anything else that is traditionally analyzed via factor analysis. And I think SEFA would help us focus our attention on the design of the study and the interpretation of the estimates, rather than getting caught up in a decades-long debate over the distinction between EFA and CFA, which has never been a particularly sharp or fruitful distinction anyway.
Now, if I could just finish this R package ...
Posted by: Ben Goodrich at November 28, 2007 8:43 PM
Mike, Ben - thank you both for the kind words. Two quick points.
First, a clarification: those posts were mostly written over this summer, growing out of an argument with a friend, and predate the pieces in Slate, and even the mess James Watson made for himself. I expected them to become topical eventually, because this folly, like herpes, always breaks out again sooner or later; but I didn't think it would happen this soon.
Second, from the way Ben describes his SEFA it sounds to me like I'd think of it as (at least potentially) inferential as well. The distinction I have in mind is something like this: saying "the least-squares regression hyperplane for this data is thus-and-so" is just descriptive. Either that hyperplane minimizes the residual sum of squares for that data set or it doesn't, and that's all there is to it. This is something which can be found objectively and unambiguously from the data, without any possibility of rotation, but it's still just descriptive. The statistical inference only comes in when you make some form of extrapolation or generalization beyond the given data - "and we'd have to be damn unlucky to get this plane if there was really no relationship between the variables". I can see how to do the latter with CFA in a way which I can't with EFA. But I think I can see how to do it with Ben's SEFA, too. (I do have trouble imagining a situation where I know how many zeroes the factor loading matrix has, but not where they are, but my imagination is notoriously weak.) I'd be very interested to see the paper when it's ready, since at the very least it sounds like a potential topic the next time I teach data-mining.
Causal inference is another matter yet again, one where I certainly don't want to give the impression that CFA is at all reliable, because it definitely isn't, any more than EFA is.
Posted by: Cosma Shalizi at November 28, 2007 10:37 PM
Hi Cosma,
I think SEFA provides a basis for statistical inference in exactly the sense you suggest. Indeed, one way to think of SEFA is retroactive CFA or maximizing the likelihood over all possible CFA models with a given number of exact zeros per factor in the loading matrix.
I lack imagination too (or at least lack faith in my own theories), so when I do SEFA I am typically specifying that the number of exact zeros per factor is equal to the number of factors, which in most cases is about the minimum number of exact zeros sufficient to avoid rotational indeterminacy problems. In a well-designed study, there should be several more loadings that are near zero at the MLEs but are not required to be exactly zero.
So in principle, one could estimate a SEFA model where the researcher specified that there were 7 zeros on the first factor, 5 zeros on the second factor and 10 zeros on the third factor or something like that. But I would be the first to say something like "Why don't you just require 3 zeros per factor and reestimate the SEFA model? If you are right that there really are more zeros, then at the MLEs you should find additional near-zeros. More generally, evidence for whatever tentative theory compelled you to collect data in the first place should emerge at the SEFA MLEs. If not, your sample was too small, your luck was too bad, you chose the wrong number of factors, or your tentative theory was wrong (or my algorithm had a bug in it). But you might find something at the SEFA MLEs that would lead you to revise your theory and collect new data."
In this sense, I am excited about SEFA as a scientific tool in research areas where it is impossible, impractical, or ethically dubious to try to experimentally manipulate a "factor" to see if it exists or estimate the ATT. Or to figure out what the factors might be and then try to measure them or conduct experiments involving them. I think SEFA is the estimator Thurstone, Tucker, Cattell, Yates, and other proponents of EFA would have wanted but didn't have the hardware (big server or at least modern desktop) or software (genetic algorithm or MCMC) to even contemplate.
So, paper and R package coming Real Soon Now. Actually, in keeping with R tradition, it will probably be R package first and paper second. But I do have some slides, and I will email them to you in a second (or to anyone else for that matter).
Posted by: Ben Goodrich at November 29, 2007 12:30 AM
For those interested, there is a not-unrelated discussion about Cosma's work going on at Crooked Timber.
Posted by: David Kane at December 2, 2007 2:06 PM