| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | |||||
| 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| 17 | 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 | 30 |
| 31 |
June 26, 2008
A few bloggers at other sites (Concurring Opinions and Election Law Blog) have pointed out an interesting footnote in the Supreme Court's recent decision on punitive damages in the Exxon Valdez case. Justice Souter took note of experimental research on jury decisionmaking done by Cass Sunstein, Daniel Kahneman, and others, but then dismissed it for the purposes of the decision because Exxon had contributed funding for the research:
The Court is aware of a body of literature running parallel to anecdotal reports, examining the predictability of punitive awards by conducting numerous “mock juries,” where different “jurors” are confronted with the same hypothetical case. See, e.g., C. Sunstein, R. Hastie, J. Payne, D. Schkade, W. Viscusi, Punitive Damages: How Juries Decide (2002); Schkade, Sunstein, & Kahneman, Deliberating About Dollars: The Severity Shift, 100 Colum. L. Rev. 1139 (2000); Hastie, Schkade, & Payne, Juror Judgments in Civil Cases: Effects of Plaintiff’s Requests and Plaintiff’s Identity on Punitive Damage Awards, 23 Law & Hum. Behav. 445 (1999); Sunstein, Kahneman, & Schkade, Assessing Punitive Damages (with Notes on Cognition and Valuation in Law), 107 Yale L. J. 2071 (1998). Because this research was funded in part by Exxon, we decline to rely on it.
It will be interesting to see whether this position is taken up by the lower courts; if so, we might see less incentive for private actors to fund social science research. That could be good or bad, I suppose, depending on one's views of likelihood that researchers will be unduly influenced by their funding sources.
Posted by Mike Kellermann at 1:13 PM | Comments (8)
June 13, 2008
Two awards given by the Society for Political Methodology were announced today, and both of them went to IQSS faculty members (and co-authors).
The Gosnell Prize is given to the "best paper on political methodology given at a conference", and this year's prize was awarded to Kevin Quinn for his paper "What Can be Learned from a Simple Table? Bayesian Inference and Sensitivity Analysis for Causal Effects from 2x2 and 2x2xK Tables in the Presence of Unmeasured Confounding." From the announcement:
Quinn's paper offers a set of steps to improve inference with binary independent and dependent variables and unmeasured confounds. He derives large sample, non-parametric bounds on the average treatment effect and shows how these bounds do not rely on auxiliary assumptions. He then provides a graphical way to depict the robustness of inferences as one changes assumptions about the confounds. Finally, he shows how one can use a Bayesian framework relying on substantive knowledge to restrict the set of assumptions on the confounds to improve inference.
The Warren Miller prize is given annually to the best paper appearing in Political Analysis. This year's prize has been awarded to Daniel E. Ho, Kosuke Imai, Gary King, and Elizabeth A. Stuart for their article, "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference." The abstract of their paper follows:
Although published works rarely include causal estimates from more than a few model specifications, authors usually choose the presented estimates from numerous trial runs readers never see. Given the often large variation in estimates across choices of control variables, functional forms, and other modeling assumptions, how can researchers ensure that the few estimates presented are accurate or representative? How do readers know that publications are not merely demonstrations that it is possible to find a specification that fits the author's favorite hypothesis? And how do we evaluate or even define statistical properties like unbiasedness or mean squared error when no unique model or estimator even exists? Matching methods, which offer the promise of causal inference with fewer assumptions, constitute one possible way forward, but crucial results in this fast-growing methodological literature are often grossly misinterpreted. We explain how to avoid these misinterpretations and propose a unified approach that makes it possible for researchers to preprocess data with matching (such as with the easy-to-use software we offer) and then to apply the best parametric techniques they would have used anyway. This procedure makes parametric models produce more accurate and considerably less model-dependent causal inferences.
Posted by Mike Kellermann at 2:22 PM
May 31, 2008
I'm grateful for the strong response to my original query for quality, free PDF annotation for Linux. In general, there seem to be a few categories.
-Windows-based editors, adaptable through emulators: PDF X-change, Foxit (free version), primopdf
-Linux editors with non-portable annotations: Okular, which has hidden XML files for its annotations (skim, for OS X, has the same scheme)
-early, incomplete solutions that will eventually be good: GNU's PDF project, Xournal
-early, incomplete solutions that aren't user-friendly: pdfedit, Cabaret Stage
-early solutions that are still in progress: evince
Of all of these options, I like Okular the best, mainly because integrating its XML-saved annotations into the PDF is but one plugin away (which might already exist, for all I know), and it's theoretically portable to Windows by installing qt4 binaries. Using an emulator like wine is a hassle big enough that I've avoided it, for the same reason I don't use cygwin on Windows systems.
So we're close to a (more) universal free editing environment. But I'm still not a fan of doing all my work on a screen, and also not willing to print. So I'm trying a middle road.
I bought an iLiad e-paper reader this past week, and so far I'm impressed with how it handles (though its price tag, $600 for the model I bought, definitely isn't for everyone, and was almost not for me). The screen is easily readable, the battery lasts, and I can zoom in and rotate documents to get a half-page display with larger text. More importantly, the device runs Linux and iRex has made a point to try and use open source software as much as possible, in contrast to Amazon and the Kindle (which is half the size, can't read PDFs and can't edit books.)
However, as the project is still in its relative infancy, there are a few functions it has yet to incorporate that I really would like, and they're the same ones I want in a computer-based annotator: highlighting multiple-column text, for example, so that I can extract passages I want later at the push of a button. And like Okular, the annotations made on the iLiad are saved in a companion XML file rather than the original PDF, but the company offers a free program to do the merging.
I'm going to continue to explore what the iLiad can do as far as editing, but it's definitely reassuring that everyone who's seen me used it has oohed and aahed at it.
To sum up, I've now got a free platform for reading, editing and annotating PDFs on a Linux machine, and an auxiliary paper-free method for reading them later which is admittedly not free. And I have more needs as well, but I can at least see them being met soon. What else do people want in paperless work we haven't covered yet?
P.S. If the people from iRex are reading this and want me to shill for them for real, they can let me know directly.
Posted by Andrew C. Thomas at 11:05 PM | Comments (5)
May 26, 2008
I'm a Linux user in need of a quality PDF reader with basic annotation tools, and I need it to be available for free. Think I'm asking for too much?
We're at a point where the level of content available online dwarfs our ability to print it all onto paper for examination and notation. As academics, we're expected to sort through volumes of other people's work in order to verify that our own is original, as well as comment, annotate, and on occasion make corrections or forward-references to later works.
But despite a boom in computational power and information bandwidth, the software to do this without resorting to printed or copied matter isn't accessible to most students without paying through the nose. Full software suites like Adobe Acrobat aren't necessary for the kind of work academics need to do. There are a few functions that are essential to the task, currently available in commercial software:
-Adding and reading notes, whether free-floating or attached to highlighted text
-The ability to select and copy multi-column text (none of the free ones seem to be able to get this one right)
-I'd like that when LaTeX creates a link to a footnote or citation, hovering over the displayed link should cause a pop-up box to display the information.
I'm a man with big ideas but no time, and more importantly, no budget, to motivate and drive the development and use of a free PDF reader with mild annotation capabilities. I can't resort to the for-pay software available from the school website because I'm running Linux, and I shouldn't have to go to a virtual machine or another computer to do this kind of annotation. Likewise, others shouldn't have to spend hundreds for software where they only need a few simple functions.
I suppose the issue is that everyone has their own toys they want included in a PDF editor, which is why the commercial package makes sense. But as academics, wouldn't we be happy with "the basics plus"?
Posted by Andrew C. Thomas at 6:34 PM | Comments (17)
May 22, 2008
Professor Nicholas Christakis and Professor James Fowler’s study on social network and smoking cessation is featured in the New York Times, which is also going to appear in the New England Journal of Medicine this Thursday. Congratulations to them!
Their basic findings are that smokers are likely to quit in groups (As Nicholas said, “Whole constellations are blinking off at once.”) and that the remaining smokers tend to be socially marginalized.
One interesting question I have for their study is that, if friends tend to quit smoking together, will this partly contribute to the simultaneous weight gains among friends, a result Nicholas and James have found last year using the same dataset? In other words, I totally accept that social ties have important impacts on individuals' wellbeing, but if you try to research a certain outcome of wellbeing and do not control for the “contaminating” effects from other outcomes, the estimation of the social network effects on the former outcome could be biased. For example, the weight gains among friends, from this point of view, could be partially resulted from their simultaneous quitting from smoking. Of course, if smokers only consist of a very small fraction of the participants in the studied sample and their weight changes are not too extreme, the bias of the estimation should not invoke a serious problem.
See the following link for a glimpse of their study.
Study Finds Big Social Factor in Quitting Smoking
http://www.nytimes.com/2008/05/22/science/22smoke.html?partner=rssnyt&emc=rss
Sorry for the duplicate if you have noticed this news.
Posted by Weihua An at 12:01 PM | Comments (2)
May 19, 2008
Mark Blumenthal from pollster.com has been posting interviews with scholars at the 2008 AAPOR conference, including two with our very own Sunshine Hillygus and Chase Harrison from the Program on Survey Research:
Posted by Mike Kellermann at 10:50 AM
May 15, 2008
I just finished reading an interesting paper on placebo effects in drug trials by Anup Malani. Malani noticed that participants in high probability trials know that they more likely to get active treatment (because of informed consent prior to the trial). They have higher expectations and hence should have higher placebo effects than patients in low probability trials. Malani compares outcomes across trials with different assignment probabilities and finds evidence for placebo effects. A related finding is that the control group in high probability trials reports more side effects.
The paper discusses some potential implications of placebo effects, e.g. that patients who are optimistic about the outcome might change their behavior and hence get better even without the active drug. It makes me wonder how this might translate into non-medical settings and whether there are studies of placebo effects in the social sciences. Also, if placebo drugs can improve health outcomes, maybe ineffective social programs would still work as long as participants don’t know whether the program works or doesn’t? Maybe this is the role of politics. But what about the side-effects?
Malani, A (2006) “Identifying Placebo Effects with Data from Clinical Trials” Journal of Political Economy, Vol. 114, pp. 236-256. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=901838
Abstract:
A medical treatment is said to have placebo effects if patients who are optimistic about the treatment respond better to the treatment. This paper proposes a simple test for placebo effects. Instead of comparing the treatment and control arms of a single trial, one should compare the treatment arms of two trials with different probabilities of assignment to treatment. If there are placebo effects, patients in the higher-probability trial will experience better outcomes simply because they believe that there is a greater chance of receiving treatment. This paper finds evidence of placebo effects in trials of antiulcer and cholesterol-lowering drugs.
Posted by Sebastian Bauhoff at 12:00 PM | Comments (5)
May 13, 2008
I recently came across Datamob.org, a site featuring public datasets and interfaces that have been built to help the public explore them.
From datamob's about page:
Our listings emphasize the connection between data posted by governments and public institutions and the interfaces people are building to explore that data.It's for anyone who's ever looked at a site like MAPLight.org and wondered, "Where did they get their data?" And for anyone who ever looked at THOMAS and thought, "There's got to be a better way to organize this!"
I continue to wonder how the types of interfaces featured on datamob will affect the dissemination of information in society. The dream of a lot of these interface builders is to disintermediate information provision -- ie, to make it possible for citizens to do their own research, produce their own insights, publish their findings on blogs and via data-laden widgets. (We welcomed Fernanda and Martin from Many Eyes, two prominent participants in this movement, earlier this year at our applied stats workshop.) At the same time, the new interfaces make it cheaper for professional analysts -- academics, journalists, consultants -- to access the data and, as they have always done, package it for public consumption. It makes me wonder to what extent the source of our data-backed insights will really change, ie, how much more common will "I was playing around with data on this website and found out that . . . " become relative to "I heard about this study where they found that . . ."?
My hunch is that, just as blogging and internet news has democratized political commentary, the new data resources will make it possible for a new group of relatively uncertified people to become intermediaries for data analysis. (I think FiveThirtyEight is a good example in political polling, although since the site's editor is anonymous I can't be sure.) People will overwhelmingly continue to get data insights as packaged by intermediaries rather than through new interfaces to raw data, but the intermediaries (who will use these new services) will be quicker to use data in making their points, will become much larger in number, and will on average become less credentialed.
Posted by Andy Eggers at 9:48 AM | Comments (3)
May 9, 2008
fabulous three part series on further adventures in identification on the Freakonomics blogs here, here, and here. The story features Kennedy School Professor Robert Jensen in his five year long quest of achieving rigorous identification for Giffen effects. After finding correlational evidence for Giffen goods in survey data he and his co-author actually followed up by running an experiment in China and guess what, they do find evidence for Giffen behavior. Impressive empirics and a funny read, enjoy!
Posted by Jens Hainmueller at 2:16 PM | Comments (1)
May 8, 2008
Last week we had an International Meeting on Methodology for Empirical Research on Social Interactions, Social Networks, and Health here at the IQ., thanks to the organization by Professor Charles Manski and Professor Nicholas Christakis. Some people told me that the second day of the meeting was much more “violent” than the first day and based on what I have seen, I believe it was true. I saw at least three cliques of speakers were automatically formed on site along the disciplinary lines: statisticians, economists, and sociologists and political scientists. There were even sub-cliques and backfires! Fortunately, nobody was severely wounded. But anyway, it was a great intellectual exchange between disciplines. Below are some brief notes I took at the second day of the meeting, particularly at the last 20 minutes of the meeting when speakers talked about the future directions of network analysis in social sciences. Sorry for that I forgot to jot down exactly who said what, and that I also squeezed into the notes some of my personal thoughts. I took full responsibility for all errors in the notes.
1. Need to combine game theory with social network analysis, particularly evolutionary game theory (and transaction costs theory).
2. Need to further develop social network analysis based on (random) graph theory, typology and random matrix theory.
3. Network studies tend to focus on network structure and typology as dependent variables while social sciences are more concerned with how network positions and features affect node level of problems. To put simply, network studies tend to start from nodes and end at network while social sciences are more like a top-down approach.
4. In either case, however, it is very crucial to understand the data/tie generating mechanism. Especially, think that the formation of ties can go two ways: influence and selection. For example, smokers can become friends either because a person is influenced by his/her smoking friend to start smoking or because they are both smokers and then become friends. For another example, a highly educated person is usually less likely to be nominated by others as the best friend. This could be either because the highly educated person is less trustworthy or incapable to maintain friend ties or because he/she is more independent and less wiling to associate with others.Longitudinal data may help solve the influence vs. selection issue.
5. Network analysis assumes that the probability of forming ties between nodes is the same between any pair of nodes. So start with a meaningful number of nodes to build network so that each node have roughly the same probability to form ties with one another.
6. How the sever of an existing tie and the formation of a new tie will affect the structure of social network? How ties can bring more ties and lead to polarized network? Nonlinear generating processes and dynamics in network can lead to dramatic difference in network structure for any tiny changes at the node level. How network size can affect network structure? (Think about the difference among monopolistic market, oligarchic market and perfect competitive market.)
7. How to define homophyly between friends? One dimension vs. multiple dimensions? Suppose it is one dimension, there are still two approaches: 1) do a mean test between the tie senders and the tie receivers. 2) Use the ratio of the number of ties whose connected nodes are in the same group (e.g., age +/- 5) that you defined to the total number of ties as an alternative measure. What else?
8. Need to think about how to incorporate network analysis into traditional regression framework. We can either include network properties into regression models to study how network affect personal/clique level of phenomena or use regressions to evaluate how network properties are determined by socioeconomic variables.
9. How to deal with the dependence structure among node level of variables since the errors are not iid.? Is it enough to just using correlation matrix to weight the standard errors and get robust SEs?
10. Need to combine network software with traditional statistical software. The stat-net is getting there. But for Stata users, canned programs are needed to generate network data inside of Stata.
Lastly, for those of you who are interested in causal analysis, read Patrick Doreian (2001), “Causality in Social Network Analysis” (Sociological Methods and Research 30: 81-114) and see if you can improve his study.
Posted by Weihua An at 10:46 AM
May 6, 2008
I've been programming in R for four years now, and it seems that no how much I learn there are a million tiny ways that I could do it better. We all have our own programming styles and frequently used functions that may prove useful to others. I often find that a casual conversation with an office mate yields new approaches to a programming quandary. I'm speaking not of statistical insights, though those are important too, but rather the "simple" art of data manipulation and programming implementation--those essential tricks that help to improve coding efficiency. So, to that end I'm announcing the beginning of a bi-weekly "Tuesday Tips & Tricks" posting. These tips may include the description of a useful and perhaps obscure function, or the solutions to common coding problems. I'm selfishly hoping that if readers of this blog know of better or alternate approaches, they'll respond in the comment section. So I'm looking forward to reading your responses.
This week's tip: How to quickly summarize contents of an object.
Answer: summary(), str(), dput()
The primary option, of course, is the familiar summary() command. This command works well for viewing model output, but also to get a quick sense of data frame, matrices and factors. For example, summary of a data frame or matrix shows the following:
> summary(dat1)
Hello test citynames
Min. :1.00 Min. :-3 Length:2
1st Qu.:1.25 1st Qu.:-2 Class :character
Median :1.50 Median :-1 Mode :character
Mean :1.50 Mean :-1
3rd Qu.:1.75 3rd Qu.: 0
Max. :2.00 Max. : 1
This is an incredibly useful function for numeric data, but is less useful for string data. For character vectors the summary function only reveals the length, class, and mode of the variable. In this case, to get a quick look at the data, one might want to use str(). Officially str() "compactly displays the structure of an arbitrary R object", and in practice this is incredibly useful. So using the same dataframe as an example:
> str(dat1)
'data.frame': 2 obs. of 3 variables:
$ Hello : num 1 2
$ test : num -3 1
$ citynames: chr "Cambridge" "Rochester"
In this case, this is just a 2 x 3 data frame, where the first variable is Hello, it's a numeric variable, and the values of the variable Hello are: 1, 2. In this case, the character vector for citynames is much more usefully displayed. While this is a small example, the function works just as well for much larger data frames and matrices where it only displays the first ten values of each variable.
For smaller objects, the function dput() might also prove useful. This function shows the ASCII text representation of the R object and it's characteristics. So for this same example:
> dput(dat1)
structure(list(Hello = c(1, 2), test = c(-3, 1), citynames = c("Cambridge",
"Rochester")), .Names = c("Hello", "test", "citynames"), row.names = c(NA,
-2L), class = "data.frame")
Posted by Eleanor Neff Powell at 4:41 PM | Comments (3)
May 1, 2008
James Heckman has a new NBER working paper ``Econmetric Causality’’ which some of you might interesting. To give you a flavor, Heckman writes
``Unlike the Neyman–Rubin model, these [selection] models do not start with the experiment as an ideal but they start with well-posed, clearly articulated models for outcomes and treatment choice where the unobservables that underlie the selection and evaluation problem are made explicit. The hypothetical manipulations define the causal parameters of the model. Randomization is a metaphor and not an ideal or “gold standard".’’ (page 37)
Heckman, J (2008) ``Econometric Causality’’ NBER working paper #13934. http://papers.nber.org/papers/W13934
Abstract: This paper presents the econometric approach to causal modeling. It is motivated by policy problems. New causal parameters are defined and identified to address specific policy problems. Economists embrace a scientific approach to causality and model the preferences and choices of agents to infer subjective (agent) evaluations as well as objective outcomes. Anticipated and realized subjective and objective outcomes are distinguished. Models for simultaneous causality are developed. The paper contrasts the Neyman-Rubin model of causality with the econometric approach.
Posted by Sebastian Bauhoff at 10:00 AM
April 24, 2008
I am writing a short essay about the connection and distinction between indirect effect and interaction effect for a methodological class and find the following website very helpful to clarify some of the FAQs on that subject. The website is maintained by Professor Regina Branton at the Department of Political Science of Rice University.
http://www.ruf.rice.edu/~branton/interaction/faqshome.htm
Also check out the mediation item at Wikipedia and its great references.
http://en.wikipedia.org/wiki/Mediation_(statistics)
Posted by Weihua An at 11:35 AM | Comments (2)
April 16, 2008
The Journal of the American Medical Association published a piece today on ghostwriting of medical research. Thanks to the Vioxx lawsuits, the authors say that they found documents ``describing Merck employees working either independently or in collaboration with medical publishing companies to prepare manuscripts and subsequently recruiting external, academically affiliated investigators to be authors. Recruited authors were frequently placed in the first and second positions of the authorship list.’’ One of the exhibits uses a placeholder ``External author?’’ for the expert to be named. Obviously the idea that a pharmaceutical company is pre-writing clinical studies is as controversial as doctors possibly signing off on them without really being involved. A NYT article has some comments, and Merck has released a press statement.
Posted by Sebastian Bauhoff at 10:54 PM
April 15, 2008
A few weeks ago I wrote a post sharing some code I wrote to generate sharp-looking PNG scatterplots from R using the Google Chart API. I think there are some nice uses of that (for example, as suggested by a commenter, to send a quick plot over IM), but here's something that I think could be much more useful: maps from R using Google Charts.
So, suppose you have data on the proportion of people who say "pop" (as opposed to "soda" or "coke") in each US state. (I got this data from Many-Eyes.) Once you get my code, you enter a command like this in R
googlemap(x = pct_who_say_pop, codes = state_codes, location = "usa", file ="pop.png")
and this image is saved locally as "pop.png":
To use this, first get the code via
source("http://people.fas.harvard.edu/~aeggers/googlemap.r")
which loads in a function named googlemap, to which you pass
For optional parameters to affect the scale of the figure and its colors, see the source.
Another quick example:
Suppose you wanted to make a little plot of Germany's colonial possessions in Africa. This code
googlemap(x = c(1,1,1,1), location = "africa", codes = c("CM", "TZ", "NA", "TG"),file = "germans_in_africa.png")
returns this url
"http://chart.apis.google.com/chart?cht=t&chtm=africa . . . etc.
and saves this PNG on your hard drive:
The scatterplot thing before was something of a novelty, but I think this mapping functionality could actually be useful for generating quick maps in R, since the existing approaches are pretty annoying in my (limited) experience. The Google Charts API is not very flexible about labels and whatnot, so you probably won't be publishing any of these figures. But I expect this will serve very well for quick exploratory stuff, and I hope others do too.
I'd love it if someone wanted to help roll this into a proper R package . . . .
Posted by Andy Eggers at 3:01 PM | Comments (5)
April 10, 2008
When Professor Nicholas Christakis came by to give a talk on social networks and health two weeks ago, some commentator expressed concern about the sparseness of information contained in network graphs (not specifically regarding Nicholas’ research, which I believe was well-done). I do share the same concern with that commentator. So afterwards I did some preliminary search on the literature about visualization of network data and found several interesting pieces that may help clarify (or even exacerbate) part of the concern some of us are having with network graphs.
The first is the lecture notes Professor Peter V. Marsden wrote about visualization of network graphs in soc275. Here I just want to highlight a few points in his notes. (Words in quotes are taken from Professor Marsden’s lecture notes.)
1) Network graphs can be “referenced to known geographical/spatial/social locations of points”.
2) Aesthetic criteria are used to generate network graphs, for examples, to minimize crossing lines, to make lines shorter, … and “[to] construct plot such that close vertices are connected, positively connected, strongly connected, or connected via short geodesics”.
3) “Location of points reflects ‘social distances’”. … “Spatial configuration differs depending on what 'distance-generating mechanism' is assumed and built in to one’s data.”
4) Some often-used network graph generating algorithms include factor analysis, multidimensional scaling (MDS) and spring embedders, etc.
So the configuration of network graphs seems to a large degree dependent on researchers’ theoretical interests and can change according to the network measures (whether it is the number of clusters within network or overall network connectedness, etc.) that researchers are mostly interested in. In other words, before generating any network graphs, researchers have to be clear about what theoretical themes they aim to present through network graphs and then select corresponding network measures and generating algorithms. For those of you who want to follow up with this topic, there are several pieces recommended by Professor Marsden in his lecture notes that I think are good starting references. See below for more details.
1. Bartholomew, David J., Fiona Steele, Irini Moustaki, and Jane I. Galbraith. 2002. The Analysis and Interpretation of Multivariate Data for Social Scientists. London: Chapman and Hall/CRC. Chapters 3 and 4.
2. Freeman, Linton C. 2005. “Graphic Techniques for Exploring Social Network Data.” Chapter 12 in Carrington, Peter J., John Scott, and Stanley Wasserman. 2005. Models and Methods in Social Network Analysis. New York: Cambridge University Press.
3. Freeman, Linton C. 2000. “Visualizing Social Networks.” Journal of Social Structure 1. (Electronically available at http://www.cmu.edu/joss/content/articles/volindex.html)
Posted by Weihua An at 11:51 AM | Comments (3)
April 7, 2008

Seb just sent this very amusing paper (which he found in a comment to a post on Andrew Gelman's blog):
Objectives: To determine whether parachutes are effective in preventing major trauma related to gravitational challenge. Design: Systematic review of randomised controlled trials. Data sources: Medline, Web of Science, Embase, and the Cochrane Library databases; appropriate internet sites and citation lists. Study selection: Studies showing the effects of using a parachute during free fall. Main outcome measure: Death or major trauma, defined as an injury severity score > 15. Results: We were unable to identify any randomised controlled trials of parachute intervention. Conclusions: As with many interventions intended to prevent ill health, the effectiveness of parachutes has not been subjected to rigorous evaluation by using randomised controlled trials. Advocates of evidence based medicine have criticised the adoption of interventions evaluated by using only observational data. We think that everyone might benefit if the most radical protagonists of evidence based medicine organised and participated in a double blind, randomised, placebo controlled, crossover trial of the parachute.
Funny how such a lampoon can trigger a flame war on the BMJ website. Makes me understand why Gary writes about Misunderstandings between experimentalists and
observationalists about causal inference...
Posted by Jens Hainmueller at 7:16 PM | Comments (2)
April 5, 2008
Dear students and colleagues,
We would like to invite you to attend the Political Economy Student Conference, to be held on April 17th in the NBER premises, in Cambridge, MA. The conference is an opportunity for students interested in political economy and other related fields to get together and discuss the open issues in the field, know what other people are working on, and share ideas. The program of the conference can be found at:
http://www.stanford.edu/group/peg/april_2008_conference/conference_program
This year, some members of the NBER Political Economy Group will be joining us for the conference. We are sure that we will greatly benefit from their comments and suggestions during the discussions.
We hope that those of you interested will attend the conference. The success of the conference largely depends on students' attendance and participation. Given that we have limited seats for the conference, please e-mail leopoldo (at) mit (dot) edu as soon as possible if you are interested in attending so that we can secure a spot for you.
Best regards,
Leopoldo Fergusson
Marcello Miccoli
Pablo Querubin
Posted by Jens Hainmueller at 5:04 PM | Comments (2)
April 4, 2008
Here are the results of the Pennsylvania Democratic primary, with Obama counties in purple and Clinton counties in Orange.
What, you say? The Pennsylvania primary hasn't happened yet? You're right. Enter statistics!
Consider this scatterplot of Kerry's 2004 vote share versus Obama's 2008 vote shares in Ohio counties. The result is something I call the Kerry-Obama smile: Obama does well in Kerry's best counties, where staunchly Democratic urban blacks are concentrated; and in Kerry's worst regions, presumably due to Obama's appeal to crossover Republicans. Clinton does best in the wide middle swath.

This motivates a very simple modeling idea: fit a curve to the scatterplot. Obviously, a quadratic in Kerry's share looks like a decent fit. That gives us the best-fit line shown on the plot. The R-squared is 0.16, representing an okay fit.
The next step is utterly useless, but utterly fun. We can use Ohio to predict Pennsylvania. In other words, given that we know how Kerry did in Pennsylvania counties in 2004, we can predict how well Obama will do in 2008 in every Pennsylvania county. Note that I first tweaked the model's intercept slightly in Obama's favor, so that the aggregate prediction matches the current polling average (showing Clinton up by 6.6%).
The bad news for Obama is that nearly all of Pennsylvania's counties fall in the middle of the smile. The image below compares Kerry in 2004 to the model's predictions for Obama in 2008. Obama is predicted to carry Philadelphia overwhelmingly, and to do well in some of the curvy, heavily Republican counties in the south-center of the state. Everywhere else, though, is Clinton country.
Posted by Kevin Bartz at 1:15 PM | Comments (8)
April 3, 2008
It's a day or so past April 1, but if you haven't seen this post [Edit: link fixed] over at Andrew Gelman's blog, it is worth a look. It's about as good an apologia from a "born-again frequentist" as you are likely to find. An exerpt:
I like unbiased estimates and I like confidence intervals that really have their advertised confidence coverage. I know that these aren't always going to be possible, but I think the right way forward is to get as close to these goals as possible and to develop robust methods that work with minimal assumptions. The Bayesian approach--to give up even trying to approximate unbiasedness and to instead rely on stronger and stronger assumptions--that seems like the wrong way to go.
Fortunately, Gelman's conversion experience appears to have ended after about a day...
Posted by Mike Kellermann at 12:09 AM | Comments (2)
March 28, 2008
A friend just referred me to Processing, a powerful language for visualizing data:
Processing is an open source programming language and environment for people who want to program images, animation, and interactions. It is used by students, artists, designers, researchers, and hobbyists for learning, prototyping, and production. It is created to teach fundamentals of computer programming within a visual context and to serve as a software sketchbook and professional production tool. Processing is developed by artists and designers as an alternative to proprietary software tools in the same domain.
Their exhibition shows some very impressive results. For example, I liked the visualization of the London Tube map by travel time. I lived in Russel Square once, so this invoked pleasant memories:
.
If you can spare a minute also take a look at the other exhibited pieces. Most are art rather than statistics. For chess friends I especially recommend the piece called "Thinking Machine 4" by Martin Wittenberg, who gave a talk at the IQSS applied stats workshop in the fall. Enjoy!
Posted by Jens Hainmueller at 7:43 AM | Comments (7)
March 27, 2008
Recently I read an article written by Erin Leahey, talking about how the usage of statistical significance testing, the 0.05 cut-off value and the three-star system becomes legitimized and dominant in mainstream sociology. According to Erin, one star stands for p<=.05, two stars p<=.01 and three stars p<=.001. But I feel the cut-off values are something like .01, .05 and .10 respectively. Anyway, Erin attributed the first usage of .05 significance level to R. A. Fisher’s book, Design of Experiments in 1935. Erin noticed that other forms of significance testing besides the .05 test were already very popular in the 1930s, when close to 40 percent of articles published in ASR and AJS applied one or another form of significance testing procedure. Based on the articles she sampled from ASR and AJS, Erin showed that the popularity of the usage of statistical significance testing and the 0.05 cut-off value roughly took an “S” shape. The usage rose firstly from the 1930s to 1950, declined afterwards until 1970 and then revived since then. Currently, around 80 percent of articles published in ASR and AJS employ both practices. The three-star system emerged in the 1950s, but became popular only after 1970. Now there were slightly above 40 percent of articles published in the above top two sociological journals use this procedure.
So what account for the diffusion of such practices? Erin brought out several arguments to answer this question. For examples, she argued that institutional factors like investment in research and computer, graduate training and institution’s academic status, and journal editor’s individual preference, etc., could be some of the most important factors in the diffusion process of these practices. Interestingly, she found that graduating from Harvard had a significant negative “effect” on adopting these statistical practices. :-)
Of course, as it happens to almost all research, Erin’s study can not avoid some minor drawbacks either. For example, her sample is only drawn from the top two sociological journals and hence the generalization power of her findings could be limited. But overall, it is a fun reading. And if you are interested in more historical account of how the statistical practices were introduced to and became legitimized in social sciences in general, Camic and Xie (1994) is a very good start.
Sources:
Leahey, Erin. 2005. Alphas and Asterisks: the Development of Statistical Significance Testing Standards in Sociology. Social Forces 84: 1-24.
Camic, Charles, and Yu Xie. 1994. “The Statistical Turn in American Social Science: Columbia University, 1890-1915.” American Sociological Review 59:773-805.
Posted by Weihua An at 11:57 AM
March 26, 2008
A joint project by Andy Eggers and Jens Hainmueller, two long-time contributors to this blog, is the basis of a piece in The Guardian this Monday. Check out the article "How election paid off for postwar Tory MPs" and the paper "MPs For Sale? Estimating Returns to Office in Post-War British Politics". Congrats to Andy and Jens!
Posted by Sebastian Bauhoff at 4:44 PM
March 20, 2008
Yesterday I went to Professor Stanley Lieberson’s class, Issue in the Interpretation of Empirical Evidence. We discussed a paper, written by Stan and Glenn Fuguitt, titled Correlation of Ratios or Difference Scores Having Common Terms. The basic argument of this paper is that although ratios and difference scores are often used as dependent variables in traditional regression analysis, if there are some independent variables who share the same common term with those dependent variables, the estimated coefficients could be severely biased due to the spurious correlation brought about by this common term (whether it is in the denominator or numerator). For examples, if dependent variables are in the form of X/Z while independent variables are something like Y/Z, Z, or Z/X, etc., the estimated coefficients between the dependent and independent variable could become statistically significant simply due to chance.
For some concrete examples, criminologist often use crime rate (adjusted by city population size) as dependent variable while at the same time using city population size as independent variable; organizational researchers are interested in the relationship between the relative size of administration of organization and the absolute size of organization; and economists often regress GDP per capita on such variables as population growth rate, and/or even population size, etc. According to Stan and Fuguitt’s research, all the above examples will provide spurious coefficients since the dependent variable and the independent variable include common terms. In their paper, they attributed this finding back to a paper written by Kail Pearson in 1897 in which Pearson presented rigorously how the spurious correlation came from and a proximate formula for computing correlations of ratios, etc.
We were asked to do an experiment to prove the above spurious correlation, in which we generated three sets of random integers (namely, X, Y, Z) ranging from 1 to 99, presented the pairwise correlation matrix among them and found no significant correlations between any pair of variables. But we found significant correlation between Y/X and X, and when we regressed Y/X on X, the coefficient became significant too. So after such manipulations like division or subtraction, we artificially build significant correlation among two originally insignificant correlated random integers.
Why not try the following in Stata to see if the above claims are overstated or not?
set obs 50
gen x=int(99*uniform()+1)
gen y=int(99*uniform()+1)
gen z=int(99*uniform()+1)
pwcorr x y z, sig
gen ydx = y/x
pwcorr x ydx, sig
reg x ydx
gen xdz = x/z
gen ydz = y/z
pwcorr xdz ydz, sig
reg xdz ydz
gen zdy = z/y
pwcorr xdz zdy, sig
reg xdz zdy
Are you convinced by now? If not, please go read the source paper below (or just write back and say what is wrong with Stan and Fuguitt’s argument). If yes, the question now becomes what should we do with the spurious correlation. Shall we just use the original forms of variables? Shall we re-specify the Solow model? But what if our research interest is about ratio or difference? … …
Source:
Stanley Lieberson and Glenn Fuguitt, 1974. Correlation of Ratios or Difference Scores Having Common Terms, in Sociological Methodology (1973-1974), edited by Herbert Costner, San Francisco: Jossey-Rass Publishers.
Posted by Weihua An at 11:17 AM | Comments (4)
March 18, 2008
In a conversation with Kevin Quinn this week I was reminded of a fascinating lecture given at Google in 2006 by Luis von Ahn, an assistant professor in computer science at Carnegie Mellon. Von Ahn gives a very entertaining and thought-provoking talk on ingenious ways to apply human intelligence and judgment on a large scale to fairly small problems that computers still struggle with.
(Or watch video on Google video.)
Von Ahn devises games that produce data, the best-known example being the ESP Game, which Google acquired and developed as Google Image Labeler. In the game, you are paired with another (anonymous) player and shown an image. Each of you feverishly types in words describing the image (eg, "Spitzer", "politician", "scandal", "prostitution"); you get points and move to the next image when you and your partner agree on a label. The game is fun, even addictive, and of course Google gets a big, free payoff -- a set of validated keywords for each image.
I'm curious about how these approaches can be applied to coding problems in social science. A lot of recent interesting work has involved developing machine learning techniques to teach computers to label text, but there are clearly cases where language is just too subtle and complex to accurately extract meaning, and we need real people to read the text and make judgments. Mostly we hire RAs or do it ourselves; could we devise games instead?
Posted by Andy Eggers at 9:37 AM | Comments (1)
March 11, 2008
While the Democratic nomination contest drags on (and on and on...; Tom Hanks declared himself bored with the race last week), attention is turning to hypothetical general election matchups between Hilary Clinton or Barack Obama and John McCain. Mystery Pollster has a post up reporting on state-by-state hypothetical matchup numbers obtained from surveys of 600 registered voters in each state conducted by Survey USA. There is some debate about the quality of the data (Survey USA uses Interactive Voice Response to conduct its surveys, there is no likely voter screen, etc.). But we have what we have.
At this point, the results are primarily of interest to the extent that they speak to the "electability" question on the Democratic side; who is more likely to beat McCain? MP goes through the results state by state, classifying each state into Strong McCain, Lean McCain, Toss-up, etc. From this you can calculate the number of electoral votes in each category, which provides some information but isn't exactly what we're interested in.
This problem is a natural one for the application of some simple, naive Bayesian ideas. If we throw on some flat priors, make all sorts of unreasonably strong independence assumptions, and assume that the results were derived from simple random sampling, we can quickly get posterior distributions for the support for each candidate in each state and can calculate estimates of the probability of victory. From there, it is easy to calculate the posterior distribution of the number of electoral votes for each candidate and find posterior probabilities that Obama beats McCain, Clinton beats McCain, or the probability that Obama would receive more electoral votes than Clinton.
While I was sitting around at lunch yesterday, I ran a very quick analysis using the reported SurveyUSA marginals. Essentially, I took samples from 50 independent Dirichlet posteriors for both hypothetical matchups, assuming a flat prior and multinomial sampling density (to allow for undecideds); to avoid dealing with the posterior predictive distributions, I'm just going to assume that all registered voters will vote so I can just compare posterior proportions. When you run this, you obtain estimates (conditional on the data and, most importantly, the model) that the probability of an Obama victory over McCain is about 88% and the probability of a Clinton victory is about 72%. There is a roughly 70% posterior probability that Obama would win more electoral votes than Clinton.
As I mentioned, this is an extremely naive Bayesian approach. There are a lot of ways that one could make the model better: adding additional sources of uncertainty, allowing for correlations between the states, using historical information to inform priors, and imposing a hierarchical structure to shrink outlying estimates toward the grand mean. One place to start would be by modeling the pairs of responses to the two hypothetical matchup questions. Any of these things, however, is going to be much easier to do in a Bayesian framework, since calculating posterior distributions of functions of the model parameters is extremely easy.
Posted by Mike Kellermann at 11:17 AM | Comments (4)
March 5, 2008
The dramatic increase in cases of autism in children over the past few years has been in the news again in recent days. Most notably, presumptive Republican presidential nominee John McCain said at a recent stop, "there’s strong evidence that indicates that it’s got to do with a preservative in vaccines." Which would be fine if such strong evidence existed; unfortunately, that is a mischaracterization of the current state of the literature to say the least. McCain has since backed away from his initial comments (see this article in yesterday's New York Times), but the debate prompted by his comments will undoubtedly continue.
By coincidence, the Robert Wood Johnson program at Harvard is sponsoring a talk tomorrow on this topic. Professor Peter Bearman (chair of the Statistics Department at Columbia) will be speaking on "Early Thoughts on the Autism Epidemic." Professor Bearman is currently leading a project on the social determinants of autism. The talk is in N262 on the second floor of the Knafel Building at CGIS from 11:00 to 12:30.
Posted by Mike Kellermann at 2:56 PM | Comments (2)
February 23, 2008
A study published in the New England Journal of Medicine last month showed that widely-prescribed antidepressants may not be as effective as the published research indicates. After reading about the study in the NYT, I recently read the article and was struck by how well the authors were able to document the somewhat elusive phenomenon of publication bias.
Researchers in most fields can document publication bias only by pointing out patterns in published results. A jump in the density of t-stats around 2 is one strong sign that null reports are not being published; an inverse relationship between average reported effect size and sample size in studies of the same phenomenon is another strong sign (because the only small studies that could be published are the ones with large estimated effects). These meta-analysis procedures are clever because they infer something about unpublished studies from what we see in published studies.
As the NEJM article makes clear, publication bias is more directly observable in drug trials because we have very good information about unpublished trials. When a pharmaceutical company initiates clinical trials for a new drug, the studies are registered with the FDA; in order to get FDA approval to bring the drug to market, the company must submit the results of all of those trials (including the raw data) for FDA review. All trials conducted on a particular drug are therefore reviewed by the FDA, but a subset of those trials are published in medical journals.
The NEJM article uses this information to determine which antidepressant trials made it into the journals:
Among 74 FDA-registered studies, 31%, accounting for 3449 study participants, were not published. Whether and how the studies were published were associated with the study outcome. A total of 37 studies viewed by the FDA as having positive results were published; 1 study viewed as positive was not published. Studies viewed by the FDA as having negative or questionable results were, with 3 exceptions, either not published (22 studies) or published in a way that, in our opinion, conveyed a positive outcome (11 studies). According to the published literature, it appeared that 94% of the trials conducted were positive. By contrast, the FDA analysis showed that 51% were positive. Separate meta-analyses of the FDA and journal data sets showed that the increase in effect size ranged from 11 to 69% for individual drugs and was 32% overall.
One complaint -- I thought it was too bad that the authors did not determine whether the 22 studies that were "negative or questionable" and went unpublished were not submitted ("the file drawer problem") or rejected by the journals. But otherwise very thorough and interesting.
Posted by Andy Eggers at 2:05 AM
February 22, 2008
A major item of interest in applied health economics is to understand the impact of health shocks on household income, investments and consumption. This relation is particularly important in developing countries that don’t have programs like universal health insurance or social insurance like Medicaid. Alas it’s also a major challenge to establish causal effects and mechanisms through which the shocks might operate. A main culprit is endogeneity, since health affects wealth and vice versa. As result there is a huge and truly inter-disciplinary literature on the topic, much of it with suspicious identification strategies.
The main struggle is to find a plausibly exogenous exposure to health shocks that have real-life relevance. A new paper by Manoj Mohanan takes this challenge seriously and looks at the effect of health shocks from bus accidents on household’s consumption, and examines what mechanisms households rely on to smooth consumption. (Full disclosure: Manoj is a classmate of mine, and I really like his work!)
To address the endogeneity problem, the paper focuses on people who have been in bus accidents as recorded by the state-run bus company in Karnataka, India. Clearly, finding a good control group is critical: people who travel on public buses may be different from those who don’t. For starters, they actually took the risk of getting on a bus – if you have ever been on the road in a developing country you’ll know what this means. Manoj’s approach is to select unexposed individuals among travelers on the same bus route, after matching on age, sex and geographic area of residence. Hence, conditional on these factors, the bus accident can be treated as exogenous.
He then compares the two groups on various dimensions. He finds that households reduce educational and festival spending by a large amount, but appear to be able to smooth food and housing consumption. He is unable to find effects on assets or labor supply. The principal coping mechanism is debt accumulation. Overall this suggests that not all is well: debt traps aside, reducing investments in education could be very costly in the long run (on this point see also Chetty and Looney, 2006).
Posted by Sebastian Bauhoff at 10:00 AM | Comments (2)
February 2, 2008
This year's Spring Conference of the Harvard Program on Survey Research is on ``New Technologies and Survey Research.'' It will be held on May 9, 2008, 9:00am to 5:00 pm at IQSS, and is open to the public.
See here for details.
Posted by Sebastian Bauhoff at 9:54 AM
February 1, 2008
Abstracts are now being accepted for the 2008 useR! conference in Dortmund, Germany. This conference is designed to bring R users and developers together to trade ideas and find out what is new in the sprawling world of R. Several of us went to the Vienna conference a few years ago, and found it very useful. Previous editions have had a good mix of academic and private sector participants, and I learned more than I have at some of the more traditional academic conferences. The announcement from the useR webpage is below; the website is at http://www.statistik.uni-dortmund.de/useR-2008/
useR! 2008, the R user conference, takes place at the Fakultät Statistik, Technische Universität Dortmund, Germany from 2008-08-12 to 2008-08-14. Pre-conference tutorials will take place on August 11.The conference is organized by the Fakultät Statistik, Technische Universität Dortmund and the Austrian Association for Statistical Computing (AASC). It is funded by the R Foundation for Statistical Computing.
Following the successful useR! 2004, useR! 2006, and useR! 2007 conferences, the conference is focused on
- R as the `lingua franca' of data analysis and statistical computing,
- providing a platform for R users to discuss and exchange ideas how R can be used to do statistical computations, data analysis, visualization and exciting applications in various fields,
- giving an overview of the new features of the rapidly evolving R project.
As for the predecessor conference, the program consists of two parts:
- invited lectures discussing new R developments and exciting applications of R,
- user-contributed presentations reflecting the wide range of fields in which R is used to analyze data.
A major goal of the useR! conference is to bring users from various fields together and provide a platform for discussion and exchange of ideas: both in the formal framework of presentations as well as in the informal part of the conference in Dortmund's famous beer pubs and restaurants.
Prior to the conference, on 2008-08-11, there are tutorials offered at the conference site. Each tutorial has a length of 3 hours and takes place either in the morning or afternoon.
Call for Papers
We invite all R users to submit abstracts presenting innovations or exciting applications of R on topics such as:Applied Statistics & Biostatistics
Bayesian Statistics
Bioinformatics
Chemometrics and Computational Physics
Data Mining
Econometrics & Finance
Environmetrics & Ecological Modeling
High Performance Computing
Machine Learning
Marketing & Business Analytics
Psychometrics
Robust Statistics
Sensometrics
Spatial Statistics
Statistics in the Social and Political Sciences
Teaching
Visualization & Graphics
and many more.We recommend a length of about one page in pdf format. The program committee decided on the presentation format. There is no proceedings volume, but the abstracts are available in an online collection linked from the conference program and in a single pdf file.
Deadline for submission of abstracts: 2008-03-31.
Posted by Mike Kellermann at 11:55 AM
January 4, 2008
James Fowler sent the following message to the Polmeth list, regarding a conference that we will apparently be hosting in June that may be of interest:
The study of networks has exploded over the last decade, both in the social and hard sciences. From sociology to biology, there has been a paradigm shift from a focus on the units of the system to the relationships among those units. Despite a tradition incorporating network ideas dating back at least 70 years, political science has been largely left out of this recent creative surge. This has begun to change, as witnessed, for example, by an exponential increase in network-related research presented at the major disciplinary conferences.We therefore announce an open call for paper proposals for presentation at a conference on "Networks in Political Science" (NIPS), aimed at _all_ of the subdisciplines of political science. NIPS is supported by the National Science Foundation, and sponsored by the Program on Networked Governance at Harvard University.
The conference will take place June 13-14. Preceding the conference will be a series of workshops introducing existing substantive areas of research, statistical methods (and software packages) for dealing with the distinctive dependencies of network data, and network visualization. There will be a $50 conference fee. Limited funding will be available to defray the costs of attendance for doctoral students and recent (post 2005) PhDs. Funding may be available for graduate students not presenting papers, but preference will be given to students using network analysis in their dissertations. Women and minorities are especially encouraged to apply.
The deadline for submitting a paper proposal is March 1, 2008. Proposals should include a title and a one-paragraph abstract. Graduate students and recent Ph.D.'s applying for funding should also include their CV, a letter of support from their advisor, and a brief statement about their intended use of network analysis. Send them to networked_governance@ksg.harvard.edu. The final program will be available at www.ksg.harvard.edu/netgov.
Posted by Mike Kellermann at 5:18 PM | Comments (3)
December 11, 2007
A recent message to the Polmeth mailing list announced that a research group at the University of Pittsburgh is looking for beta testers for some new coding reliability software that they have developed:
The Coding Analysis Toolkit (or “CAT”) was developed in the summer of 2007. The system consists of a web-based suite of tools custom built from the ground-up to facilitate efficient and effective analysis of text datasets that have been coded using the commercial-off-the-shelf package ATLAS.ti (http://www.atlasti.com). We have recently posted a narrated slide show about CAT and a tutorial online. The Coding Analysis Toolkit was designed to use keystrokes and automation to clarify and speed-up the validation or consensus adjudication process. Special attention was paid during the design process to the need to eliminate the role of the computer mouse, thereby streamlining the physical and mental tasks in the coding analysis process. We anticipate that CAT will open new avenues for researchers interested in measuring and accurately reporting coder validity and reliability, as well as for those practicing consensus-based adjudication. The availability of CAT can improve the practice of qualitative data analysis at the University of Pittsburgh and beyond.
More information is avaliable at this website: http://www.qdap.pitt.edu/cat.htm. This is far from my area of expertise, but it looks like it might be useful for some projects...
Posted by Mike Kellermann at 6:00 PM
December 5, 2007
The infosthetics blog offers its "shopping guide for the data-addicted." I was intrigued by the chumby and nabaztag, two devices that offer the charms of the internet divorced from the keyboard/mouse/monitor setup. For the urban planner on your list, don't miss the fly swatter whose mesh is a street map of Milan. For the social science stats crowd, though, the best gift on the list has to be the Death and Taxes poster, depicting the US federal discretionary budget in remarkable detail and clarity. Click on the image below to get a close-up look at the poster.

Posted by Andy Eggers at 8:52 AM
November 30, 2007
IQSS is sponsoring a conference next Friday on the emerging area of computational social science. Below is the announcement:
The Conference on Computational Social Science (part of the Eric M. Mindich Conference series)Friday, December 7, 2007
Center for Government and International Studies South, Tsai Auditorium (Room S010)
1730 Cambridge Street, Cambridge, MAThe development of enormous computational power and the capacity to collect enormous amounts of data has proven transformational in a number of scientific fields. The emergence of a computational social science has been slower than in the sciences. However, the combination of the still exponentially increasing computational power with a massive increase in the capturing of data about human behavior makes the emergence of a field of computational social science desirable, but not inevitable. The creation of a field of computational social science poses enormous challenges, but offers enormous promise to achieve the public good. The hope is that we can produce an understanding of the global network on which many global
problems exist: SARS and infectious disease, global warming, strife due to cultural collisions, and the livability of our cities. That is, can sensing our society lead to a sensible society?To solve these problems will require trading off privacy versus convenience, individual freedom versus societal benefit, and our sense of individuality versus group identity. How will we decide what the sensible society will look like? This conference brings together the wide array of individuals who are working in this emerging research area to discuss how we might address these global challenges, and to evaluate the potential emergence of a field of "computational social science.
Registration is required; more information is available here.
Posted by Mike Kellermann at 9:42 AM | Comments (2)
November 15, 2007
From Andrew Gelman, I saw a link to an interesting "art exhibit" that's actually all about statistics and language. In some ways it reminded me of this other art exhibit that's actually all about statistics -- in this case, the meaning of some of the very large numbers we read about all the time, but find difficult to grasp on an intuitive level.
Both are worth checking out online. And if you live somewhere that you can visit either, lucky you!
Posted by Amy Perfors at 9:47 AM | Comments (1)
October 31, 2007
Amy Perfors
There's an interesting article at Salon today about racial perception. As is normally the case for scientific articles reported in the mainstream media, I have mixed feelings about it.
1) First, a pet peeve: just because something is can be localized in the brain using fMRI or similar techniques, does not mean it's innate. This drives me craaazy. Everything that we conceptualize or do is represented in the brain somehow (unless you're a dualist, and that has its own major logical flaws). For instance, trained musicians devote more of their auditory processing regions to listening to piano music, and have a larger auditory cortex and larger areas devoted toward motor control of the fingers used to play their instrument. [cite]. This is (naturally, reasonably) not interpreted as meaning that playing the violin is innate, but that the brain can "tune itself" as it learns. [These differences are linked to amount of musical training, and are larger the younger the training began, which all supports such an interpretation]. The point is, localization in the brain != innateness. Aarrgh.
2) The article talks about what agent-based modeling has shown us, which is interesting:
Using this technique, University of Michigan political scientist Robert Axelrod and his colleague Ross Hammond of the Brookings Institution in Washington, D.C., have studied how ethnocentric behavior may have evolved even in the absence of any initial bias or prejudice. To make the model as simple as possible, they made each agent one of four possible colors. None of the colors was given any positive or negative ranking with respect to the other colors; in the beginning, all colors were created equal. The agents were then provided with instructions (simple algorithms) as to possible ways to respond when encountering another agent. One algorithm specified whether or not the agent cooperated when meeting someone of its own color. The other algorithm specified whether or not the agent cooperated with agents of a different color.The scientists defined an ethnocentric strategy as one in which an agent cooperated only with other agents of its own color, and not with agents of other colors. The other strategies were to cooperate with everyone, cooperate with no one and cooperate only with agents of a different color. Since only one of the four possible strategies is ethnocentric and all were equally likely, random interactions would result in a 25 percent rate of ethnocentric behavior. Yet their studies consistently demonstrated that greater than three-fourths of the agents eventually adopted an ethnocentric strategy. In short, although the agents weren't programmed to have any initial bias for or against any color, they gradually evolved an ethnocentric preference for one's own color at the expense of those of another color.
Axelrod and Hammond don't claim that their studies duplicate the real-world complexities of prejudice and discrimination. But it is hard to ignore that an initially meaningless trait morphed into a trigger for group bias. Contrary to how most of us see bigotry and prejudice as arising out of faulty education and early-childhood indoctrination, Axelrod's model doesn't begin with preconceived notions about the relative values of different colors, nor is it associated with any underlying negative emotional state such as envy, frustration or animosity. Detection of a difference, no matter how innocent, is enough to result in ethnocentric strategies.
As I understand it, the general reason these experiments work the way they do is that the other strategies do worse given the dynamics of the game (single-interaction Prisoner's Dilemma): (a) cooperating with everyone leaves one open to being "suckered" by more people; (b) cooperating with nobody leaves one open to being hurt disproportionately by never getting the benefits of cooperation; and (c) cooperating with different colors is less likely to lead to a stable state.
Why is this last observation -- the critical one -- true? Let's say we have a red, orange, and yellow agent sitting next to each other, and all of them decide to cooperate with a different color. This is good, and leads to an increased probability of all of them being able to reproduce, and the next generation has two red, two yellow, and two orange agents. Now the problem is apparent: each of the agents is now next to an agent (i.e., the other one of its own color) that it is not going to cooperate with, which will hurt its chances of being able to survive and reproduce. By contrast, subsequent generations of agents that favor their own color won't have this problem. And in fact, if you remove "local reproduction" -- if an agent's children aren't likely to end up next to it -- then you don't get the rise of ethnocentrism... but you don't get much cooperation, either. (Again, this is sensible: the key is for agents to be able to essentially adapt to local conditions in such a way that they can rely on the other agents close to them, and they can't do that if reproduction isn't local). I would imagine that if one's cooperation strategy didn't tend to resemble the cooperation strategy of one's parents, you wouldn't see either ethnocentrism (or much cooperation) either.
3) One thing the article didn't talk about, but I think is very important, is how much racial perception may have to do with our strategies of categorization in general. There's a rich literature studying categorization, and one of the basic findings is of boundary sharpening and within-category blurring. (Rob Goldstone has been doing lots of interesting work in this area, for instance). Boundary sharpening refers to the tendency, once you've categorized X and Y as different things, to exaggerate their differences: if the categories containing X and Y are defined by differences in size, you would perceive the size difference between X and Y to be greater than it actually is. Within-category blurring refers to the opposite effect: the tendency to minimize the differences of objects within the same category -- so you might see two X's as being closer in size than they really are. This is a sensible strategy, since the more you do so it, the better you'll be able to correctly categorize the boundary cases. However, it results in something that looks very much like stereotyping.
Research along these lines is just beginning, and it's too early to go from this observation to conclude that part of the reason for stereotyping is that it emerges from the way we categorize things, but I think it's a possibility. (There also might be an interaction with the cognitive capacity of the learning agent, or its preference for a "simpler" explanation -- the more the agent can't remember subtle distinctions, and the more the agent favors an underlying categorization with few groups or few subtleties between or within groups, the more these effects occur).
All of which doesn't mean, of course, that stereotyping or different in-group/out-group responses are justified or rational in today's situations and contexts. But figuring out why we think this way is a good way to start to understand how not to when we need to.
[*] Axelrod and Hammond's paper can be found here.
Posted by Amy Perfors at 2:32 PM
October 30, 2007
The Clay Mathematics Institute and the Harvard Mathematics Department are sponsoring a lecture by Terry Speed from the Department of Statistics at Berkeley on "Technology-driven statistics," with a focus on the challenges presented to statistical theory and practice presented by the massive amounts of data that are generated by modern scientific instruments (microarrays, mass spectrometers, etc.). These issues have not yet been as salient in the social sciences, but they are clearly on the horizon. The talk is at 7PM tonight (Oct. 30) in Science Center B at Harvard. The abstract for the talk is after the jump:
Technology-driven StatisticsTerry Speed, UC Berkeley and WEHI in Melbourne, Australia
Tuesday, October 30, 2007, at 7:00 PM
Harvard University Science Center -- Hall B
Forty years ago, biologists collected data in their notebooks. If they needed help from a statistician in analyzing and interpreting it, they would pass over a piece of paper with numbers on it. The theory on which statistical analyses was built a couple of decades earlier seemed entirely adequate for the task. When computers became widely available, analyses became easier and a little different. with the term "computer intensive" entering the lexicon. Now, in contemporary biology and many other areas, new technologies generate data whose quantity and complexity stretches both our hardware and our theory. Genome sequencing, genechips, mass spectrometers and a host of other technologies are now pushing statistics very hard, especially its theory. Terry Speed will talk about this revolution in data availability, and the revolution we need in the way we theorize about it.Terry Speed splits his time between the Department of Statistics at the University of California, Berkeley and the Walter & Eliza Hall Institute of Medical Research (WEHI) in Melbourne, Australia. Originally trained in mathematics and statistics, he has had a life-long interest in genetics. After teaching mathematics and statistics in universities in Australia and the United Kingdom, and a spell in Australia's Commonwealth Scientific and Industrial Research Organization, he went to Berkeley 20 years ago. Since that time, his research and teaching interests have concerned the application of statistics to genetics and molecular biology. Within that subfield, eventually to be named bioinformatics, his interests are broad, including biomolecular sequence analysis, the mapping of genes in experimental animals and humans, and functional genomics. He has been particularly involved in the low level analysis of microarray data. Ten years ago he took the WEHI job, and now spends half of his time there, half in Berkeley, and the remaining half in the air somewhere in between.
Posted by Mike Kellermann at 12:08 AM
October 29, 2007
Andy Eggers and I are currently working on a project on UK elections. We have collected a new dataset that covers detailed information on races for the House of Commons between 1950 and 1970; seven general elections overall. We have spent some time thinking about new ways to visualize electoral data and Andy has blogged about this here and here. Today, I'd like to present a new set of plots that we came up with to summarize the closeness of constituency races over time. This is important for our project because we exploit close district races as a source of identification.
Conventional wisdom holds that in Britain, about one-quarter of all seats are 'marginal', ie. decided within majorities of less than 10 percentage points. To visualize this fact Andy and I came up with the following plot. Constituencies are on the x axis and the elections are on the y axis. Colors indicate the closeness of the district race (ie. vote majority / vote sum) categorized into different bins as indicated in the colorkey on top. Color scales are from Colorbrewer. We have ranked the constituencies from close to safe from left to right. Please take a look:

The same plot is available as a pdf here. The conventional wisdom seems to hold. About 30 percent of the races are close. Also some elections are closer than others.
A long format of the plot is available here. It allows to identify individual districts, but requires some scrolling. We are considering developing an interactive version using javascript so that additional info pops up as one mouses over the plot. Notice that both plots exclude the 50 or so districts that changed names as a result of the 1951 redistricting wave.
Finally, Andy and I care about districts that swing between the two major parties. To visualize this we have produced similar plots where the color now indicates the vote share margins as seen by the Conservative party: ((Conservative vote - Labour vote)/vote sum). So negative values indicate a Labour victory and positive values a victory of the Conservative party. We only look at districts where Labour or the Conservative party took first and second place. Here it is:

The partisan swings from election to election are really clear. Finally, the long format is here. The latter plot allows to easily identify the party strongholds during this time period. Comments and suggestions are highly welcome. We wonder whether anybody has done such plots before or whether we can legitimately coin them as Eggmueller plots (lol).
Posted by Jens Hainmueller at 8:13 PM
October 19, 2007
The Red Sox beat the Indians last night in Game 5 of the ALCS, sending the series back to Fenway and enabling the majority of us at Harvard who are (at least fair-weather) Sox fans to, as Kevin Youkilis said last night, come down off the bridge for a few more days. Why do I bring this up? Well, after Boston's loss in Game 4, a commenter on this blog asked the following question:
In the disastrous inning of the Red Sox game tonight, the announcer (maybe Tim McCarver?) said “One would think that a lead-off walk would lead to more runs than a lead-off home-run, but it’s not true. We’ve researched it and this year a lead-off home-run has led to more multi-run innings than have lead-off walks.”I must not be "one", b/c I think a lead-off home-run is much more likely to lead to multiple-run innings, b/c after the home-run, you have a run and need only 1 more to have multiple, and the actions after the first batter are mostly independent of the results of the first batter. So, I think he has it totally backwards. I was a fair stats student, so I need confirmation. He was backwards, right?
The short answer is that it was Tim McCarver, and as an empirical matter he was wrong to be surprised. I don't have access to full inning-by-inning statistics over a long period of time, but the most convincing analysis I found in a quick search (here) suggests that between 1974 and 2002, the probability of a multi-run inning conditional on a leadoff walk is .242 and the probability of a multirun inning after a leadoff home run is .276.
The blogosphere has had a lot of fun at McCarver's expense (not that it takes much to provoke such a reaction, granted): It's Math!, Zero > One, Tim McCarver Does Research, etc. His observation, though, is a good example of Bayesian updating at work: while I doubt that most baseball observers "would think that a lead-off walk would lead to more runs than a lead-off home-run," it is very clear that Tim McCarver thought that at some point. As evidence, in a 2006 game he made the following comment:
"There is nothing that opens up big innings any more than a leadoff walk. Leadoff home runs don't do it. Leadoff singles, maybe. But a leadoff walk. It changes the mindset of a pitcher. Since he walked the first hitter, now all of a sudden he wants to find the fatter part of the plate with the succeeding hitters. And that could make for a big inning."
In 2004, he said during the Yankees-Red Sox ALCS that "a walk is as good as a home run." And back in 2002, he made a similar comment during the playoffs; in fact, it was that comment that prompted the analysis that I linked to above! Clearly, he had a strong prior belief (from where, I don't know) that leadoff walks somehow get in the pitcher's head and produce more big innings. Now that he's been confronted by data, those belief are updating, but since his posterior has shifted so much from his prior it's not surprising that he thinks this is some great discovery. In a couple of years, he'll probably think that he always knew a leadoff home run was better.
As for the intuition, it looks like the commenter is also correct. Using the data cited above, the probability of scoring zero runs in an inning is approx. .723, while the probability of scoring no additional runs after a leadoff homer is approx. .724; the rest of distribution is similar as well.
Posted by Mike Kellermann at 1:02 PM | Comments (4)
October 18, 2007
Perl has the Perl quiz, Python has the Python challenges, Ruby has the Ruby quiz, but what about our good old friend R?? Does such a thing exist anywhere? Would be a nice idea I think...
Posted by Jens Hainmueller at 8:52 PM | Comments (3)
October 17, 2007
Continuing on the topic of self-reported health data, and how to correct for reporting (and other) biases, here an interesting paper on height and weight in the US. Those two measures have received a lot of interest in the past years, not least as components of the body-mass index BMI which is used to estimate the prevalence of obesity. BMI itself is not a great measure (more on that another day) but at least it’s relatively easy to collect via telephone and in-person interviews. Of course some people make mistakes while reporting their own vital measures, and some might do so systematically: a height of 6 foot sounds like a good height to have even to me, and I tend to think in the metric system!
Anyway, the paper by Ezzati et al examines the issue of systematic misreporting. They note that existing smaller-scale studies on this issue might in fact under-estimate the bias because of their design. People might limit their misreporting if they are measured before or after reporting their vitals, which is a challenge for validation studies. And participation might systematically differ with the interview modes of the analysis studies and a general health surveys (e.g. in-person versus telephone interviews) so that the studies are not directly comparable to population-level surveys.
The idea of the paper is to employ two nationally representative surveys to compare three different kinds of measurement for height and weight, by age group and gender. The first survey is the National Health and Nutrition Examination Survey NHANES which collects self-reported information through in-person interviews, and also through medical examination. The second survey is the Behavior and Risk Factor Surveillance Survey BRFFS, an annual cross-sectional telephone survey that is state-level representative and features widely in policy discussions.
The comparisons between the surveys might confirm your priors on misreporting. On average, women under-report their weight and men under 65 tend to over-report their height. The authors find that state-level obesity measures based on the BRFFS are too low – they re-calculate that a number of states in fact had obesity prevalences above 30% in 2000. Of course this is not a perfectly clean assessment, because the NHANES participants might have anticipated the clinical examination a few weeks after the in-person interview. But at the least this study is a good reminder that people do systematically misreport for some reason, and that analysts should treat self-reported BMI carefully.
Posted by Sebastian Bauhoff at 10:23 PM | Comments (1)
October 10, 2007
Today's applied stats talk by Fernanda Viegas and Martin Wattenberg covered a wide array of interesting data visualization tools that they and their colleagues have been developing over at IBM Research. One of the early efforts that they described is an applet called History Flow, which allows users to visualize the evolution of a text document that was edited by a number of people, such as Wikipedia entries or computer source code. You can track which authors contributed over time, how long certain parts of the text have remained in place, and how text moves from one part of the document to another. To give you a flavor of what is possible, here is a visualization of the history of the Wikipedia page for Gary King (who is the only blog contributor who has one at the moment):

This shows how the page became longer over time and that it was primarily written by one author. The applet also allows you to connect textual passages from earlier versions to their authors. We noticed this one from Gary's entry:

"Ratherclumsy"'s contribution to the article only survived for 24 minutes, and was deleted by another user with best wishes for becoming "un-screwed". All kidding aside, this is a really interesting tool for text-based projects. Leaving aside the possibility for analysis, this would be useful for people working on coding projects. I can think of more than one R function that I've worked on where it would be nice to know who wrote a particular section of code....
Posted by Mike Kellermann at 5:52 PM | Comments (2)
October 8, 2007
Dear Applied Statistics Community,
Please join us for this week's installment of the Applied Statistics workshop, where Fernanda Viegas and Martin Wattenberg will be presenting their talk entitled, "From Wikipedia to Visualization and Back'. The authors provided the following abstract for their talk:
This talk will be a tour of our recent visualization work, starting with a case study of how a new data visualization technique uncovered dramatic dynamics in Wikipedia. The technique sheds light on the mix of dedication, vandalism, and obsession that underlies the online encyclopedia. We discuss the reaction of the Wikipedia community to this visualization, and how it led to a recent ambitious project to make data visualization technology available to everyone. This project, Many Eyes, is a web site where people may upload their own data, create interactive visualizations, and carry on conversations. The goal is to foster a social style of data analysis in which visualizations serve not only as a discovery tool for individuals but also as a means to spur discussion and collaboration.
Martin and Fernanda have also provided the following set of links as background for the presentation:
http://alumni.media.mit.edu/~fviegas/papers/history_flow.pdf
http://www.research.ibm.com/visual/papers/viegasinfovis07.pdf
And to a website based upon recent work in data visualization
Link to Many Eyes site:
www.many-eyes.com
As always, the workshop meets at 12 noon on Wednesday, in room N-354 CGIS-Knafel. A light lunch will be provided
Posted by Justin Grimmer at 12:02 PM
October 4, 2007
Amy Perfors
On Tuesday I went to a talk by Terrence Fine from Cornell University. It was one of those talks tha