May 2008
Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 3.34


« April 2008 | Main

9 May 2008

Adventures in Identification III: The Indiana Jones of Economics

fabulous three part series on further adventures in identification on the Freakonomics blogs here, here, and here. The story features Kennedy School Professor Robert Jensen in his five year long quest of achieving rigorous identification for Giffen effects. After finding correlational evidence for Giffen goods in survey data he and his co-author actually followed up by running an experiment in China and guess what, they do find evidence for Giffen behavior. Impressive empirics and a funny read, enjoy!

Posted by Jens Hainmueller at 2:16 PM | Comments (0)

8 May 2008

Some Random Notes about the International Network Meeting

Last week we had an International Meeting on Methodology for Empirical Research on Social Interactions, Social Networks, and Health here at the IQ., thanks to the organization by Professor Charles Manski and Professor Nicholas Christakis. Some people told me that the second day of the meeting was much more “violent” than the first day and based on what I have seen, I believe it was true. I saw at least three cliques of speakers were automatically formed on site along the disciplinary lines: statisticians, economists, and sociologists and political scientists. There were even sub-cliques and backfires! Fortunately, nobody was severely wounded. But anyway, it was a great intellectual exchange between disciplines. Below are some brief notes I took at the second day of the meeting, particularly at the last 20 minutes of the meeting when speakers talked about the future directions of network analysis in social sciences. Sorry for that I forgot to jot down exactly who said what, and that I also squeezed into the notes some of my personal thoughts. I took full responsibility for all errors in the notes.

1. Need to combine game theory with social network analysis, particularly evolutionary game theory (and transaction costs theory).

2. Need to further develop social network analysis based on (random) graph theory, typology and random matrix theory.

3. Network studies tend to focus on network structure and typology as dependent variables while social sciences are more concerned with how network positions and features affect node level of problems. To put simply, network studies tend to start from nodes and end at network while social sciences are more like a top-down approach.

4. In either case, however, it is very crucial to understand the data/tie generating mechanism. Especially, think that the formation of ties can go two ways: influence and selection. For example, smokers can become friends either because a person is influenced by his/her smoking friend to start smoking or because they are both smokers and then become friends. For another example, a highly educated person is usually less likely to be nominated by others as the best friend. This could be either because the highly educated person is less trustworthy or incapable to maintain friend ties or because he/she is more independent and less wiling to associate with others.Longitudinal data may help solve the influence vs. selection issue.

5. Network analysis assumes that the probability of forming ties between nodes is the same between any pair of nodes. So start with a meaningful number of nodes to build network so that each node have roughly the same probability to form ties with one another.

6. How the sever of an existing tie and the formation of a new tie will affect the structure of social network? How ties can bring more ties and lead to polarized network? Nonlinear generating processes and dynamics in network can lead to dramatic difference in network structure for any tiny changes at the node level. How network size can affect network structure? (Think about the difference among monopolistic market, oligarchic market and perfect competitive market.)

7. How to define homophyly between friends? One dimension vs. multiple dimensions? Suppose it is one dimension, there are still two approaches: 1) do a mean test between the tie senders and the tie receivers. 2) Use the ratio of the number of ties whose connected nodes are in the same group (e.g., age +/- 5) that you defined to the total number of ties as an alternative measure. What else?

8. Need to think about how to incorporate network analysis into traditional regression framework. We can either include network properties into regression models to study how network affect personal/clique level of phenomena or use regressions to evaluate how network properties are determined by socioeconomic variables.

9. How to deal with the dependence structure among node level of variables since the errors are not iid.? Is it enough to just using correlation matrix to weight the standard errors and get robust SEs?

10. Need to combine network software with traditional statistical software. The stat-net is getting there. But for Stata users, canned programs are needed to generate network data inside of Stata.

Lastly, for those of you who are interested in causal analysis, read Patrick Doreian (2001), “Causality in Social Network Analysis” (Sociological Methods and Research 30: 81-114) and see if you can improve his study.

Posted by Weihua An at 10:46 AM | Comments (0) | TrackBack

7 May 2008

What's New in Econometrics

Here's a link to a free, 18-hour mini-course on recent advances in econometrics and statistics from the National Bureau of Economic Research. It's co-taught by Guido Imbens and Jeffrey Wooldridge. The intended audience is obviously economists, but there are several topics (Bayesian inference, missing data, etc.) that are likely of interest to a wide range of social scientists. The course includes lecture videos, slides, as well as detailed notes on each topic.

Posted by John Graves at 1:53 PM | Comments (0)

Plotting Survival Curves with Uncertainty Estimates

One of the pesky things I've found in my (limited) experience with survival analysis is that it's almost impossible to plot several survival curves in the same space and include measures of uncertainty without the entire plot becoming incomprehensible. So, to build on the great R discussions Ellie and Andy have provided in recent blog posts, I'd like to offer an extension of my own. I've created a fairly flexible function that allows one to plot several survival curves along with estimation uncertainty from Zelig's Cox proportional hazards output (which was developed by Patrick Lam). Here are two examples of what my surv.plot() function can provide:

survplotex.jpg

Hopefully this will be of some interest to a few readers. More details and example code below.

Here is the syntax for the command:
s.out: Simulated output from Zelig for each curve organized as a list()
duration: Surival time
censor: Censoring indicator
type: Display type for confidence bands. The default is "line" but "poly" is also supported (to create the shaded region in the right plot above).
plotcensor: Creates rug() plot indicating censoring times (Default is TRUE)
plottimes: plots a point for each survival time in the step function (Default is TRUE)
int: Desired uncertainty interval (Default is c(0.025,0.975) which corresponds to a 95% interval)

Here's the plot.surv() source code, and below I've copied the R code I used to create the plots above:

library(Zelig)
data(coalition)

# Fit the Cox Model
z.out1 <- zelig(Surv(duration,ciep12)~invest+numst2+crisis,
robust=TRUE,cluster="polar",model="coxph",data=coalition)

# Set Low and High Quantities of Interest
low <- setx(z.out1,numst2=0)
high <- setx(z.out1,numst2=1)

# Simulate for Each
s.out1 <- sim(z.out1,x=low)
s.out2 <- sim(z.out1,x=high)

# Create list output that contains both simulations
out <- list(s.out1,s.out2)


# Plot the results
par(mfrow=c(1,2))
surv.plot(s.out=out,duration=coalition$duration,censor=coalition$ciep12,type="line",plottimes=TRUE)
surv.plot(s.out=out,duration=coalition$duration,censor=coalition$ciep12,type="poly",plotcensor=TRUE)


Posted by John Graves at 11:48 AM | Comments (0)

6 May 2008

Tuesday: Tips & Tricks

I've been programming in R for four years now, and it seems that no how much I learn there are a million tiny ways that I could do it better. We all have our own programming styles and frequently used functions that may prove useful to others. I often find that a casual conversation with an office mate yields new approaches to a programming quandary. I'm speaking not of statistical insights, though those are important too, but rather the "simple" art of data manipulation and programming implementation--those essential tricks that help to improve coding efficiency. So, to that end I'm announcing the beginning of a bi-weekly "Tuesday Tips & Tricks" posting. These tips may include the description of a useful and perhaps obscure function, or the solutions to common coding problems. I'm selfishly hoping that if readers of this blog know of better or alternate approaches, they'll respond in the comment section. So I'm looking forward to reading your responses.

This week's tip: How to quickly summarize contents of an object.

Answer: summary(), str(), dput()

The primary option, of course, is the familiar summary() command. This command works well for viewing model output, but also to get a quick sense of data frame, matrices and factors. For example, summary of a data frame or matrix shows the following:

> summary(dat1)
Hello test citynames
Min. :1.00 Min. :-3 Length:2
1st Qu.:1.25 1st Qu.:-2 Class :character
Median :1.50 Median :-1 Mode :character
Mean :1.50 Mean :-1
3rd Qu.:1.75 3rd Qu.: 0
Max. :2.00 Max. : 1

This is an incredibly useful function for numeric data, but is less useful for string data. For character vectors the summary function only reveals the length, class, and mode of the variable. In this case, to get a quick look at the data, one might want to use str(). Officially str() "compactly displays the structure of an arbitrary R object", and in practice this is incredibly useful. So using the same dataframe as an example:

> str(dat1)
'data.frame': 2 obs. of 3 variables:
$ Hello : num 1 2
$ test : num -3 1
$ citynames: chr "Cambridge" "Rochester"

In this case, this is just a 2 x 3 data frame, where the first variable is Hello, it's a numeric variable, and the values of the variable Hello are: 1, 2. In this case, the character vector for citynames is much more usefully displayed. While this is a small example, the function works just as well for much larger data frames and matrices where it only displays the first ten values of each variable.

For smaller objects, the function dput() might also prove useful. This function shows the ASCII text representation of the R object and it's characteristics. So for this same example:

> dput(dat1)
structure(list(Hello = c(1, 2), test = c(-3, 1), citynames = c("Cambridge",
"Rochester")), .Names = c("Hello", "test", "citynames"), row.names = c(NA,
-2L), class = "data.frame")

Posted by Eleanor Neff Powell at 4:41 PM | Comments (3)

4 May 2008

IN, NC Predictions

Since I have qualifying exams tomorrow, I'll keep this entry unimaginative. I've re-run my predictions for the Indiana and North Carolina primaries on Tuesday, adding a few new bells and whistles:

  • A turnout model
  • More covariates in the voting share model

nc.dem.2008.pred.share.png

in.dem.2008.pred.share.png

With the help of a turnout model, I can actually predict the election result by multiplying turnout by population and adding up votes for Clinton and Obama. When I do that, I get:

Indiana: Clinton 53.5%, Obama 46.5%; turnout 950,000
North Carolina: Obama 58%, Clinton 42%; turnout 1,200,000

Yowzers! We'll see how the real numbers pan out. Here are a few details on the two models:

  • The share model is trained on the primary results from Ohio, Pennsylvania and Virginia. This model has R^2 = 0.99, meaning that it's explained nearly as much as it can. The residuals still show a SE of 5%, however, so the results could be shaky at the county level.
  • The turnout model is trained on the primary results from Ohio. Note that Indiana and North Carolina are open primaries. I didn't use Pennsylvania in this model because it was a closed primary, and I didn't use Virginia because it had a contested Republican election at the time. Ohio's Republican primary was technically contested by Huckabee, but he wasn't a serious factor, whereas he had dedicated substantial resources to competing in Virginia. For this model R^2 = .84 and the residual SE is 2%. My turnout projections are mapped below.

    in.dem.2008.pred.turn.png
    nc.dem.2008.pred.turn.png


This time I included even more covariates for both models. Next to the ones found to be important, I've placed their effect in parentheses.


  • Kerry's 2004 vote share and its square (pro-Clinton and +turnout)

  • Proportions White, Black, Asian, Native American and Hispanic (white pro-Clinton and +turnout, others pro-Obama)

  • Proportion male (pro-Clinton, +turnout)

  • Proportions 18-21 and 65+ (both pro-Obama, young -turnout, old +turnout)

  • Percentage urban

  • Log(median household income) (pro-Obama)

  • Proportion with a bachelor's degree, proportion with a master's degree (pro-Obama)

  • Unemployment rate (high is pro-Clinton)

  • Proportions employed in mining, in education, in construction (mining pro-Clinton, education pro-Obama)

How do my results stack up against the current polls? In Indiana, the RealClearPolitics average has Clinton +6%, only a point from my prediction. In North Carolina, the RCP average has Obama +8%, significantly below my predicted 16% victory. Two factors shed light on this discrepancy:


  • In neighboring South Carolina, the polling average had Obama +11.6% and he won by 28.9%.

  • In neighboring Virginia, the polling average had Obama +17.7% and he won by 28.2%.

  • So perhaps my analysis isn't so crazy putting Obama above what the polls say in NC.

We'll see how it pans out on Tuesday. I'm more than willing to eat crow :)

Posted by Kevin Bartz at 6:38 PM | Comments (4)

1 May 2008

New NBER working paper by James Heckman ``Econometric Causality''

James Heckman has a new NBER working paper ``Econmetric Causality’’ which some of you might interesting. To give you a flavor, Heckman writes

``Unlike the Neyman–Rubin model, these [selection] models do not start with the experiment as an ideal but they start with well-posed, clearly articulated models for outcomes and treatment choice where the unobservables that underlie the selection and evaluation problem are made explicit. The hypothetical manipulations define the causal parameters of the model. Randomization is a metaphor and not an ideal or “gold standard".’’ (page 37)


Heckman, J (2008) ``Econometric Causality’’ NBER working paper #13934. http://papers.nber.org/papers/W13934

Abstract: This paper presents the econometric approach to causal modeling. It is motivated by policy problems. New causal parameters are defined and identified to address specific policy problems. Economists embrace a scientific approach to causality and model the preferences and choices of agents to infer subjective (agent) evaluations as well as objective outcomes. Anticipated and realized subjective and objective outcomes are distinguished. Models for simultaneous causality are developed. The paper contrasts the Neyman-Rubin model of causality with the econometric approach.

Posted by Sebastian Bauhoff at 10:00 AM | Comments (0)