November 2009
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


April 2, 2009

Correlation is Not Causation, Part 1,345,649,303

As a follow to on Andy's post, here's a great addition to the slides of any methods class. Here is the original. Via Megan McArdle at the Atlantic Monthly.

ci700332kn00001.gif

Posted by John Graves at 6:23 PM

March 19, 2009

Writing Excel Tables, Figures and Graphs Directly from R

As those of us who have worked on empirical projects surely know, at times a frustratingly large amount of time can be spent packaging results into tables or figures for publication or review. Fortunately, a number of modules have been developed to facilitate this process. For example, "write.csv" can be used from within R to output a table directly into an Excel-readable format. Likewise, practitioners of Stata can use the package "xml_tab" to do the same with a bit more flexibility.

Recently I've been involved in a large-scale modeling effort that requires a very detailed multi-worksheet Excel output that, depending on the task, includes a mix of tables, graphs and figures created in both R and Stata. Given the amount of modeling we're doing, creating this output manually every time would either take up 90% of my time, or would require hiring of an army of RAs whose sole task is creating these Excel files. So, while the above packages are no doubt helpful in specific contexts, we've had to scour through what's out there come up with our own way to do the outputs most efficiently.

What follows is what (I think) is a neat way to automate outputs directly from R. Hopefully, readers of this blog can benefit from using this in their own research. Basically, what one can do is use the "write" function in R to write a perl file that is then fed into the terrific Spreadsheet-WriteExcel module. This gives one the flexibility to, among other things, output to separate worksheets, format tables (with merged cells, different column widths, cell borders, etc.), include figures, and create charts, all in the same Excel file.

The example below is fairly simple -- it outputs two generic tables into separate worksheets -- but gives a good sense of how the powers of R and WriteExcel can be harnessed to really speed up the research process. Also, I'd appreciate any other thoughts on this from folks who have done similar things!

output_excel.txt
Here is the final output.

Posted by John Graves at 12:06 PM

February 19, 2009

Uncertainty Estimates and the Current Population Survey

The Annual Social and Economic Supplement (ASEC) to the Current Population Survey (CPS) is among the most widely used and influential data sets in the social sciences and in policymaking. For example, the much-cited figure of 45 million uninsured is a CPS estimate; Title I education funding is allocated using the CPS; and state outlays for the State Children's Health Insurance Program are also determined using the survey.

From the perspective of the social scientist, the CPS is a key research tool because of its large sample size (roughly 60,000 households) and because it is is typically released publicly about 5-6 months after the survey is initially fielded. However, one major drawback is that, unlike other major national surveys (the SIPP, the MEPS, and the NHIS to name a few), the public release of the CPS data does not include variables that must be used to get the correct standard errors for the complex survey design. Rather, the CPS releases a series of adjustment factors for specific population subgroups (e.g. by race, income group, state, etc.) that can be applied to uncertainty estimates. However, this approach is obviously problematic in the case of regression -- which adjustment factors does one use if the regression contains a rich array of covariates? As a result, much research using the CPS (which appears quite often in economics and health services research journals) proceeds either under the assumption of simple random sampling, or using robust standard errors. These studies therefore likely have understated uncertainty estimates, casting some doubt on the conclusions of this work.

So what is the applied researcher to do? One simple method of approximation (suggested to me once by Alan Zaslavsky) is to exploit the fact that the CPS uses monthly rotation groups that effectively replicate the CPS survey design. That is, one could produce separate estimates for each monthly rotation group and combine these estimates to come up with an estimate of the uncertainty from the survey design.

An alternative method (described in Davern, et. al Inquiry 43 (3) 2006), is to construct synthetic stratum and primary sampling unit (PSU) variables using available information in the survey (e.g. metropolitan statistical area, state, and household identifiers). In the above article, the authors compared this synthetic method to the internal census files (which obviously do have the complex survey design variables) and computed the ratio of the synthetic method to the standard error from the internal census file. In general, the ratios were on the order of 0.75 to 0.85, bringing the uncertainty estimates closer to the internal estimates than the ratios of about 0.5-0.6 they found under the assumption of simple random sampling (i.e. making no adjustment for survey design) and using robust standard errors.

Posted by John Graves at 8:23 AM

February 15, 2009

How many people could improved COBRA cover?

One feature of the recently-passed stimulus legislation is a temporary federal subsidy program for individuals to purchase transitional health insurance coverage in the event they lose their job. This coverage -- called COBRA after the legislation that created the program in the 1980's -- is generally available to displaced workers in firms with more than 20 employees, although some states have also adopted policies that allow employees of smaller firms to buy into transitional coverage. The catch, however, is that workers must pay the full premium amount plus 2 percent for administrative expenses.

The new federal program allows for subsidies of up to 65 percent of the cost of health insurance, which is aimed to provide a boost to individuals trying to make ends meet and remain insured while they're unemployed. But the question is, how many people could this program potentially cover in a given year?

Below I've created a population flow diagram using R's diagram package. The underlying data are drawn from the 2006 Medical Expenditure Panel Survey (MEPS) -- a nationally-representative survey of American households conducted each year by the Agency for Health Care Research and Quality. To construct the diagram I took the health insurance coverage status of non-elderly adults and children in January 2006, and compared this to their coverage status in December of that year.

Clearly, the potential for reducing the number of uninsured is large -- according to these estimates, 6.4 million adults and 1.4 million children lost employer-based group insurance between January and December 2006.* If just one-half of these individuals took up a subsidy and were able to continue that group coverage, the number of uninsured in 2006 could have been reduced by about 4 million. Moreover, this is almost certainly an underestimate, since there are also individuals who lost employer-based coverage prior to January, as well as individuals who were intermittently uninsured during the year, who may not show up in the diagrammed flows but could also have benefited from access to subsidized transitional coverage. Finally, I would also note the potential for spillover and cost-offsetting effects within Medicaid, as the estimated 1.2 million who went from group coverage to public coverage could have also retained their employer-based insurance, saving states and the federal government the costs of these extra Medicaid/SCHIP enrollees.

cov06_adult.png

cov06_child.png

* Note, however, that I have made no attempt to produce confidence intervals around these figures, though in principle this would be straightforward to do with R's survey package. If anyone knows how to easily input these into the figures (I couldn't figure out a way), please let me know!

Posted by John Graves at 4:51 PM

November 5, 2008

R Animations for Teaching Statistics

Here is a neat R package that can be used to create animations for teaching a wide variety of topics in statistics: survey sampling, bootstrap, probability theory, just to name a few.

Also, for those not as interested in implementing the animations directly in R, there's also a web page with everything already done for you!

Posted by John Graves at 11:13 AM

September 23, 2008

A Handy Trick for Multiple Imputation of Categorical Data

As an applied researcher, I've often come across missing data problems where my data are categorical. This can raise issues because most standard multiple imputation packages assume the multivariate normal (MVN) distribution, which may not hold for certain types of categorical and binary data.

The standard shortcut for overcoming this problem is to just impute under the MVN assumption, then use rounding to finish out the imputation. But as Yucel Recai, Yulei He, and Alan Zaslavsky point out in their May 2008 article in The American Statistician, naive rounding can bias estimates, particularly when the underlying data are asymmetric or multimodal.

So what should the applied researcher do when multiply imputing categorical data? The authors propose a method of calibration whereby one duplicates the original data but sets the observed values for the variable of interest to missing in the duplicated data. The original data and the duplicated data are then stacked and imputation is carried out on the stacked dataset. By comparing the fraction of 1's among the originally observed (but imputed) observations in the duplicated data (Y_obs(dup)) with the fraction of 1's in the original observed data (Y_obs), one can find the appropriate cutoff (c) and assign 0's and 1's using that.

This is a neat technique which benefits from the fact that it's very easy to implement in practice. In any case, check out the entire paper for more details on the method.


Posted by John Graves at 9:01 PM

May 7, 2008

What's New in Econometrics

Here's a link to a free, 18-hour mini-course on recent advances in econometrics and statistics from the National Bureau of Economic Research. It's co-taught by Guido Imbens and Jeffrey Wooldridge. The intended audience is obviously economists, but there are several topics (Bayesian inference, missing data, etc.) that are likely of interest to a wide range of social scientists. The course includes lecture videos, slides, as well as detailed notes on each topic.

Posted by John Graves at 1:53 PM

Plotting Survival Curves with Uncertainty Estimates

One of the pesky things I've found in my (limited) experience with survival analysis is that it's almost impossible to plot several survival curves in the same space and include measures of uncertainty without the entire plot becoming incomprehensible. So, to build on the great R discussions Ellie and Andy have provided in recent blog posts, I'd like to offer an extension of my own. I've created a fairly flexible function that allows one to plot several survival curves along with estimation uncertainty from Zelig's Cox proportional hazards output (which was developed by Patrick Lam). Here are two examples of what my surv.plot() function can provide:

survplotex.jpg

Hopefully this will be of some interest to a few readers. More details and example code below.

Here is the syntax for the command:
s.out: Simulated output from Zelig for each curve organized as a list()
duration: Surival time
censor: Censoring indicator
type: Display type for confidence bands. The default is "line" but "poly" is also supported (to create the shaded region in the right plot above).
plotcensor: Creates rug() plot indicating censoring times (Default is TRUE)
plottimes: plots a point for each survival time in the step function (Default is TRUE)
int: Desired uncertainty interval (Default is c(0.025,0.975) which corresponds to a 95% interval)

Here's the plot.surv() source code, and below I've copied the R code I used to create the plots above:

library(Zelig)
data(coalition)

# Fit the Cox Model
z.out1 <- zelig(Surv(duration,ciep12)~invest+numst2+crisis,
robust=TRUE,cluster="polar",model="coxph",data=coalition)

# Set Low and High Quantities of Interest
low <- setx(z.out1,numst2=0)
high <- setx(z.out1,numst2=1)

# Simulate for Each
s.out1 <- sim(z.out1,x=low)
s.out2 <- sim(z.out1,x=high)

# Create list output that contains both simulations
out <- list(s.out1,s.out2)


# Plot the results
par(mfrow=c(1,2))
surv.plot(s.out=out,duration=coalition$duration,censor=coalition$ciep12,type="line",plottimes=TRUE)
surv.plot(s.out=out,duration=coalition$duration,censor=coalition$ciep12,type="poly",plotcensor=TRUE)


Posted by John Graves at 11:48 AM

April 9, 2008

Do Default Options Save Lives?

Via Dan Ariely's contribution to this Freakonomics post yesterday, I was lead to a fascinating paper on default options and behavior. The results on organ donation in Europe are particularly striking, as the authors show that large differences in organ donation rates in otherwise similar European nations (e.g. Sweeden and Denmark) may in large part be a consequence of whether organ donation is an opt-in or opt-out option on the drivers license application.

As the authors note, there are substantial public policy implications to research along these lines. For example here in the U.S., there is a growing chorus of policy gurus, including at least one major presidential candidate, pushing for policies such automatic retirement accounts. The idea is that rather than enacting more blunt mechanisms (e.g. mandates), we can implement policies that harness the inertia brought about by default options to achieve policy goals.

Update: In comments, Kieran Healy raises the important point that willingness to donate is not the same as actually donating, and that observed donation rates in European countries tend to be much closer together. Fair point!

However I would add that I'm not sure how helpful I find figure presented at the Crooked Timber link. The data points correspond to organ donation rates by year, but it's not a time series so there's no way to know which points correspond to which year. Furthermore, do all of these points correspond to only being on one side or the other of a change in informed consent law? Or did some of these countries change their informed consent policies during the 1990-2002 time frame? This would be important information to know, particularly if we're interested in whether these laws have any effect on actual organ donation. On my first glance at the paper provided I see that the same data are indeed put in a time series, but again I don't see any indication of when each country's policy was enacted and whether there were any shifts in policy during the study time frame. So, based on that it's hard to really make any kind of inference either way about whether the policies had no effect on actual donation rates.

For another take on this issue, here's a paper by IQSS member Alberto Abadie, which does find an effect of presumed consent laws. I'd be interested to hear Healy's take on this paper!


donor_default.jpg Posted by John Graves at 6:20 PM

March 31, 2008

Simulating Major League Baseball

Here is a neat application of simulation from this weekend's New York Times. The authors, a graduate student and professor at Cornell, simulated the entire history of Major League Baseball 10,000 times to see just how "mythic" Joe DiMaggio’s 56-game hitting streak really is. They find that 56-game streaks are not at all unusual, and furthermore that Joe DiMaggio wasn't even the most likely to set the record!

For those who are interested in doing some simulations of their own, my guess is that the authors used the Lahman Baseball Database, which is freely available online. Perhaps in some future post I'll take a look at some simulations of other baseball records. Any suggestions for what to look at?

Posted by John Graves at 11:27 AM

March 26, 2008

Salt Passage and Causal Inference

I'm guessing many of the readers of this blog will get a kick out of this article, which I received in a Research Methodology class I am taking with Prof. Richard Hackman of the Psychology Department.

M. Pacanowsky - "Please Pass the Salt: Examining the Motivational Variables, Idiosyncratic Dynamics, and Historical Precedents Associated With the Utterance," The Washington Post (4/9/78)

Posted by John Graves at 8:07 PM

Does Medicare Save Lives?

A few weeks ago I attended a talk by David Card, a Berkeley economist currently on leave here at Harvard. Card's talk was on a new paper, written with Carlos Dobkin and Nicole Maestos, entitled "Does Medicare Save Lives?"

In the paper, Card and his coauthors analyze data on over 400,000 hospital emergency room encounters in California for "non-deferrable" admissions, which are defined as conditions for which daily admissions rates do not differ during the week.

Given the rather strict age cutoff for Medicare* eligibility (which, with a few exceptions, starts in the month one turns 65), and the fact that using non-deferrable ER admissions helps ensure that individuals within a narrow age band have similar underlying health, the authors are able to employ a regression discontinuity design to estimate the effect on mortality of becoming eligible to receive benefits under Medicare. Strikingly, their principal finding is that Medicare eligibility reduces mortality among their study cohort by 20 percent. That is a huge result!

Card mentioned in his talk that he and his coauthors were fairly surprised by the magnitude of this finding. So, what could explain this large decrease in mortality?

As the authors note, the magnitude is too large to be explained by the added health benefit of gaining coverage for the 8 percent of their sample that was previously uninsured. Moreover, the drop in mortality is also seen among individuals who had other coverage prior to Medicare. As an alternative explanation, the authors suggest that the result may be driven by improved "insurance generosity" of gaining Medicare coverage at age 65. That is, if a typical insurance policy for a non-Medicare eligible near-elderly citizen contains a lot of restrictions or administrative hurdles, then the more generous coverage and fewer restrictions provided by Medicare may result in more timely delivery of care, thus reducing mortality.

Here's one mechanism through which I think this explanation could be working. One question I raised during the talk was whether they had any data on the mode of arrival to the ER (unfortunately, they don't). Several years ago I actually worked as an Emergency Medical Technician for an ambulance service in rural Tennessee, and one of the most striking things about working in prehospital care is that the vast majority of ambulance calls are for Medicare recipients. Now, in part this is the result of the obvious fact that folks on Medicare are, on average, in poorer health than everyone else. But, I wonder whether the generous coverage of prehospital care under Medicare causes beneficiaries to call the ambulance, and thus receive earlier medical intervention, more than they would under a standard insurance policy (under which coverage for ambulances is more variable). Given the enormous clinical impact of early intervention on mortality, particularly for conditions such as heart attacks and strokes (which likely make up a good portion of the ER sample used here), this fact could help explain much of the drop in mortality.

In any case, I think the Card paper is a neat example of the use of a regression discontinuity design. The major downside to these designs, however, is that they gobble up the effective sample size, since identification essentially comes from individuals who are (assumed) to be randomly distributed around the cutoff point. So even with 400,000 observations, it's tough for the authors to really drill down to see which specific health events are showing the biggest declines in mortality.

*For those not familiar, Medicare is a social health insurance program provided to elderly U.S. citizens; it's sometimes confused with Medicaid, which is an insurance program for very low-income families.

Posted by John Graves at 11:11 AM