August 2008
Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            

Authors' Committee

Chair:

Andy Eggers (Gov)

Members:

Weihua An (Soc)
Kevin Bartz (Stats)
Sebastian Bauhoff (HealthPol)
John Graves (HealthPol)
Justin Grimmer (Gov)
Jens Hainmueller (Gov)
Mike Kellermann (Gov)
Ellie Powell (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Kevin Quinn, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.12


« Anchors Down (I) | Main | Anchoring Vignettes (II) »

2 December 2005

Questions about Free Software

Jim Greiner

This past spring at Harvard, a group of students from a variety of academic disciplines agitated for a course in C, C++, and R focusing on implementating iterative statistical algorithms such as EM, Gibbs sampling, and Metropolis-Hastings. The result was an informal summer class sponsored by IQSS and taught by recent Department of Statistics graduate Gopi Goswami. Professor Goswami created (from scratch) class notes, problem sets, and sample programs as well as compiling lists of web links and other useful materials. Course participants came from, among other places, Statistics, Biostatistics, Government, Japanese Studies, the Medical School, the Kennedy School, and Health Policy. For those interested in the lecture slides and other materials Professor Goswami compiled, the link is here. Principal among the subjects taught in the course was how to marry R's data-processing and display capabilities to an iterative inferential engine (try saying that phrase quickly three times) such as an EM or a Gibbs, with the latter written in C or C++ so as to increase (vastly) the speed of runs. In other words, we learned how to have R do the front end (data manipulation, data formatting) and back end (analysis of results, graphics) of an analysis while letting a faster language do the hard work in the middle.

The course both demonstrates and facilitates a growing trend in the quantitative social sciences toward making open-source software stemming from scholarly publications freely available to the academic community. Two examples from the ever-expanding field of ecological inference are Gary King's EI program, based on a truncated bivariate normal model and implemented in GAUSS, and Kosuke Imai and Ying Lu's implementation of a Dirichlet-process-based model), implemented with an R-C interface.

The trend toward freely available, model-specific software has obvious potential upsides. Previously written code can save the time of a user interested in applying the model. Moreover, if the code is used often enough and potential bugs are reported and fixed, the software may become better than what a potential user could write on his or her own. After all, few of us interested in answers to real-world issues want to spend the rest of our lives coding in C.

Nevertheless, I confess to a certain amount of apprehension. For me at least, freely available, model-specific software provides a temptation to use models I do not fully understand. Relatedly, I often think that I do understand a model fully, that I grasp all of its strengths and weakness, only to discover otherwise when I sit down to program it. Finally, oversight, hubris, or a desire to make accompanying documentation readable may cause the author of the software not to describe fully details of implementation or compromises made therein. Thus, while I am excited by the possibilities freely available social science software holds, I worry about the potential for misuse as well.

Posted by James Greiner at December 2, 2005 6:00 AM

Comments

If you are interested in open-source C code for stats, you might like apophenia. It's for those who want complete control over their statistical computing and you can use it in conjunction with R/perl/ruby or whatever other scripting language you like. Plus you don't get tied into an endless and expensive cycle of upgrades.

Posted by: afelton at December 2, 2005 12:00 PM

A few comments:

First, R and C++ don't mix terribly well I'm afraid---this is more C++'s fault than it is R's. If you have some C++ code (by which I mean real C++ code using objects, templates, etc not just C code written in the C++ style) you're going to be stuck writing a wrapper in C for the C++ code to be able to talk to R. Most of the time its more trouble than its worth. Of course, the fact that you are forced to resort to C to achieve acceptable performance represents a failure of the statistical computing community (and, IMHO, inadequate support from the larger community... which is a rant that I'll get into some other time. Y'all are lucky its sunny outside).

Second, related to the availability of free tools, is I think an overlooked fact of the free statistical environments (predominately R these days, but this would also include LispStat and, to some extent BUGS): For-pay statistical environments (SAS, SPSS, JMP, etc) are in the business of packaging routines for end users, typically with a cute little point and click interface. They are not historically in the business of providing an environment for research into new techniques/models/what have you beyond whatever limited scripting facilities they provide. This is very different from say, R, which could, in theory, be implemented in itself. For performance reasons it isn't entirely done that way, but the point is that when you are programming R you have access to all the tools needed to do anything with the same level of priority as the core system. This, it turns out, is important.

This also plays towards your final comment on the use of models you don't understand. Personally, I think cost and model understanding are largely independent, except for the feeling that when you've paid $10,000 for your hammer, you're a lot less likely to use a screwdriver, even when the problem a screw.

Thats not to say there isn't a place for for-pay software, there are lots of things that you really need to pay people to do, like good technical writing (the person who developed the routines is probably the last person who should document them) and user interface designers and such.

Posted by: Byron at December 2, 2005 4:22 PM

Jim,

Check out this paper which is full of cool vaporware.

Posted by: Andrew Gelman at December 6, 2005 4:13 PM