| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | ||||
| 4 | 5 | 6 | 7 | 8 | 9 | 10 |
| 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| 25 | 26 | 27 | 28 | 29 | 30 | 31 |
October 26, 2009
This Wednesday, October 28th, the Applied Statistics workshop will welcome Eric Tchetgen Tchetgen, Assistant Professor of Epidemiology at Harvard School of Public Health, presenting his work titled "Doubly robust estimation in a semi-parametric odds ratio model." Eric has provided the following abstract for the paper:
We consider the doubly robust estimation of the parameters in a semi-parametric conditional odds ratio model characterizing the effect of an exposure in the presence of many confounders. We develop estimators that are consistent and asymptotically normal in a union model where either a prospective baseline density function or a retrospective baseline density function is correctly specified but not necessarily both. The case of a binary outcome is of particular interest, then our approach yields a doubly robust locally efficient estimator in a semi-parametric logistic regression model For general types of outcomes, we provide a strategy to obtain doubly robust estimators that are nearly locally efficient We illustrate the method in a simulation study and an application in statistical genetics. Finally, we briefly discuss extensions of the proposed method to the semi-parametric estimation of a parameter indexing an interaction between two exposures on the logistic scale, as well as extensions to the setting of a time-varying exposure in the presence of time-varying confounding.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
Posted by Matt Blackwell at 11:10 AM | Comments (0)
October 20, 2009
In case you had not already heard, Trevor Hastie, Robert Tibshirani, and Jerome Friedman have put a PDF copy of the second edition of their excellent text Elements of Statistical Learning on the book's website. I am sure many of you already own it, but a searchable version for the laptop is incredibly useful. The second edition has a lot of new content, including completely new chapters on Random Forests, Ensemble Learning, Undirected Graphical Models, and High-Dimensional Problems.
While a copy on your computer is very handy, a desk copy of this book is essential if you are interested in machine learning or data mining. The book is also a sight to behold. You can buy a copy at Amazon or Springer.
Posted by Matt Blackwell at 10:15 AM | Comments (0)
October 19, 2009
Please join us this Wednesday October 21st when we will have a change in the schedule. We are happy to have Andy Eggers (Department of Government) presenting a talk titled "Electoral Rules, Opposition Scrutiny, and Policy Moderation in French Municipalities: An Application of the Regression Discontinuity Design." Andy has provided the following abstract for his talk:
Regression discontinuity design (RDD) is a powerful and increasingly popular approach to causal inference that can be applied when treatment is assigned deterministically based on a continuous covariate. In this talk, I will present an application of RDD from French municipalities, where the system of electing the municipal council depends on whether the city's population is above or below 3500. First I show that cities above the population cutoff have fewer uncontested elections and more opposition representation on municipal councils, consistent with expectations. I then trace the effect of these political changes -- which amount to a heightening of the scrutiny imposed on the mayor -- on policy outcomes, providing evidence that more opposition scrutiny leads to more moderate policy.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
Posted by Matt Blackwell at 7:21 PM | Comments (0)
October 14, 2009
Tim Kreider at the New York Times has a short piece on what he dubs "The Referendum" and how it plagues us:
The Referendum is a phenomenon typical of (but not limited to) midlife, whereby people, increasingly aware of the finiteness of their time in the world, the limitations placed on them by their choices so far, and the narrowing options remaining to them, start judging their peers' differing choices with reactions ranging from envy to contempt. ...Friends who seemed pretty much indistinguishable from you in your 20s make different choices about family or career, and after a decade or two these initial differences yield such radically divergent trajectories that when you get together again you can only regard each other's lives with bemused incomprehension.
Those familiar with casual inference will recognize this as stemming from the Fundamental Problem of Causal Inference: we cannot observe, for one individual, both their response to treatment and control. The article is an elegant look at how we grow to worry about those mysterious missing potential outcomes--the paths we didn't choose--and how we use our friends' lives to impute those missing missing outcomes. Kreider goes on to make this point exactly, with a beautiful quote from a novel:
The problem is, we only get one chance at this, with no do-overs. Life is, in effect, a non-repeatable experiment with no control. In his novel about marriage, "Light Years," James Salter writes: "For whatever we do, even whatever we do not do prevents us from doing its opposite. Acts demolish their alternatives, that is the paradox." Watching our peers' lives is the closest we can come to a glimpse of the parallel universes in which we didn't ruin that relationship years ago, or got that job we applied for, or got on that plane after all. It's tempting to read other people's lives as cautionary fables or repudiations of our own.
Perhaps the only response is that, while so close to us in so many respects, friends may be poor matches for gauging these kinds of effects. In any case, "Acts demolish their alternatives, that is the paradox" is the best description of the problem of causal inference that I have seen.
Posted by Matt Blackwell at 4:19 PM | Comments (0)
October 13, 2009
We hope you can join us at the Applied Statistics workshop this Wednesday, October 14th at 12 noon, when we will be happy to have Weihua An, a graduate student in the Sociology Department here at Harvard. Weihua will be presenting "Bayesian Propensity Score Estimators: Simulations and Applications." He has provided the following abstract:
Despite their popularity, conventional propensity score estimators (PSEs) do not take into account the estimation uncertainties in the propensity score into causal inference. This paper develops Bayesian propensity score estimators (BPSEs) to model the joint likelihood of both the outcome and the propensity score in one step, which naturally incorporate such uncertainties into causal inference. Simulations show that PSEs treating estimated propensity scores as if they were known will overestimate the variation in treatment e_ects and result in overly conservative inference, whereas BPSEs will provide corrected variance estimation and valid inference. Compared to other direct adjustment methods (E.g., Abadie and Imbens 2009), BPSEs are guaranteed to provide positive variance estimation, more reliable in small samples, and more flexible to contain complex propensity score models. To illustrate the proposed methods, BPSEs are applied to evaluating a job training program.
Posted by Matt Blackwell at 12:53 AM | Comments (0)
October 9, 2009
We are a few days late to comment on the story of Senator Tom Coburn's amendment to the Commerce, Justice and Science Appropriations Bill to cut all National Science Foundation funding for the political science program and any of its missions. Choice quote (of which there are many): "...it is difficult, even for the most creative scientist, to link NSF's political science findings to the advancement of cures to cancer or any other disease." Snap.
This has received attention from the social science community and others. Even Paul Krugman, mentioned in Coburn's press release as an example of (wasteful? political?) NSF funding, has something to say about it. There's no need to rehash the arguments here, which ever-so-nicely point out that Senator Coburn doesn't really know what he's talking about nor do his arguments make a whole lot of sense.
Regardless of the arguments, I just wanted to put a graph up to put all of this in perspective. In the 111th Congress, Coburn has had very little success with his amendments:

Seven of the rejections are instances when Coburn's amendment was tabled without discussion. Most of the rejections have been of proposed budget cuts or banning funds from certain projects And this is just in this year. Out of all the roll call votes on Coburn-sponsored amendments in the Senate over his tenure, only 8 out of 68 have actually passed.
I understand trying to tackle his critiques, as they track with an internal debate already in the discipline. But I think it may be a tad knee-jerk to start letter-writing campaigns to our Senators. Tom Coburn knows that putting out no-win amendments is a great way to take positions in the Senate without committing to anything. Minority amendments are a costless signal of the blandest kind--even a political scientist can see that.
Posted by Matt Blackwell at 12:21 PM | Comments (6)
October 6, 2009
Just in time for Halloween, a study from the British Journal of Psychiatry by Moore, Carter and van Goozen that uses data from the British Cohort Study to estimate the effect of daily candy intake on adult violent behavior.
They find that 10 year olds that ate candy daily were much more likely to be convicted of a violent crime at age 34 than those who did not eat candy daily. They cite this as evidence that childhood diet has an effect on adult behavior. One of their hypothesized mechanisms is that using candy as a reward for children (e.g. for behavior modification) inhibits the child's ability to delay gratification. And there is evidence that children that posses problems with delayed gratification tend to score lower on a host of measures, including the SATs (see also: the marshmallow studies).
The longitudinal data gives them leverage. For instance, the authors are able to control for parenting style at age 5 along with other variables, such as various scales of behavior problems or mental abilities at age 5 (some of these were discarded in the final analysis because of their variable selection rules). These ease my main concern that "problem children" might lead to a certain type of parenting and also indicate a propensity for violent adult behavior. Their controls help to eliminate this possibility (though, I will say that I am not familiar with this literature and they use fairly complicated scales to measure these concepts).
Strangely, at least to me, they do not seem to control for parental income or socio-economic class. I have a few ideas as to why this might matter. First, candy is relatively cheap compared to a good diet, thus poorer families might be forced to choose the cheaper option when feeding their children. Second, financial pressures lead to time pressures, which could force parents to take shortcuts--feeding their children junk food because it is quick or using it to induce behavior because it is easy. Thus, parental income may matter greatly for candy intake and it also may increase propensity to commit violent crimes. I am not certain this is true, but it seems plausible and unmentioned in the paper. Even if the finding is not causal, however, it is still interesting.
Posted by Matt Blackwell at 1:48 PM | Comments (0)
October 5, 2009
Please join us this Wednesday, October 7th at the Applied Statistics workshop when we will be happy to have Jamie Robins, the Mitchell L. and Robin LaFoley Dong Professor of Epidemiology here at Harvard, who will be presenting on "Estimation of Optimal Treatment Strategies from Observational Data with Dynamic Marginal Structural Models." Jamie has passed along a related paper with the following abstract:
We review recent developments in the estimation of an optimal treatment strategy or regime from longitudinal data collected in an observational study. We also propose novel methods for using the data obtained from an observational database in one health-care system to determine the optimal treatment regime for biologically similar subjects in a second health-care system when, for cultural, logistical, or financial reasons, the two health-care systems differ (and will continue to differ) in the frequency of, and reasons for, both laboratory tests and physician visits. Finally, we propose a novel method for estimating the optimal timing of expensive and/or painful diagnostic or prognostic tests. Diagnostic or prognostic tests are only useful in so far as they help a physician to determine the optimal dosing strategy, by providing information on both the current health state and the prognosis of a patient because, in contrast to drug therapies, these tests have no direct causal effect on disease progression. Our new method explicitly incorporates this no direct effect restriction.
A copy of the paper is also available.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
Posted by Matt Blackwell at 11:31 AM | Comments (0)
October 1, 2009
A group of students from the Machine Learning department at Carnegie Mellon took to the streets last week to protest at the G20 summit in Pittsburgh. I am afraid that their issues were not being taken seriously inside the summit. There's a first hand account and a photo set on flickr. I can't decide if my favorite is "Repeal Power Laws" or "Safer Data Mining".
Posted by Matt Blackwell at 3:35 PM | Comments (1)
September 29, 2009
Please join us at the Applied Statistics workshop this Wednesday, Sept 30th when we will be delighted to have the distinguished Susan Athey, Professor of Economics here at Harvard, presenting on "A Structural Model of Equilibrium and Uncertainty in Sponsored Search Advertising Auctions" (joint work with Denis Nekipelov). Susan has passed along the following abstract:
Sponsored links that appear beside internet search results on the major search engines are sold using real-time auctions, where advertisers place standing bids that are entered in an auction each time a user types in a search query. The ranking of advertisements and the prices paid depend on advertiser bids as well as "quality scores" that are assigned for each advertisement and user query. Existing models assume that bids are customized for a single user query and the associated quality scores; however, in practice that is impossible, as queries arrive more quickly than advertisers can change their bids, and advertisers cannot perfectly predict changes in quality scores. This paper develops a new model where bids apply to many user queries, while the quality scores and the set of competing advertisements may vary from query to query. In contrast to existing models that ignore uncertainty, which produce multiplicity of equilibria, we provide sufficient conditions for existence and uniqueness of equilibria, and we provide evidence that these conditions are satisfied empirically. We show that the necessary conditions for equilibrium bids can be expressed as an ordinary differential equation.
We then propose a structural econometric model. With sufficient uncertainty in the environment, the valuations are point-identified, otherwise, we propose a bounds approach. We develop an estimator for bidder valuations, which we show is consistent and asymptotically normal. We provide Monte Carlo analysis to assess the small sample properties of the estimator. We also develop a tractable computational approach to calculate counterfactual equilibria of the auctions.
Finally, we apply the model to historical data for several keywords. We show that our model yields lower implied valuations and bidder profits than approaches that ignore uncertainty. We find that bidders have substantial strategic incentives to reduce their expressed demand in order to reduce the unit prices they pay in the auctions, and in addition, these incentives are asymmetric across bidders, leading to inefficient allocation. We show that for the keywords we study, the auction mechanism used in practice is not only strictly less efficient than a Vickrey auction, but it also raises less revenue.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
Posted by Matt Blackwell at 10:47 AM | Comments (0)
September 23, 2009
Wired has a fascinating article about the placebo effect and how the pharmaceutical companies deal with it. Not only is there evidence that the placebo effect is growing (some drugs approved in the 80s and 90s would struggle to pass the FDA now), but it turns out there may be significant geographic differences in the strength of the effect:
Assumption number one was that if a trial were managed correctly, a medication would perform as well or badly in a Phoenix hospital as in a Bangalore clinic. Potter discovered, however, that geographic location alone could determine whether a drug bested placebo or crossed the futility boundary. By the late '90s, for example, the classic antianxiety drug diazepam (also known as Valium) was still beating placebo in France and Belgium. But when the drug was tested in the US, it was likely to fail. Conversely, Prozac performed better in America than it did in western Europe and South Africa. It was an unsettling prospect: FDA approval could hinge on where the company chose to conduct a trial.
I'm not sure how you separate out the geographic confounding of the drug response versus the geographic confounding of the placebo response when looking at differences between the two, but it is interesting nonetheless.
(via kottke)
UPDATE: I just wanted to clarify why I thought this article was interesting so that folks do not think that I believe all the analysis contained in the article. The "effect" of the placebo treatment is clearly nonsensical as effects always need to about comparisons. What is identified from a clinical trial is the difference between the placebo response and the treatment response. My interpretation of the article (which is different than the author's interpretation) is that there is a lot of variation in that difference, both over time and over geography within the same drug. Since I have not read the academic articles that inform the article, I'm not sure if this variation is about what we would expect or not giving sampling variation, but the possibility of a systematic relationship is intriguing.
As Kevin notes in the comments below, there are some that are criticizing the article. It took a bit of searching (not that simple!), but I found a good response:
http://scienceblogs.com/whitecoatunderground/2009/09/placebo_is_not_what_you_think.php
The author of the response simply claims that variation in the placebo response is simply sampling variance.
Posted by Matt Blackwell at 10:04 AM | Comments (6)
September 21, 2009
Please join us this Wednesday, September 23rd at the Applied Statistics Workshop when we will be fortunate to have Marshall Van Alstyne presenting "Network Structure and Information Advantage: The Diversity--Bandwidth Tradeoff." Marshall is an Associate Professor at Boston University in the Department of Management Information Systems as well as Research Associate at MIT's Center for E-Business. Marshall passed along the following abstract:
To get novel information, we propose that actors in brokerage positions face a tradeoff between network diversity and communication channel bandwidth. As the structural diversity of a network increases, the bandwidth of communication channels in that network decreases, creating countervailing effects on the receipt of novel information. This argument is based on the observation that diverse networks are typically made up of weaker ties, characterized by narrower communication channels across which less diverse information is likely to flow. The diversity-bandwidth tradeoff is moderated by (a) the degree to which topics are uniformly or heterogeneously distributed over the alters in a broker's network, (b) the dimensionality of the information in a broker's network (whether the total number of topics communicated by alters is large or small) and (c) the rate at which the information possessed by a broker's contacts refreshes or changes over time. We test this theory by combining social network and performance data with direct observation of information content flowing through email channels at a medium sized executive recruiting firm. These analyses unpack the mechanisms that enable information advantages in networks and serve as a 'proof-of-concept' for using email content data to analyze relationships among information flows, networks, and social capital.
A copy of the paper is also available.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm. We hope you can make it.
Posted by Matt Blackwell at 10:26 AM | Comments (0)
September 15, 2009
Please join us tomorrow, September 16th when we are excited to have Ben Goodrich (Government/Social Policy) presenting "Bringing Rank-Minimization Back In: An Estimator of the Number of Inputs to a Data-Generating Process," for which Ben has provided the following abstract:
This paper derives and implements an algorithm to infer the number of inputs to a data-generating process from the outputs. Previous working dating back to the 1930s proves that this inference can be made in theory, but the practical difficulties have been too daunting to overcome. These obstacles can be avoided by looking at the problem from a different perspective, utilizing some insights from the study of economic inequality, and relying on modern computer technology.Now that there is a computational algorithm that can estimate the number of variables that generated observed outcomes, the scope for applications is quite large. Examples are given showing its use for evaluating the reliability of measures of theoretical concepts, empirically testing formal models, verifying whether there is an omitted variable in a regression, checking whether proposed explanatory variables are measured without error, evaluating the completeness of multiple imputation models for missing data, and facilitating the construction of matched pairs in randomized experiments. The algorithm is used to test the main hypothesis in
Esping-Andersen (1990), which has been influential in the political economy literature, namely that various welfare-state outcomes are a function of only three underlying variables.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 12:15 and we usually wrap up around 1:30 pm.
We hope you can make it.
Posted by Matt Blackwell at 10:47 AM | Comments (0)
September 8, 2009
Please join us tomorrow, September 9th for our first workshop of the year when we are happy to have Justin Grimmer presenting joint work with Gary King entitled "Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology."
Justin and Gary have provided the following abstract for their paper:
Many people attempt to discover useful information by reading large quantities of unstructured text, but because of known human limitations even experts are ill-suited to succeed at this task. This difficulty has inspired the creation of numerous automated cluster analysis methods to aid discovery. We address two problems that plague this literature. First, the optimal use of any one of these methods requires that it be applied only to a specific substantive area, but the best area for each method is rarely discussed and usually unknowable ex ante. We tackle this problem with mathematical, statistical, and visualization tools that define a search space built from the solutions to all previously proposed cluster analysis methods (and any qualitative approaches one has time to include) and enable a user to explore it and quickly identify useful information. Second, in part because of the nature of unsupervised learning problems, cluster analysis methods are not routinely evaluated in ways that make them vulnerable to being proven suboptimal or less than useful in specific data types. We therefore propose new experimental designs for evaluating these methods. With such evaluation designs, we demonstrate that our computer-assisted approach facilitates more efficient and insightful discovery of useful information than either expert human coders using qualitative or quantitative approaches or existing automated methods. We (will) make available an easy-to-use software package that implements all our suggestions.
The Applied Statistics workshop meets each Wednesday in room K-354, CGIS-Knafel (1737 Cambridge St). We start at 12 noon with a light lunch, with presentations beginning around 1215 and we usually wrap up around 130 pm.
Posted by Matt Blackwell at 12:00 PM | Comments (1)
August 20, 2009
There was a time that the only place to find R help was through the R-help listserv. But things have changed pretty drastically in just the last year or so as R has gained users from all different disciplines. I wanted to just point out a few resources that I have found useful over the last few months.
The #rstats hashtag on Twitter has a good following and a number of consistent contributors. If you already use Twitter, this is a great way to hear about interesting new applications of R or the growing number of R tutorials and meetups (Los Angeles and New York have already had a few well attended meetups).
Partially born from the #rstats group is the R tag on StackOverflow, a website dedicated to asking and answering programming questions. The R questions have only recently started to appear on StackOverflow, but if it takes off, it might be a smarter way to match up R users who need help and R experts who can help. The site has voting on answers so that unhelpful or repetitive answers will be weeded out. And since all of this is on one website, searching through the questions is quite a bit easier than trying to track down an R-help thread from 2004. Exciting stuff.
Posted by Matt Blackwell at 1:55 PM
May 13, 2009
The social sciences have long embraced the idea of text-as-data, but in recent years, increasing numbers of quantitative researchers are investigating how to have computers find answers to questions in texts. This task might appear easy on the outset (as it apparently did to early researchers in machine translation), but, as we know, natural languages are incredibly complicated. In most of the applications in social science, analysts end up making a "bag of words" assumptions--the relevant part of a document are the actual words, not their order (this is not a unreasonable assumptions, especially given the questions being asked).
When I see applications of natural language processing (NLP) in the social sciences, I typically think very quickly to its future. Computers are making strides at being able to understand, in some sense, what they are reading. Two recent articles , however, give a good overview of the challenges that NLP faces. First, John Seabrook of the New Yorker had an article last summer, Hello, Hal, which states the problem clearly:
The first attempts at speech recognition were made in the nineteen-fifties and sixties, when the A.I. pioneers tried to simulate the way the human mind apprehends language. But where do you start? Even a simple concept like "yes" might be expressed in dozens of different ways--including "yes," "ya," "yup," "yeah," "yeayuh," "yeppers," "yessirree," "aye, aye," "mmmhmm," "uh-huh," "sure," "totally," "certainly," "indeed," "affirmative," "fine," "definitely," "you bet," "you betcha," "no problemo," and "okeydoke"--and what's the rule in that?
The article is mostly about speech recognition, but it definitely hits the main points about why human-generated language is so hard tricky. The second article, in the New York Times recently, is a short story about Watson, the computer that IBM is creating to compete on Jeopardy! IBM is trying to push the field of Question Answering quite a bit forward with this challenge. This goal is to create a computer that you can ask a natural language question to and get the correct answer. A quick story in the article indicates that they may a bit to go:
In a demonstration match here at the I.B.M. laboratory against two researchers recently, Watson appeared to be both aggressive and competent, but also made the occasional puzzling blunder.For example, given the statement, "Bordered by Syria and Israel, this small country is only 135 miles long and 35 miles wide," Watson beat its human competitors by quickly answering, "What is Lebanon?"
Moments later, however, the program stumbled when it decided it had high confidence that a "sheet" was a fruit.
This whole Watson enterprise makes me wonder if there are applications for this kind of technology within the social sciences. Would this only be useful as a research aid, or are there empirical discoveries to be made with this? I suppose it comes down to this: if a computer could answer your question, what would you ask?
Posted by Matt Blackwell at 9:43 AM
March 25, 2009
Over on the polmeth mailing list there is a small discussion brewing about how to teach undergraduate methods classes. Much of the discussion is on how to manage the balance between computation and statistics. A few posters are using R as their main data analysis tool, which provoked others to comment that this might push a class too far away from its original intent: to learn research methods (although one teacher of R indicated that a bigger problem was the relative inability to handle .zip files). This got me thinking about how research methods, computing and statistics fit into the current education framework.
As a gross and unfair generalization, much of college is about learning how take a set of skills and use them to make effective and persuasive arguments. In a literature class, for instance, one might use the skills of reading and writing to critical engage a text. In mathematics, one might take the "skill" of logic and use it to derive a proof.
The issue with introductory methods classes is that many undergraduates come into school without a key skill: computing. It is becoming increasingly important to have proficient computing skills in order to make cogent arguments with data. I wonder if it is time to rethink how we teach computing at lower levels of education to adequately prepare students for the modern workplace. There is often emphasis on using computers to teach students, but I think it will become increasingly important to teach computers to students. This way courses on research methods can focus on how to combine computing and statistics in order to answer interesting questions. We could spend more time matching tools to questions and less time simply explaining the tool.
Of course, my argument reeks of passing buck. A broader question is this: where do data analysis and computing fit in the education model? Is this a more fundamental skill that we should build up in children earlier? Is it perfectly fine where it is, being taught in college?
Posted by Matt Blackwell at 3:08 PM
March 11, 2009
At today's Applied Statistics Workshop, Dan Hopkins gave a talk on contextual effects on political views in the United States and United Kingdom. Dan presented evidence that national political discussions increase the salience of local context for opinion formation. Namely, those who live in areas of high immigrant populations tend to react more strongly to changes in the national discussion of immigration than others. The data and analysis are interesting, but the talk's derailment interested me slightly more.
The derailment involved Dan's choice of method, a version of difference-in-difference (DID) estimator and how to represent it in the Rubin Causal Model. Putting this model in terms of the usual counterfactual framework is slightly nuanced, but not impossible.
The typical setup for a DID estimator is that there are two groups G = {0,1} and two time periods T={0,1}. Between time 0 and time 1, some policy is applied to group 1 and not applied to group 0. What we are interested in is the effect of that policy. For instance, if Y is the outcome in time 1 and Y(1) is the potential outcome (in time 1) in the counterfactual world where we forced the policy to be implemented, then we can define a possible quantity of interest: the average treatment effect on the treated (ATT): E[Y(1) - Y(0) | G = 1].
We could proceed from here by simply making an ignorability assumption about the treatment assignment. Unfortunately, policies are often not randomly assigned to the groups and the groups may differ in ways that affect the outcome. For instance, an example from the Wooldrige textbook is the effect of the placement of trash processing facility on house prices. The two groups in this case are "houses close to the facility" and "houses far from the facility" and the policy is the facility's placement. It would be borderline insane to imagine city planners randomly assigning the location of the facility and these two groups will differ in ways that are very related to house prices (I don't think I have seen too many newly minted trash dumps in rich neighboorhoods). Thus, we cannot simply use the observed data from the control group to make the counterfactual inference.
What we can do, however, is look at how changes in the dependent variable occur for the two groups and use these changes to identify the model. For instance, if we assume that X is the outcome in period 0, then the DID identifying assumption is
E[Y(0) - X(0) | G = 1] = E[Y(0) - X(0) | G = 0],
which is simply saying that the change in potential outcomes under control is the same for both groups. Or, that group 1 would have followed the same "path" as group 0 if they had not received treatment. With this assumption in hand, we can identify the ATT as the typical DID estimator
E[Y(1) - Y(0) | G =1] = (E[Y|G=1] - E[X|G=1]) - (E[Y|G=0] - E[X|G=0]).
The proof is short and can be found in Abadie (2005) and Athey & Imbens (2006) also show (these papers also go into considerable depth on how to simple schemes).
Two issues always arise for me when I see DID estimators. First is the incredibly difficult task of arguing that the policy is the only thing that changed between time 0 and time 1 with respect to the two groups. That is, perhaps the city also placed a freeway through the part of town where the trash processing facility was built at the same time. The DID estimator would not be able to differentiate effects. Thus, it is up to the practitioner to argue that all other changes in the period are orthogonal to the two groups. Second, I have very little insight about how identification or estimands change as we move from a simple non-parametric world to a highly parametric world (where most applied researchers live). If and how do inferences change when we move away from simple conditional expectations?
Posted by Matt Blackwell at 2:13 PM
February 25, 2009
I've been doing some work on diagnostics for missing data issues and one that I have found particularly useful and enlightening has been what I've been calling a "missingness map." In the last few days, I used it on some World Bank data I downloaded to see what missingness looks like in a typical comparative political economy dataset.

The y-axis here are country-years and the x-axis are variables. We draw a red square where the country-year-variable cell is missing and a light green square where the cell is observed. We can see immediately that a whole set of variables in the middle columns are almost always unobserved. These are variables measuring income inequality and they are known to have extremely poor coverage. This plot very quickly shows us how listwise deletion will affect our analyzed sample and how the patterns of missingness occur in our data. For example, in these data, it seems that if GDP is missing, then many of the other variables, such as imports and exports are also missing. I think this is a neat way to get a quick, broad view of missingness.
(Another map and some questions after the jump...)
We can also change the ordering of the rows to give a better sense of missingness. For the World Bank data, it is wise to resort the data by time and see how missingness changes over time.

A clear pattern emerges that the World Bank has better and better data as we move forward in time (the map becomes more "clear"). This is not surprising, but it is an important point when, say, deciding the population under study in a comparative study. Clearly, listwise deletion will radically change the sample we analyze (the answers will be biased toward more recent data, at the very least). The standard statistical advice of imputation or data augmentation is tricky as well here because we need to choose what to impute. Should we carry forth with imputation given that income inequality measures seem to be completely unavailable before 1985? If we remove observations before this, how do we qualify our findings?
Any input on the missingness map would be amazing, as I am trying to add as a diagnostic it to a new version of Amelia. What would make these plots better?
Posted by Matt Blackwell at 2:58 PM
February 3, 2009
You can now answer that question and so many more. The Japanese Statistics Bureau conducts a survey every five years called the "Survey on Time Use and Leisure Activities" where they give people journals to record their activities throughout the day. Thus, they have a survey of what people are in Japan at any given time of the day. This is fun data in of itself, but it was made downright addictive by Jonathan Soma who created a slick Stream Graph based on the data. (via kottke)
There are actually three Stream Graphs: one for the various activities, another for how the current activity differs between sexes and a final for how the current activity breaks down by economic status. Thus, the view contains not only information about daily routines, but also how those routines vary across sex and activity. For instance, gardening tends to happen in the afternoon and evening at around equal intensity and is fairly evenly distributed between men and women. Household upkeep, on the other hand, is done mostly by women and mostly in the morning. This visualization is so compelling, I think, because it allows for deep exploration of rich and interesting data (to be honest, though, I find the economic status categories a little strange and not incredibly useful).
I think there are two points that come to mind when seeing this. First is that it would fascinating to see how these would look across countries, even if it was just one other country. The category of this survey on the website for the Japanese Bureau of Statistics is "culture." Seeing the charts actually makes me wonder how different this culture is from other countries. Soma does point out, though, that Japanese men are rather interested in "productive sports" which is perhaps unique to the island.
Second, I think that Stream Graphs might be useful for other time-based data types. Long term survey projects, such as the General Social Survey, track respondent spending priorities. It seems straightforward to use a Stream Graph to capture how priorities shift over time. Other implemented Stream Graphs are the NYT box-office returns data and Lee Byron's last.fm playlist data. This graph type seems best suited for showing how different categories change over time and how rapidly they grow and how quickly they shrink. They also seem to require some knowledge of Processing. There are still some open questions here: What other types of social science data might these charts be useful for? How or should we incorporate uncertainty? (Soma warns that the Japan data is rather slim on the number of respondents)
Also: October 18th is Statistics Day in Japan. There are posters. And a slogan: "Statistical Surveys Owe You and You Owe Statistical Data"!
Posted by Matt Blackwell at 5:37 PM
November 19, 2008
I like the noise of democracy.
--James Buchanan
There has been quite a bit of popular and scholarly interest in the mechanics of voting over the last decade, especially after the 2000 Florida Presidential election threw the concepts of butterfly ballots, residual votes and chads into the spotlight. The recount of the U.S. Senate race in Minnesota between Norm Coleman and Al Franken has brought the voting-error fun right on back. Minnesota Public Radio has compiled a list of challenged ballots for you to judge (via kottke). You can even use the Minnesota state statues governing voter's intent. I think the write-in for "Lizard People" is one of the best.
It is refreshing to see that in spite of all of the attention toward electronic voting problems, the old paper method can still make a mess. Things have changed a bit since the blanket ballots of the nineteenth-century, but ballot design still has quite a few problems. The most obvious case is the butterfly ballot of Palm Beach County in 2000 which almost certainly changed the outcome of the presidential election (see Wand, et al (2001)). Laurin Frisina, Michael Herron, James Honaker, and Jeff Lewis recently published an article in the Election Law Journal about undervoting in Florida's 13th Congressional District, a phenomenon they attribute to poor (electronic) ballot design. Other examples abound.
The good folks at AIGA put together an interactive guide for designing ballots and the problems with current designs. A lot of these suggestions are really spot on and would help to solve a lot of the errors in the Minnesota ballots. Especially important are the "if you make a mistake..." guidelines. This was posted at the New York Times in late August, which seems to me to be plenty of time for registrars to get these issues worked out. On the other hand, some of the Minnesota ballot problems do seem to transcend clear design. Depressingly, this probably brings a smile to faces of anti-plebian elites.
If you are a sucker, like me, for images of old ballots, you can find plenty of old California ballots at the voting technology project. Melanie Goodrich put this together. The real gem of this collection is the Regular Cactus Ticket of 1888.
Posted by Matt Blackwell at 2:10 PM