November 2009
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


April 25, 2006

Open and Transparent Data

You, Jong-Sung

There was a big scandal in scientific research recently. Dr. Hwang Woo-suk, Seoul National University in Korea, announced last June that he and his team had cloned human embryonic stem cells from 11 patients. It was a remarkable breakthrough in stem cell research and many people expected that he would eventually get a Nobel Prize. Hwang's team, however, was found to have intentionally fabricated key data in two landmark papers on human embryonic stem cells, according to a Seoul National University panel. Now, the prosecution is probing into his team’s alleged fabrication of data and violation of bioethics law.

Remarkably, the prestigious journal Science was not able to detect the data faking before and after publication of the articles. It is understandable considering that peer reviewers typically examine the presented analysis of the data but do not receive nor examine the actual data itself. Even more surprisingly, most of the 26 co-authors of the June 2005 article were unaware of the data fabrication. It was revealed only through an inside whistleblower who was the second author of the earlier article, and through a team of investigative journalists.

This incident makes us aware of the weakness and vulnerability of the review system of academic journals. Indeed, there have been many fraud cases in the history of scientific research, and Dr. Hwang has just added one more such case. Although outright faking may not be very common, errors in data and data analysis might be much more common than most people assume them to be.

I was struck by numerous errors that were found by students of Gov 2001 who replicated the analysis of an article published in a prominent social science journal. Many of the errors are probably benign and not critical to their key findings, but some errors may be critical and even deliberate. It can be tempting to distort the data or results of data analysis when a researcher has spent much time and energy to find evidence to support his or her hypothesis and the results are close but fall short of significance.

In his entry entitled Citing and Finding Data, Gary King discussed the [in]ability to reliably cite, access, and find quantitative data, all of which remain in an entirely primitive state of affairs. Sebastian Bauhoff also stressed the need for making data available in his entry Data Availability. I cannot agree with them more. If journals require authors to submit data as well as manuscript of their paper and publish data that were used for articles as an on-line appendix, it will certainly reduce the errors in data and data analysis as well as spur further research. This should be applied to qualitative data (such as interview transcripts) as well as quantitative data.

Posted by Jong-sung You at 6:00 AM

April 14, 2006

Statistical Humor

You, Jong-Sung

Here are some good statistics jokes for all of you.

  • "When she told me I was average, she was just being mean".
  • “Old statisticians never die-- they just become insignificant. - Gary Ramseyer, First Internet Gallery of Statistical Jokes
  • "Three statisticians go deer hunting with bows and arrows. They spot a big buck and take aim. One shoots and his arrow flies off three meters to the right. The second shoots and his arrow flies off three meters to the left. The third statistician jumps up and down yelling, "We got him! We got him (on average)!" - Richard Lomax and Seyed Moosavi, 1998, Using Humor to Teach Statistics: Must They Be Orthogonal?

Is the use of humor an effective way of teaching statistics? Lomax and Moosavi (1998), citing J. Bryant and D. Zillmann (1988) suggest that there is little empirical evidence that humor either (1) increases student attention, (2) improves the classroom climate or (3) reduces tension. Fortunately, however, the same research indicates that humor actually does (1) increase enjoyment and (2) motivates students to achieve higher. Hence, it may not be a bad idea to incorporate some statistical jokes (their article and Gary Ramseyer's website are two good sources).

This isn't a joke as such, but here is another interesting statistical dialogue from Lomax and Moosavi:

Q. I read that a sex survey said the typical male has six sexual partners in his life and the typical female has two. Assuming the typical male is heterosexual, and since the number of males and females is approximately equal, how can this be true?

A. You’ve assumed that "typical" refers to the arithmetical average of the numbers. But "average" also means "middle" and "most common". (Statisticians call these three kinds of averages the mean, the median and the mode, respectively.) Here’s how the three are used: Say you’re having five guests at a dinner party. Their ages are 100, 99 17, 2, and 2. You tell the butler that their average age is 44 (100+99+17+2+2=220¸5=44). Just to be safe, you tell the footman their average age is 17 (the age right in the middle). And to be sure everything is right, you tell the cook their average age is 2 (the most common age). Voila! Everyone is treated to pureed peas accompanied by Michael Jackson’s latest CD, followed by a fine cognac. In the case of the sex survey, "typical" may have referred to "most common", which would fit right in with all the stereotypes. (That is, if you believe sex surveys.)

Posted by Jong-sung You at 6:00 AM

February 14, 2006

Is Military Spending Justified by Security Threats?

You, Jong-Sung

In the recent ASSA meeting in Boston, Linda Bilmes, a Kennedy School lecturer, and Joseph Stiglitz, Columbia professor and a Nobel prize-winning economist, presented an interesting paper, “The Economic Costs of the Iraq War.� They estimated the total economic costs of the war, including direct costs and macroeconomic costs, lie between $1 and $2 trillion. Interestingly, the “$2 trillion� figure was already projected by William Nordhaus, Yale professor of economics, even before the war. In his paper, “The Economic Consequences of a War With Iraq�(2002), he predicted the costs of Iraq war would reach from $99 billion, if the war is short and favorable, to $1,924 billion, if the war is protracted and unfavorable.

In the same ASSA meeting, Nordhaus raised important questions about excessive military spending in his paper entitled “The Problem of Excessive Military Spending in the United States.� I am providing some excerpts from the paper below.

Nordhaus notes, “The U.S. has approximately half of total national security spending for the entire world. Total outlays for ‘defense’ as defined by the Congressional Budget Office were $493 billion for FY2005, while the national accounts concept of national defense totaled around $590 billion for 2005. It constitutes about $5000 per family. By comparison, the Federal government current expenditures in 2004 were $14 billion for energy, $4.7 billion for recreation and culture, and $1.8 billion for transit and railroads.� The question is whether the US is earning a good return on its national-security ‘investment,’ for it is clearly an investment in peace and safety. The bottom line he argues, is probably not.

Nordhaus asks whether it is plausible that the United States faces a variety and severity of objective security threats that are equal to the rest of the world put together. Then he points the following facts. “Unlike Israel, no serious country wishes to wipe the U.S. off the face of the earth. Unlike Russia, India, China, and much of Europe, no one has invaded the U.S. since the nineteenth century. We have common borders with two friendly democratic countries with which we have fought no wars for more than a century.�

He raises the issue of strategic and budgetary inertia. “Many costly programs are still in place a decade and a half after the end of the cold war. The U.S. has around 6000 deployed nuclear weapons, and Russia has around 4000 weapons. There can be little doubt that the world and the U.S. are more vulnerable rather than less vulnerable with such a large stock of weapons, yet they survive in the military budget. There is a kind of security Laffer curve in nuclear material, where more is less in the sense that the more nuclear material floating around the more difficult it is to control it and the more like it is that it can be stolen.� He argues that today’s slow decline in spending on obsolete systems arises largely because there are such weak budgetary and virtually non-existent political pressures on military spending – the ‘loose budget constraints.’

He suggests that an excessive military budget is not just economic waste but also causes problems rather than solving them by tempting leaders to use an existing military capability. “Countries without military capability cannot easily undertake ‘wars of choice’ or wars whose purposes evolve, as in Iraq, from dismantling wars of mass destruction to promoting democracy. To the extent that Vietnam and Iraq prove to be miscalculations and strategic blunders, the ability to conduct them is clearly a cost of having a large military budget.�

A final concern he raises is that the large national-security budget leads to loose budget constraints and poor control over spending and programs. “Congress exercises no visible oversight on defense spending and a substantial part is secret. Some of the abuses in recent military activities arise because Congress cannot possibly effectively oversee such a large operation where programs involving $24 billion are enacted as a single line item. Even worse, how can citizens or ordinary members of Congress understand the activities of an agency like the National Security Agency, whose spending level and justification are actually classified?�

Posted by Jong-sung You at 6:00 AM

January 17, 2006

Network Analysis and Detection of Health Care Fraud

You, Jong-Sung

In my earlier entries on “Statistics and Detection of Corruption� and “Missing Women and Sex-Selective Abortion,� I demonstrated that examination of statistical anomaly can be a useful tool for detection of crime and corruption. In these cases, binomial probability distribution was a very useful tool.

Professor Malcolm Sparrow at the Kennedy School of Government shows how network analysis can be used to detect health care fraud in his book, License to Steal: How Fraud Bleeds America's Health Care System (2000). He gives an example of network analysis performed within Blue Cross/Blue Shield of Florida in 1993.

An analyst explored the network of patient-provider relationships with twenty-one months of Medicare data, treating a patient as linked to a provider if the patient had received services during the twenty-one-month period. The resulting patient-provider network had 188,403 links within it. The analyst then looked for unnaturally dense cliques within that structure. He found a massive one. “At its densest core, the cluster consisted of a specific set of 122 providers, linked to a specific set of 181 beneficiaries. The (symmetric) density criteria between these sets were as follows:
A. Any one of these 122 providers was linked with (i.e., had billed for services for) a minimum of 47 of these 181 patients.
B. Any one of these 181 patients was linked with (i.e., had been “serviced� by) a minimum of 47, and an average of about 80, of these providers.�
After the analyst found this unnaturally dense clique, field investigations confirmed a variety of illegal practices. “Some providers were indeed using the lists of patients for billing purposes without seeing the patients. Other patients were being paid cash to ride a bus from clinic to clinic and receive unnecessary tests, all of which were then billed to Medicare.�

Professor Sparrow suggests that many ideas and concepts from network analysis can be useful in developing fraud-detection tools, in particular for monitoring organized and collusive multiparty frauds and conspiracies.

Posted by Jong-sung You at 2:36 AM

January 10, 2006

Statistics and Detection of Corruption

You, Jong-Sung

Duggan and Levitt's (2002) article on "corruption in sumo wrestling" demonstrates how statistical analysis may be used to detect crime and corruption. Sumo wrestling is a national sport of Japan. A sumo tournament involves 66 wrestlers participating in 15 bouts each. A wrestler with a winning record rises up the official ranking, while a wrestler with a losing record falls in the rankings. An interesting feature of sumo wrestling is the existence of a sharp nonlinearity in the payoff function. There is a large gap in the payoffs for seven wins and eight wins. The critical eighth win garners a wrestler roughly four times the value of the typical victory.

Duggan and Levitt uncover striking evidence that match rigging takes place in the final days of sumo tournaments. They find that the wrestler who is on the margin for an eighth win is victorious with an unusually high frequency, but the next time those same two wrestlers face each other, it is the opponent who has a very high win percentage. This suggests that part of the currency used in match rigging is promise of throwing future matches in return for taking a fall today. They present a histogram of final wins for the 60,000 wrestler-tournament observations between 1989 and 2000, in which a wrestler completes exactly 15 matches. Approximately 26.0 percent of all wrestlers finish with eight wins, compared to only 12.2 percent with seven wins. The binomial distribution predicts that these two outcomes should occur with an equal frequency of 19.6 percent. The null hypothesis that the probability of seven and eight wins is equal can be rejected at resounding levels of statistical significance. They report that two former sumo wrestlers have made public the names of 29 wrestlers who they allege to be corrupt and 14 wrestlers who they claim refuse to rig matches. Interestingly, they find that wrestlers identified as "not corrupt" do no better in matches on the bubble than in typical matches, whereas those accused of being corrupt are extremely successful on the bubbles.

A similar kind of empirical study of corruption dates to 1846, when Quetelet documented that the height distribution among French men based on measurements taken at conscription was normally distributed except for a puzzling shortage of men measuring 1.57–1.597 meters (roughly 5 feet 2 inches to 5 feet 3 inches) and an excess number of men below 1.57 meters. Not coincidentally, the minimum height for conscription into the Imperial army was 1.57 meters (recited from Duggan and Levitt 2002). These examples show that detection of statistical anomaly can give compelling evidence of corruption.

Corruption in conscription has been a big political issue in South Korea. Examination of anomaly in the distributions of height, weight, eyesight at each physical examination site for conscription may provide evidence of cheating and/or corruption. This kind of statistical evidence will fall short of proving crime or corruption, but will make a sufficient case for thorough investigation.

Posted by Jong-sung You at 6:39 AM

November 30, 2005

Missing Women and Sex-Selective Abortion

You, Jong-Sung

The problem of “missing women� in many developing countries reflects not just the gender inequality but serious violation of human rights, as Amartya Sen reported in his book Development as Freedom (1999). It refers to the phenomenon of excess mortality and artificially lower survival rates of women. Particularly disturbing is the practice of sex-selective abortion, which has become quite widespread in China and South Korea.

Statistical analysis, in particular examination of anomalies in a distribution of interest, can give compelling evidence of crime or corruption. If nine out of ten babies delivered at a hospital are boys, we must have a strong suspicion that the doctor(s) in the hospital conduct(s) sex-selective abortion. It may not be evidence sufficient for a conviction, but it probably is sufficient grounds for investigation. Then, what will be a good guide to decision for investigation? Applying a threshold of a certain percentage will not be a good idea, because the probability of 6 or more boys out of 10 babies is much larger than the probability of 600 or more boys out of 1000 babies. So, an appropriate guide may be the use of binomial probability distribution.

Suppose the probability of producing boy or girl is exactly 50 percent. Then, the probability of producing six or more boys out of ten babies 37.7 percent, while the probability of producing 60 or more boys out of 100 babies is only 2.8 percent in the absence of some explanatory factor (probably sex-selective abortion). The probability of producing 55 or more boys out of 100 babies is 18.4 percent, but the probability of producing 550 or more boys out of 1000 babies is only 0.09 percent in the absence of some explanatory factor (again, probably sex-selective abortion). If the police decide to investigate the hospitals with more than a certain percentage of boy-birth rate, say 60 percent, then many honest small hospitals will get investigation, while large hospitals that really engage in sex-selective abortion may avoid the investigation.

Posted by Jong-sung You at 5:59 AM