| Sun | Mon | Tue | Wed | Thu | Fri | Sat |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 |
November 12, 2009
There was a lot of press on the 1,000+-page length of the House health care bill, H.R. 3962. That got me thinking... didn't we hear the same thing about the stimulus bill and the Patriot Act? Aren't most "controversial" bills also very long?
It would make sense. Controversial bills require a lot more ink -- pork, special cases, exceptions -- to reel in support. Uncontroversial bills can be written succinctly and pass as is.
To assess this I scraped bills from OpenCongress, which maintains the full text, voting results and amendment history of House and Senate Resolutions. You can even comment on specific portions of bills. There's already a bunch of neat comments on potential loopholes in H.R. 3962.
I downloaded the text and voting results for all 152 House resolutions passed by the 111th House. A boxplot of page length against support appears below. Each page length group represents roughly 20% of House resolutions. The plot shows the suspected trend, that longer bills have less support. One-page bills almost always pass unanimously!
Posted by Kevin Bartz at 10:12 AM | Comments (1)
May 15, 2009
In late April, the FAA released the long-awaited bird strike data. It shows every recorded bird strike since 2000.
Since then, we've had a whole host of stories bemoaning the doubling in bird strikes since 2000, complete with worrisome bar graphs and explanations from experts.
But as far as I can tell, the stories seem to have forgotten about the denominator: the number of flights, which has been increasing just as rapidly since 2000.
To test this, I went and pulled commercial flight totals from a public BTS data set from 2003 on. Then I limited the bird data to the same airports, months, years and carriers as appear in the BTS flight data. Then I divided bird strikes by flights, and presto: the strike rate has been flat since 2003.

But the most interesting part of this mashup was breaking down the figures by carrier. Do some airlines strike birds more than others? The answer appears to be "Yes."

At the top of the pile are Frontier, United and JetBlue. Frontier Airlines has a staggering 9.4 strikes per 10,000 flights, compared to the industry average of 4.0. Now, a good statistician (or a Frontier executive) would wonder about confounders. Frontier's Denver hub is the most bird-prone major airport in the U.S., with 7.8 strikes per 10,000 flights. Here's the breakdown by airport for the top 34 airports in the U.S..

To try to account for all the possible confounders, I fit a Poisson regression modeling strike rate using the following covariates:
Since the data have 100,000 rows and hundreds of columns after expanding all the categorical covariates, I used bigglm in R to fit the model. The operator coefficients (actually, exp(coefficient)) are shown below. 1.0 refers to the "base" rate -- the strike rate you would expect given the airline's flight history of airports, years and months. The "winners" -- Hawaiian, United and Frontier -- all have values above 2, which means their strike rates are more than double those of any other airline with their flying schedule.

So why do some airlines strike more birds than others, even after accounting for airport and month? Possibilities include differing planes, pilots or maintenance crews.
Posted by Kevin Bartz at 3:06 PM
March 13, 2009
This entry uses people data from ZabaSearch to show which English first names are most popular among Chinese Americans.
When I worked at Google, I once did an employee search on "Vivian" and 26 of the 30 results were Chinese. This post examines this phenomenon a bit more scientifically, with two goals:
The ideal approach would be to download a phone book and tally the first names for Chinese last names. While there's nowhere to download a phone book, there are several searchable people databases online. The largest and most famous free option is ZabaSearch.
The first step: get a list of Chinese last names to search for. I used the 100 most common last names in both the natural Pinyin and Wade-Giles variants, for a total of 128 unique last names. With a script, I searched for each on ZabaSearch. Sadly, Zaba won't show you the results if there are over 1000 -- it just says "1000's of CHIN's found!" If you search across the entire U.S., then this happens for too many of the names, so I limited my searches to Boston. The 128 last names culled 22,483 unique people (after also de-duping by address).
Among Chinese in Boston, the most common three first names are Wei (1.34%), Hong (0.916%) and Hui (0.836%). Only about 25% -- 5,949 of 22,483 -- of the first names are English.
Of the Chinese population with English first names, the most popular three male and female names are shown below. For the American public, these are downloaded from the latest census report. (Note: To standardize the population sizes, I limited both populations to those with first names in the top 500.)
Name | Rank across America | Rank among Chinese | Frequency across America | Frequency among Chinese |
| DAVID | 6 | 1 | 0.0286 | 0.0511 |
| JOHN | 2 | 2 | 0.0396 | 0.0378 |
| JAMES | 1 | 3 | 0.0402 | 0.0369 |
Name | Rank across America | Rank among Chinese | Frequency across America | Frequency among Chinese |
| JENNIFER | 6 | 1 | 0.01300 | 0.0311 |
| AMY | 32 | 2 | 0.00631 | 0.0303 |
| ANGELA | 29 | 3 | 0.00655 | 0.0178 |
The three most popular Chinese male first names are also very popular in America as a whole. A more interesting question is the one about P(Chinese | Name n) -- which English first names are much more common among Chinese Americans than among all Americans? To answer that, I conducted a binomial proportion test and sorted the results by p-value, identifying the most extreme differences. The top 10 male and female differences are given below.
Some of the top results are nicknames -- Chinese are much more likely to pick "Andy" or "Jenny" as a legal name, while general Americans are formally named by the longer versions.
The other names on the list are more interesting. For males, "Andrew," "Eric," "Peter" and "Albert" are much more common among Chinese than among Americans. For females, it's "Amy," "Grace," "May" and, yes, "Vivian." By comparing the frequencies, you can see that these names are all over five times more popular among Chinese Americans!
I'll leave interpretation to the sociologists.
Name | Frequency across America | Frequency among Chinese | p-value |
| ANDREW | 0.006510 | 0.02810 | 2.2e-30 |
| ANDY | 0.000594 | 0.00937 | 2.0e-26 |
| DAN | 0.001220 | 0.01120 | 3.6e-23 |
| PETER | 0.004620 | 0.01990 | 5.3e-22 |
| ALBERT | 0.003810 | 0.01570 | 7.0e-17 |
| ERIC | 0.006590 | 0.02120 | 1.5e-16 |
| ALAN | 0.002470 | 0.01150 | 2.9e-14 |
| SAM | 0.001110 | 0.00786 | 3.7e-14 |
| ALEX | 0.001390 | 0.00846 | 1.4e-13 |
| DAVID | 0.028600 | 0.05110 | 1.7e-12 |
Name | Frequency across America | Frequency among Chinese | p-value |
| AMY | 0.006310 | 0.03030 | 2.5e-29 |
| JENNY | 0.000951 | 0.01330 | 6.9e-28 |
| GRACE | 0.002640 | 0.01670 | 4.3e-21 |
| MAY | 0.000406 | 0.00644 | 3.1e-15 |
| VIVIAN | 0.001650 | 0.01100 | 5.2e-15 |
| ALICE | 0.004990 | 0.01780 | 3.5e-13 |
| JENNIFER | 0.013000 | 0.03110 | 2.7e-12 |
| CECILIA | 0.000769 | 0.00644 | 6.8e-11 |
| JANE | 0.003500 | 0.01290 | 2.7e-10 |
| CINDY | 0.002690 | 0.01060 | 2.2e-09 |
Posted by Kevin Bartz at 5:46 PM
November 23, 2008
Which past market environment was most like today's? This post shows that in terms of correlation to returns, the closest of all major markets to the S&P 500 in the last year was the Nikkei in 1991.
The S&P 500 peaked in mid-October of 2007, and has been down, down, down since then. Common comparisons include the Nikkei in 1990, the Dow Jones in 1929, and the NASDAQ crash of 2000.
But did those downturns really feel like today's? The plot below shows weekly returns overlaid for these four bear markets. All the plots start at the (then) all-time week-ending high of the market index in question.

They're not really all that similar. The 1929 and 2000 crashes both fell much more dramatically from their all-time highs initially. In contrast, this crash started slowly, gradually accelerating its decline until those dark days of August and September 2008. One depressing fact is that we're actually worse now than we were 58 weeks after the crash of '29!
Are there any precedents for this kind of decline? To answer that I scraped all week-by-week returns for all major indices, U.S. and international, from Yahoo! Finance. Then I programmatically scanned each index's past for a 58-week period where the returns have high correlation with the S&P 500's returns over the past 58 weeks since its October high. The winners, in order of correlation:

In particular, the ebbs and flows of the Nikkei in 1991 are eerily close to those of the S&P 500 in the last year!
Although no one knows what our future holds, all but one of these indices went on to rally in the following two years. Perhaps it is some consolation that none declined much further, and most gained. The S&P 500 in 1969 recouped all of its losses by two years later. The worst of the bunch was the German DAX 30, which was about flat at its trough after two years.

Posted by Kevin Bartz at 11:22 AM
October 30, 2008
LendingClub is a P2P lending site much like Propser. What makes them special is that they've released a full data set of all 4,564 past loans and their current status. As a data source this is extraordinary, since most literature on credit scoring uses proprietary data. For the LendingClub data, can we beat the FICO at default prediction by incorporating additional clues?
This post focuses on the borrower's "Loan Description," which I use along with FICO scores to predict defaults. The loan description is written by the borrower and usually pitches his qualifications and reasons for needing the money. Here's a randomly chosen example from someone who is current on his payments.
I have some credit card debt that I would like to pay-off. It makes sense to pay one lender as opposed to 5 credit card companies. I'd rather pay interest to one payee rather than split between 5 or 6.
This is a relatively short one -- the average description is 58 words long. Perhaps there are keywords in the description that impact the probability of default after controlling for the FICO score. Here's what I did to test for these keywords:
Now, the fun stuff. For our purposes define a Delinquency as either being late in your payments or having defaulted completely. The 10 words with the greatest p-values are below. I report marginal delinquency probabilities, not broken out by FICO score, simply for brevity; the actual M-H test controlled for the FICO scores.
| Word | Loans With | P(Delinquency|No word) | P(Delinquency|Word) | p-value |
|---|---|---|---|---|
| also | 215 | 0.067 | 0.140 | 0.0004 |
| need | 608 | 0.062 | 0.105 | 0.0015 |
| business | 233 | 0.069 | 0.116 | 0.0038 |
| live | 91 | 0.070 | 0.154 | 0.0057 |
| already | 64 | 0.071 | 0.156 | 0.0059 |
| other | 285 | 0.068 | 0.112 | 0.0081 |
| bills | 223 | 0.067 | 0.135 | 0.0082 |
| bill | 279 | 0.066 | 0.125 | 0.0117 |
| interest | 660 | 0.081 | 0.053 | 0.0136 |
I have good credit and am looking to consolidate all my debt into one easy payment. I am looking to get married soon so the less multiple bills we have to keep track of the better. I have two credit cards with low balances that I would like to pay off. I have a furniture debt that I would also like to consolidate and I need to overhaul the commuter vehicle my fiance will begin driving. I have no recorded late or delinquent payments on my credit. I have worked for my current employer for 5 1/2 yrs and have good standing. I am excited to join hands in marriage with my lovely fiance and the remainder balance after consolidation will be used for marraige documentation purposes. I appreciate your consideration. Thank you.As for the other words, "need" implies that the borrower is in straits of some kind, while "live," "bill" and "bills" suggest that the money will be used for day-to-day expenses rather than a targeted goal, implying a systemic negative cash flow. "Already" suggests an existing outstanding loan. All but one word ("interest") on the list enhances delinquency risk. "Business" is somewhat surprising -- people who want money to start businesses must be greater risks. Here's an example:
i am trying to buy a residential Land in emerging and booming market like new delhi where building cost is very cheap and return of investment is 150% in just six months. I intend to purchase the land build the house with my friends help who is in building house business and make a six flats/3 floor house. and sale it each one of them under USD 12, 000.00.I'm stunned something like this got funded! All in all, such keywords look like a good building block for enhancing a credit score model that goes beyond FICO scores. In a saner credit market, a viable strategy would be to fund P2P loans judged by an enhanced model to minimize default risk. Right now, however, I'd be worried that the credit crisis could wipe out all these sites at the drop of a hat.
Posted by Kevin Bartz at 2:16 PM
October 3, 2008
This post looks at the linguistics of last night's Biden-Palin debate. Palin used the word "reform" 12 times compared to Biden's none. Biden used "middle class" 12 times to Palin's one.
Here's a sequel to my earlier Obama-Clinton post. Overall,
Overall, Biden uttered 7,065 words and Palin 7,646, with a total of 2,117 unique words. Which words did Biden use significantly more or less than Palin? For each word, we apply a chi-squared test that the candidates spoke the word with equal probability. Finally, we sort the list by p-value, highlighting the differences. I've eliminated words that appear over 50 times (mostly stop words like "the," which Palin evidently used a couple hundred times more than Biden).
| Word | Biden | Palin | pval |
|---|---|---|---|
| also | 3 | 47 | 0.0000 |
| their | 32 | 4 | 0.0000 |
| number | 15 | 0 | 0.0002 |
| want | 0 | 16 | 0.0003 |
| united | 16 | 1 | 0.0004 |
| policy | 22 | 4 | 0.0004 |
| just | 6 | 28 | 0.0007 |
| those | 10 | 34 | 0.0013 |
| too | 0 | 13 | 0.0014 |
| they | 41 | 18 | 0.0015 |
| well | 24 | 7 | 0.0019 |
| these | 1 | 15 | 0.0020 |
| said | 40 | 18 | 0.0022 |
| reform | 0 | 12 | 0.0023 |
| who | 11 | 34 | 0.0025 |
| even | 3 | 19 | 0.0025 |
| down | 16 | 3 | 0.0034 |
| gwen | 16 | 3 | 0.0034 |
Observations:
We can also look at bigrams, pairs of words, in a similar way.
| Word | Biden | Palin | pval |
|---|---|---|---|
| the united | 16 | 1 | 0.0004 |
| united states | 16 | 1 | 0.0004 |
| we have | 9 | 34 | 0.0007 |
| want to | 0 | 14 | 0.0009 |
| he said | 11 | 0 | 0.0016 |
| have got | 0 | 12 | 0.0023 |
| and i | 6 | 25 | 0.0025 |
| that is | 4 | 21 | 0.0026 |
| and that's | 1 | 14 | 0.0032 |
| middle class | 12 | 1 | 0.0035 |
Posted by Kevin Bartz at 11:38 AM
May 20, 2008
Jointly with Dave Kane, an IQSS fellow and head of Kane Capital, I've been working on applying causal inference techniques to the financial problem of performance evaluation. We have a draft on SSRN up here.
The problem: how do you evaluate a stock portfolio's performance? This is usually done by comparing the returns on the manager's portfolio against those of a counterfactual portfolio of investments the manager could have chosen, but did not. A common choice is a passive portfolio like the S&P 500. If a manager can't perform at least as well as a passive benchmark like this, why not just invest in the S&P 500? But this may not be a fair comparison, since the S&P 500 contains only large-cap stocks, while the manager may actually have considered a wider universe of possibilities. Any difference in returns could be due to the portfolio's smaller capitalization rather than the manager's stock-picking ability.
Dave Kane and I view performance evaluation as a causal inference problem. We consider the treatment to be the manager's claimed advantage. Does he time the market? Does he pick hot sectors? Most commonly a manager claims an ability to pick stocks. Then the covariates are the set of confounding factors: observable characteristics of stocks, such as their capitalization, sector and country.
To get a better benchmark, we propose forming a matching portfolio of stocks with similar characteristics, but which are not held in the portfolio. In the leftmost figure above, the black dots represent the characteristics of holdings in a particular portfolio we considered (an equal-weighted portfolio based on the StarMine indicator). The gray dots represent non-holdings. We form the matching portfolio by matching each black dot to a nearby gray dot, using a propensity score method. When we're done, we end up with a well-matched portfolio -- the exposures are compared in the second figure, and they line up nicely. Notice from the figure that there are several possible matched portfolios -- we consider a random set of 100 of them, matching within a thin caliper, as part of our benchmark.
Finally, we compare the realized portfolio return against the returns of the matched portfolios. When we do that, we obtain the histogram below. The portfolio outperforms 75% of the matched portfolios, suggesting there's a moderate but not overwhelming amount of evidence for the stock-picking ability of the StarMine indicator.
In the paper we consider several extensions of this framework to situations with non-equal portfolio weights and to long-short portfolios. We employ the generalized propensity score of Imai and Imbens to form the matching portfolios in this case, treating the portfolio weights of the stocks as a continuous treatment.
We welcome any comments, thoughts or reactions to these ideas! The SSRN draft is linked above, and an accompanying R package is available here if you want to reproduce the computations.
Posted by Kevin Bartz at 11:10 PM
May 13, 2008
I know this isn't my normal day, but three points today:
| Error | Actual | Predicted |
|---|---|---|
I'm less worried about the turnout discrepancy; it happened because there had been no semi-open Democratic primary since Huckabee dropped out of the Republican contest. I was forced to use Pennsylvania (a closed primary) and Ohio (a semi-open primary, but with Huckabee still formally in) to predict turnout, which resulted in my underestimates. I'm more confident about my turnout projection in West Virginia, which is a semi-open primary, now that I have North Carolina to use as a predictor.
In predicting voter shares, my overall county-level correlations were .81 for Indiana and .88 for North Carolina -- on the whole pretty good, but with some problems. Below are spatial plots of residuals for North Carolina, and Indiana's appear above. Dark red corresponds to overestimation of Obama's support, and dark grey to underestimation of Obama's support.
| Error | Actual | Predicted |
|---|---|---|
The biggest mistake in my North Carolina predictions came with rural Blacks, who had not appeared significantly in my training data. The largest-magnitude residual was Greene County, a rural county that's 50% White and 40% Black (it's the small dark red). I projected a 70%-30% Obama victory, as is typical for counties with this racial split (note that among Democrats in such a county, Blacks will dominate). But somehow Clinton actually won this county 53% to 47%, putting me 23% off. In all of the neighboring rural black counties I had similarly overestimated Obama's support. This points to a possible interaction effect -- that rural blacks are more pro-Clinton than urban blacks.
Now to my top-line West Virginia prediction: Clinton 70.5%, Obama 29.5%, with a turnout of 300,000 votes. The map is below. I have Clinton taking every county in the state. Obama comes closest in Jefferson (a high-income, well-educated county next to Virginia) and Monongalia (a well-educated urban county that’s part of Pittsburgh tri-state).
With Clinton's impending departure, however, I plan to abandon these projections and move on to other fun. I really want to try a language model on Obama's and McCain's speeches.
Posted by Kevin Bartz at 5:48 PM
May 4, 2008
Since I have qualifying exams tomorrow, I'll keep this entry unimaginative. I've re-run my predictions for the Indiana and North Carolina primaries on Tuesday, adding a few new bells and whistles:
With the help of a turnout model, I can actually predict the election result by multiplying turnout by population and adding up votes for Clinton and Obama. When I do that, I get:
Indiana: Clinton 53.5%, Obama 46.5%; turnout 950,000
North Carolina: Obama 58%, Clinton 42%; turnout 1,200,000
Yowzers! We'll see how the real numbers pan out. Here are a few details on the two models:
This time I included even more covariates for both models. Next to the ones found to be important, I've placed their effect in parentheses.
How do my results stack up against the current polls? In Indiana, the RealClearPolitics average has Clinton +6%, only a point from my prediction. In North Carolina, the RCP average has Obama +8%, significantly below my predicted 16% victory. Two factors shed light on this discrepancy:
We'll see how it pans out on Tuesday. I'm more than willing to eat crow :)
Posted by Kevin Bartz at 6:38 PM
April 22, 2008
Update: Check out how my predictions fared! Two comparisons are given, one showing both maps in the same image and one as an animated GIF (kudos to the animation package in R).

Overall, my predictions did pretty well. Their overall correlation with the true vote shares was .89 -- leading to an R^2 of .79, just below the in-sample R^2. My biggest miss was Centre County, where I predicted that Clinton would edge out Obama. Instead, Obama won pretty convincingly, with over 60% of the vote. I also overestimated Obama’s support in some of the counties surrounding Philadelphia. Not sure what I can do to improve the model next time. If you have any ideas, leave a comment.
Original entry:This isn't my normal blogging day, but I wanted to show my final Pennsylvania prediction map. Later on I will update my post to include the true map in the same color scheme, so we can compare. I have updated the prediction model after everyone's suggestions last time.
The big problems last time were:
There were other comments, too, but not all of them could be addressed effectively (What else can I do besides predict on the county level? That's where we have data!) Well, I'm happy to say that for the latest model I pulled in lots more covariates from the census:
With all these, the model fits like a dream come true. R^2 = 0.82 and a residual standard error of 0.04 (i.e., +- 8% of Obama's true share). Here are the estimated coefficients (after pruning some variables based on the BIC):
Name | Estimate | Std. Error | t value | Pr(>|t|) |
| (Intercept) | -1.93 | 0.35 | -5.44 | 0.00 |
| kerry | -0.29 | 0.06 | -4.66 | 0.00 |
| black | 1.00 | 0.10 | 9.81 | 0.00 |
| hisp | 0.74 | 0.30 | 2.49 | 0.01 |
| male | -1.52 | 0.33 | -4.60 | 0.00 |
| young | 1.46 | 0.22 | 6.59 | 0.00 |
| log(income) | 0.29 | 0.03 | 9.96 | 0.00 |
The coefficients are pretty much as you expect: counties with more Blacks, young people and higher incomes vote for Obama. Poorer counties and counties where Kerry did well tend to go for Clinton. The only somewhat surprising part is the negative coefficient on male population. You would think counties with more females would go for Clinton. There's probably some confounder, because there were several counties in Ohio with 55% male populations who went for Clinton.
Anyway, I will update this post tomorrow comparing my predictions to the realized results.
Posted by Kevin Bartz at 11:16 AM
April 18, 2008
In last week's debate in Philadelphia,
Last week's debate provides a small but interesting corpus to analyze the candidates' favorite linguistic formulations. Overall,
So all in all, the candidates spoke about the same number of words. But which words? We can test that using a basic corpus comparison method. In all, there were 1,971 unique words. For each of these, we test the hypothesis that the candidates spoke the word with equal probability, using a simple chi-squared test. Next we sort all words by their p-values so that the most differentially expressed words percolate to the top. Here are the top 20 words by p-value, along with their frequencies from Obama and Clinton.
| Word | obama | clinton | pval |
|---|---|---|---|
| will | 18 | 56 | 0.0000 |
| know | 23 | 64 | 0.0000 |
| that's | 43 | 12 | 0.0001 |
| she | 16 | 0 | 0.0002 |
| it | 41 | 79 | 0.0005 |
| how | 36 | 12 | 0.0010 |
| clinton | 14 | 1 | 0.0021 |
| i | 150 | 205 | 0.0024 |
| he | 5 | 21 | 0.0029 |
| politics | 10 | 0 | 0.0047 |
| this | 58 | 30 | 0.0047 |
| american | 20 | 5 | 0.0056 |
| to | 211 | 268 | 0.0058 |
| begin | 0 | 9 | 0.0072 |
| york | 0 | 9 | 0.0072 |
| decade | 9 | 0 | 0.0081 |
| economic | 9 | 0 | 0.0081 |
| election | 9 | 0 | 0.0081 |
| going | 49 | 26 | 0.0128 |
| give | 1 | 10 | 0.0149 |
Sometimes control words (I, it, etc.) are excluded from analysis, but here I thought it would be fun to leave them in so we could see each candidate's preferred constructions. Besides the points listed above, here are a few interesting notes:
- Clinton used the word "I" 205 times to Obama's 150
- Obama loves to start sentences with "That's:" "That's why I'm...", "That's what we're," etc.
- Obama loves the word "decade" -- evidently he used the phrase "decades after decades" several times
Of course, unigrams -- single words -- can only tell you so much. If we do the same analysis using bigrams, a few more bits of information drip out:
| Word | obama | clinton | pval |
|---|---|---|---|
| you know | 18 | 49 | 0.0002 |
| american people | 16 | 1 | 0.0008 |
| senator clinton | 13 | 0 | 0.0009 |
| the american | 17 | 2 | 0.0014 |
| and that's | 13 | 1 | 0.0035 |
| have a | 5 | 20 | 0.0046 |
| this country | 10 | 0 | 0.0047 |
| i will | 7 | 23 | 0.0055 |
| going to | 46 | 22 | 0.0061 |
| new york | 0 | 9 | 0.0072 |
So Clinton always punctuates her thoughts with "you know," while Obama attributes his goals to the "American people."
It will be interesting when McCain gets into the mix with one of these two. I think it would be fun to construct a language model -- a model for the probability that each candidate spoke a certain sentence. Given the differences, I bet that given a sentence, it could easily figure out whether Obama, Clinton or McCain said it!
Posted by Kevin Bartz at 12:44 PM
April 4, 2008
Here are the results of the Pennsylvania Democratic primary, with Obama counties in purple and Clinton counties in Orange.
What, you say? The Pennsylvania primary hasn't happened yet? You're right. Enter statistics!
Consider this scatterplot of Kerry's 2004 vote share versus Obama's 2008 vote shares in Ohio counties. The result is something I call the Kerry-Obama smile: Obama does well in Kerry's best counties, where staunchly Democratic urban blacks are concentrated; and in Kerry's worst regions, presumably due to Obama's appeal to crossover Republicans. Clinton does best in the wide middle swath.

This motivates a very simple modeling idea: fit a curve to the scatterplot. Obviously, a quadratic in Kerry's share looks like a decent fit. That gives us the best-fit line shown on the plot. The R-squared is 0.16, representing an okay fit.
The next step is utterly useless, but utterly fun. We can use Ohio to predict Pennsylvania. In other words, given that we know how Kerry did in Pennsylvania counties in 2004, we can predict how well Obama will do in 2008 in every Pennsylvania county. Note that I first tweaked the model's intercept slightly in Obama's favor, so that the aggregate prediction matches the current polling average (showing Clinton up by 6.6%).
The bad news for Obama is that nearly all of Pennsylvania's counties fall in the middle of the smile. The image below compares Kerry in 2004 to the model's predictions for Obama in 2008. Obama is predicted to carry Philadelphia overwhelmingly, and to do well in some of the curvy, heavily Republican counties in the south-center of the state. Everywhere else, though, is Clinton country.
Posted by Kevin Bartz at 1:15 PM
March 20, 2008
We're lucky to have two contested Presidential primaries. One of my favorite habits is to look at cross-tabs of candidate preferences by party and county. Here's an example of an Iowa cross-tab, showing the number of Iowa counties by Republican winner and Democratic winner:
| Iowa | Obama | Clinton | Edwards |
| McCain | 0 | 0 | 0 |
| Romney | 15 | 7 | 2 |
| Huckabee | 27 | 21 | 27 |
We can visualize cross-tabs using mosaic plots as in "Visualizing Categorical Data." I did it for nine primary states in the image below. The green represents Obama counties, the orange Hillary counties and the purple Edwards counties. Across the columns are the Republican candidates: McCain, Romney, Huckabee. Across the rows, Obama, Hillary and Edwards. Check it out here. If you instead prefer an inverted version, with Republicans across the rows and Democrats across the columns (this makes it easier to compare the Democrats), check it out here.
The conclusions are the same over most states: Huckabee and Edwards are clearly the most complementary candidates. They shared counties whenever Edwards was in play (Iowa, Florida); after that, Huckabee shared Clinton counties. In Missouri every single county he won was a Clinton county! Huckabee and Clinton are somewhat complementary. Neither McCain nor Romney is particularly complementary with any Democrat (see California, where McCain and Romney split the Hillary-Obama counties), though both did better in Obama counties when Huckabee was in play.
One distracting feature of the plots above is that counties aren't uniformly populous. Obama won Missouri by winning only six counties. An alternative interpretation is to view this as an ecological inference problem, in which we are trying to determine the population totals in each of the cross-tab cells. This isn't perfectly accurate, since Edwards voters don't actually also vote for Huckabee. But it does provide a nice framework for scaling the mosaic plot by population size, and making it look generally less degenerate. I did that using Ryan Moore's eiPack and got this.
Posted by Kevin Bartz at 5:53 PM