November 2009
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


November 12, 2009

Bill Support by Page Length

There was a lot of press on the 1,000+-page length of the House health care bill, H.R. 3962. That got me thinking... didn't we hear the same thing about the stimulus bill and the Patriot Act? Aren't most "controversial" bills also very long?

It would make sense. Controversial bills require a lot more ink -- pork, special cases, exceptions -- to reel in support. Uncontroversial bills can be written succinctly and pass as is.

To assess this I scraped bills from OpenCongress, which maintains the full text, voting results and amendment history of House and Senate Resolutions. You can even comment on specific portions of bills. There's already a bunch of neat comments on potential loopholes in H.R. 3962.

I downloaded the text and voting results for all 152 House resolutions passed by the 111th House. A boxplot of page length against support appears below. Each page length group represents roughly 20% of House resolutions. The plot shows the suspected trend, that longer bills have less support. One-page bills almost always pass unanimously!

SupportByLength.png

Posted by Kevin Bartz at 10:12 AM | Comments (1)

May 15, 2009

The Most Bird-prone: Frontier, United, Hawaiian

In late April, the FAA released the long-awaited bird strike data. It shows every recorded bird strike since 2000.

Since then, we've had a whole host of stories bemoaning the doubling in bird strikes since 2000, complete with worrisome bar graphs and explanations from experts.

But as far as I can tell, the stories seem to have forgotten about the denominator: the number of flights, which has been increasing just as rapidly since 2000.

To test this, I went and pulled commercial flight totals from a public BTS data set from 2003 on. Then I limited the bird data to the same airports, months, years and carriers as appear in the BTS flight data. Then I divided bird strikes by flights, and presto: the strike rate has been flat since 2003.

strike.year.png

But the most interesting part of this mashup was breaking down the figures by carrier. Do some airlines strike birds more than others? The answer appears to be "Yes."

strike.carrier.png

At the top of the pile are Frontier, United and JetBlue. Frontier Airlines has a staggering 9.4 strikes per 10,000 flights, compared to the industry average of 4.0. Now, a good statistician (or a Frontier executive) would wonder about confounders. Frontier's Denver hub is the most bird-prone major airport in the U.S., with 7.8 strikes per 10,000 flights. Here's the breakdown by airport for the top 34 airports in the U.S..

strike.airport.png

To try to account for all the possible confounders, I fit a Poisson regression modeling strike rate using the following covariates:

  • Year
  • Month
  • Airport
  • Operator

Since the data have 100,000 rows and hundreds of columns after expanding all the categorical covariates, I used bigglm in R to fit the model. The operator coefficients (actually, exp(coefficient)) are shown below. 1.0 refers to the "base" rate -- the strike rate you would expect given the airline's flight history of airports, years and months. The "winners" -- Hawaiian, United and Frontier -- all have values above 2, which means their strike rates are more than double those of any other airline with their flying schedule.

strike.model.png

So why do some airlines strike more birds than others, even after accounting for airport and month? Possibilities include differing planes, pilots or maintenance crews.

Posted by Kevin Bartz at 3:06 PM

March 13, 2009

English First Names for Chinese Americans

This entry uses people data from ZabaSearch to show which English first names are most popular among Chinese Americans.

When I worked at Google, I once did an employee search on "Vivian" and 26 of the 30 results were Chinese. This post examines this phenomenon a bit more scientifically, with two goals:


  • Find the most common English first names for Chinese last names (P(Name n | Chinese)).

  • Find the English first names that are differentially expressed -- that is, which are much more popular among Chinese Americans than among the general American public (i.e., P(Chinese | Name n)).

The ideal approach would be to download a phone book and tally the first names for Chinese last names. While there's nowhere to download a phone book, there are several searchable people databases online. The largest and most famous free option is ZabaSearch.

The first step: get a list of Chinese last names to search for. I used the 100 most common last names in both the natural Pinyin and Wade-Giles variants, for a total of 128 unique last names. With a script, I searched for each on ZabaSearch. Sadly, Zaba won't show you the results if there are over 1000 -- it just says "1000's of CHIN's found!" If you search across the entire U.S., then this happens for too many of the names, so I limited my searches to Boston. The 128 last names culled 22,483 unique people (after also de-duping by address).

Among Chinese in Boston, the most common three first names are Wei (1.34%), Hong (0.916%) and Hui (0.836%). Only about 25% -- 5,949 of 22,483 -- of the first names are English.

Of the Chinese population with English first names, the most popular three male and female names are shown below. For the American public, these are downloaded from the latest census report. (Note: To standardize the population sizes, I limited both populations to those with first names in the top 500.)

Name

Rank across America

Rank among Chinese

Frequency across America

Frequency among Chinese

DAVID
6
1
0.0286
0.0511
JOHN
2
2
0.0396
0.0378
JAMES
1
3
0.0402
0.0369

Name

Rank across America

Rank among Chinese

Frequency across America

Frequency among Chinese

JENNIFER
6
1
0.01300
0.0311
AMY
32
2
0.00631
0.0303
ANGELA
29
3
0.00655
0.0178

The three most popular Chinese male first names are also very popular in America as a whole. A more interesting question is the one about P(Chinese | Name n) -- which English first names are much more common among Chinese Americans than among all Americans? To answer that, I conducted a binomial proportion test and sorted the results by p-value, identifying the most extreme differences. The top 10 male and female differences are given below.

Some of the top results are nicknames -- Chinese are much more likely to pick "Andy" or "Jenny" as a legal name, while general Americans are formally named by the longer versions.

The other names on the list are more interesting. For males, "Andrew," "Eric," "Peter" and "Albert" are much more common among Chinese than among Americans. For females, it's "Amy," "Grace," "May" and, yes, "Vivian." By comparing the frequencies, you can see that these names are all over five times more popular among Chinese Americans!

I'll leave interpretation to the sociologists.

Name

Frequency across America

Frequency among Chinese

p-value

ANDREW
0.006510
0.02810
2.2e-30
ANDY
0.000594
0.00937
2.0e-26
DAN
0.001220
0.01120
3.6e-23
PETER
0.004620
0.01990
5.3e-22
ALBERT
0.003810
0.01570
7.0e-17
ERIC
0.006590
0.02120
1.5e-16
ALAN
0.002470
0.01150
2.9e-14
SAM
0.001110
0.00786
3.7e-14
ALEX
0.001390
0.00846
1.4e-13
DAVID
0.028600
0.05110
1.7e-12

Name

Frequency across America

Frequency among Chinese

p-value

AMY
0.006310
0.03030
2.5e-29
JENNY
0.000951
0.01330
6.9e-28
GRACE
0.002640
0.01670
4.3e-21
MAY
0.000406
0.00644
3.1e-15
VIVIAN
0.001650
0.01100
5.2e-15
ALICE
0.004990
0.01780
3.5e-13
JENNIFER
0.013000
0.03110
2.7e-12
CECILIA
0.000769
0.00644
6.8e-11
JANE
0.003500
0.01290
2.7e-10
CINDY
0.002690
0.01060
2.2e-09

Posted by Kevin Bartz at 5:46 PM

November 23, 2008

Which Crash Was Most Like This?

Which past market environment was most like today's? This post shows that in terms of correlation to returns, the closest of all major markets to the S&P 500 in the last year was the Nikkei in 1991.

The S&P 500 peaked in mid-October of 2007, and has been down, down, down since then. Common comparisons include the Nikkei in 1990, the Dow Jones in 1929, and the NASDAQ crash of 2000.

But did those downturns really feel like today's? The plot below shows weekly returns overlaid for these four bear markets. All the plots start at the (then) all-time week-ending high of the market index in question.

They're not really all that similar. The 1929 and 2000 crashes both fell much more dramatically from their all-time highs initially. In contrast, this crash started slowly, gradually accelerating its decline until those dark days of August and September 2008. One depressing fact is that we're actually worse now than we were 58 weeks after the crash of '29!

Are there any precedents for this kind of decline? To answer that I scraped all week-by-week returns for all major indices, U.S. and international, from Yahoo! Finance. Then I programmatically scanned each index's past for a 58-week period where the returns have high correlation with the S&P 500's returns over the past 58 weeks since its October high. The winners, in order of correlation:

  • The Nikkei 225 from March 1991 is the closest, and the big winner. This is interesting because it is the cusp of the Nikkei's so-called ``second wave'' of declines, after it had stabilized following its better-known 1990 crash.
  • Germany's DAX 30 starting September 2000. The European markets all had a more pronounced downturn in the wake of the tech bubble crash than did the U.S. indices besides the NASDAQ.
  • The S&P 500 in May 1969 and in July 1973. Both of these were the crest of major bear markets, the latter owing to an oil crash.

In particular, the ebbs and flows of the Nikkei in 1991 are eerily close to those of the S&P 500 in the last year!

Although no one knows what our future holds, all but one of these indices went on to rally in the following two years. Perhaps it is some consolation that none declined much further, and most gained. The S&P 500 in 1969 recouped all of its losses by two years later. The worst of the bunch was the German DAX 30, which was about flat at its trough after two years.

Posted by Kevin Bartz at 11:22 AM

October 30, 2008

Words and Credit Scores

Find statistical evidence that borrowers who use words like "bill," "bills," and "need" in their loan applications are twice as likely to default. This post uses freely available data from the P2P lending site LendingClub.

LendingClub is a P2P lending site much like Propser. What makes them special is that they've released a full data set of all 4,564 past loans and their current status. As a data source this is extraordinary, since most literature on credit scoring uses proprietary data. For the LendingClub data, can we beat the FICO at default prediction by incorporating additional clues?

This post focuses on the borrower's "Loan Description," which I use along with FICO scores to predict defaults. The loan description is written by the borrower and usually pitches his qualifications and reasons for needing the money. Here's a randomly chosen example from someone who is current on his payments.

I have some credit card debt that I would like to pay-off. It makes sense to pay one lender as opposed to 5 credit card companies. I'd rather pay interest to one payee rather than split between 5 or 6.

This is a relatively short one -- the average description is 58 words long. Perhaps there are keywords in the description that impact the probability of default after controlling for the FICO score. Here's what I did to test for these keywords:

  1. Find the 300 most common words in all loan descriptions.
  2. For each word w, test the hypothesis that use of w is conditionally independent of delinquency given the FICO score range (six ranges from 640 up). I apply the Maentel-Haenszel test. Note that for simplicity I am ignoring the survival analysis aspect of the problem here (i.e., some loans are newer than others) for simplicity since all loans are relatively new anyway (Lending Club started in January of 2007).
  3. Order all the words by test's p value. Check that the distribution of p values is non-uniform to ensure significance in the presence of multiple comparisons.

Now, the fun stuff. For our purposes define a Delinquency as either being late in your payments or having defaulted completely. The 10 words with the greatest p-values are below. I report marginal delinquency probabilities, not broken out by FICO score, simply for brevity; the actual M-H test controlled for the FICO scores.


WordLoans WithP(Delinquency|No word)P(Delinquency|Word)p-value
also
215
0.067
0.140
0.0004
need
608
0.062
0.105
0.0015
business
233
0.069
0.116
0.0038
live
91
0.070
0.154
0.0057
already
64
0.071
0.156
0.0059
other
285
0.068
0.112
0.0081
bills
223
0.067
0.135
0.0082
bill
279
0.066
0.125
0.0117
interest
660
0.081
0.053
0.0136
Some speculative reasoning: A word like "also" implies that the loan will be used for more than one purpose, which points to a heightened risk. Here's a randomly chosen delinquent borrower who used "also." It's clear that he has multiple goals in mind for the money and has obviously racked up quite a bit of debt.
I have good credit and am looking to consolidate all my debt into one easy payment. I am looking to get married soon so the less multiple bills we have to keep track of the better. I have two credit cards with low balances that I would like to pay off. I have a furniture debt that I would also like to consolidate and I need to overhaul the commuter vehicle my fiance will begin driving. I have no recorded late or delinquent payments on my credit. I have worked for my current employer for 5 1/2 yrs and have good standing. I am excited to join hands in marriage with my lovely fiance and the remainder balance after consolidation will be used for marraige documentation purposes. I appreciate your consideration. Thank you.
As for the other words, "need" implies that the borrower is in straits of some kind, while "live," "bill" and "bills" suggest that the money will be used for day-to-day expenses rather than a targeted goal, implying a systemic negative cash flow. "Already" suggests an existing outstanding loan. All but one word ("interest") on the list enhances delinquency risk. "Business" is somewhat surprising -- people who want money to start businesses must be greater risks. Here's an example:
i am trying to buy a residential Land in emerging and booming market like new delhi where building cost is very cheap and return of investment is 150% in just six months. I intend to purchase the land build the house with my friends help who is in building house business and make a six flats/3 floor house. and sale it each one of them under USD 12, 000.00.
I'm stunned something like this got funded! All in all, such keywords look like a good building block for enhancing a credit score model that goes beyond FICO scores. In a saner credit market, a viable strategy would be to fund P2P loans judged by an enhanced model to minimize default risk. Right now, however, I'd be worried that the credit crisis could wipe out all these sites at the drop of a hat.

Posted by Kevin Bartz at 2:16 PM

October 3, 2008

Biden-Palin Linguistics

This post looks at the linguistics of last night's Biden-Palin debate. Palin used the word "reform" 12 times compared to Biden's none. Biden used "middle class" 12 times to Palin's one.


Here's a sequel to my earlier Obama-Clinton post. Overall,

Overall, Biden uttered 7,065 words and Palin 7,646, with a total of 2,117 unique words. Which words did Biden use significantly more or less than Palin? For each word, we apply a chi-squared test that the candidates spoke the word with equal probability. Finally, we sort the list by p-value, highlighting the differences. I've eliminated words that appear over 50 times (mostly stop words like "the," which Palin evidently used a couple hundred times more than Biden).

WordBidenPalinpval
also 3470.0000
their32 40.0000
number15 00.0002
want 0160.0003
united16 10.0004
policy22 40.0004
just 6280.0007
those10340.0013
too 0130.0014
they41180.0015
well24 70.0019
these 1150.0020
said40180.0022
reform 0120.0023
who11340.0025
even 3190.0025
down16 30.0034
gwen16 30.0034

Observations:


  • Look at all the connectors in the Palin column: "also," "just," "too." She uses these words to string together her thoughts.

  • Biden's favored conjunctions are "number" -- from his several "number one, number two" formulations -- and "well."

  • Two interesting ones are "policy" (Biden 22, Palin 4) and "reform" (Palin 12, Biden 0). These were certainly buzzwords from debate prep.

We can also look at bigrams, pairs of words, in a similar way.


  • Biden used "United States" 16 times as opposed to Palin's 1.

  • Like McCain, Palin only once used the term "middle class," compared to Biden's 12. To be fair, she made several allusions to the middle class without mentioning the word.
  • Here's an interesting observation: Palin shares many of Obama's constructions; her favorite conjunctions are "and that's" and "and i."

WordBidenPalinpval
the united16 10.0004
united states16 10.0004
we have 9340.0007
want to 0140.0009
he said11 00.0016
have got 0120.0023
and i 6250.0025
that is 4210.0026
and that's 1140.0032
middle class12 10.0035

Posted by Kevin Bartz at 11:38 AM

May 20, 2008

Matching Portfolios

Jointly with Dave Kane, an IQSS fellow and head of Kane Capital, I've been working on applying causal inference techniques to the financial problem of performance evaluation. We have a draft on SSRN up here.

matching.strip.pngmatching.expo.png

The problem: how do you evaluate a stock portfolio's performance? This is usually done by comparing the returns on the manager's portfolio against those of a counterfactual portfolio of investments the manager could have chosen, but did not. A common choice is a passive portfolio like the S&P 500. If a manager can't perform at least as well as a passive benchmark like this, why not just invest in the S&P 500? But this may not be a fair comparison, since the S&P 500 contains only large-cap stocks, while the manager may actually have considered a wider universe of possibilities. Any difference in returns could be due to the portfolio's smaller capitalization rather than the manager's stock-picking ability.

Dave Kane and I view performance evaluation as a causal inference problem. We consider the treatment to be the manager's claimed advantage. Does he time the market? Does he pick hot sectors? Most commonly a manager claims an ability to pick stocks. Then the covariates are the set of confounding factors: observable characteristics of stocks, such as their capitalization, sector and country.

To get a better benchmark, we propose forming a matching portfolio of stocks with similar characteristics, but which are not held in the portfolio. In the leftmost figure above, the black dots represent the characteristics of holdings in a particular portfolio we considered (an equal-weighted portfolio based on the StarMine indicator). The gray dots represent non-holdings. We form the matching portfolio by matching each black dot to a nearby gray dot, using a propensity score method. When we're done, we end up with a well-matched portfolio -- the exposures are compared in the second figure, and they line up nicely. Notice from the figure that there are several possible matched portfolios -- we consider a random set of 100 of them, matching within a thin caliper, as part of our benchmark.

Finally, we compare the realized portfolio return against the returns of the matched portfolios. When we do that, we obtain the histogram below. The portfolio outperforms 75% of the matched portfolios, suggesting there's a moderate but not overwhelming amount of evidence for the stock-picking ability of the StarMine indicator.

matching.perf.png

In the paper we consider several extensions of this framework to situations with non-equal portfolio weights and to long-short portfolios. We employ the generalized propensity score of Imai and Imbens to form the matching portfolios in this case, treating the portfolio weights of the stocks as a continuous treatment.

We welcome any comments, thoughts or reactions to these ideas! The SSRN draft is linked above, and an accompanying R package is available here if you want to reproduce the computations.

Posted by Kevin Bartz at 11:10 PM

May 13, 2008

IN, NC Rehash; WV Prediction

I know this isn't my normal day, but three points today:


  • How I did in IN and NC

  • My prediction for WV

Error Actual Predicted
in.dem.2008.actual.error.png in.dem.2008.actual.share.png in.dem.2008.pred.share.png
  • Indiana was off by about 3%: I had predicted 53.5% Clinton, 46.5% Obama; the result was 50.6% Clinton, 49.4% Obama.
  • North Carolina was near spot-on: I had predicted 58% Obama, 42% Clinton; the result was 57.3% Obama, 42.7% Clinton.
  • I significantly underestimated turnout: there were 1.27 million votes in IN and 1.53 million, while I had predicted 950,000 and 1.2 million, respectively.

I'm less worried about the turnout discrepancy; it happened because there had been no semi-open Democratic primary since Huckabee dropped out of the Republican contest. I was forced to use Pennsylvania (a closed primary) and Ohio (a semi-open primary, but with Huckabee still formally in) to predict turnout, which resulted in my underestimates. I'm more confident about my turnout projection in West Virginia, which is a semi-open primary, now that I have North Carolina to use as a predictor.

In predicting voter shares, my overall county-level correlations were .81 for Indiana and .88 for North Carolina -- on the whole pretty good, but with some problems. Below are spatial plots of residuals for North Carolina, and Indiana's appear above. Dark red corresponds to overestimation of Obama's support, and dark grey to underestimation of Obama's support.

Error Actual Predicted
nc.dem.2008.actual.error.png nc.dem.2008.actual.share.png nc.dem.2008.pred.share.png

The biggest mistake in my North Carolina predictions came with rural Blacks, who had not appeared significantly in my training data. The largest-magnitude residual was Greene County, a rural county that's 50% White and 40% Black (it's the small dark red). I projected a 70%-30% Obama victory, as is typical for counties with this racial split (note that among Democrats in such a county, Blacks will dominate). But somehow Clinton actually won this county 53% to 47%, putting me 23% off. In all of the neighboring rural black counties I had similarly overestimated Obama's support. This points to a possible interaction effect -- that rural blacks are more pro-Clinton than urban blacks.

Now to my top-line West Virginia prediction: Clinton 70.5%, Obama 29.5%, with a turnout of 300,000 votes. The map is below. I have Clinton taking every county in the state. Obama comes closest in Jefferson (a high-income, well-educated county next to Virginia) and Monongalia (a well-educated urban county that’s part of Pittsburgh tri-state).

wv.dem.2008.pred.share.png

With Clinton's impending departure, however, I plan to abandon these projections and move on to other fun. I really want to try a language model on Obama's and McCain's speeches.

Posted by Kevin Bartz at 5:48 PM

May 4, 2008

IN, NC Predictions

Since I have qualifying exams tomorrow, I'll keep this entry unimaginative. I've re-run my predictions for the Indiana and North Carolina primaries on Tuesday, adding a few new bells and whistles:

  • A turnout model
  • More covariates in the voting share model

nc.dem.2008.pred.share.png

in.dem.2008.pred.share.png

With the help of a turnout model, I can actually predict the election result by multiplying turnout by population and adding up votes for Clinton and Obama. When I do that, I get:

Indiana: Clinton 53.5%, Obama 46.5%; turnout 950,000
North Carolina: Obama 58%, Clinton 42%; turnout 1,200,000

Yowzers! We'll see how the real numbers pan out. Here are a few details on the two models:

  • The share model is trained on the primary results from Ohio, Pennsylvania and Virginia. This model has R^2 = 0.99, meaning that it's explained nearly as much as it can. The residuals still show a SE of 5%, however, so the results could be shaky at the county level.
  • The turnout model is trained on the primary results from Ohio. Note that Indiana and North Carolina are open primaries. I didn't use Pennsylvania in this model because it was a closed primary, and I didn't use Virginia because it had a contested Republican election at the time. Ohio's Republican primary was technically contested by Huckabee, but he wasn't a serious factor, whereas he had dedicated substantial resources to competing in Virginia. For this model R^2 = .84 and the residual SE is 2%. My turnout projections are mapped below.

    in.dem.2008.pred.turn.png
    nc.dem.2008.pred.turn.png


This time I included even more covariates for both models. Next to the ones found to be important, I've placed their effect in parentheses.


  • Kerry's 2004 vote share and its square (pro-Clinton and +turnout)

  • Proportions White, Black, Asian, Native American and Hispanic (white pro-Clinton and +turnout, others pro-Obama)

  • Proportion male (pro-Clinton, +turnout)

  • Proportions 18-21 and 65+ (both pro-Obama, young -turnout, old +turnout)

  • Percentage urban

  • Log(median household income) (pro-Obama)

  • Proportion with a bachelor's degree, proportion with a master's degree (pro-Obama)

  • Unemployment rate (high is pro-Clinton)

  • Proportions employed in mining, in education, in construction (mining pro-Clinton, education pro-Obama)

How do my results stack up against the current polls? In Indiana, the RealClearPolitics average has Clinton +6%, only a point from my prediction. In North Carolina, the RCP average has Obama +8%, significantly below my predicted 16% victory. Two factors shed light on this discrepancy:


  • In neighboring South Carolina, the polling average had Obama +11.6% and he won by 28.9%.

  • In neighboring Virginia, the polling average had Obama +17.7% and he won by 28.2%.

  • So perhaps my analysis isn't so crazy putting Obama above what the polls say in NC.

We'll see how it pans out on Tuesday. I'm more than willing to eat crow :)

Posted by Kevin Bartz at 6:38 PM

April 22, 2008

Predicting Pennsylvania, Updated

Update: Check out how my predictions fared! Two comparisons are given, one showing both maps in the same image and one as an animated GIF (kudos to the animation package in R).

pa.movie.gif

pa.dem.2008.comp.png

Overall, my predictions did pretty well. Their overall correlation with the true vote shares was .89 -- leading to an R^2 of .79, just below the in-sample R^2. My biggest miss was Centre County, where I predicted that Clinton would edge out Obama. Instead, Obama won pretty convincingly, with over 60% of the vote. I also overestimated Obama’s support in some of the counties surrounding Philadelphia. Not sure what I can do to improve the model next time. If you have any ideas, leave a comment.

Original entry:This isn't my normal blogging day, but I wanted to show my final Pennsylvania prediction map. Later on I will update my post to include the true map in the same color scheme, so we can compare. I have updated the prediction model after everyone's suggestions last time.

pa.dem.2008.png

The big problems last time were:

  • Kerry's vote share was only a loose indicator of Obama's, not enough to base a model upon
  • The model didn't incorporate other obvious factors like population density, nearby colleges, etc.
  • R^2 = 0.16 isn't all that god!

There were other comments, too, but not all of them could be addressed effectively (What else can I do besides predict on the county level? That's where we have data!) Well, I'm happy to say that for the latest model I pulled in lots more covariates from the census:

  • Kerry's 2004 vote share
  • % Whites
  • % Blacks
  • % Hispanics
  • % males
  • % young people (age 18 through 21)
  • % urban population
  • Population density
  • Median household income

With all these, the model fits like a dream come true. R^2 = 0.82 and a residual standard error of 0.04 (i.e., +- 8% of Obama's true share). Here are the estimated coefficients (after pruning some variables based on the BIC):

Name

Estimate

Std. Error

t value

Pr(>|t|)

(Intercept)

-1.93

0.35

-5.44

0.00

kerry

-0.29

0.06

-4.66

0.00

black

1.00

0.10

9.81

0.00

hisp

0.74

0.30

2.49

0.01

male

-1.52

0.33

-4.60

0.00

young

1.46

0.22

6.59

0.00

log(income)

0.29

0.03

9.96

0.00

The coefficients are pretty much as you expect: counties with more Blacks, young people and higher incomes vote for Obama. Poorer counties and counties where Kerry did well tend to go for Clinton. The only somewhat surprising part is the negative coefficient on male population. You would think counties with more females would go for Clinton. There's probably some confounder, because there were several counties in Ohio with 55% male populations who went for Clinton.

Anyway, I will update this post tomorrow comparing my predictions to the realized results.

Posted by Kevin Bartz at 11:16 AM

April 18, 2008

Linguistics of the Debate

In last week's debate in Philadelphia,

  • Clinton's favorite phrase was "You know," which she used 49 times to Obama's 18
  • Obama's favorite phrase was "American people," which he used 16 times to Clinton's 1
  • Obama was the only one to use the words "politics" (10 times), "economic" (9 times) and "election" (9 times).

Last week's debate provides a small but interesting corpus to analyze the candidates' favorite linguistic formulations. Overall,

  • 12,329 words were uttered by a candidate
  • Obama uttered 6,206 words (1,331 unique) in 40 chunks
  • Clinton uttered 6,123 words (1,250 unique) in 37 chunks

So all in all, the candidates spoke about the same number of words. But which words? We can test that using a basic corpus comparison method. In all, there were 1,971 unique words. For each of these, we test the hypothesis that the candidates spoke the word with equal probability, using a simple chi-squared test. Next we sort all words by their p-values so that the most differentially expressed words percolate to the top. Here are the top 20 words by p-value, along with their frequencies from Obama and Clinton.

Wordobamaclintonpval
will 18 560.0000
know 23 640.0000
that's 43 120.0001
she 16 00.0002
it 41 790.0005
how 36 120.0010
clinton 14 10.0021
i1502050.0024
he 5 210.0029
politics 10 00.0047
this 58 300.0047
american 20 50.0056
to2112680.0058
begin 0 90.0072
york 0 90.0072
decade 9 00.0081
economic 9 00.0081
election 9 00.0081
going 49 260.0128
give 1 100.0149

Sometimes control words (I, it, etc.) are excluded from analysis, but here I thought it would be fun to leave them in so we could see each candidate's preferred constructions. Besides the points listed above, here are a few interesting notes:
- Clinton used the word "I" 205 times to Obama's 150
- Obama loves to start sentences with "That's:" "That's why I'm...", "That's what we're," etc.
- Obama loves the word "decade" -- evidently he used the phrase "decades after decades" several times

Of course, unigrams -- single words -- can only tell you so much. If we do the same analysis using bigrams, a few more bits of information drip out:

Wordobamaclintonpval
you know18490.0002
american people16 10.0008
senator clinton13 00.0009
the american17 20.0014
and that's13 10.0035
have a 5200.0046
this country10 00.0047
i will 7230.0055
going to46220.0061
new york 0 90.0072

So Clinton always punctuates her thoughts with "you know," while Obama attributes his goals to the "American people."

It will be interesting when McCain gets into the mix with one of these two. I think it would be fun to construct a language model -- a model for the probability that each candidate spoke a certain sentence. Given the differences, I bet that given a sentence, it could easily figure out whether Obama, Clinton or McCain said it!

Posted by Kevin Bartz at 12:44 PM

April 4, 2008

Predicting Pennsylvania

Here are the results of the Pennsylvania Democratic primary, with Obama counties in purple and Clinton counties in Orange.

pa.dem.2008.png

What, you say? The Pennsylvania primary hasn't happened yet? You're right. Enter statistics!

Consider this scatterplot of Kerry's 2004 vote share versus Obama's 2008 vote shares in Ohio counties. The result is something I call the Kerry-Obama smile: Obama does well in Kerry's best counties, where staunchly Democratic urban blacks are concentrated; and in Kerry's worst regions, presumably due to Obama's appeal to crossover Republicans. Clinton does best in the wide middle swath.

kerry.obama.png

This motivates a very simple modeling idea: fit a curve to the scatterplot. Obviously, a quadratic in Kerry's share looks like a decent fit. That gives us the best-fit line shown on the plot. The R-squared is 0.16, representing an okay fit.

The next step is utterly useless, but utterly fun. We can use Ohio to predict Pennsylvania. In other words, given that we know how Kerry did in Pennsylvania counties in 2004, we can predict how well Obama will do in 2008 in every Pennsylvania county. Note that I first tweaked the model's intercept slightly in Obama's favor, so that the aggregate prediction matches the current polling average (showing Clinton up by 6.6%).

The bad news for Obama is that nearly all of Pennsylvania's counties fall in the middle of the smile. The image below compares Kerry in 2004 to the model's predictions for Obama in 2008. Obama is predicted to carry Philadelphia overwhelmingly, and to do well in some of the curvy, heavily Republican counties in the south-center of the state. Everywhere else, though, is Clinton country.

pa.comp.png

Posted by Kevin Bartz at 1:15 PM

March 20, 2008

Primary Crosstabs

We're lucky to have two contested Presidential primaries. One of my favorite habits is to look at cross-tabs of candidate preferences by party and county. Here's an example of an Iowa cross-tab, showing the number of Iowa counties by Republican winner and Democratic winner:






IowaObamaClintonEdwards
McCain
0
0
0
Romney
15
7
2
Huckabee
27
21
27

This paints a very clear picture: Huckabee won the Edwards counties and, to a lesser extent, the Clinton counties, and, to an even lesser extent, the Obama counties.

We can visualize cross-tabs using mosaic plots as in "Visualizing Categorical Data." I did it for nine primary states in the image below. The green represents Obama counties, the orange Hillary counties and the purple Edwards counties. Across the columns are the Republican candidates: McCain, Romney, Huckabee. Across the rows, Obama, Hillary and Edwards. Check it out here. If you instead prefer an inverted version, with Republicans across the rows and Democrats across the columns (this makes it easier to compare the Democrats), check it out here.

The conclusions are the same over most states: Huckabee and Edwards are clearly the most complementary candidates. They shared counties whenever Edwards was in play (Iowa, Florida); after that, Huckabee shared Clinton counties. In Missouri every single county he won was a Clinton county! Huckabee and Clinton are somewhat complementary. Neither McCain nor Romney is particularly complementary with any Democrat (see California, where McCain and Romney split the Hillary-Obama counties), though both did better in Obama counties when Huckabee was in play.

One distracting feature of the plots above is that counties aren't uniformly populous. Obama won Missouri by winning only six counties. An alternative interpretation is to view this as an ecological inference problem, in which we are trying to determine the population totals in each of the cross-tab cells. This isn't perfectly accurate, since Edwards voters don't actually also vote for Huckabee. But it does provide a nice framework for scaling the mosaic plot by population size, and making it look generally less degenerate. I did that using Ryan Moore's eiPack and got this.

Posted by Kevin Bartz at 5:53 PM