October 2007
Sun Mon Tue Wed Thu Fri Sat
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      

Authors' Committee

Chair:

Matt Blackwell (Gov)

Members:

Martin Andersen (HealthPol)
Kevin Bartz (Stats)
Deirdre Bloome (Social Policy)
Andy Eggers (Gov)
John Graves (HealthPol)
Rich Nielsen (Gov)
Maya Sen (Gov)
Gary King (Gov)

Weekly Research Workshop Sponsors

Alberto Abadie, Lee Fleming, Adam Glynn, Guido Imbens, Gary King, Arthur Spirling, Jamie Robins, Don Rubin, Chris Winship

Recent Comments

Recent Entries

Categories

Blogroll

Brad DeLong
Cognitive Daily
Complexity & Social Networks
Developing Intelligence
EconLog
The Education Wonks
Empirical Legal Studies
Free Exchange
Freakonomics
Health Care Economist
Junk Charts
Language Log
Law & Econ Prof Blog
Machine Learning (Theory)
Marginal Revolution
Mixing Memory
Mystery Pollster
New Economist
Political Arithmetik
Political Science Methods
Pure Pedantry
Science & Law Blog
Simon Jackman
Social Science++
Statistical modeling, causal inference, and social science

Archives

Notification

Powered by
Movable Type 4.24-en


« Visualizing the evolution of open-edited text | Main | Damon Centola on 'Diffusion in Social Networks' »

12 October 2007

Visualization for data cleaning

Speaking of Fernanda Viegas and Martin Wattenberg's excellent presentation on visualization, I recently came across a data cleaning problem where visualization was a big help. Data cleaning is all about having powerful ways of finding mistakes quickly. Much of the time, clever scripting is the best way to detect errors, but in this case a simple data visualization turned out to be the best tool. Screenshot after the jump.

First, a little background on the project, which is a collaboration with Jens Hainmueller. The Times of London published election guides throughout the 20th century including voting results and candidate bios for every constituency in every election to the House of Commons. We scanned and OCR'd seven volumes of this series and wrote scripts to extract information about each constituency race, including the name, vote total, and short bio of each candidate. The challenge then was to determine which appearances belonged to the same individual. For example, when "P G Agnew" runs in 1950 and "Peter Agnew" runs in 1955, are they the same person? We trained a clustering algorithm to do this matching based on name similarity, year of birth, party, and gender, and wrote some scripts to catch likely errors. When we thought we had done as well as we could, we decided to produce a little visualization to admire our perfectly cleaned data. To our surprise, the visualization revealed a number of hard-to-catch remaining errors.

As can be seen in the screenshot below, we listed the candidates alphabetically by surname and depicted their election career graphically with a colored rectangle for each appearance in a race. We selected the colors to reflect the margin in the race, with deep green indicating an easy victory and deep red indicating a resounding defeat.
thc_screenshot3.JPG
Depicting the candidates' campaign history in this way helped us see patterns that suggested that a single candidate had been incorrectly coded as separate candidates. Brian Batsford, shown at the top of the screen shot, was one such case: the Brian Batsford who ran in 1959, 1964, and 1970 was very likely to be the same person as the Brian Batsford who ran in 1966. Indeed, it turned out that they were the same person; our clustering algorithm had mistakenly separated him in two because the year of birth had been miscoded as 1928 in his 1966 appearance.

The key point here is that the pattern that allowed us to see this mistake is easier to see than it is to articulate and, perhaps more importantly, than it is to write in a script. (OK, I'll try: "Find pairs of candidates who have similar names and did not appear in the same elections, especially if they appeared in contiguous elections and had similar results.") I prefer the pretty colors.

Posted by Andy Eggers at October 12, 2007 12:35 PM