Applied Stats Workshop (Gov 3009)


Wednesday, March 29, 2017, 12:00pm to 1:30pm


CGIS Knafel K354

The Applied Statistics Workshop (Gov 3009) meets all academic year, Wednesdays, 12pm-1:30pm, in CGIS K354. This workshop is a forum for advanced graduate students, faculty, and visiting scholars to present and discuss methodological or empirical work in progress in an interdisciplinary setting. The workshop features a tour of Harvard's statistical innovations and applications with weekly stops in different fields and disciplines and includes occasional presentations by invited speakers.  Free lunch is provided.

Kosuke Imai (Princeton) presents. 


Title: Using a Probabilistic Model to Assist Merging of Large-scale Administrative Records



 Since most social science research relies upon multiple data sources, merging data sets is an essential part of workflow for many researchers.  In many situations, however, a unique identifier that unambiguously links data sets is unavailable and data sets may contain missing and inaccurate information.  As a result, researchers can no longer combine data sets ``by hand'' without sacrificing the quality of the resulting merged data set.  This problem is especially severe when merging large-scale administrative records such as voter files. The existing algorithms to automate the merging process do not scale, result in many fewer matches, and require arbitrary decisions by researchers.  To overcome this challenge, we develop a fast algorithm to implement the canonical probabilistic model of record linkage for merging large data sets. Researchers can combine this model with a small amount of human coding to produce a high-quality merged data set.  The proposed methodology can handle millions of observations and account for missing data and auxiliary information.  We conduct simulation studies to show that our algorithm performs well in a variety of practically relevant settings.  Finally, we use our methodology to merge the campaign contribution data (5 million records), the Cooperative Congressional Election Study data (50 thousand records), and the nationwide voter file (160 million records).