Applied Stats Workshop (Gov 3009)

Date: 

Wednesday, March 21, 2018, 12:00pm to 1:30pm

Location: 

CGIS Knafel K354
The Applied Statistics Workshop (Gov 3009) meets all academic year, Wednesdays, 12pm-1:30pm, in CGIS K354. This workshop is a forum for advanced graduate students, faculty, and visiting scholars to present and discuss methodological or empirical work in progress in an interdisciplinary setting. The workshop features a tour of Harvard's statistical innovations and applications with weekly stops in different fields and disciplines and includes occasional presentations by invited speakers. Free lunch is provided. Luke Miratrix presents. Title: Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality Abstract: How should one perform matching in observational studies when the units are text documents? The lack of randomized assignment of documents into treatment and control groups may lead to systematic differences between groups on high-dimensional and latent features of text such as topical content and sentiment. Standard balance metrics, used to measure the quality of a matching method, fail in this setting. We present a framework for matching documents that decomposes matching methods into two parts: (1) a text representation, and (2) a distance metric. We consider various methods that can be used at each step and conduct a systematic multifactor evaluation experiment using human subjects to identify the methods that dominate. We also show that our framework can be used to produce matches with higher subjective match quality than current state-of-the-art techniques. We then apply our chosen method to a substantive debate in the study of media bias using a novel data set of front page news articles from thirteen news sources. Media bias is composed of topic selection bias and presentation bias; using our matching method to control for topic selection, we find that both components contribute significantly to media bias, though some news sources rely on one component more than the other.