Workshop in Applied Statistics (Gov 3009)

Location: 

CGIS Knafel, room K354 or Online via Zoom

This Week's Speaker

Soichiro Yamauchi (Google),  "Statistical Analysis with Machine Learning Predicted Variables"

Abstract

Scholars in the social sciences are increasingly relying on machine learning (ML) techniques to construct data from large corpora of text and images. The ML-generated variables are subsequently utilized in statistical analysis to address substantive questions through regression and hypothesis testing. However, this approach can introduce substantial bias and lead to incorrect inferences due to prediction errors during the machine learning stage. In this paper, we present an approach that incorporates ML-generated variables into regression analysis while ensuring consistency and asymptotic normality. The proposed approach leverages a small-scale human-coded sample to capture the bias in the naive estimator, without the need for strict assumptions about the structure of prediction errors. Furthermore, we have developed diagnostic tools to assess whether additional human coding can further reduce variance in the main analysis. We illustrate the effectiveness of our method by revisiting a study on the sources of election fraud with ballot image data and regression analysis.

The Applied Statistics Workshop (Gov 3009) meets all academic year, Wednesdays, 12pm-1:30pm, in CGIS K354. This workshop is a forum for advanced graduate students, faculty, and visiting scholars to present and discuss methodological or empirical work in progress in an interdisciplinary setting. The workshop features a tour of Harvard's statistical innovations and applications with weekly stops in different fields and disciplines and includes occasional presentations by invited speakers.

More information is available at the Gov 3009 website: https://projects.iq.harvard.edu/applied.stats.workshop-gov3009

All interested Harvard affiliates are invited to attend.