CGA releases Twitter Sentiment Geographical Index (TSGI) dataset

November 21, 2022
cga_logo_globe_400x400

by Devika Kakkar
 

Promoting well-being is one of the key targets of the Sustainable Development Goals at the United Nations. Many national and city governments worldwide are incorporating subjective well-being (SWB) indicators into their agenda to complement traditional objective development and economic metrics. In this study, we develop the Twitter sentiment geographical index (TSGI), a proxy for SWB, by applying natural language processing techniques on a comprehensive archive of 7.4 billion geotagged tweets. In contrast to the previous works focusing on SWB, TSGI is not limited to a specific topic, period, or location. Using this data, we construct a high-frequency multi-year database that has global coverage, which enables the evaluation of SWB in 163 countries and regions for one decade. It offers great opportunities to investigate rich topics related to SWB. It mainly provides a detailed sentiment index spanning time and geography. To the best of our knowledge, it is the first SWB dataset at this scale and granularity. The TSGI is a collaborative project between MIT Sustainable Urbanization Lab and the Center for Geographic Analysis at IQSS to study the effect of climate change on human well-being using social-media data. More information can be found on the TSGI website here.

Screen shot of introductory info from TSGI website

 

Data Availability

This dataset is open to the public and can be accessed on TSGI’s dataverse repository here. Researchers can access the national indices, updated monthly with new data on this link. 

Data Source

The raw tweet data we used to produce the global sentiment and geography index dataset (GSGD) is from Harvard CGA Geotweet Archive v2.0, a global collection of geotagged tweets spanning time, geography, and language maintained by the the Center for Geographic Analysis. The Archive extends from 2010 to the present and is updated daily. The number of tweets in the collection is approximately 10 billion. More information on this dataset here.

Methodology

The sentiment index for global geotagged tweets is made in the following steps:

  • First, we vectorize the text into a 768 dimensions vector.
  • Then, we feed the vector into a trained neural classifier to get the single sentiment score.
  • Finally, we aggregate the scores in different administrative areas to represent the local subjective well-being.

Generate sentiment index flowchart, showing Geotagged tweets to Selected best model to Imputed sentiment score to Sentiment index

Applications/Publications

The dataset is being used for several use cases including but not limited to Global Sentiment and Climate Change and Global Sentiment during COVID-19. Following is a list of publications which are currently under review:

  • Chai, Y., Kakkar, D., Palacios, J., & Zheng, S. (2022). Twitter Sentiment Geographical Index: A global high-frequency dataset for monitoring Subjective Well-Being. Nature Scientific Data (Under Review).
  • Wang, J., Guetta-Jeanrenaud, N., Palacios, J., Fan, Y., Kakkar, D., Obradovich, N., & Zheng, S. (2022). A global nonlinear effect of temperature on human sentiment. Nature Human Behavior (Under Review).

Questions/Comments

Any questions or comments on this dataset can be sent to Harvard CGA or emailed to Devika Kakkar