Twitter Sentiment Geographical Index digs through social media posts to look at regional moods and well-being

July 17, 2023
Multicolored emojis connected by network lines on a dark field

by Colleen Walsh
 

In its infancy, early adopters typically used the social network site Twitter for mundane updates about dinners or trips to the dentist. It was, many complained, just another way for the selfie-obsessed to share personal details of their everyday lives.

But the platform, which initially restricted users to 140 characters—meaning messages had to be pithy and to the point—quickly became a way for people to also share critical information. From a journalist detained in Egypt in the wake of the Arab Spring—who tweeted “beaten arrested in interior ministry” and was freed when her message went viral—to posts by government agencies about weather or safety emergencies, Twitter was increasingly used as a vital resource.

And the academic community took note.

In recent years, tweets have informed the work of countless scholars, including a group at Harvard currently analyzing billions of the social media messages. The team of geospatial experts at the University’s Center for Geographic Analysis (CGA) at IQSS is working in collaboration with the Massachusetts Institute of Technology (MIT) to mine a massive dataset of tweets containing geographical information, called geotweets. Their goal: help researchers track people’s attitudes toward a range of topics, from climate change to COVID-19.

For Harvard’s Devika Kakkar, who is leading the effort known as the Twitter Sentiment Geographical Index (TSGI), the project is particularly timely.

TSGI website banner

“This work is so important because promoting well-being is one of the United Nations sustainable development goals aimed at tackling a range of pressing problems such as poverty, global health, inequality, and climate change,” says Kakkar, data science project manager at CGA, a center located within Harvard’s Institute for Qualitative Social Science.

Devika Kakkar
Devika Kakkar

The TGSI draws its raw data from the CGA’s Geotweet Archive v2.0, a global record of tweets spanning time, geography, and location, and applies natural language processing algorithms to that data to examine in detail how people are feeling around a wide number of issues, explains Kakkar. (Version 2.0 represents a merge between the CGA Archive with one developed by Clemens Havas and Bernd Resch at the Department of Geoinformatics at the University of Salzburg. The original archive, Version 1.0, was developed in 2012 by Ben Lewis, CGA geospatial technology manager, and then-Harvard graduate student Todd Mostak as part of a project to create a GPU-powered spatial database known as GEOPS.)

How does it work? First, using an algorithm,  the CGA team assigns tweets from the Archive a sentiment score, a number between 0 and 1, with 1 indicating the highest positive sentiment and 0 the lowest. They then compile their findings into a publicly available downloadable database. Incoming tweets are processed in real-time, and the database is updated monthly. “Until now, people really haven’t been studying this on such granular levels,” says Kakkar. “We have processed about eight billion historical tweets beginning in 2012 that offer up information about how people on social media are feeling, making this a very refined dataset in regards to subjective well-being.”

Unlike earlier efforts to track subjective well-being—a blend of a person’s cognitive judgment and emotions—that were limited to a specific topic, period, or location, TGSI is expansive in its scope and scale encompassing 163 countries, 104 languages, and myriad subjects of interest to academic researchers. 

Siqi Zheng
Siqi Zheng
(credit: Jordan Knight)

One of those researchers is MIT professor Siqi Zheng, whose work focuses on urban environmental economics and policy. Using a language processing program on the Chinese equivalent of Twitter known as Sina Weibo, she and her MIT team extracted sentiments from those online messages to measure the impact of extreme temperatures and air pollution on people’s happiness in China. When she wanted to take her work global, Zheng turned to Harvard.

By collaborating with CGA, Zheng and her team have studied the impact of wildfires on people’s moods in Indonesia, examining both how often people discuss the topic and their expressed sentiments. They have explored the emotional toll of climate change, assessing “sentiment change due to warming temperatures, increased weather unpredictability, and more frequent environmental disasters,” and they have explored how COVID-19 affected people’s moods around the globe.

“Devika’s group really has a very strong expertise and their independent product in this area  has been key to our work,” said Zheng. “We realized we really complemented each other, so we started to work together.”

Scholars have increasingly turned to Twitter as a way to track emotion and sentiment on a global scale. Such work has taken on increased importance in the wake of COVID-19 and the numerous studies indicating the pandemic triggered a worldwide surge in cases of anxiety and depression. In the past, researchers have relied on surveys to try to assess emotional well-being, says Kakkar, but that process can be labor intensive, time consuming, and expensive. By contrast, a social media-based approach is easier, more cost effective, and can be done in real time, she notes. It can also cast a much wider net. And in recent years, advances in machine learning have further streamlined the process.

Experts have traditionally employed dictionary-based language processing algorithms that assess individual words in a tweet to determine its overall sentiment score based on an average. The problem, says Kakkar, is that those programs “don’t account for context, which is very important when someone tweets something sarcastically or tries to express a more complex idea in their message.” To overcome that hurdle, the TSGI team uses Bidirectional Encoder Representations from Transformers (BERT), a more robust language processing algorithm that considers the surrounding words in a sentence in order to establish context. 

“That's the advantage of BERT—it takes into account what the person is trying to say, what the idea is—and that is very important when you try to measure a sophisticated parameter like subjective well-being,” says Kakkar. She offers the example of the two tweets: “This movie about time-travel is terrible,” and “I didn’t love this last sci-fi film.”

BERT scatter graph comparing 2 tweets: "This movie about time travel is terrible" and "I didn't love his last sci-fi film."

“Here both tweets express the same negative sentiment in a different manner; the first uses the negative sentiment word ‘terrible’ and the second the positive sentiment word ‘love,’” says Kakkar. “But BERT will put these tweets in the same negative embedding space because it considers the words in the second tweet in relation to others in the sentence.”

Wendy Guan
Wendy Guan

Wendy Guan, CGA’s executive director, notes that the center’s expertise is “analyzing data in the specialty domain of space and time.” She calls the CGA’s Geotweet Archive “a unique and valuable resource that enables us to take the pulse of the society at large on just about any research topic,” and says Kakkar’s recent work is helping scholars delve even deeper into the data.

“You can now come to the Archive and search all the tweets containing certain events, and then further specify your criteria using certain keywords involving time and location. And you can find out if people were happy about an event, if they were angry about it, what was the up and down in their mood from before, and after,” says Guan. “The TGSI is a real trove of information about sentiment on a deep level.”

For the next phase of the project, Kakkar is working on making the latest billion tweets instantly searchable with the Billion Object Platform, which will allow researchers to query, analyze, and visualize the most recent billion geotweets they’ve collected in real time. Through an interactive online interface, users can instantly filter geotweets by keyword, geographic region, and time. Colorful maps of the globe indicate the tweets’ sentiment scores and highlight the spatial distribution of tweets in different languages. “This platform allows you to do interactive analysis on these latest one billion tweets in real time,” says Kakkar, adding that the instant nature of the platform will enable researchers to study issues unfolding on the ground with the click of a button. “Within a matter of seconds, you can filter through the most recent billion tweets to match the query criteria.”

Billion Object Platform screenshot

The real-time approach has far-reaching applications and can help those studying a wide range of issues, says Kakkar, such as government officials, who in the event of another COVID-19 lockdown, might be eager to track how people are responding so they can tailor their policies more effectively. She notes that in the early days of the Ukraine war, many wanted to know up to the minute how people on the ground were responding to Russian attacks. “They wanted immediate access to this kind of data,” says Kakkar. “It’s that real-time access that we are targeting in the next phase of our work.”

 

The Center for Geographic Analysis (CGA) was established in 2006 to support research and teaching across all disciplines in the University as they relate to geospatial technology and methods.

More on the Twitter Sentiment Geographical Index

Human figures under emoticons in speech bubbles
IQSS NEWS

CGA releases Twitter Sentiment Geographical Index (TSGI)