When Hurricane Sandy hit New York and New Jersey in 2012, many people turned to Twitter to share firsthand information about the disaster. Twitter has become a useful social media tool for obtaining and sharing news as it happens, but the data it generates can also be a rich source of information for researchers in a number of different fields, say Penn State researchers.
Guangqing Chi, associate professor of rural sociology and demography and director of the Computational and Spatial Analysis (CSA) Core in the Social Science Research Institute (SSRI), and his team have collected 25 terabytes of geo-tagged tweets over the last three years.
“Knowing the demographics of a group is usually the first step in population research, and Twitter offers one of the newest and most rapidly growing Big Data sources,” he said. Twitter data is very versatile and can reveal a variety of social, behavioral and emotional information about its users in real time.
While Twitter data has drawn interest in research fields such as computer science, transportation planning, and urban studies, social scientists have resisted using the data. According to Chi, this is because the data is not representative of the population and because we know little about the demographic characteristics of the users who produce it. “Previous research focused on predicting only a few demographic characteristics of Twitter users and relied on small amounts of Twitter data. Also, because Twitter user demographics and language use may change year to year, the prediction methods may be inaccurate. These factors have limited researchers from taking full advantage of the information embedded in Twitter.”
In response to these challenges, Chi and his team are developing a set of methods to accurately predict demographics in real time. “Our work has the potential to change the landscape of population research,” said Chi. “It could open the door for demographers to take advantage of rich Twitter data and strengthen research in many other social science disciplines that use demographic data.”
Their methods are essentially machine-learning algorithm models. “We’re not trying to predict demographics of individual Twitter users. Rather, we are predicting the composition of a group of Twitter users and then the demographic composition of the population,” Chi explained.
“The approach is based on the premise that it is difficult to make predictions about an individual but is much easier to make predictions about large groups of individuals,” he said. His team also uses U.S. Census data to compare against its findings to determine how effective its models are.
Chi and the CSA Core will be building an infrastructure to more efficiently manage and analyze the information they’ve collected for social science research. Other CSA Core staff members involved in the project include Daniel Nugent, managing director; Cynthia Mitchell, research analysts/programmer; and Yosef Bodovski, research analyst.
The team will be hosting a series of workshops to promote the use of Big Data for social science research and packaging the Twitter data and capacity into a product for collaboration with SSRI researchers. “While other researchers could start data collection on their own, it would be difficult and expensive,” said Chi. “Since we’ve been collecting and analyzing the data for three years, I feel we are ahead of the curve in knowing the potential uses and challenges of Big Data.”
For example, there are many social scientists already using Twitter data to determine interests among demographic groups and track behaviors such as cyberbullying. Additionally, researchers could use the data to study user locations, times they are online, commuting patterns, topics of interest and how those topics change over time.
“Geographically annotated social media is extremely valuable for modern information retrieval.” Chi reported. “Twitter data can provide a significant amount of individual social, behavioral and emotional information that is longitudinal and georeferenced. The latter enables the linkage to other individual-level data, such as patient-based data and social surveys, as well as other environmental data.”
The work is being supported by the Social Science Research Institute and the Population Research Institute at Penn State.
Additional researchers participating in this project include Daniel Kifer, associate professor of computer science; Jennifer Van Hook, professor of Sociology and Demography and director of the Population Research Institute; Lee Giles, professor of information sciences and technology and director of the Intelligent Information Systems Research Laboratory; Corina Graif, assistant professor of sociology and criminology; Stephen A. Matthews, professor of sociology, anthropology, and demography; all of Penn State, as well as Xiaopeng Li, assistant professor of civil engineering at University of South Florida; and Tse-Chuan Yang, associate professor of sociology at SUNY Albany.