Mapping Chinese History and Society: The China Biographical Database quantifies a distant society and social network that looks remarkably like our own

July 2, 2020
CBDB letters and logo

by Elaine K. Howley

For millennia, Chinese society has been richly populated with individuals who’ve shared ideas and connections across an enormous landmass. Identifying and cataloging all of these people through the ages is the aim of an ambitious prosopography project supported by the Institute for Quantitative Social Science called the China Biographical Database Project (CBDB).

Prosopography uses individual biographies to build an interrelated story about historical groups. Part genealogy, part biography, part social and political history, prosopography looks at many biographies to shed light on a group for social, political, and historical understanding. At its core, the prosopographical study of China is the work of the CBDB.

To that end, led by director Peter K. Bol, the Charles H. Carswell Professor of East Asian Languages and Civilizations at Harvard University, the CBDB has been systematically identifying and capturing the biographical details of elite members of Chinese society from ancient times onward.

Professor Bol’s work continues and expands on the efforts of the late Robert M. Hartwell, who was “the first social science historian of Middle Period Chinese history,” Bol explains. Hartwell was interested in mapping and collecting biographies to use in his work and launched a Chinese biography project at the University of Pennsylvania in the 1980s. Eventually, Bol and Hartwell connected over a shared interest in Chinese history, and Hartwell decided he would bequeath his project data to Bol when he died.

“I didn’t worry about it,” Bol recalls. “He was a difficult man and he would get enamored of people and then get angry at them, so I knew he’d get angry at me and I’d be off the hook.” But Hartwell died in 1996, before removing Bol from his will, and thus, his digital projects came to the Harvard Yenching Institute, with Bol their primary caretaker.

Building upon Hartwell’s data and upgrading its accessibility and depth has been a 15-year journey for Bol. In 2004 and 2005, Michael Fuller at UC Irvine revised and modernized the database structure, and around that same time, the CBDB became a collaboration among three institutions—the Fairbank Center for Chinese Studies at Harvard University, the Institute of History and Philology of Academia Sinica in Taiwan, and the Center for Research on Ancient Chinese History at Peking University in Beijing.

When the database first came to Harvard, it contained about 25,000 individual names. But since then, Bol and his team have accelerated their research and have since added hundreds of thousands more individuals and their biographical details to a now much larger and more detailed relational database.

Today, the CBDB contains roughly 470,000 individuals, primarily from the 7th through 19th centuries. Bol says the project is preparing to publish a series of additional biographies that will bring the grand total up to about 500,000 in 2020.

Technology Supports Academic Pursuit

In order to amass such a vast number of names and accompanying bits of biographical information, Bol needed to mechanize as much as possible the process of extracting these details from digital texts. “It would make sense to study one person and gather all possible data on that person and then put the data into the database,” he says. But to do that properly, “you’d have to hire a bunch of very skilled historians. We figured that if we did that kind of work, we’d probably be at the order of 50,000 names thus far, because it’s very slow and we can’t hire people full-time. So that’s why we switched to computational methods.”

These computational methods, called named-entity recognition (NER) for natural language processing, use a computer program to mine digitized texts. The program looks for certain types of terms based on parameters it’s been taught, such as names, job titles, how the person got that job, family relationships, dates, and locations.

The data is culled from a variety of sources including biographies, personal letters, court and office documents, civil service examination records, poems and other literary documents, and other biographical databases. A simple text seven sentences long can provide more than 20 discrete pieces of useful data that are then cataloged and made searchable and relational in the database.

The project is first and foremost an academic pursuit intended to support researchers studying Chinese history around the world. But Bol says the database may also help Chinese people answer questions about their own past and families. “Descent groups in villages across in China have genealogies all mapped out with thousands of people in them,” Bol says. This data can be helpful in building, augmenting, or verifying such genealogies, and a vendor in China makes the database available to libraries across the country. Bol says he hopes that partnership will provide enough revenue to support the ongoing work of the project in the future.

Societal Insights Based on Individuals

Tracing this network of relationships across Chinese history is helpful for understanding how society has moved and changed over thousands of years, and some have called for more contemporary data.

“We’re moving slowly on that,” Bol says, because early in the 20th century, Chinese society fundamentally shifted in ways that make collecting and publishing that type of information problematic. “As you enter the 20th century, powerful people don’t want anyone to know who their relatives are.” A known proximity to wealth and power can spell trouble for some.

In addition, titles and institutions have also changed, and with the introduction of communism in the mid-20th century, it became even less likely that elites would want to have their familial relationships and wealth status cataloged and broadcast.

Nevertheless, the CBDB contains a fascinating cornucopia of information about some of China’s previous dynasties and the relationships that drove the top tiers of those societies. That information can also be represented spatially, by overlaying data points onto maps to show clearly where people have moved and information has flowed. “I’m an historian but I’ve always been interested in maps,” Bol says. “What interested me about maps is that they let you see spatial variations,” that provide insight into the way a society works.

For example, Bol recently ran a query that looked at where students and their teachers lived during the Ming dynasty. A quick mapping of the data showed very clearly that the vast majority of these individuals lived on or very near a major postal route, which was not surprising, given that from the 14th through 17th centuries, the postal service was the original information superhighway—a critical means of sharing documents and supporting education.

topographical map of china with multi-colored lines drawn on it

Above: Leading scholars in China between 1450 and 1550 (green), concentrations of students (red), and national postal routes (yellow roads and blue rivers).

A Drive to Catalog and Quantify History

History is fundamentally made up of people and their stories, but quantifying and converting that information into data points can help researchers better understand how societies work and change over time. Making such detailed and data-driven insights widely accessible has long been a driving force for Bol.

He says he was drawn to history as a high schooler and likens the experience of studying history to that of looking at a lava lamp. The dynamic and constantly changing colors and arrangement of shapes mirrors the arc of history. “Over time, you see different people or different configurations of society. Any moment, you could do a cut and that would be the situation at that moment.” But wait a moment and it’s all changed again. Quantifying those changes has become a big part of his work as an historian.

Bol initially started studying Russian—he’s always been “interested in presumed-to-be enemies of the US that we have known very little about”—but says he found the language too difficult. So, he took up Chinese instead.

Although Bol is fluent in both English and written and spoken Chinese, he recognizes that not everyone who could benefit from the database is bilingual. “One of the reasons we’ve kept it bilingual is we want it to be possible for people who don’t know Chinese to do research with the data.”

Challenges: Past, Present, and Future

It has been a challenge to accurately record information about a single person across Chinese history because common names are repeated and individuals often have multiple aliases or changes to their names and titles. But continued diligence has helped to solve the mystery of who’s who and to eliminate duplicate entries for the same individual. “Over time, ambiguity gets resolved,” Bol says.

Looking ahead, Bol says he hopes the project will one day have a complete census of elite members of Chinese society.

But it’s a project that’s going to need long-term tending. “One of my concerns is that this is an open-ended project. I need to make sure it’s on a solid footing so it can be maintained over time and that it has an institutional home. It does now at three major institutions that I hope will be able to support it,” long into the future.

The project is also cautiously moving toward crowdsourcing, Bol says. Currently, a trained historian would need to review any data that might be input by a layperson. But technological improvements the team is working on may permit a broader group of individuals to assist with the project in the future.

For now, Bol says the more texts are analyzed and the more data is input, the more sophisticated the database becomes. “The goal for the database is to have everyone connected to something. We’re interested in social science data, and when you add up the anecdotes, it becomes data.”