Beyond 100 Installations: Dataverse's Growth and Global Role in Research Data Sharing

by Danielle Benaroche Gottesman

In the vast realm of obtaining and sharing research data, the possibilities are nearly limitless. The potential therein, however, can prove a daunting voyage through uncharted territory. Even among researchers versed in harnessing their findings, the path to a viable open-source sharing platform—whether used internally or among a broader network—does not always lead directly to efficient data recording, storage, or discovery. 

Enter Dataverse, an open-source web application created to allow users to share, preserve, cite, explore, and analyze research data. Developed at Harvard's Institute for Quantitative Social Science (IQSS) in collaboration with contributors from around the world, the Dataverse Project was built to make data readily available; allow users to reference and replicate others' work more efficiently; and facilitate traceable academic credit and web visibility among researchers, journals, data distributors, and institutions. In the quantitative data world, where information management encompasses compilation methods, citation practices, and dataset integrity, the ability to reliably track informational origins while building upon others’ work can be transformative.

Stefano Iacus

Having recently surpassed a milestone with over 100 installations of Dataverse worldwide, the indication of how researchers and organizations are looking to both use and store data is refreshingly evident. Once an organization installs and customizes the Dataverse platform, its users may deposit and share their datasets with the public or with specific users. “It’s highly configurable,” says Stefano Iacus, director of Data Science and Product Research at IQSS, “and many of the users use it differently to share their data among their communities in their own ways and using their own field specific language.”  

For users with sensitive information, the option to share data via an internal, secure network that bypasses the vulnerability of individual laptops, for example, is another benefit. Among scientific communities eager to share and utilize one another’s findings in a more public way, Dataverse is equally willing to serve.

Once inside a Dataverse installation, users are able to access thousands of collections of related data, known as datasets—much like books on specific topics in a library. What renders the software unique is that the data it hosts is organized and tagged in ways that make it easy to find, interpret, and use, comparable to having a well-organized library wherein each book has a distinct author, title, and synopsis. Because Dataverse allows people to share and collaborate on the data that resides within it, users can share their datasets with others using a repository, contributing to a broader and more accessible pool of knowledge. In a broad sense, this fosters research, facilitates analysis, and enhances applications across innumerable fields. 

In terms of nomenclature, says Sonia Barbosa, associate director of Dataverse Support, Data Curation, and the Henry A. Murray Research Archive at IQSS, “people confuse The Dataverse Project and the Harvard Dataverse Repository. The Project is the software people can use to create their repository,” explains Barbosa, who has been honing her experience in the world of repositories since the 1990s. To summarize, “The Dataverse Project” is the term used to describe the software itself; a Dataverse installation is an adoption of the software by an organization/institution as their data repository; and a “collection” is a container used to hold the “data” from related projects within a Dataverse installation [see graphic].

Dataverse Collection Management

The Dataverse platform as a whole is designed to adhere to the FAIR principles, in which data meet specifications for Findability (F), Accessibility (A), Interoperability (I), and Reusability (R) and “emphasize machine-actionability (i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention).”[1] Philip Durbin, a Dataverse software developer who leads the Dataverse Users Community Group, likens the user experience to WordPress—an open-source content management system that hosts blogs and websites for the average user who is not versed in coding. “I think of Dataverse as similar to downloading software to host a blog. In our world though, we’re not hosting blogs; we’re hosting a data repository that we run ourselves here at Harvard. An IT professional would download the web application to encourage data sharing across their organization.”

Phil Durbin

“For a long time I've been working to foster the community,” Durbin  says. “Having so many users is a good problem to have, and we are trying to move more toward modularity. Instead of a single monolithic piece of software, we would ideally enable the community to write plug-ins, and share them with the community to add to the core functionality.” Iacus adds, “it is also important to mention the deep re-architectural work done under the hood during these last two years to improve user experience, modularity, personalization, and speed of development.”

Devoted to its adoption and ongoing growth, the IQSS team has ensured that downloading the software is as minimal of a hurdle as possible. As such, institutions—whether opting to make their data public or private—can easily do so directly through the Dataverse site. 

“We’ve almost made it so that you can click a button to have the Dataverse installation running within a few minutes,” says Iacus.

World map of Dataverse installations

Surpassing 100 Installations Worldwide, and the Dataverse Team 

As of today, there are 120 installations in over 38 countries worldwide, from Iceland to Kenya to Lebanon. The number is even more impressive with the knowledge that each installation serves more than one institution. “For example,” says Iacus, “we host Harvard, but we also host journals and a number of other institutions, meaning that per each download of the software, the number of institutions using Dataverse is much higher than just the single download.”

“It’s an incredible endogenous growth,” says Iacus, “because when I arrived less than two years ago, there were 70 installations overall. We are also aware that more are coming.” He acknowledges that while proliferation can be both good and bad, he also points out that the IQSS team “foresees steady growth and recently completed the support  for hosting very large data, which is becoming more and more common across so many disciplines.” 

The success of Dataverse’s growth is due largely to educational efforts around usage and implementation, which encompass teaching researchers how to share information in ways that will benefit the larger community. 

Sonia Barbosa

“Our role centers around guiding on data sharing, best practices, and how to do that correctly,” says Barbosa. “The software is designed to influence [users] toward best practices. It has all of the prompts to guide you in entering data properly, and all the metadata to guide you in your discipline on how to share your data (by title, by author, by description, by discipline, etc.). There are over 100 metadata fields to help describe the data so other people can use it well. If you have two similar datasets, someone looking at it will go with the one that makes the most sense,” she explains. “The Dataverse-supported Repository allows the data to become usable and user-friendly, and the best part is that the workflow is set up to educate on best practices as the data gets entered.”

Because the NIH (National Institutes of Health) requires NIH-funded researchers to submit plans demonstrating the means by which scientific data (“any data needed to replicate research findings”) are shared and managed (effective January 2023), the team at Dataverse acknowledges the importance of their collective role as educators. 

“We created data management plan guidelines specifically for the Dataverse-based Repository, asking researchers, ‘What are you collecting? How are you collecting? What are the licenses?’ and any other requirements the NIH has, sometimes regarding authorship or ethical guidelines,” says Barbosa. “So we are also educating researchers when they come to us for help with those guidelines.

Beyond the webinars and trainings offered by the Dataverse team, Barbosa emphasizes the educational value of their guidance on best practices and creating metrics to make tools consistent across repositories. The Dataverse team also offers open office hours (available by appointment in person or via Zoom).

“Researchers will come to us from partner organizations, like the Harvard-affiliated hospitals. We’ll learn about their projects, consult on how to use the Harvard Repository and get enough information on what tools we have to support their data needs—like if they are generating 3D images, for example,” says Barbosa. “The Repository provides metrics like download counts, and DataCite mints the DOIs we use, connecting the research and tracking usage; and that’s how [researchers] find out they’re being cited, used, and linked.” 

Barbosa credits Dataverse’s success to the software being open access and having such a large community of support. Organizations and institutions determine their community needs, and they have access to a community of experts on efforts around training and data sharing, “letting people know how to share, clean, and process data.” Now with the [NIH] mandates, you have to give enough information so that it’s clear that best practices are being followed around securing data.” While Dataverse does not yet support sensitive data in an official capacity, Barbosa points out that they are actively working to do so and notes, “we can still provide all the guidance needed on how to do this. We also ensure that the data provider maintains copyright on their data and can publish and remove data as necessary (anonymized sensitive medical data, for example, and who has access to it). As they say in the data-sharing world, ‘You make data as open as possible and as closed as necessary.’”

A Decade of Community Meetings 

Having just wrapped the 10th annual Dataverse Community Meeting in Mexico City, developers, researchers, librarians, and IT professionals came together at the International Maize and Wheat Improvement Center (abbreviated CIMMYT in Spanish) headquarters in Texcoco, Mexico, to discuss and collaborate on the continued growth of Dataverse. 

The Community Meeting offered Q&As, keynotes from speakers across numerous fields, updates, and workshops on data preservation, integrity, and maintenance.

“I love the international flavor of it,” Durbin says. “Our community is asking to move [the conference] around, and I’m proud to be part of this movement for greater access to research data.” With last year’s conference in Braga, Portugal, following three pandemic-influenced virtual versions, the destination for next year’s meeting has yet to be determined but promises excitement.

“One of the nicest pieces of news we heard at the Community Meeting,” says Barbosa, “is that despite everything they are going through in Ukraine, they were able to get a repository installation at an open social science program.”

“The meeting,” Iacus says, “was a celebration of the huge effort made by the IQSS team and the whole community to make Dataverse what it is today. The choice of the hosting institution was also very important: CIMMYT is a very important center for agricultural development that focuses on selecting seeds from maize, rice, and wheat that can survive pests and calamitous events, and they have saved millions of lives by sending these seeds to Africa and Asia (their founder won the Nobel Prize in 1970). They have sensor data, genetic inscription, satellite data, and field data that they want to make available for research. They chose Dataverse as a platform for its unique ability to make data discoverable, using the FAIR principles. This is one of the first data repositories that has implemented these principles to make data searchable in a way that can be accessible by researchers.”

When discussing Dataverse, Durbin references Brian Nosek’s Strategy for Culture Change pyramid, where the base level emphasizes, “‘make it possible, then make it easy, then normative, and then rewarding’; and encourage best practices to influence larger organizations like the NIH to influence behavior for the greater good.”

In its ever-expanding state, Dataverse steadily offers value for scientists and scholars looking to exchange and perpetuate clean data. As the software continues to enable users to safely and reliably take charge of how their research is recorded and disseminated, reproducibility and transparency are facilitating new levels of collaboration, dissolving boundaries, and making a positive global impact. The future looks promising.

“Up to now,” says Iacus, “Dataverse was seen exclusively as a data repository. But Dataverse is also more than that in at least two ways. It is a platform that is now able to support different types of research objects (not just traditional data) throughout the entire research lifecycle. But it is also very powerful in handling metadata and therefore connecting researchers to data no matter where they are physically hosted-. We think Dataverse can be useful in many situations beyond the present most common use case.”

[1] "FAIR Principles".  GO FAIR. Retrieved 2020-02-16. Material was copied from this source, which is available under a Creative Commons Attribution 4.0 International License.

Do you have news to share?

We're always interested in hearing the latest work and accomplishments from IQSS affiliates and would be happy to help you share your news. If you have any updates, let us know!