IQSS Data Science: Aiding Reproducible Research By Adding Provenance in Data Citations

November 4, 2014
seas-iqss2

In partnership with the Harvard School of Engineering and Applied Sciences (SEAS), and the Dataverse Project at the Institute for Quantitative Social Science (IQSS) at Harvard, we are pleased to announce the launch of a new project to capture and incorporate meaningful provenance into data citations in order to facilitate research reproducibility and reuse.

Funded by an EAGER grant from the National Science Foundation, this project, titled “Citation++: Data citation, provenance, and documentation”, will result in designing and prototyping mechanisms to add provenance metadata within the data citation. The PIs for this project (Margo Seltzer, Gary King, and Mercè Crosas) also plan to work with the research data community and especially groups working on data citation solutions, including DataCite, to incorporate provenance more broadly.

In the digital world, provenance or lineage, is the history of how an artifact came to be in its current state. It typically includes precise references to both the inputs and the transformations that led to an object’s existence. There are myriad uses of provenance, but one of the most frequently mentioned is in data citation. Nonetheless, no existing widely-used data citation standards nor services include provenance.

This project leverages research in data citation and provenance to prototype and evaluate a provenance-enabled citation service that will allow researchers to access to the history of a data set. To do this, this project will deliver at least two instances of data citation services: R-based transformations of a dataset and sql-based transformations of a dataset, by using Dataverse and in conjunction with the USENIX open access repository. PI Margo Seltzer from SEAS has this to say about the collaboration:

“Having worked on provenance systems for the past several years, I’m excited to be working closely with colleagues from IQSS and Dataverse to put our research into practice and help keep Dataverse on the cutting edge of data sharing.”

Evaluating the success of this project will be based upon at least the following metrics: fraction of citations added after deployment of our service that incorporate the provenance keyword, the absolute number of provenance queries issued, and the ratio of non-provenance metadata queries to provenance queries issued.

Originally posted in the IQSS Data Science blog. Please see the original post for more information.