DataUp—Data Curation for the Long Tail of Science

Published

The long tail (opens in new tab): sure, it’s a well-known concept in business and marketing, but there’s a very important “hidden” long tail in the sciences, too. So, what is this hidden long tail of science? It consists of the millions of datasets that are not stored in a databank and therefore are not available for use by other scientists. Every day, researchers throughout the world are observing, calculating, and compiling data, recording it all on their local machines within their labs—often not even as a shared resource to their institutions. Regrettably, much of this data never gets deposited in larger web-accessible data repositories where it could be reused by other investigators around the globe.

Learn more about DataUp (opens in new tab) (opens in new tab) (opens in new tab)

As a researcher myself and working with other researchers from around the globe, I am acutely aware of scientific data pain points; after all, those of us in the research community understand better than anyone that data preservation, curation, and sharing are critical for the advancement of scientific discovery. We want to share our data beyond our immediate groups, but many times we find ourselves hindered by a lack of tools and services designed to promote data curation and sharing.

Microsoft Research Podcast

Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi

Dr. Bichlien Nguyen and Dr. David Kwabi explore their work in flow batteries and how machine learning can help more effectively search the vast organic chemistry space to identify compounds with properties just right for storing waterpower and other renewables.

Enter DataUp, an open-source tool that helps us document, manage, and archive our tabular data. The DataUp project was born out of this need for seamless integration of data management into the researchers’ current workflows. The University of California Curation Center (opens in new tab) (UC3) at the California Digital Library (opens in new tab) (CDL), with sponsorship from Microsoft Research and the Gordon and Betty Moore Foundation (opens in new tab) (GBMF), focused on creating a tool that could be used by researchers in the environmental sciences. They recognized that this field epitomizes the problems of data management and curation; in particular, the storage of data locally without data description (metadata)—such as where it was collected, by whom, and when—that would make it more usable by others.

By conducting surveys at ecological and environmental science events, CDL found that the majority of these scientists use spreadsheets to collect and organize their data, so rather than make them learn a new program, UC3 recognized a need for a tool that works with a program most scientists already know: Microsoft Excel.

From the results of further surveys, it was determined that about half of the scientists preferred a tool that would be installed on their laptop, while the other half wanted a web-based tool that they could use on any device. Well, we sponsors and the UC3 team were not about to let this divided preference thwart the creation of a much-needed tool, so, together, we decided that there needed to be two versions of the tool: an open-source add-in (opens in new tab) (extension) for Microsoft Excel, and an open-source web application (opens in new tab).

To achieve the project goals of facilitating data management, sharing, and archiving, both the add-in and the web application accomplish four main tasks:

  1. Perform a best-practices check to ensure good data organization
  2. Guide users through creation of metadata for their Excel file
  3. Help users obtain a unique identifier for their dataset
  4. Connect users to a major repository, where their data can be deposited and shared with others

The California Digital Library established the initial repository, the ONEShare. Researchers will be able to find tools from the DataUp project as part of the Investigator Toolkit for DataONE (opens in new tab).

I want to thank Carly Strasser (opens in new tab), Trisha Cruse (opens in new tab), John Kunze (opens in new tab), and Stephen Abrams (opens in new tab) from UC3 for their passion and commitment to bring DataUp to life. I also want to thank Chris Mentzel from GBMF for co-funding the project with Microsoft Research Connections (opens in new tab).

Now, get out there and DataUp!

Kristin Tolle (opens in new tab), Director, Microsoft Research Connections

Learn More