Effective Data Versioning for Collaborative Data Science

With the increasing number of individuals performing data science in every organization and team, there is a proliferation of dataset versions at various stages of data analysis. More often than not, these dataset versions are stored in an ad-hoc manner in shared file systems, leading to massive redundancy and duplication, and making it impossible to keep track of and find specific versions. In the first part of this talk, I will talk about our developed tool, titled OrpheusDB, on the effective data versioning for structured data. OrpheusDB enables true collaboration via Git-style commands and supports reasoning about various versions via a rich syntax of SQL-like statements. It can compactly store, keep track of, and retrieve versions on demand. In the second part of the talk, I will focus on one of our attempts towards general purpose data versioning by removing assumptions in OrpheusDB. Specifically, I will describe a generalized storage representation for arbitrary data formats, which enables compact storage and meanwhile maintains fast version retrieval.

[Slides]

Date:
Speakers:
Silu Huang
Affiliation:
Illinois