Self-service Data Preparation

It is often cited that data scientists spend a significant portion of their time (up to 80%), cleaning and preparing data. For less-technical users, who may be less proficient in writing code (e.g., in Excel, Power-BI and Tableau), the tasks of preparing and cleaning data are not just time-consuming, but also technically challenging.

In the “Self-service Data Preparation” project, our goal is to develop technologies that can automate common data-preparation tasks, in the context of data science and business intelligence workflows. We aim to empower technical and non-technical users alike, towards the democratization of data.

Our research has been recognized with best paper awards at VLDB and SIGMOD. Some of our technologies have been integrated into Microsoft products such as Power Query (opens in new tab) for Power BI (opens in new tab) (program synthesis, operator recommendations, fuzzy join, fuzzy deduplication), Excel (opens in new tab) (error detection, data cleansing), Azure Machine Learning (opens in new tab) (data prep sdk), Azure Purview (opens in new tab) (auto-tagging in data lake), Azure Data Factory (opens in new tab) (fuzzy join), and Dynamics 365 Customer Insights (opens in new tab) (fuzzy join, fuzzy deduplication).