MLBench: Benchmarking Machine Learning Services Against Human Experts

  • Yu Liu ,
  • Hantian Zhang ,
  • Luyuan Zeng ,
  • ,
  • Ce Zhang

Proceedings of the VLDB Endowment (VLDB 2018) |

Modern machine learning services and systems are complicated data systems — the process of designing such systems is an art of compromising between functionality, performance, and quality. Providing different levels of system supports for different functionalities, such as automatic feature engineering, model selection and ensemble, and hyperparameter tuning, could improve the quality, but also introduce additional cost and system complexity. In this paper, we try to facilitate the process of asking the following type of questions: How much will the users lose if we remove the support of functionality x from a machine learning service?

Answering this question using existing datasets, such as the UCI datasets, is challenging. The main contribution of this work is a novel dataset, mlbench, harvested from Kaggle competitions. Unlike existing datasets, mlbench contains not only the raw features for a machine learning task, but also those used by the winning teams of Kaggle competitions. The winning features serve as a baseline of best human effort that enables multiple ways to measure the quality of machine learning services that cannot be supported by existing datasets, such as relative ranking on Kaggle and relative accuracy compared with best-effort systems.

We then conduct an empirical study using mlbench to understand two example machine learning services from Azure and Amazon, and showcase how mlbench enables a comparative study revealing the strength and weakness of these existing machine learning services quantitatively and systematically.