Steering Query Optimizers: A Practical Take on Big Data Workloads

  • Parimarjan Negi ,
  • ,
  • Ryan Marcus ,
  • Mohammad Alizadeh ,
  • Tim Kraska ,
  • Marc Friedman ,
  • Alekh Jindal

2021 International Conference on Management of Data |

Published by ACM

PDF | Publication | Publication

In recent years, there has been tremendous interest in research that applies machine learning to database systems. Being one of the most complex components of a DBMS, query optimizers could benefit from adaptive policies that are learned systematically from the data and the query workload. Recent research has brought up novel ideas towards a learned query optimizer, however these ideas have not been evaluated on a commercial query processor or on large scale, real-world workloads. In this paper, we take the approach used by Marcus et al. in Bao and adapt it to SCOPE, a big data system used internally at Microsoft. Along the way, we solve multiple new challenges: we define how optimizer rules affect final query plans by introducing the concept of a rule signature, we devise a pipeline computing interesting rule configurations for recurring jobs, and we define a new learning problem allowing us to apply such interesting rule configurations to previously unseen jobs. We evaluate the efficacy of the approach on production workloads that include 150K daily jobs. Our results show that alternative rule configurations can generate plans with lower costs, and this can translate to runtime latency savings of 7-30% on average and up to 90% for a non trivial subset of the workload.