Pipemizer: An Optimizer for Analytics Data Pipelines

  • Sunny Gakhar ,
  • Joyce Cahoon ,
  • Wangchao Le ,
  • Xiangnan Li ,
  • Kaushik Ravichandran ,
  • Hiren Patel ,
  • Marc Friedman ,
  • Brandon Haynes ,
  • Shi Qiao ,
  • Alekh Jindal ,

PVLDB |

Pipemizer is an optimizer and recommender aimed at improving the performance of queries or jobs in pipelines. These job pipelines are ubiquitous in modern data analytics due to jobs reading output files written by other jobs. Given that more than 650k jobs run on Microsoft’s SCOPE job service per day and about 70% have inter-job dependencies, identifying optimization opportunities across query jobs is of considerable interest to both cluster operators and users. Pipemizer addresses this need by providing recommendations to users, allowing users to understand their system, and facilitating automated application of recommendations. Pipemizer introduces novel optimizations that include holistic pipeline-aware statistics generation, inter-job operator push-up, and job split & merge.