Instance-Optimized Data Layouts for Cloud Analytics Workloads
- Jialin Ding ,
- Umar Farooq Minhas ,
- Badrish Chandramouli ,
- Chi Wang ,
- Yinan Li ,
- Ying Li ,
- Donald Kossmann ,
- Johannes Gehrke ,
- Tim Kraska
SIGMOD 2021 |
Today, businesses rely on efficiently running analytics on large amounts of operational and historical data to gain business insights and competitive advantage. Increasingly, such analytics are run using cloud-based data analytics services, such as Google BigQuery, Microsoft Azure Synapse, Amazon Redshift, and Snowflake. These services persist and process data in compressed, columnar formats, stored in large blocks, each of which contains thousands or millions of records. For these services, disk I/O from (remote) cloud storage is often one of the dominant costs for query processing. To reduce the amount of I/O, services often maintain per-block metadata, such as zone maps, which are used to skip blocks that are irrelevant to the query, leading to lower query execution times. However, the effectiveness of block skipping via zone maps is dependent on how the records are assigned to blocks. Recent work on instance-optimized data layouts aims to maximize block skipping by specializing the block assignment strategy to a specific dataset and workload. However, these existing approaches only optimize the layout for a single table.
In this paper, we propose MTO, an instance-optimized data layout framework that determines the blocking strategy for all tables in a multi-table dataset in the presence of joins, such as in a star or snowflake schema common in real-world workloads. MTO takes advantage of sideways information passing through joins to jointly optimize the layout for all tables, which results in better block skipping and hence reduced query execution times. Experiments on a commercial cloud-based analytics service show that MTO achieves up to 93% reduction in blocks accessed and 75% reduction in end-to-end query times compared to state-of-the-art blocking strategies.