The Case for ML-Enhanced High-Dimensional Indexes

  • Rong Kang ,
  • ,
  • Chen Wang ,
  • Ce Zhang ,
  • Jianmin Wang

AIDB@VLDB 2021 |

Author's Version

The case of learned indexes has recently inspired intensive research in the database community. It has been shown that learned indexes can often outperform traditional indexes such as B-tree on lowdimensional data. However, learned indexes suffer from inherent difficulties due to the “curse of dimensionality” when applied to datasets with very high dimensions (e.g., time series data). In this paper, we explore the “middle ground” between traditional and learned indexes where we can leverage advantages of both, targeting highdimensional data. Specifically, we study ML-enhanced indexes in the context of processing ????-nearest-neighbor (????NN) queries over time series data. Our ML-enhanced indexes take traditional tree-based indexes as inputs, and train deep neural networks that capture the distribution of nearest neighbors among index leaf nodes. We then process ????NN queries by scanning the leaf nodes with respect to the order suggested by the neural networks (rather than their original order). Experimental evaluation shows that ML-enhanced indexes can significantly outperform their traditional counterparts in terms of the recall of the nearest neighbors with respect to the number of index leaf nodes being scanned.