Efficient Attribute Recommendation with Probabilistic Guarantee

Proc. 2018 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining |

Published by ACM – Association for Computing Machinery

Publication

We study how to efficiently solve a primitive data
exploration problem: Given two ad-hoc predicates which define two subsets
of a relational table, find the top-K attributes whose distributions in
the two subsets deviate most from each other. The deviation is measured
by ℓ1 or ℓ2 distance. The exact approach is to query the full table to
calculate the deviation for each attribute and then sort them. It is too
expensive for large tables. Researchers have proposed heuristic sampling
solutions to avoid accessing the entire table for all attributes. However,
these solutions have no theoretical guarantee of correctness and their
speedup over the exact approach is limited. In this paper, we develop an
adaptive querying solution with probabilistic guarantee of
correctness and near-optimal sample complexity. We perform experiments in both
synthetic and real-world datasets. Compared to the exact approach
implemented with a commercial DBMS, previous sampling solutions achieve
up to 2× speedup with erroneous answers. Our solution can produce 25×
speedup with near-zero error in the answer.