Web data extraction using hybrid program synthesis: a combination of top-down and bottom-up inference

SIGMOD (Special Interest Group on Management of Data) |

Organized by ACM

Related File

Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various usability challenges around robustness, the amount of training effort required, the complexity of programs synthesized, as well as the ease of interaction in limited UI environments. In this work we address these challenges based on a novel program synthesis approach which combines the benefits of deductive (top-down) and enumerative (bottom-up) synthesis strategies. This yields a semi-supervised technique with which concise web data extraction programs expressible in standard XPath/CSS can be synthesized from a small number of user-provided examples. We demonstrate the effectiveness of our method in comparison to existing techniques in terms of overall accuracy, robust inference from a small number of examples, as well as inference of concise programs comparable to hand-written selectors. Our method has been deployed as a feature in the Microsoft Power BI product and released to millions of users.