ProLinguist: Program Synthesis for Linguistics and NLP

We introduce ProLinguist, an approach that uses program synthesis to automatically synthesize explicit string transformation rules from input-output examples for NLP tasks. Our algorithm is able to learn rules not only where the output depends on the surrounding input context, but also stateful rules, where it also depends on the results of applying transformation rules to the input context. Our algorithms work for both small and large amounts of potentially noisy training data. Furthermore, the learning process, as well as the level of abstraction of the inferred rules, can be controlled by an expert by providing linguistic knowledge to ProLinguist in the form of a Domain Specific Language. We demonstrate ProLinguist on a variety of NLP tasks ranging from textbook phonology problems to a more complex grapheme-to-phoneme conversion for Hindi and Tamil, showing that it can produce interpretable rules from small amounts of training data.