FLAME: A Small Language Model for Spreadsheet Formulas

AAAI |

Organized by Association for the Advancement of Artificial Intelligence

Spreadsheets are a vital tool for end-user data management.
Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to
their size (up to billions of parameters). We present FLAME,
a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive
performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We
curate a training dataset using sketch deduplication, introduce
an Excel-specific formula tokenizer, and use domain-specific
versions of masked span prediction and noisy auto-encoding
as pre-training objectives. We evaluate FLAME on formula
repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as
the Davinci (175B) and Cushman (12B) variants of Codex
and CodeT5 (220M), in 10 of 14 evaluation settings for the
repair and completion tasks. For formula retrieval, FLAME
outperforms CodeT5, CodeBERT, and GraphCodeBERT.