FS-Mol: A Few-Shot Learning Dataset of Molecules

Megan Stanley; John Bronskill; Krzysztof Maziarz; Hubert Misztela; Jessica Lanini; Marwin Segler; Nadine Schneider; Marc Brockschmidt

FS-Mol: A Few-Shot Learning Dataset of Molecules

Megan Stanley ,
John Bronskill ,
Krzysztof Maziarz ,
Hubert Misztela ,
Jessica Lanini ,
Marwin Segler ,
Nadine Schneider ,
Marc Brockschmidt

NeurIPS 2021 | December 2021

Download BibTex

Small datasets are ubiquitous in drug discovery as data generation is expensive and can be restricted for ethical reasons (e.g. in vivo experiments). A widely applied technique in early drug discovery to identify novel active molecules against a protein target is modelling quantitative structure-activity relationships (QSAR). It is known to be extremely challenging, as available measurements of compound activities range in the low dozens or hundreds. However, many such related datasets exist, each with a small number of datapoints, opening up the opportunity for few-shot learning after pre-training on a substantially larger corpus of data. At the same time, many few-shot learning methods are currently evaluated in the computer-vision domain. We propose that expansion into a new application, as well as the possibility to use explicitly graph-structured data, will drive exciting progress in few-shot learning. Here, we provide a few-shot learning dataset (FS-Mol) and complementary benchmarking procedure. We define a set of tasks on which few-shot learning methods can be evaluated, with a separate set of tasks for use in pre-training. In addition, we implement and evaluate a number of existing single-task, multi-task, and meta-learning approaches as baselines for the community. We hope that our dataset, support code release, and baselines will encourage future work on this extremely challenging new domain for few-shot learning.

Publication Downloads

FS-Mol

December 10, 2021

FS-Mol is A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets. The dataset is presented with a model evaluation benchmark which aims to drive few-shot learning research in the domain of molecules and graph-structured data.

Download Data