Deep Program Understanding

Teaching machines to understand complex algorithms

The Deep Program Understanding project aims to teach machines to understand complex algorithms, combining methods from the programming languages, software engineering and the machine learning communities.

We have open-sourced many of our work and implementations, including utilities and project-specific sample code. See our Publications and Downloads tabs for more details.

Learning to understand programs

Building “smart” software engineering tools requires learning to analyse and understand existing code and related artefacts such as documentation and online resources (e.g., StackOverflow). One of our primary concerns is the integration of standard static analysis methods with machine learning methods to create learning-based program analyses that can be used within software engineering tools. Such tools can then be used to find bugs, automatically retrieve or produce relevant documentation, or verify programs.

Highlighted publications

Self-Supervised Bug Detection and Repair (NeurIPS’21) | Code on GitHub
Typilus: Neural Type Hints (PLDI’20)
Learning to Represent Edits (ICLR’19) | Code on GitHub
Learning to Represent Programs with Graphs (ICLR’18) | Code on GitHub
A Survey of Machine Learning for Big Code and Naturalness (ACM Computing Surveys 2018)

Learning to generate programs

A core problem of machine learning is to learn algorithms that explain observed behaviour. This can take several forms, such as program synthesis from examples, in which an interpretable program matching given input/output pairs has to be produced; or alternatively programming by demonstration, in which a system has to learn to mimic sequences of actions.

Highlighted publications

Learning to Complete Code with Sketches (ICLR’22)
Fast and Memory-Efficient Neural Code Completion (MSR’20)
Generative Code Modeling with Graphs (ICLR’19) | Code on GitHub
DeepCoder: Learning to Write Programs (ICLR’17) | Code on GitHub
TerpreT: A Probabilistic Programming Language for Program Induction (Tech Report, 2016) | Code on GitHub
Bimodal Modelling of Source Code and Natural Language (ICML’15)

Advancing the machine learning frontier

Structured data such as programs represent a challenge for machine learning methods. The combination of domain constraints, known semantics and complex structure requires new machine learning methods and techniques. Our focus in this area is the analysis and generation of graphs, for which we have developed novel neural network architectures and generative procedures.