illustration of a network of lines connecting to form a brain in shades of blue

Microsoft Research & Berkeley AI Research (BAIR)

Phase 2 collaborations

Secure and Privacy-Preserving Federated Learning

Kim Laine (MSR), Rob Sim (MSR), Dawn Song (BAIR), Lun Wang (PhD student), Xiaoyuan Liu (PhD student) 

Federated learning (FL) proposes a powerful new distributed learning paradigm and has grown as an active research field with large-scale real-world deployment in the last several years. In FL, participants collaboratively train a model when all the data is held locally to preserve data privacy. Despite its success, FL still faces a variety of security challenges, among which inference attacks and poisoning attacks are the most notable two categories. How to ensure privacy and model integrity under these attacks remains an open question of critical importance.

We propose to further explore inference attacks and poisoning attacks in FL and design countermeasures. Specifically, we plan to explore the attack landscape under novel and stronger threat models, for example, malicious participants for inference attacks, and design and develop new defense approaches against these attacks.


Causal and Interpretable Machine Learning for Medical Images

Emre Kiciman (MSR), Bin Yu (BAIR), Robert Netzorg (PhD student) 

The goal of this collaboration is to develop methods that learn predictive and stable, if not causal, representations on data from various domains, medical and beyond, and then understand how the method uses those representations to make predictions, when those representations are not known beforehand. Specifically, we plan to investigate approaches to improve out-of-distribution performance or transfer learning performance in the realm of precision medicine, where it is often not clear to humans what the high-level features driving, say, a tumor (label) classification are. For doctors seeking to apply machine learning techniques to precision medicine, black-box models making predictions on spurious correlations are not sufficient. Making and communicating medical diagnoses at the individual level requires that those decisions are made on an interpretable and stable, and hopefully causal, basis. In our proposed setting, namely tumor identification from medical images, there are a multitude of deviances from previous settings: Doctors themselves develop heuristic rules for classifying tumors, images taken on different machines result in images with very different characteristics, and the learned model must be interpretable and medically meaningful for doctors to use in practice.


Distributed learning: privacy and data summarization

Lester Mackey (MSR), Nika Haghtalab (BAIR), Abishek Shetty (PhD student) 

This project will explore a wide range of multi-agent learning tasks under the lens of privacy. We will consider distributed learning (e.g., where the objective is to learn a model that performs well over the agents) and collaborative learning (e.g., where the objective is to learn a model that performs well for each agent). Our goal is to design algorithms that preserve the privacy of the data with guarantees that significantly outperform those that agents can achieve on their ownWe plan to address these challenges by bridging between privacy and data summarization. We expect privacy and data summarization to become closely related when agents have sufficiently large data sets. That is, preserving the privacy of the data will approximately translate to creating a synthetic data set that summarizes the data effectively. By exploring these connections further, our project will build synergies between two well established areas and will contribute to their further progress by providing a unified perspective for the use of ML on important and sensitive data sets.


Realistic Large-Scale Benchmark for Adversarial Patch

Jerry Li (MSR), David Wagner (BAIR), Chawin Sitawarin (PhD student), Nabeel Hingun (undergraduate student) 

The goal of our study is to make machine learning models robust against patch attacks. In particular, we will develop the first standardized benchmark for security against patch attacks, under realistic threat models. Our benchmark will cover two important aspects often ignored in past work: (1) realistic patches that must work under multiple camera angles, lighting conditions, etc., and (2) realistic constraints on the location of the patch. Then, we will develop better patch attacks, and use them together with adversarial training to improve the defense side.


Video Representation Learning for Global and Local Features

Yale Song (MSR), Avideh Zakhor (BAIR), Franklin Wang (PhD student) 

Existing video representation learning frameworks generally learn global representations of videos, usually at the clip-level. These types of representations are generally evaluated on action recognition baselines (which have a strong bias towards global appearance information), and are ill-suited for local tasks involving fine details and dense prediction, like action segmentation and tracking. In this work, we propose to learn representations that are optimized both for global tasks and local tasks by developing contrastive learning methods that operate at a spatiotemporally denser regime beyond the clip-level. Our self-supervised framework will learn global and local representations for RGB frames and motion features like optical flow to learn coarse and fine-grained representations of appearance and motion.


Active Visual Planning: Handling Uncertainty in Perception, Prediction, and Planning Pipelines

Xin Wang (MSR), Joey Gonzalez (BAIR), Charles Packer (PhD student) 

Our goal is to develop an Active Visual Planner (AVP) for multi-agent planning in environments with partial observability (e.g., limited visibility). Recent work on Active Perception has studied improving perception via prediction (e.g., repositioning a LiDAR sensor to improve object detection), however, existing approaches generally assume control of a specific sensor, and do not enable a planner to plan entire future states (e.g., vehicle position and multiple sensor configurations) with respect to uncertainty in perception. Whereas Active Perception is primarily concerned with reducing perception uncertainty as an end in itself, an AVP will plan to improve perception only if it aids the planning objective. Prior work on limited visibility reasoning is either intractable (POMDP methods), has constraints that severely limit real-world application (game-theoretic methods), or is overly conservative and does not account for the effect of an agent’s future actions on its own perception, or on other agents’ future state (Forward Reachable Set methods). Prior work on contingency planning can explicitly reason about uncertainty in agent behavior but does not account for uncertainty in perception.


Nonconvex Optimization and Robustness of Neural Networks

Sebastien Bubeck (MSR), Peter Bartlett (BAIR), Yeshwanth Cherapanamjeri (PhD student) 

Machine learning based systems are set to play an increasing role in everyday life due to the increased abundance of large-scale training data and the use of sophisticated statistical models such as neural networks. Considering these recent trends, much recent attention has been devoted towards understanding both how robust and reliable these methods are when deployed in the real world and the computational complexity of actually learning them from data. In our collaboration so far, we have adopted a theoretical perspective on each of these questions and plan to explore and empirically validate them in future work.


Reinforcement Learning in High Dimensional Systems

Sham Kakade (MSR), Akshay Krishnamurthy (MSR), Peter Bartlett (BAIR), Juan C Pedromo (PhD student) 

The goal of this collaboration is to explore the limits and possibilities of sequential decision making in complex, high-dimensional environments. Compared with more classical settings such as supervised learning, relatively little is known regarding the minimal assumptions, representational conditions, and algorithmic principles needed to enable sample-efficient learning in complex control systems with rich sets of actions and observations. Given recent empirical breakthroughs in robotics and game playing ([SHM+16], [MKS+15]), we believe that it is a timely moment to develop our understanding of the theoretical foundations of reinforcement learning (RL). In doing so, we aim to identify new algorithmic techniques and theoretical insights which may serve to democratize RL into a well-founded, mature technology that can be routinely used in practice by non-experts.


Towards Human-like Attention

Xin Wang (MSR), Trevor Darrell (BAIR), Baifeng Shi (PhD student) 

Transformers are shown to achieve stronger performance and robustness to distribution shift and image perturbation and random/adversarial noise than CNN models, thanks to the self-attention mechanism. However, there is increasing evidence suggesting some flaws of self-attention, for example,

  • The output of self-attention tends to converge to a rank-1 matrix, thus losing most of the information in the features.
  • Self-attention tends to look at different parts of an object separately based on pixel/patch similarity, instead of attending to the whole object based on semantics.
  • Self-attention can accurately recognize patch-shuffled images, suggesting that it’s neglecting position information even with positional encoding added.

All these phenomena make us wonder if the current self-attention mechanism is the best we could do, or even the right thing to do. Specifically, self-attention seems different from the attention mechanism in the human visual system (which will be elaborated in “Novelty and Innovation”). To this end, we propose to improve the current self-attention mechanism (ideally borrowing ideas from the human visual system) to fix its flaws and boost the performance and robustness.


Towards GPT-3 Style Vision-and-Language Model Scaling

Pengchuan Zhao (MSR), Trevor Darrell (BAIR) 

The one-year goal of this project is to scale a GPT-3 vision-and-language model that will be pre-trained over large-scale vision-and-language datasets while validating the generalization/expandability/debuggability of the model over a range of vision-language tasks (including phrase grounding task, visual question answering, referring expression comprehension, referring expression segmentation, and instruction following).


Pretraining Efficient Vision-Language Transformers with View- and Region-level Supervision That Encourage Cross-modal Alignment

Pengchuan Zhao (MSR), Trevor Darrell (BAIR), Dong Huk Park (PhD student) 

The goals of this research include (1) Maintaining or reducing the number of the model parameters compared to ViLT; (2) Maintaining or increasing the throughput compared to ViLT; (3) Outperforming ViLT and other region feature-based VLP models on various downstream tasks; (4) Compare with MDETR on V+L grounding tasks and open-vocabulary object detection tasks.


ML-Based Robotic Manipulation via the Use of Language-Annotated Diverse Datasets

Andrey Kolobov (MSR), Sergey Levine (BAIR), Frederik Ebert (PhD student) 

Thanks to advances in robotic manipulation hardware, its physical capabilities are approaching those of a human hand, potentially enabling robots to assist people in tedious or undesirable tasks from datacenter maintenance to sorting trash. However, the difficulty of constructing robust policies that enable versatile manipulation hardware to operate in weakly structured environments is still prohibitive. The focus of our project is overcoming this formidable challenge by pretraining robotic manipulation agents using diverse data that is only weakly relevant to the target task and operating environment. Our approach relies on leveraging language in task demonstrations to train a robot to induce dense reward functions for other, previously unseen tasks, and using these reward functions for learning manipulation policies via a combination of reinforcement and imitation learning.


Enabling Non-Experts to Annotate Complex Logical Forms at Scale

Jason Eisner (MSR), Dan Klein (BAIR), Ruiqi Zhong (PhD student) 

We propose to use non-expert annotators, who do not understand logical forms. We hope our research can lead to better semantic parsers with lower cost, and that it can be incorporated into tools — such as the Semantic Machines SDK — that allow non-experts to collect data and build their own semantic parsers. We will take an active learning approach with indirect supervision.


Pre-trained Representations for Language-Guided Web Navigation

Xiaodong Liu (MSR), Dan Klein (BAIR), Kevin Lin (PhD student), Cathy Chen (PhD student), Nikita Kitaev (PhD student), Jessy Lin (PhD student) 

Most existing web navigation assistants rely on text-only pretrained representations, which do not take advantage of structural information in web pages. However, rich structured representations of webpages are needed to ground natural language requests more effectively into webpage elements.

To create more effective web assistants, we propose to (a) determine a good architecture for constructing context-dependent representations of webpages and their text, (b) train this architecture in a self-supervised manner, using only raw webpages obtained from the Internet, and (c) demonstrate that the provided representations provide benefits on tasks involving webpages.