Azure Florence project header: eye + language = AI represents visual-language learning

Project Florence-VL

Humans perceive the world through many channels, such as images viewed by the eyes or voices heard by the ears. Though any individual channel might be incomplete or noisy, humans can naturally align and fuse information collected from multiple channels, in order to grasp the key concepts needed for a better understanding of the world.

So, what is Florence-VL about?

One of the core aspirations in artificial intelligence is to develop algorithms that endow computers with an ability to effectively learn from multi-modality (or multi-channel) data. This data is similar to sights and sounds attained from vision and language that help humans make sense of the world around us. For example, computers could mimic this ability by searching the most similar images for a text query (or vice versa) and by describing the content of an image using natural language.

Azure Florence-Vision and Language, short for Florence-VL, is launched to achieve this goal, where we aim to build new foundation models for Multimodal Intelligence. Florence-VL, as part of Project Florence, is funded by the Microsoft AI Cognitive Service team since 2020. Motivated by the strong demand from real applications and recent research progresses on computer vision, natural language processing, and vision-language understanding, we strive to advance the state of the art on vision-language modeling and develop the best computer vision technologies as part of our mission to empower everyone on the planet to achieve more.

Our journey starts from VL pre-training

Recently, Vision-Language Pre-training (VLP) has shown great progress toward learning general-purpose multimodal representations. The most representative approach is to train large transformer-based models on massive image-text pair data in a self-supervised manner, such as predicting the masked elements based on their context. The cross-modal representations of the pretrained models can then be finetuned to adapt to various downstream vision-language tasks.

There are tons of images on the web and social media that have annotated texts. These texts can be used as a “free” source of labels. There are also a large number of videos that have audio channels describing what happens in these videos, and this audio can be transcribed into text as labels as well. In addition to the benefit of not requiring manual data labeling, VLP has another important benefit of cross-modal knowledge distillation, where knowledge learned in one modality helps learning in a different modality.

Along the journey, our team has developed a series of seminal works, including UNITER (opens in new tab), OSCAR (opens in new tab), VILLA (opens in new tab), and VinVL (opens in new tab). These models, when equipped with large-scale pre-training, have helped us build up state-of-the-art techniques for challenging vision-language tasks. For example, in VIVO (opens in new tab), we have achieved the first human parity on the novel image captioning (nocaps (opens in new tab)) task. By enhancing pre-training with detected scene text in images, our TAP (opens in new tab) model has also achieved No. 1 on the TextCaps Challenge 2021 (opens in new tab).

How to further modernize our Florence-VL efforts?

These successes are encouraging, which have driven us to further modernize our Florence-VL efforts, detailed below.

End-to-end pre-training: Our previous efforts adopt transformers for multimodal fusion, while pre-trained object detectors on the vision side are used to extract regional features offline. Recently, end-to-end pre-training methods have become appealing and increasingly popular. To modernize our Florence-VL models, we have developed UFO and METER, that systematically investigate how to train a performant multimodal transformer in an end-to-end manner.
Scaling up: Scale is believed to be an important factor for the recent advances in VLP. For this purpose, we have developed LEMON and GIT, where GIT is our most recent multimodal generative foundation model trained on 800M image-text pairs. GIT achieves new state of the arts across 12 image/video captioning and QA tasks, including the first human-parity on TextCaps. GIT achieves an accuracy of 88.79% on ImageNet-1k using a generative scheme. GIT can recognize logos, landmarks, characters, etc.
Few-shot learning: Modern language models, such as GPT-3, has shown strong few-shot capabilities. That is, by providing only a few in-context examples, GPT-3 can achieve very strong performance on downstream language understanding tasks. However, such an ability is not seen in VLP yet. As a first step towards multimodal few-shot learning, we have developed PICa, a simple but effective method that prompts GPT-3 via the use of image captions. We envision our work can inspire future studies to train a real multimodal GPT-3.
Efficiency and robustness: In terms of efficiency, we have developed MiniVLM, DistillVLM and VL tickets that aim to investigate the use of knowledge distillation and lottery ticket hypothesis for large VL model compression. Further, despite rapid progress of VLP, it remains unclear whether these state-of-the-art models are robust when encountering examples in the wild, which is critical for real-life applications. To this end, we have introduced a new Adversarial VQA benchmark, that we hope to shed new light on robustness study in the community and serve as a valuable benchmark for future work.

All our modernization endeavors are deeply connected with each other, with a same final goal in mind. For example, the model architectures developed in end-to-end pre-training can serve as building blocks for our multimodal foundation model. And the techniques we developed for scaling up VLP can be used for scaling our final solution. Further, we believe when the model is really scaled up, the capability of few-shot learning will emerge naturally. With a few light-weight adapters, and a few in-context examples, we envision that our final unified VL foundation model is capable of adapting to different tasks readily. Lastly, in order to deploy these state-of-the-art models into products, we also need to improve the efficiency and robustness of these large-scale models for fast and reliable use.

Extension to video-language pre-training

So far, we’ve been focusing on static images. There are a lot of videos and live cameras in the world, and many application scenarios require video understanding. We have extended our image-text pre-training methods to the video domain, and leverage large amounts of video-text data to learn image and motion features for diverse video-language tasks. Along the journey, we have explored several directions to pursue state-of-the-art video-language research.

Learning from multi-channel videos: Videos are multi-channeled in nature, usually containing two or more modalities from vision (video frames), text (subtitle text), speech, and non-speech audio. To enrich the video representations for video-language tasks, we have developed HERO, that learns multi-channel video representations from both video frames and their accompanying subtitles with large-scale video-text corpus. Although our model design focuses on multi-channel videos, HERO can be generalized to different video types (multi- and single-channel).
End-to-end pre-training: Traditional video-language methods often adopt a two-stage approach with offline video feature extraction, followed by the training of video-language fusion models. To remedy the disconnections in video domain and training objectives between video feature extractor and video-language fusion model, we introduced ClipBERT, in which a CNN+transformer model operates on top of sparsely sampled frames and is optimized in an end-to-end fashion to solve popular video-language tasks. VIOLET and SwinBERT, further enhance ClipBERT by introducing Masked Visual-token Modeling and Sparse Attention. More recently, we have developed LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, and surprisingly, it provides strong performance on a wide range of benchmarks, covering video question answering, text-to-video retrieval and video captioning.
Benchmarking video-language models: Unlike image-text pretrained models, video-language models are often evaluated on their own choices of tasks, datasets and video domains, making it difficult to have a comprehensive comparison across different models. To facilitate the advances in video-language research, we released VALUE, a comprehensive multi-task video-and-language understanding evaluation benchmark. VALUE includes diverse and challenging video-language tasks built on multi-channel videos, covering a broad range of video genres, video lengths, data volumes. We hope VALUE can inspire active research and discussion in the community.

Unification of localization and VL understanding

We ask how language can play a more fundamental role for core vision tasks (such as image classification and object detection (OD)) and aim to unify both vision tasks and vision-language (VL) tasks into one general framework and enable computer vision in the wild via using natural language as a general-purpose interface. To this end, we have developed several models along the way.

Grounded Language-Image Pre-training: In GLIP, we have unified object detection and phrase grounding for grounded pre-training, and this unification allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model. In GLIPv2, we have tried to further unify localization and VL understanding, which not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Besides the use of standard OD heads (e.g., DyHead) for outputting bounding boxes as used in GLIP and GLIPv2, in UniTAB, we have tried to unify text generation and bounding box prediction into a single transformer encoder-decoder architecture via representing each bounding box as a few discrete tokens, inspired by Pix2seq. This enables UniTAB to approach different VL tasks with a single set of parameters.
Coarse-to-fine vision-language pre-training: In FIBER, we provide another elegant solution that aim to tackle both localization and VL understanding tasks. FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data.
External knowledge: In K-LITE, we provide a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on both image classification and object detection tasks.
Others: In DAP, we have developed detection-aware pre-training, which leverages only weakly-labeled classification-style datasets (e.g., ImageNet) for pre-training, but is specifically tailored to benefit OD tasks. In our Soft Teacher project, we aim to enhance semi-supervised end-to-end OD training via gradually improving pseudo label qualities during the curriculum, and show that the more and more accurate pseudo labels in turn benefit OD training. Besides new methods, we have also developed a new task, called Open-Vocabulary Visual Instance Search (OVIS). Given a textual search query, OVIS aims to return a ranked list of visual instances, i.e., image patches, that satisfies the search intent from an image database. Further, we leverage massive image-caption pairs as weak image-level supervision to tackle this new task.

Visual synthesis

NUWA-Infinity (opens in new tab) is our new multimodal generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Check our project website here (opens in new tab).