Azure Florence project header: eye + language = AI represents visual-language learning

Project Florence-VL

Object Detection in the Wild via Grounded Language Image Pre-training

Share this page

Visual recognition systems are typically trained to predict a fixed set of predetermined object categories in a specific domain, which limits their usability in real-world applications. How to build a model that generalizes to various concepts and domains with minimal annotations? While great progress has been made on coarse-grained (image-level) recognition such as CLIP (opens in new tab), generalizable fine-grained (object-level) localization ability (e.g., object detection) remains an open challenge. Existing detection and segmentation models are “good at one task but one task only and require significant effort to adapt to a new task”.  

In this blog, we introduce our recent efforts on building a generalizable localization model with language supervision (GLIP). GLIP and GLIPv2 enable unification between localization and vision-language understanding, paving a way towards a unified CV foundation model. GLIP is accepted in CVRP 2022, winning the Best Paper Finalist Award.

GLIP (Grounded Language-Image Pre-training) is a generalizable object detection (we use object detection as the representative of localization tasks) model. As illustrated in Figure 1, it is language aware, taking a natural language prompt as instruction. It is also semantic rich, able to detect millions of visual concepts out-of-box. GLIPv2 further extends such ability to instance segmentation and grounded vision-language understanding tasks; see examples in Figure 2. GLIP introduces language into object detection and leverages self-training techniques to pre-train on scalable and semantic-rich data: grounded image-captions (24M). This marks a milestone towards generalizable localization models: as shown in Figure 3, GLIP enjoys superior zero-shot and few-shot transfer ability similar to that of CLIP (opens in new tab)/GPT-2 (opens in new tab)/GPT-3 (opens in new tab). We also release a HuggingFace Demo (opens in new tab). Feel feee to give it a try.

Figure 1: GLIP detects objects based on a text prompt. Its zero-shot performance surpasses supervised detection models on established benchmarks (COCO & LVIS) and generalizes to various downstream tasks – the Object Detection in the Wild Benchmark (ODinW), introduced in GLIP. The visualizations are from the zero-shot (not trained on any of the task data) GLIP.
Figure 1: GLIP detects objects based on a text prompt. Its zero-shot performance surpasses supervised detection models on established benchmarks (COCO & LVIS) and generalizes to various downstream tasks – the Object Detection in the Wild Benchmark (ODinW), introduced in GLIP. The visualizations are from the zero-shot (not trained on any of the task data) GLIP.
Figure 2: GLIPv2 extends the generalization ability of GLIP to instance/referring segmentation (Row 1 and 2) and grounded vision-language understanding tasks, such as grounded VQA (Row 3) and grounded image captioning (Row 4).
Figure 2: GLIPv2 extends the generalization ability of GLIP to instance/referring segmentation (Row 1 and 2) and grounded vision-language understanding tasks, such as grounded VQA (Row 3) and grounded image captioning (Row 4).
Figure 3. (Left) GLIP shows great data efficiency on 13 downstream tasks (ODinW): zero-shot GLIP rivals with few-shot baselines; few-shot GLIP rivals with fully-supervised baselines. (Right) Prompt tuning with GLIP almost matches full fine-tuning.
Figure 3. (Left) GLIP shows great data efficiency on 13 downstream tasks (ODinW): zero-shot GLIP rivals with few-shot baselines; few-shot GLIP rivals with fully-supervised baselines. (Right) Prompt tuning with GLIP almost matches full fine-tuning.

Object detection as a vision-language task

Figure 4. Architecture of GLIP.
Figure 4. Architecture of GLIP.

At the core of GLIP is the reformulation of object detection as a vision-language task: the model is not trained to predict objects with a multi-class classifier for specific benchmarks; rather, we reformulate object detection as phrase grounding. The model takes in an image and a text prompt – either a synthesized sentence as a concatenation of category names (for detection) or a natural language sentence (for phrase grounding); the task is to identify the correspondence between phrases in the prompt and objects (or regions) in an image.

We also introduce deep fusion into the model. The language features are computed using a language model, which gives the new detection (or grounding) model a dual-encoder structure. Different from CLIP that fuses vision and language only at the last dot product layer, we show that deep cross-modality fusion applied by GLIP, as shown in Figure 4 (Middle), is crucial to learn high-quality language-aware visual representations.

Figure 5. Grounding predictions from GLIP. GLIP can locate rare entities, phrases with attributes, and even abstract words.
Figure 5. Grounding predictions from GLIP. GLIP can locate rare entities, phrases with attributes, and even abstract words.

This reformulation allows us to pre-train GLIP on scalable and semantic-rich data: millions of image-caption pairs with millions of unique grounded phrases. Given a good grounding model (a teacher GLIP trained on a moderate amount of gold grounding data), we can automatically generate grounding boxes for massive image-text-paired data and train a student GLIP model. We showcase two real examples of the generated boxes in Figure 5. Training on such semantic-rich data delivers a semantic-rich student model. In contrast, prior work (opens in new tab) on scaling detection data simply cannot predict concepts out of the teacher models’ pre-defined vocabulary.

Flexible transfer ability

Zero-shot GLIP can surpass established supervised models:  GLIP can “zero-shot” transfer to a new detection task by simplifying rewriting the candidate categories into a language prompt. See Figure 1 and Figure 3 (left; data amount = 0) for an example. 

When writing the prompt, one could take the default approach by simply concatenating all the object names with “ . ”; one could also inject domain knowledge by describing the rare objects with attributes and language context. See below where we designed custom prompts for 6 datasets and observe significant performance improvement without any parameter change.

Table 1. Transfer to novel concepts by writing descriptive prompts.
Table 1. Transfer to novel concepts by writing descriptive prompts.

Few-shot / full-data fine-tuning: GLIP serves as a strong pre-trained checkpoint for easy adaptation to various tasks. When fine-tuned on COCO, GLIP (Large) achieves 60.8 AP on COCO 2017val and 61.5 on test-dev, surpassing the current public SoTA models; on 13 downstream tasks, a 1-shot GLIP rivals with a fully supervised Dynamic Head (see Figure 3).

COCOPascalVOCAerialDroneAquariumRabbitsEgoHandsMushrooms
58.872.9/86.723.051.872.075.888.1
PackagesRacoonShellfishVehiclesPistolsPotholeThermal
75.269.573.672.173.753.581.4
Table 2. One GLIP performs well for all tasks.

One model for all detection tasks through prompt tuning:  GLIP takes a language prompt as input; thus one could change the model predictions by tuning only the prompt embeddings. This is similar to linear probing but the key difference is that the language and visual representation in GLIP is deeply fused.  In Figure 5 (right), prompt tuning on GLIP almost matches full fine-tuning, while linear probing a conventional object detection cannot. This makes deploying GLIP efficient : one GLIP model can simultaneously perform well on all downstream tasks, reducing the fine-tuning and deployment cost. See Table 2 above.

GLIPv2: Unifying localization and vision-language understanding

Figure 6. GLIPv2 can perform a wide range of tasks.
Figure 6. GLIPv2 can perform a wide range of tasks.

The development of a general-purpose CV foundation model has been hindered by the distinction between localization tasks (traditionally considered as single-modality tasks) and vision-language (VL) understanding tasks such as visual question answering and image captioning. The reformulation technique in GLIP opens a new door: we could turn every localization task (e.g., object detection and instance segmentation) into a vision-language task. We introduce the recently upgraded GLIPv2, a unified model for various localization and VL understanding tasks with one model architecture; it shows mutual benefit between localization and VL understanding.

Table 3. GLIPv2 achieves near-SoTA performance on various localization and VL understanding tasks.
Table 3. GLIPv2 achieves near-SoTA performance on various localization and VL understanding tasks.
  • One Model Architecture for all: GLIPv2 achieves near SoTA performance on various localization and understanding tasks. See Table 3.
  • One Set of Model Parameters for all: The pre-trained GLIPv2 can be effortlessly transferred to any object detection and grounding tasks without further fine-tuning; see Table 4 (Left). With the technique of prompt tuning, a single GLIPv2 model achieves comparable performance with multiple task-specific fully fine-tuned models; see Table 4 (Right).
  • Grounded VL understanding: Inherently a grounding model, GLIPv2 leads to VL understanding models with strong grounding ability, which are self-explainable and easy to debug. When GLIPv2 is finetuned on VQA, it can answer questions while localizing mentioned entities, see Figure 2 for examples.
Table 4. One set of model parameters for all localization / grounding tasks.
Table 4. One set of model parameters for all localization / grounding tasks.

Towards language-augmented visual models

GLIP (opens in new tab) and GLIPv2 (opens in new tab) demonstrate our vision: visual models can be augmented with language to achieve unprecedented generalization ability. Accompanying GLIP and GLIPv2, we are also releasing the Object Detection in the Wild Benchmark (ODinW), to be hosted at the ECCV Computer Vision in the Wild Workshop (opens in new tab).


Acknowledgement: This research was conducted by Liunian Harold Li, Haotian Zhang, Pengchuan Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Xiaowei Hu, Yen-Chun Chen, Xiyang Dai, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao. Additional thanks go to the Microsoft Research Horizontal AI Team and Microsoft Alexander Multi-modal team for providing computer resources for large-scale training. The baseline models used in our experiments are based on the open-source code released in the GitHub repository; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.