Microsoft at CVPR 2023: Pushing the boundaries of computer vision

Date de publication

Par , Distinguished Scientist , Senior Principal Research Manager

Logo for the CVPR 2023 conference showing the Vancouver, British Columbia skyline with the conference dates, June 18–23, 2023. In the background, there is a faded photo of the city of Vancouver on a sunny day.

In the vast realm of artificial intelligence, few fields have captivated our imagination and pushed the boundaries of possibility quite like computer vision. At the core of this domain of research and innovation lies the ambition to empower technologies for real-world vision-based systems, enabling machines to take in and respond to visual stimuli with unparalleled precision and sophistication. Through the combination of AI, deep learning, and vast amounts of data, computer vision has made great strides in recent years, catapulting us into an era in which the seemingly impossible becomes achievable.

The Computer Vision and Pattern Recognition (opens in new tab) (CVPR) 2023, held June 10 through June 22, is a widely recognized event that brings together leading experts in the field of computer vision. It serves as a platform for showcasing some of the most compelling and innovative work in this domain. 

The contributions presented by Microsoft researchers and their collaborators at this year’s CVPR cover a wide spectrum of research endeavors. From generative models and network pretraining to sign language understanding and neural video codecs, these cutting-edge advancements underscore the evolving capabilities of systems to analyze and extract valuable insights from visual data.

Here are some of the highlights (see below for a list of published papers and their authors): 

Uniting vision, language, and multi-modal encoding

The paper, “Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks” lies at the intersection of vision, language, and multimodal pretraining. To learn from these different forms of data, we present a general-purpose foundational model that treats images as a “foreign language.” The data from different modalities are encoded with Multiway Transformers, a modular architecture that enables modality-specific encoding and deep fusion. The model is pretrained on images, text, and image-text pairs in a way that generalizes the masked language modeling approach to different modalities. By substantially scaling the model and data, we found that these advances in foundational architecture and pretraining lead to excellent transfer performance over a variety of vision and vision-language tasks, including object detection, semantic segmentation, image classification, visual reasoning, visual question answering, image captioning, and cross-modal image retrieval.

Scaling training data for large vision models

The strength of large language models stems from their ability to leverage unlabeled training data on a massive scale. By using this data, these models acquire a broad understanding of language, enhance their generalization abilities, and improve their performance across a wide range of language-related tasks. Inspired by this achievement, our research focuses on the possibilities of scaling training data for large vision models. In the paper “On Data Scaling in Masked Image Modeling,” we explore the effects of data scaling on large vision models that are pretrained through masked image modeling. Through extensive investigation, we discovered that masked image modeling in large vision models requires large-scale data for effective pretraining. However, unlike large language models, large vision models cannot benefit from more data in a non-overfitting scenario. These findings deepen our understanding of masked image modeling and may pave the way for future advancements in large-scale vision models.

Creating 3D avatars with a diffusion network

In the world of image generation, incredible strides have been made in transforming text descriptions into stunning visuals. The rise of DALL-E and diffusion models has brought these cutting-edge tools into the hands of everyday users. In the paper “RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion,” we expand on this innovation by introducing the power of diffusion to 3D avatar generation. To do this, it is necessary to transfer diffusion from 2D to 3D. However, transferring diffusion from 2D to 3D is a significant challenge due to the prohibitive memory and processing costs for producing high-quality results with rich details in 3D. We overcome this problem by proposing the roll-out diffusion network (RODIN), which unrolls a 3D neural radiance field into a single 2D feature plane and performs 3D-aware diffusion on it. Supported by other technical contributions, including latent conditioning to promote global coherence and hierarchical synthesis to further enhance details, RODIN significantly accelerates the otherwise tedious 3D modeling process and opens new opportunities for 3D artists.

Microsoft Research Blog

Microsoft Research Forum Episode 3: Globally inclusive and equitable AI, new use cases for AI, and more

In the latest episode of Microsoft Research Forum, researchers explored the importance of globally inclusive and equitable AI, shared updates on AutoGen and MatterGen, presented novel use cases for AI, including industrial applications and the potential of multimodal models to improve assistive technologies.

Microsoft papers published at CVPR 2023 with their authors:

  1. 3D Human Mesh Estimation from Virtual Markers
    Xiaoxuan Ma, Peking University; Jiajun Su, Peking University; Chunyu Wang, Microsoft Research; Wentao Zhu, Peking University; Yizhou Wang, Peking University et National Engineering Research Center of Visual Technology
  2. 3D Line Mapping Revisited
    Shaohui Liu, ETH Zurich; Yifan Yu, ETH Zurich; Rémi Pautrat, ETH Zurich; Marc Pollefeys, ETH Zurich and Microsoft Research; Viktor Larsson, Lund University
  3. BlendFields: Few-Shot Example-Driven Facial Modeling
    Kacper Kania, Warsaw University of Technology; Stephan J. Garbin, Microsoft Research; Andrea Tagliasacchi, Simon Fraser and University and Google Brain; Virginia Estellers, Microsoft Research; Kwang Moo Yi, University of British Columbia; Julien Valentin, Microsoft Research; Tomasz Trzciński, Jagiellonian University; Marek Kowalski, Microsoft Research
  4. CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning
    Yiting Cheng, Fudan University; Fangyun Wei, Microsoft Research; Jianmin Bao, Microsoft Research; Dong Chen, Microsoft Research; Wenqiang Zhang, Fudan University
  5. Deep Frequency Filtering for Domain Generalization
    Shiqi Lin, University of Science and Technology of China; Zhizheng Zhang, Microsoft Research; Zhipeng Huang, University of Science and Technology of China; Yan Lu, Microsoft Research; Cuiling Lan, Microsoft Research; Peng Chu, Microsoft; Quanzeng You, Microsoft; Jiang Wang, Microsoft; Zicheng Liu, Microsoft Research; Amey Parulkar, Microsoft; Viraj Navkal, Microsoft; Zhibo Chen, University of Science and Technology of China
  6. DeepLSD: Line Segment Detection and Refinement with Deep Image Gradients
    Rémi Pautrat, ETH Zurich; Daniel Barath, ETH Zurich; Viktor Larsson, Lund University; Martin R. Oswald, University of Amsterdam; Marc Pollefeys, ETH Zurich and Microsoft Research
  7. DETRs with Hybrid Matching
    Ding Jia, Peking University; Yuhui Yuan, Microsoft Research; Haodi He, Stanford University; Xiaopei Wu, Zhejiang University; Haojun Yu, Peking University; Weihong Lin, Microsoft Research; Lei Sun, Microsoft Research; Chao Zhang, Peking University; Han Hu, Microsoft Research
  8. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention
    Xinyu Liu, Chinese University of Hong Kong; Houwen Peng, Microsoft Research; Ningxin Zheng, Microsoft Research; Yuqing Yang, Microsoft Research; Han Hu, Microsoft Research; Yixuan Yuan, Chinese University of Hong Kong
  9. Four-View Geometry with Unknown Radial Distortion
    Petr Hruby, Viktor Korotynskiy, Timothy Duff, Luke Oeding, Marc Pollefeys, ETH Zurich and Microsoft Research; Tomas Pajdla, Viktor Larsson, Lund University
  10. High-Fidelity and Freely Controllable Talking Head Video Generation
    Yue Gao, Microsoft Research; Yuan Zhou, Microsoft Research; Jinglu Wang, Microsoft Research; Xiao Li, Microsoft Research; Xiang Ming, Microsoft Research; Yan Lu, Microsoft Research
  11. Human Pose as Compositional Tokens
    Zigang Geng, University of Science and Technology of China et Microsoft Research; Chunyu Wang, Microsoft Research; Yixuan Wei, Tsinghua University et Microsoft Research; Ze Liu, University of Science and Technology of China and Microsoft Research; Houqiang Li, University of Science and Technology of China; Han Hu, Microsoft Research
  12. iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-training for Visual Recognition
    Yixuan Wei, Tsinghua University and Microsoft Research; Yue Cao, Microsoft Research; Zheng Zhang, Microsoft Research; Houwen Peng, Microsoft Research; Zhuliang Yao, Tsinghua University and Microsoft Research; Zhenda Xie, Tsinghua University and Microsoft Research; Han Hu, Microsoft Research; Baining Guo, Microsoft Research
  13. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
    Wenhui Wang, Microsoft; Hangbo Bao, Microsoft; Li Dong, Microsoft Research; Johan Bjorck, Microsoft; Zhiliang Peng, Microsoft; Qiang Liu, Microsoft; Kriti Aggarwal, Microsoft Research; Owais Khan Mohammed, Microsoft; Saksham Singhal, Microsoft Research; Subhojit Som, Microsoft; Furu Wei, Microsoft Research
  14. Iterative Proposal Refinement for Weakly-Supervised Video Grounding
    Meng Cao, Peking University; Fangyun Wei, Microsoft Research; Can Xu, Microsoft Research; Xiubo Geng, Microsoft Research; Long Chen, Hong Kong University of Science and Technology; Can Zhang, Peking University; Yuexian Zou, Peking University; Tao Shen, Microsoft; Daxin Jiang, Microsoft Research
  15. LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction 
    Zhaoyun Jiang, Xi’an Jiaotong University; Jiaqi Guo, Microsoft Research; Shizhao Sun, Microsoft Research; Huayu Deng, Shanghai Jiaotong University; Zhongkai Wu, Beihang University; Vuksan Mijovic, Microsoft; Zijiang James Yang, Xi’an Jiaotong University; Jian-Guang Lou, Microsoft Research; Dongmei Zhang, Microsoft Research
  16. Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
    Shruthi Bannur, Microsoft Research; Stephanie Hyland, Microsoft Research; Qianchu Liu, Fernando Pérez García, Microsoft Research; Maximilian Ilse, Microsoft Research; Daniel C. Castro, Microsoft Research; Benedikt Boecking, Harshita Sharma, Microsoft Research; Kenza Bouzid, Microsoft Research; Anja Thieme, Microsoft Research; Anton Schwaighofer, Microsoft Research; Maria Wetscherek, Matthew P. Lungren, Aditya Nori, Microsoft Research; Javier Alvarez-Valle, Microsoft Research; Ozan Oktay Microsoft Research
  17. Look Before You Match: Instance Understanding Matters in Video Object Segmentation
    Junke Wang, Shanghai Collaborative Innovation Center on Intelligent Visual Computing; Dongdong Chen, Microsoft Research; Zuxuan Wu, Shanghai Collaborative Innovation Center on Intelligent Visual Computing; Chong Luo, Microsoft Research; Chuanxin Tang, Microsoft Research; Xiyang Dai, Microsoft Research; Yucheng Zhao, Microsoft Research; Yujia Xie, Microsoft Research; Lu Yuan, Microsoft Research; Yu-Gang Jiang, Shanghai Collaborative Innovation Center on Intelligent Visual Computing
  18. MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
    Xiaoyi Dong, University of Science and Technology of China; Jianmin Bao, Microsoft Research; Yinglin Zheng, Xiamen University; Ting Zhang, Microsoft Research; Dongdong Chen, Microsoft Research; Hao Yang, Microsoft Research; Ming Zeng, Xiamen University; Weiming Zhang, University of Science and Technology of China; Lu Yuan, Microsoft Research; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research; Nenghai Yu, University of Science and Technology of China
  19. MetaPortrait: Identity-Preserving Talking Head Generation with Fast Personalized Adaptation
    Bowen Zhang, USTC; Chenyang Qi, HKUST; Pan Zhang, USTC; Bo Zhang, Microsoft Research; HsiangTao Wu, Microsoft; Dong Chen, HKUST; Qifeng Chen, HKUST; Yong Wang, USTC; Fang Wen, Microsoft
  20. MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
    Ludan Ruan, Renmin University of China; Yiyang Ma, Peking University; Huan Yang, Microsoft Research; Huiguo He, Microsoft Research; Bei Liu, Microsoft Research; Jianlong Fu, Microsoft Research; Nicholas Jing Yuan, Microsoft Research; Qin Jin, Renmin University of China; Baining Guo, Microsoft Research
  21. Motion Information Propagation for Neural Video Compression
    Linfeng Qi, University of Science and Technology of China; Jiahao Li, Microsoft Research; Bin Li, Microsoft Research; Houqiang Li, University of Science and Technology of China; Yan Lu, Microsoft Research
  22. Natural Language-Assisted Sign Language Recognition
    Ronglai Zuo, Hong Kong University of Science and Technology; Fangyun Wei, Microsoft Research; Brian Mak, Hong Kong University of Science and Technology
  23. Neural Video Compression with Diverse Contexts
    Jiahao Li, Microsoft Research; Bin Li, Microsoft Research; Yan Lu, Microsoft Research
  24. On Data Scaling in Masked Image Modeling
    Zhenda Xie, Tsinghua University and Microsoft Research; Zheng Zhang, Microsoft Research; Yue Cao, Microsoft Research; Yutong Lin, Xi’an Jiaotong University and Microsoft Research; Yixuan Wei, Tsinghua University and Microsoft Research; Qi Dai, Microsoft Research; Han Hu, Microsoft Research
  25. Paint by Example: Exemplar-based Image Editing with Diffusion Models
    Binxin Yang, University of Science and Technology of China; Shuyang Gu, Microsoft Research; Bo Zhang, Microsoft Research; Ting Zhang, Microsoft Research; Xuejin Chen, University of Science and Technology of China; Xiaoyan Sun, University of Science and Technology of China; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research
  26. ReCo: Region-Controlled Text-to-Image Generation
    Zhengyuan Yang, Microsoft Research; Jianfeng Wang, Microsoft; Zhe Gan, Microsoft; Linjie Li, Microsoft Research; Kevin Lin, Microsoft Research; Chenfei Wu, Microsoft Research; Nan Duan, Microsoft; Zicheng Liu, Microsoft Research; Ce Liu, Microsoft; Michael Zeng, Microsoft Research; Lijuan Wang, Microsoft Research
  27. ResFormer: Scaling ViTs with Multi-Resolution Training
    Rui Tian, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Zuxuan Wu, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Qi Dai, Microsoft Research; Han Hu, Microsoft Research; Yu Qiao,Shanghai AI Laboratory; Yu-Gang Jiang, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing
  28. Revealing the Dark Secrets of Masked Image Modeling
    Zhenda Xie, Tsinghua University and Microsoft Research; Zigang Geng, University of Science and Technology of China and Microsoft Research; Jingcheng Hu, Tsinghua University and Microsoft Research; Zheng Zhang, Microsoft Research; Han Hu, Microsoft Research; Yue Cao, Microsoft Research
  29. RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion
    Tengfei Wang, Hong Kong University of Science and Technology; Bo Zhang, Microsoft Research; Ting Zhang, Microsoft Research; Shuyang Gu, Microsoft Research; Jianmin Bao, Microsoft Research; Tadas Baltrusaitis, Microsoft Research; Jingjing Shen, Microsoft Research; Dong Chen, Microsoft Research; Fang Wen, Microsoft Research; Qifeng Chen, Hong Kong University of Science and Technology; Baining Guo, Microsoft Research
  30. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
    Xin Chen, Dalian University of Technology; Houwen Peng, Microsoft Research; Dong Wang, Dalian University of Technology; Huchuan Lu, Dalian University of Technology and Peng Cheng Laboratory; Han Hu, Microsoft Research
  31. Side Adapter Network for Open-Vocabulary Semantic Segmentation
    Mengde Xu, Huazhong University of Science and Technology and Microsoft Research; Zheng Zhang, Huazhong University of Science and Technology and Microsoft Research; Fangyun Wei, Microsoft Research; Han Hu, Microsoft Research; Xiang Bai; Huazhong University of Science and Technology
  32. Streaming Video Model
    Yucheng Zhao, University of Science and Technology of China; Chong Luo, Microsoft Research; Chuanxin Tang, Microsoft Research; Dongdong Chen, Microsoft Research; Noel Codella, Microsoft Research; Zheng-Jun Zha, University of Science and Technology of China
  33. Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction
    Mingfang Zhang, University of Tokyo and Microsoft Research; Jinglu Wang, Microsoft Research; Xiao Li, Microsoft Research; Yifei Huang, University of Tokyo; Yoichi Sato, University of Tokyo; Yan Lu, Microsoft Research
  34. SVFormer: Semi-supervised Video Transformer for Action Recognition
    Zhen Xing, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Qi Dai, Microsoft Research; Han Hu, Microsoft Research; Jingjing Chen, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Zuxuan Wu, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing; Yu-Gang Jiang, Fudan University and Shanghai Collaborative Innovation Center of Intelligent Visual Computing
  35. TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models
    Sucheng Ren, Microsoft Research; Fangyun Wei, Microsoft Research; Zheng Zhang, Microsoft Research; Han Hu, Microsoft Research
  36. Two-shot Video Object Segmentation
    Kun Yan, Peking University; Xiao Li, Microsoft Research; Fangyun Wei, Microsoft Research; Jinglu Wang, Microsoft Research; Chenbin Zhang, Peking University; Ping Wang, Peking University; Yan Lu, Microsoft Research
  37. Unifying Layout Generation with a Decoupled Diffusion Model
    Mude Hui, Xi’an Jiaotong University; Zhizheng Zhang, Microsoft Research; Xiaoyi Zhang, Microsoft Research; Wenxuan Xie, Microsoft Research; Yuwang Wang, Tsinghua University; Yan Lu, Microsoft Research
  38. VideoTrack: Learning to Track Objects via Video Transformer
    Fei Xie, Shanghai Jiao Tong University; Lei Chu, Microsoft Research; Jiahao Li, Microsoft Research; Yan Lu, Microsoft Research; Chao Ma, Shanghai Jiao Tong University
  39. VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction
    Yufan Ren, EPFL; Fangjinhua Wang ETH Zurich; Tong Zhang, EPFL; Marc Pollefeys, ETH Zurich and Microsoft Research; Sabine Süsstrunk, EPFL
  40. X-Avatar: Expressive Human Avatars
    Kaiyue Shen, ETH Zurich; Chen Guo, ETH Zurich; Manuel Kaufmann, ETH Zurich; Juan Jose Zarate, ETH Zurich; Julien Valentin, Microsoft Research; Jie Song, ETH Zurich; Otmar Hilliges, ETH Zurich
  41. Unifying Vision, Text, and Layout for Universal Document Processing
    Zineng Tang, University of North Carolina (UNC) Chapel Hill; Ziyi Yang, Microsoft Research; Guoxin Wang, Microsoft Research; Yuwei Fang, Microsoft Research; Yang Liu, Microsoft Research; Chenguang Zhu, Microsoft Research; Michael Zeng, Microsoft Research; Cha Zhang, Microsoft Research; Mohit Bansal, University of North Carolina (UNC) Chapel Hill

Publications connexes

Lire la suite

Voir tous les billets de blog