Visual Foundation Model

diagram

Almost all computer vision applications require basic architectural modeling and pre-training techniques. The project aims to advance these fundamental technologies that can find a wide range of applications related to computer vision. Over the past few years, we’ve developed widely used visual architectures, such as the Swin Transformer series, as well as popular self-supervised learning methods such as PixPro and SimMIM. The Swin Transformer paper won the ICCV 2021 best paper award (Marr Prize). We also trained the world’s largest and best dense visual model (Swin V2-G with 3B parameters) as of November 2021. Through this project, we hope to continue to drive fundamental advances in basic visual modeling and pre-training.