Image and video have become the language people use to communicate on the Internet. Multimedia content connects people and appeals to the young. This project aims at deep image and video transformation to generate high-quality image and video content in an automatic way and create more engaging experiences for modern work and life. Our vision is broad and focuses on developing state-of-the-art AI technology for fast, reliable, and cost-effective content creation, communication, and consumption. Our technology can benefit multiple experiences across M365, including enterprise, education, consumer, and device-specific experiences.
What are we trying to do?
The objective of this project is to enable a more automated pipeline for image and video generation, review, and publishing. Applications include image/video stylization, inpainting, super-resolution, icon generation, looping videos, and scene composition. Powered by deep learning models built by MSRA, multiple high-quality variants can be produced for each image/video at no additional cost to human designers.
How is it done today?
Current image and video transformations are mainly conducted by human designers, which is labor intensive. For example, users insert 26M images a day to PowerPoint, and human designers generally offer limited treatment of images.
What is novel in our approach?
We seek to push forward the frontiers of research on high-quality image/video transformation and make impact across both academia and industry. We have developed self-attention based Generative Networks with lightweight network optimization. We are also designing big models to understand and create multimedia content from the perspective of multiple modalities (e.g., vision and language). We have already published top-tier papers along these two dimensions, for example, in CVPR 2019, CVPR 2020, ECCV 2020, ACM Multimedia 2020 and NeurIPS 2020.
People
Jianlong Fu
Senior Research Manager
Bei Liu
Senior Researcher
Kai Qiu
Researcher