Synchronized Audio-Visual Generation with a Joint Generative Diffusion Model and Contrastive Loss

The rapid development of deep learning techniques has led to significant advancements in the fields of multimedia generation and synthesis. However, generating coherent and temporally aligned audio and video content remains a challenging task due to the complex relationships between visual and auditory information. In this work, we propose a joint generative diffusion model that addresses this challenge by simultaneously generating video and audio content, thus enabling better synchronization and temporal alignment. Our approach is based on guided sampling, which allows for more flexibility in conditional generation and improves the overall quality of the generated content. Furthermore, we introduce a joint contrastive loss, inspired by previous work that has successfully employed contrastive loss in conditional diffusion models. By incorporating this joint contrastive loss, our model achieves better performance in terms of quality and temporal alignment. Through extensive evaluations using both subjective and objective metrics, we demonstrate the effectiveness of our proposed joint generative diffusion model in generating high-quality, temporally aligned audio and video content.

Date:: November 6, 2023

- Ruihan Yang
  
  PhD student
  
  UC Irvine
Research Area
- Audio and Acoustics
Research Lab
- Microsoft Research Lab - Redmond
Group
- Audio and Acoustics Research Group

Watch Next

MSR Talk: Unsupervised Speech Reverberation Control with Diffusion Implicit Bridges
May 14, 2024
Speakers:

Hannes Gamper
Final intern talk: Improving Frechet Audio Distance for Generative Music Evaluation
August 25, 2023
Speakers:

Azalea Gui

Synchronized Audio-Visual Generation with a Joint Generative Diffusion Model and Contrastive Loss

Speakers

Ruihan Yang

Related Links

Research Area

Research Lab

Group

Watch Next

MSR Talk: Unsupervised Speech Reverberation Control with Diffusion Implicit Bridges

Final intern talk: Improving Frechet Audio Distance for Generative Music Evaluation