Training Audio Captioning Models without Audio
- Soham Deshmukh ,
- Benjamin Elizalde ,
- Dimitra Emmanouilidou ,
- Bhiksha Raj ,
- Rita Singh ,
- Huaming Wang
2024 International Conference on Acoustics, Speech, and Signal Processing |
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multi-modal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
The first panel depicts the modality gap between CLAP pretrained audio and pretrained text embeddings in the joint audio-text space. The second panel shows the proposed method of text-only training for Automated Audio Captioning. At inference, the text encoder is swapped with the audio encoder and a caption is produced for the input audio. Only mapping network m is trainable, while modules with (snowflake) are frozen. The Prefix is the output of m. Singular arrows depict embedding vectors while multiple arrows indicate a sequence of vectors.
The Table shows results of various models trained on both AudioCaps and Clotho. Models in rows 1-4 use both audio and text in training. The proposed text-only model (row 5) uses only text data and random Gaussian noise with a std of 0.015. It achieves comparable performance with the best audio captioning models in the literature and obtains a SPIDEr score of 0.256 on Clotho and 0.455 on AudioCaps, higher than 0.215 and 0.437 reported by Kim et. al.
Text-only training is a valid alternative to training and/or initializing audio captioning systems. We also train our model architecture made for text-only training with audio-text pairs. The architecture is similar to Fig 1, where during training we use audio files with an audio encoder instead of text with a text encoder and Gaussian noise. This is the last and grayed row in the Table above. The difference in SPIDEr score between the audio-text and the text-only training is small: +0.02 on AudioCaps and +0.01 on Clotho. This indicates that our text-only training can achieve comparable results without audio data. The main benefit of text-only training is training on unpaired openly available text. We explore this in Section 5.1, whereby using LLM-generated text, we show that text-only training can improve over the audio-text training.