Toxic Speech and Speech Emotions: Investigations of Audio-based Modeling and Intercorrelations

European Signal Processing Conference (EUSIPCO) |

Published by IEEE | Organized by EURASIP

Content moderation (CM) systems have become essential following the monumental increase in multimodal and online social platforms; and while increasingly published work focuses on text-based solutions, there is still limited work on audio-based methods. In this study we aim to explore relationships between speech emotions and toxic speech, as part of a CM scenario. We first investigate an appropriate framework for combining speech emotion recognition (SER) and audio-based CM models. We then investigate which emotional aspects (i.e., attribute, sentiment, or attitude) could contribute the most in facilitating audio-based CM recognition platforms. Our experimental results indicate that conventional shared feature encoder approaches may fail to capture additional discriminative features for boosting audio-based CM tasks while utilizing SER learning. We further investigate performance trade-offs of late-fusion frameworks for combining SER and CM information. We argue that these observations could be attributed to an emotionally- biased distribution in the CM scenario, concluding that SER could indeed play a role in content moderation frameworks, given added application-specific emotional information.

Accuracy table for content moderation task.

Table. Recognition performance of the Content Moderation (CM) model, compared to emotional-based recognition models.
Attr-1D: Emotional regressor model trained on arousal and valence attributes (IEMOCAP)
Senti-1D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (IEMOCAP)
Senti-5D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (5 corpora).
Atti-1D: Categorical classifier for 3-class sentiment classes Pos/Neu/Neg (Cust Support Calls Attitude corpora)

Content moderation toxic speech accuracy

Figure: Trade-off performance trend for different training data size (left); attention weights distribution during testing (right). CM: audio content moderation model. CM+SER: CM model with incorporated emotional information. The distribution of the model’s attention weights shows that SER contributes significantly less compared to the CM features w.r.t. the overall audio-based CM model (i.e., most attention weights for the SER channel fall below 0.2). The major contributor to this may be the negative sentiment-bias in the online CM scenario or the more controlled recording settings of the SER domain when compared to the CM corpus.