ICASSP 2023 Deep Noise Suppression Challenge

Region: Global

Includes de-reverberation and suppression of interfering talkers for headset and speakerphone scenarios

Program dates: November 2022–August 2023

See Results tab for final results evaluated on Blind Testset. Five papers were invited from 7 teams where three teams from Tencent agreed to submit one paper. These teams will submit 2-page paper in ICASSP 2023. For deadline, please see the timeline tab. These five teams will get an email with instructions on writing the 2-page paper. All teams are invited to submit their current and future work in DNS Challenge Special Issues of IEEE OJ-SP journal. More details on OJ-SP will be emailed to all participants in next few months.

With decades of research, deep noise suppression (DNS) has advanced to improve subjective overall audio quality. However, we are still quite far from eliminating speech distortions which results from over-suppression of noise, reverberation and neighboring talkers etc. in real-world scenarios. Enhancing speech quality (i.e., eliminating speech distortion) and suppressing noise, reverberation and neighboring talkers are two trade-offs DNS models must handle. IEEE ICASSP 2023 Deep Noise Suppression (DNS) grand challenge is the 5th edition of Microsoft DNS challenges with focus on deep speech enhancement achieved by suppressing background noise, reverberation and neighboring talkers and enhancing the signal quality. This challenge invites researchers to develop real-time deep speech enhancement models for full band speech. Deep speech enhancement models submitted to challenge are supposed to do joint denoising and dereverberation in presence of neighboring (interfering) talkers. We will release the development test set for intermediate evaluation of challenge models and a blind test set for final evaluation for selecting top 5 models. Each test clips will have a corresponding enrollment clip (30s) for primary talker to enable the development of personalized models. Participants are encouraged to develop both personalized and non-personalized models to elucidate the benefits of personalized for deep speech enhancement models.

The challenge has two tracks: (1) Headset (wired/wireless headphone, earbuds such as airpods etc.) speech enhancement; (2) non-headset (speakerphone, built-in mic in laptop/desktop/mobile phone/other meeting devices etc.) speech enhancement. Past challenges demonstrated that headset scenarios exhibit certain acoustic properties by dint of proximity of microphone to primary talker. Such acoustic properties can be leveraged to improve deep speech enhancement models with and without personalization. On the other hand, speakerphone cases exhibit different set of acoustic properties which motivates a separate track for non-headset scenarios. Participants can develop personalized as well as non-personalized (non-enrollment) deep speech enhancement models for both tracks. Blind test set will have paralinguistic test clips which will include standard forms of paralanguage including but not limited to The throat-clear, ”hmm” or ”mhm”, ”Huh?” or ”what?, Gasps, Sighs, Moans and groans, Deceptive speech, Sincere speech, Speech with high-base, Speech with high-pitch, Speech with low-pitch, Confident speech, Tired speech (when talker is tired), Persuasive speech, Voice change mid-clip (i.e. mimicry in last 50% of the clip). Blind testset will also include emotional speech including but not limited to happy, sad, angry, yelling, crying, and laughter. Blind testset include real test clips with high reverberation, high reverberation with noise, and noise in presence of interfering talkers. Testset noises include but not limited to Office scenarios (Typing, AC, Door shutting, Eating/munching, Copy machine, Squeaking chair, Notification sounds etc.), Home scenarios (Baby crying, Dogs, TV, Radiators, hair dryer, kitchen noise, running water etc.), Appliances (Washer Dryer, Dishwasher, Coffee maker, kitchen noise, Vacuum cleaner etc.), Fire alarm, Car, Inside parked car on busy road, In-car neighboring talkers, Traffic Road, Car noise (from machinery, control systems, turn signal etc.), Caf´e, Coffee machine, Blender, Background babble, Airport Announcements etc.

In all test clips, we have only one primary talker in enrollment clip while noisy test clip may have noise, reverberation, one or more neighboring talkers in addition to a primary talker. The goal of deep speech enhancement model is to preserve primary speaker speech while suppressing everything else. We provide a flexible framework for synthesizing training datasets which allows participants to choose a subset of challenge dataset or add their own corpora to augmented challenge training dataset. The two tracks have different test sets using corresponding device types. Test clips are real-world recordings collected through crowd sourcing. Test set includes representative noisy scenarios relevant for video/audio meetings in hybrid-work settings. The challenge overview paper will discuss the test set data collection in detail including list of devices, specifications sent to crowd-sourced workers, steps taken for quality assurance (QA) of test set. Test sets are selected to include speaker variety, device variety, different acoustic properties such as impulse response, direct-to-reverb-ratio (DRR), T60 which is achieved by changing the relative and absolute position of primary and interfering talkers, noise source and presence of reflecting surfaces etc. Along with training datasets and testsets, we also provide a baseline model (or enhanced clips) for both tracks. Track 1 and Track 2 both may have personalized or non-personalized models, but we only provide one baseline for each track.

We also provide personalized P.835 subjective evaluation framework which is used for both tracks along with Word Accuracy (WAcc) Azure API. The personalized P.835 subjective evaluation framework is an improved version of framework used in the 4th DNS Challenge. Our subjective framework is a modified version of ITU-T (International Telecommunication Union-Telecommunication Standardization Sector) P.835 which provides three scores for each test clip (and its corresponding enrollment clip) namely speech quality (SIG), background
noise quality (BAK), and overall quality (OVRL). This challenge aims to improve the subjective audio quality as measured by SIG, BAK and OVRL, and provide improvements in WAcc. Crowd-sourced workers doing subjective evaluation are instructed to rate interfering talkers as undesirable signal so the model which suppresses interfering talkers is rated higher. Similarly, WAcc ground-truth transcripts will only contain words spoken by the primary talker, thus treating interfering talker as undesirable signal. Enhanced test clips from past DNS Challenges had noticeable WAcc and SIG degradation due to over-suppression resulting in removal of primary talker’s speech. We open-sourced DNSMOS P.835 which is a deep neural network (DNN) model for non-intrusive prediction of speech, background noise and overall quality of audio signal. DNSMOS aims to help participants with intermediate model evaluations. Participants need to register on challenge’s CMT site which will be used for challenge communication: https://cmt3.research.microsoft.com/DNSChallenge2023. Questions related to challenge can be sent to dns challenge@microsoft.com.

The challenge paper is ICASSP 2022 Deep Noise Suppression Challenge (opens in new tab).

Please NOTE that the intellectual property (IP) is not transferred to the challenge organizers, i.e., if code is shared/submitted, the participants remain the owners of their code (when the code is made publicly available, an appropriate license should be added).

Challenge Tracks

• This challenge has two tracks: Track-1: Headset DNS; Track-2: Speakerphone DNS.
• Each testclip in both tracks has enrollment clip with 30s duration. The enrollment speech can be noise-free or noisy with/without reverberation. This facilitates multi-condition enrollment of primary talkers which serves as a measure of robustness for personalized models which use enrollment speech as additional input for denosing the testclips.
• Participants can choose to work on models with speaker enrollment or without it for one or both tracks. Each team can submit 1-4 models depending on what experiments they conduct. Each participating team can submit a maximum of one personalized and one non-personalized model for each track, e.g., a team can submit one personalized and one non-personalized model for Track 1 but not two personalized or two non-personalized model for Track 1. Similarly, a team can submit 4 models, personalized and non-personalized
model for Track 1 and personalized and non-personalized model for Track 2. All models for a track will be evaluated and ranked together i.e., both personalized and non-personalized models for Track 1 will go through one subjective evaluation. Similarly, for track 2, all models in one subjective evaluation.
• Participants are encouraged to conduct experiments with both personalized and non-personalized models to elucidate the benefits of personalization. Though, this is NOT a requirement for this challenge.
• Each track will have its own dev testset and blind testset, each consisting of 600 testclips. Thus, total of 2400 unique testclips is released in this challenge. The main difference between two tracks will be devices used for collecting the testsets, i.e., headsets for Track 1 or speakerphones for Track 2.

Challenge Requirements

Failing to adhere to challenge rules will lead to disqualification from the challenge.

Algorithmic latency: The offset introduced by the whole processing chain including STFT, iSTFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. But this doesn’t include buffering latency.

  • Ex.1: A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length – hop length = 10 ms.
  • Ex.2: A STFT-based processing with window length = 32 ms and hop length = 8 ms introduces an algorithmic delay of window length – hop length = 24 ms.
  • Ex.3: An overlap-save based processing algorithm introduces no additional algorithmic latency.
  • Ex.4: A time-domain convolution with a filter kernel size = 16 samples introduce an algorithmic latency of kernel size – 1 = 15 samples. Using one-sided padding, the operation can be made fully “causal”, i.e. a left-sided padding with kernel size-1 samples would result in no algorithmic latency.
  • Ex.5: A STFT-based processing with window_length = 20 ms and hop_length = 10 ms using 2 future frames information introduces an algorithmic latency of (window_length – hop_length) + 2*hop_length = 30 ms.

Buffering latency: It is defined as the latency introduced by block-wise processing, often referred to as hop-size, frame-shift, or temporal stride.

•Ex.1: A STFT-based processing has a buffering latency corresponding to the hop size
• Ex.2: A overlap-save processing has a buffering latency corresponding to the frame size.
• Ex.3: A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.

Real-time factor (RTF): RTF is defined as the fraction of time it takes to execute one processing step. For a STFT-based algorithm, one processing step is the hop-size. For a time-domain convolution, one processing step is 1 sample. RTF = compute time/time step.

All models submitted to this challenge must meet all of the below requirements.

  1. To be able to execute an algorithm in real-time, and to accommodate for variance in compute time which occurs in practice, we require RTF <= 0.5 in the challenge on an Intel Core i5 Quadcore clocked at 2.4 GHz using single thread.
  2. Algorithmic latency + buffering latency <= 20ms.
  3. No future information can be used during model inference.
  4. Participants can only enhance the testclips by a single pass through their model.
  5. None of the testclips from current or previous DNS Challenges can be used for training or fine-tuning the model.

Evaluation Criteria and Methodology

This challenge adopts the ITU-T P.835 subjective test framework to measure speech quality (SIG), background noise quality (BAK), and overall audio quality (OVRL). We modified the ITU-T P.835 to make it reliable for test clips with interfering (undesired neighboring) talkers. We are also releasing DNSMOS P.835 (opens in new tab), which is a machine learning based model for predicting SIG, BAK, OVRL. Participants can use DNSMOS P.835 to evaluate their intermediate models. In this challenge, we introduced Word Accuracy (WAcc) as an additional metric to compare the performance of DNS models. Challenge winners will be decided based on OVRL and WAcc as follows:

\( \text{Metric} = {{{(\text{OVLR} – 1) \over 4}+\text{WAcc}} \over 2} \)

WAcc will be obtained using Microsoft Azure Speech Recognition API. This challenge metric gives an equal weighting between subjective quality and speech recognition performance. The dev-test set and DNSMOS P.835 are provided to participants to accelerate model development. A script to evaluate WAcc is also provided. We neither use the dev-test set nor DNSMOS P.835 for deciding final winners. DNSMOS P.835 has a high correlation with human perception and hence can serve as a robust measure of audio quality. Challenge winner will be decided based on M computed on enhanced clips from blind test set.

Challenge winners will be decided based on enhanced blind test set. Participants are also required to report multiply–accumulate (MAC) or multiply-add (MAD) operation in single-pass inference of one audio frame for all models submitted to the challenge. In case of tie, model with lower MAC, lower RTF and lower algorithmic latency will be ranked higher.

Registration procedure

  • There are two steps in registering for the challenge.
  • Step 1: Participants are required to email Deep Noise Suppression Challenge dns_challenge@microsoft.com with the List of all the participants, Affiliation of each participant (include country name), Contact information for the team and Name of your team.
  • Step 2: Participants need to register on the Challenge CMT site https://cmt3.research.microsoft.com/DNSChallenge2023 where they can submit the enhanced clips and receive challenge announcements etc.
  • Organizers plan to announce the availability of data, baseline models, and evaluation results etc. via CMT.
  • Challenge leadership board will be managed using Piazza platform (Event name: ICASSP 2023 DNS CHALLENGE). Participants can register using the link below: https://piazza.com/microsoft dns challenge/spring2023/icassp2023dnschallenge
  • Results, challenge rules and descriptions of the two tracks will be posted to the challenge website: https://aka.ms/dns-challenge
  • Organizers will make the challenge overview paper available initially on Arxiv/Researchgate. It will eventually be published at OJ-SP as per IEEE GC guidelines.

Contact us: If you have questions about this program, email us at dns_challenge@microsoft.com.