Speech Signal Improvement Challenge – ICASSP 2024

Region: Global

Program dates: October 2023-February 2024

The Speech Signal Improvement Challenge Grand Challenge at ICASSP 2024 is intended to stimulate research in the area of improving the speech signal quality in communication systems. The speech signal quality is measured with SIG in ITU-T P.835 and is still a top issue in audio communication and conferencing systems.

This challenge is to benchmark the performance of real-time speech enhancement models with a real (not simulated) test set. The audio scenario is the send signal in telecommunication; it does not include echo impairments. Participants will evaluate their speech enhancement model on a test set and submit the results (clips) for evaluation.

Challenge tracks

There are two tracks for this challenge:

Real-time track
Non-real-time track

Latency and runtime requirements

Algorithmic latency: The offset introduced by the whole processing chain including STFT, iSTFT, overlap-add, additional lookahead frames, etc., compared to just passing the signal through without modification. But this doesn’t include buffering latency.

Ex.1: A STFT-based processing with window length = 20 ms and hop length = 10 ms introduces an algorithmic delay of window length – hop length = 10 ms.
Ex.2: A STFT-based processing with window length = 32 ms and hop length = 8 ms introduces an algorithmic delay of window length – hop length = 24 ms.
Ex.3: An overlap-save-based processing algorithm introduces no additional algorithmic latency.
Ex.4: A time-domain convolution with a filter kernel size = 16 samples introduces an algorithmic latency of kernel size – 1 = 15 samples. Using one-sided padding, the operation can be made fully “causal”, i.e. a left-sided padding with kernel size-1 samples would result in no algorithmic latency.
Ex.5: A STFT-based processing with window_length = 20 ms and hop_length = 10 ms using 2 future frames information introduce an algorithmic latency of (window_length – hop_length) + 2*hop_length = 30 ms.

Buffering latency: It is defined as the latency introduced by block-wise processing, often referred to as hop-size, frame-shift, or temporal stride.

• Ex.1: A STFT-based processing has a buffering latency corresponding to the hop size
• Ex.2: A overlap-save processing has a buffering latency corresponding to the frame size.
• Ex.3: A time-domain convolution with stride 1 introduces a buffering latency of 1 sample.

Real-time factor (RTF): RTF is defined as the fraction of time it takes to execute one processing step. For a STFT-based algorithm, one processing step is the hop-size. For a time-domain convolution, one processing step is 1 sample. RTF = compute time/time step.

All models submitted to this challenge must meet all of the below requirements.

For the real-time track: To be able to execute an algorithm in real-time, and to accommodate for variance in compute time which occurs in practice, we require RTF <= 0.5 in the challenge on an Intel Core i5 Quadcore clocked at 2.4 GHz using a single thread. The non-real-time track must have a RTF > 0.5.
Algorithmic latency + buffering latency <= 20ms.
No future information can be used during model inference.

Registration procedure

To register for the challenge, participants are required to email Speech Signal Improvement Challenge sig_challenge@microsoft.com (opens in new tab) with the name of their team members, emails, affiliations, team name, track(s) participating in, team captain, and tentative paper title. Participants also need to register on the Challenge CMT (opens in new tab) site where they can submit the enhanced clips. Registration data is captured and stored in the US.

Submission instructions

Please use Microsoft Conference Management Toolkit (opens in new tab) for submitting the results. After logging in, complete the following steps to submit the results:

Choose “Create new submission” in the Author Console.
Enter the title, abstract, and co-authors, and upload a lastname.txt file (can be empty or contain additional information regarding the submission).
Compress the enhanced results files to a single lastname.zip file, retaining the same folder and file names as the blind test set (max file size: 1.8 GB).
After creating the submission, return to the “Author Console” (by clicking on “Submissions” at the top of the page) and upload the lastname.zip file via “Upload Supplementary Material”.

Contact us: For questions, please contact sig_challenge@microsoft.com