June 6, 2021 - June 11, 2021

Microsoft at ICASSP 2021

Location: Virtual

All times are displayed in Eastern Daylight Time (UTC -4)

Monday, June 7

10:00 – 13:30 | Tutorial

Distant conversational speech recognition and analysis: Recent advances, and trends towards end-to-end optimization

Presenters: Keisuke Kinoshita, Yusuke Fujita, Naoyuki Kanda, Shinji Watanabe

18:00 – 19:00

Young Professionals Panel Discussion

Moderator: Subhro Das
Panelists: Sabrina Rashid, Vanessa Testoni, Hamid Palangi


Tuesday, June 8

13:00 – 13:45 | Speech Synthesis 1: Architecture

Lightspeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Renqian Luo, Xu TanRui WangTao QinJinzhu LiSheng Zhao, Enhong Chen, Tie-Yan Liu

13:00 – 13:45 | Speech Synthesis 1: Architecture

A New High Quality Trajectory Tiling Based Hybrid TTS In Real Time

Feng-Long Xie, Xin-Hui Li, Wen-Chao Su, Li Lu, Frank K. Soong

13:00 – 13:45 | Language Modeling 1: Fusion and Training for End-to-End ASR

Internal Language Model Training for Domain-Adaptive End-To-End Speech Recognition

Zhong MengNaoyuki KandaYashesh GaurSarangarajan Parthasarathy, Eric Sun, Liang LuXie ChenJinyu LiYifan Gong

13:00 – 13:45 | Audio and Speech Source Separation 1: Speech Separation

Session Chair: Zhuo Chen

Rethinking The Separation Layers In Speech Separation Networks

Yi Luo, Zhuo Chen, Cong Han, Chenda Li, Tianyan Zhou, Nima Mesgarani

13:00 – 13:45 | Deep Learning Training Methods 3

Session Chair: Jinyu Li

13:00 – 13:45 | Brain-Computer Interfaces

Decoding Music Attention from “EEG Headphones”: A User-Friendly Auditory Brain-Computer Interface

Wenkang An, Barbara Shinn-Cunningham, Hannes GamperDimitra EmmanouilidouDavid JohnstonMihai JalobeanuEdward CutrellAndrew Wilson, Kuan-Jung Chiang, Ivan Tashev

14:00 – 14:45 | Speech Enhancement 1: Speech Separation

Session Chair: Takuya Yoshioka

Dual-Path Modeling for Long Recording Speech Separation in Meetings

Chenda Li, Zhuo Chen, Yi Luo, Cong Han, Tianyan Zhou, Keisuke Kinoshita, Marc Delcroix, Shinji Watanabe, Yanmin Qian

14:00 – 14:45 | Speech Enhancement 1: Speech Separation

Continuous Speech Separation with Conformer

Sanyuan Chen, Yu WuZhuo ChenJian WuJinyu LiTakuya YoshiokaChengyi WangShujie LiuMing Zhou

14:00 – 14:45 | Speech Enhancement 2: Speech Separation and Dereverberation

Session Chair: Takuya Yoshioka

14:00 – 14:45 | Speaker Recognition 1: Benchmark Evaluation

Microsoft Speaker Diarization System for the Voxceleb Speaker Recognition Challenge 2020

Xiong XiaoNaoyuki KandaZhuo ChenTianyan ZhouTakuya YoshiokaSanyuan ChenYong ZhaoGang LiuYu WuJian WuShujie LiuJinyu LiYifan Gong

14:00 – 14:45 | Dialogue Systems 2: Response Generation

Topic-Aware Dialogue Generation with Two-Hop Based Graph Attention

Shijie Zhou, Wenge Rong, Jianfei Zhang, Yanmeng Wang, Libin Shi, Zhang Xiong

16:30 – 17:15 | Speech Recognition 4: Transformer Models 2

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Xie ChenYu WuZhenghao WangShujie LiuJinyu Li

16:30 – 17:15 | Active Noise Control, Echo Reduction, and Feedback Reduction 2: Active Noise Control and Echo Cancellation

Session Chair: Hannes Gamper

ICASSP 2021 Acoustic Echo Cancellation Challenge: Datasets, Testing Framework, and Results

Kusha Sridhar, Ross CutlerAndo Saabas, Tanel Parnamaa, Markus LoideHannes GamperSebastian BraunRobert AichnerSriram Srinivasan

16:30 – 17:15 | Learning

Session Chair: Zhong Meng

Sequence-Level Self-Teaching Regularization

Eric Sun, Liang LuZhong MengYifan Gong


Wednesday, June 9

13:00 – 13:45 | Language Understanding 1: End-to-end Speech Understanding 1

Speech-Language Pre-Training for End-to-End Spoken Language Understanding

Yao Qian, Ximo Bian, Yu ShiNaoyuki Kanda, Leo Shen, Zhen XiaoMichael Zeng

13:00 – 13:45 | Audio and Speech Source Separation 4: Multi-Channel Source Separation

DBnet: Doa-Driven Beamforming Network for end-to-end Reverberant Sound Source Separation

Ali Aroudi, Sebastian Braun

14:00 – 14:45 | Speech Enhancement 4: Multi-channel Processing

Don’t Shoot Butterfly with Rifles: Multi-Channel Continuous Speech Separation with Early Exit Transformer

Sanyuan Chen, Yu WuZhuo ChenTakuya YoshiokaShujie LiuJinyu Li, Xiangzhan Yu

14:00 – 14:45 | Matrix Factorization and Applications

Cold Start Revisited: A Deep Hybrid Recommender with Cold-Warm Item Harmonization

Oren Barkan, Roy Hirsch, Ori Katz, Avi Caciularu, Yoni Weill, Noam Koenigstein

14:00 – 14:45 | Biological Image Analysis

CMIM: Cross-Modal Information Maximization For Medical Imaging

Tristan Sylvain, Francis Dutil, Tess Berthier, Lisa Di Jorio, Margaux Luck, Devon Hjelm, Yoshua Bengio

15:30 – 16:15 | Speech Recognition 8: Multilingual Speech Recognition

Multi-Dialect Speech Recognition in English Using Attention on Ensemble of Experts

Amit DasKshitiz KumarJian Wu

15:30 – 16:15 | Quality and Intelligibility Measures

MBNET: MOS Prediction for Synthesized Speech with Mean-Bias Network

Yichong Leng, Xu TanSheng ZhaoFrank K. Soong, Xiang-Yang Li, Tao Qin

15:30 – 16:15 | Quality and Intelligibility Measures

Crowdsourcing Approach for Subjective Evaluation of Echo Impairment

Ross Cutler, Babak Nadari, Markus LoideSten SootlaAndo Saabas

16:30 – 17:15 | Speech Recognition 9: Confidence Measures

Session Chair: Yifan Gong

16:30 – 17:15 | Speech Recognition 10: Robustness to Human Speech Variability

Session Chair: Yifan Gong

16:30 – 17:15 | Speech Processing 2: General Topics

Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors

Chandan K A ReddyVishak GopalRoss Cutler

16:30 – 17:15 | Style and Text Normalization

Generating Human Readable Transcript for Automatic Speech Recognition with Pre-Trained Language Model

Junwei Liao, Yu ShiMing GongLinjun ShouSefik EskimezLiyang Lu, Hong Qu, Michael Zeng

16:30 – 17:15 | Modeling, Analysis and Synthesis of Acoustic Environments 3: Acoustic Analysis

Prediction of Object Geometry from Acoustic Scattering Using Convolutional Neural Networks

Ziqi Fan, Vibhav Vineet, Chenshen Lu, T.W. Wu, Kyla McMullen


Thursday, June 10

13:00 – 13:45 | Speech Recognition 11: Novel Approaches

Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

Naoyuki KandaZhong MengLiang LuYashesh GaurXiaofei WangZhuo ChenTakuya Yoshioka

13:00 – 13:45 | Speech Synthesis 5: Prosody & Style

Speech Bert Embedding for Improving Prosody in Neural TTS

Liping ChenYan DengXi WangFrank K. SoongLei He

13:00 – 13:45 | Speech Synthesis 6: Data Augmentation & Adaptation

Adaspeech 2: Adaptive Text to Speech with Untranscribed Data

Yuzi Yan, Xu Tan, Bohan Li, Tao QinSheng Zhao, Yuan Shen, Tie-Yan Liu

14:00 – 14:45 | Speech Enhancement 5: DNS Challenge Task

Session Chair: Chandan K A Reddy

ICASSP 2021 Deep Noise Suppression Challenge

Chandan K A ReddyHarishchandra DubeyVishak GopalRoss CutlerSebastian BraunHannes GamperRobert AichnerSriram Srinivasan

14:00 – 14:45 | Speech Enhancement 6: Multi-modal Processing

Session Chair: Chandan K A Reddy

14:00 – 14:45 | Graph Signal Processing

Fast Hierarchy Preserving Graph Embedding via Subspace Constraints

Xu Chen, Lun Du, Mengyuan Chen, Yun Wang, QingQing Long, Kunqing Xie

15:30 – 16:15 | Speech Recognition 13: Acoustic Modeling 1

Hypothesis Stitcher for End-to-End Speaker-Attributed ASR on Long-Form Multi-Talker Recordings

Xuankai Chang, Naoyuki KandaYashesh GaurXiaofei WangZhong MengTakuya Yoshioka

15:30 – 16:15 | Speech Recognition 14: Acoustic Modeling 2

Ensemble Combination between Different Time Segmentations

Jeremy Heng Meng WongDimitrios DimitriadisKenichi KumataniYashesh GaurGeorge PolovetsPartha Parthasarathy, Eric Sun, Jinyu LiYifan Gong

15:30 – 16:15 | Privacy and Information Security

Detection Of Malicious DNS and Web Servers using Graph-Based Approaches

Jinyuan Jia, Zheng DongJie LiJack W. Stokes

16:30 – 17:15 | Language Assessment

Improving Pronunciation Assessment Via Ordinal Regression with Anchored Reference Samples

Bin Su, Shaoguang MaoFrank K. SoongYan XiaJonathan Tien, Zhiyong Wu

16:30 – 17:15 | Signal Enhancement and Restoration 1: Deep Learning

Towards Efficient Models for Real-Time Deep Noise Suppression

Sebastian BraunHannes GamperChandan K A ReddyIvan Tashev

16:30 – 17:15 | Signal Enhancement and Restoration 3: Signal Enhancement

Phoneme-Based Distribution Regularization for Speech Enhancement

Yajing Liu, Xiulian Peng, Zhiwei Xiong, Yan Lu

16:30 – 17:15 | Audio & Images

Session Chair: Ivan Tashev


Friday, June 11

1:30 – 12:15 | Speech Recognition 18: Low Resource ASR

MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition

Linghui Meng, Jin Xu, Xu TanJindong WangTao Qin, Bo Xu

11:30 – 12:15 | Speech Synthesis 7: General Topics

Denoispeech: Denoising Text to Speech with Frame-Level Noise Modeling

Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao QinSheng ZhaoTie-Yan Liu

13:00 – 13:45 | Speech Enhancement 8: Echo Cancellation and Other Tasks

Cascaded Time + Time-Frequency Unet For Speech Enhancement: Jointly Addressing Clipping, Codec Distortions, And Gaps

Arun Asokan Nair, Kazuhito Koishida

13:00 – 13:45 | Speaker Diarization

Hidden Markov Model Diarisation with Speaker Location Information

Jeremy Heng Meng WongXiong XiaoYifan Gong

13:00 – 13:45 | Detection and Classification of Acoustic Scenes and Events 5: Scenes

Cross-Modal Spectrum Transformation Network for Acoustic Scene Classification

Yang Liu, Alexandros NeophytouSunando SenguptaEric Sommerlade