Complex-bin2bin

Complex-bin2bin: A Latency-Flexible Generative Neural Model for Audio Packet Loss Concealment

C. Aironi, L. Gabrielli, S. Cornell and S. Squartini

Abstract - Despite the significant advancements in networking technologies, transmission of data packets in real-time, particularly in speech communications, continues to face challenges due to the possibility of data loss. This loss not only compromises sound quality but also diminishes overall intelligibility. In such cases, packet loss concealment (PLC) techniques could help by reconstructing the missing content and restoring the audio quality. This work proposes a novel method, that improves previous time-frequency generative inpainting approaches. Compared to other state-of-the-art methods, our proposed approach has the flexibility to restore lost packets either in real-time at low latency or in offline mode, without the need to retrain the network. Evaluations conducted against two recent state-of-the-art methods, ranked at the top of the 2022 Microsoft PLC competition, and against four DNN-based PLC solutions from the literature, show superior scores in terms of task-specific metrics. The method has also been tested in more challenging scenarios than aforementioned ones, with packet loss rates of up to 50%, showing the ability to help automatic speech recognition (ASR) systems reduce word error rate (WER) by up to almost 50% relative improvement. Additionally, a comparative subjective evaluation has been conducted, confirming the effectiveness of the proposed method in relation to the state of the art.

Below are some examples of repaired speech sequences, at different loss rates, for complex-bin2bin and the comparison methods: bin2bin [1], CRNN [2], SEGAN [3], Wave UNet [4] and TFGAN-PLC [5]. Next to each audio sample is the magnitude spectrogram of an excerpt, with a binary loss mask (in red) indicating the placement of lost samples as: 0 = non-lost, 1 = lost.

Sample 1 - Loss rate 23.4%

Clean (reference)

Lossy

bin2bin [1]

CRNN [2]

SEGAN [3]

Wave UNet [4]

TFGAN-PLC [5]

complex-bin2bin

Sample 2 - Loss rate 32.2%

Clean (reference)

Lossy

bin2bin [1]

CRNN [2]

SEGAN [3]

Wave UNet [4]

TFGAN-PLC [5]

complex-bin2bin

Sample 3 - Loss rate 10.9%

Clean (reference)

Lossy

bin2bin [1]

CRNN [2]

SEGAN [3]

Wave UNet [4]

TFGAN-PLC [5]

complex-bin2bin

[1] - C. Aironi, S. Cornell, L. Serafini, and S. Squartini, “A Time-Frequency Generative Adversarial Based Method for Audio Packet Loss Concealment” in 31st European Signal Processing Conference (EUSIPCO), 2023, pp. 121–125.
[2] - J. Lin, Y. Wang, K. Kalgaonkar, G. Keren, D. Zhang, and C. Fuegen, “A Time-Domain Convolutional Recurrent Network for Packet Loss Concealment” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7148–7152.
[3] - Y. Shi, N. Zheng, Y. Kang, and W. Rong, “Speech Loss Compensation by Generative Adversarial Networks” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 347–351.
[4] - M. N. Ali, A. Brutti, and D. Falavigna, “Speech Enhancement Using Dilated Wave-U-Net: an Experimental Analysis,” in 27th Conference of Open Innovations Association (FRUCT), 2020, pp. 3–9.
[5] - J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, “A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577–2588, 2021.