Complex-bin2bin: A Latency-Flexible Generative Neural Model for Audio Packet Loss Concealment get pdf

C. Aironi, L. Gabrielli, S. Cornell and S. Squartini

Abstract - Despite the significant advancements in networking technologies, transmission of data packets in real-time, particularly in speech communications, continues to face challenges due to the possibility of data loss. This loss not only compromises sound quality but also diminishes overall intelligibility. In such cases, packet loss concealment (PLC) techniques could help by reconstructing the missing content and restoring the audio quality. This work proposes a novel method, that improves previous time-frequency generative inpainting approaches. Compared to other state-of-the-art methods, our proposed approach has the flexibility to restore lost packets either in real-time at low latency or in offline mode, without the need to retrain the network. Evaluations conducted against two recent state-of-the-art methods, ranked at the top of the 2022 Microsoft PLC competition, and against four DNN-based PLC solutions from the literature, show superior scores in terms of task-specific metrics. The method has also been tested in more challenging scenarios than aforementioned ones, with packet loss rates of up to 50%, showing the ability to help automatic speech recognition (ASR) systems reduce word error rate (WER) by up to almost 50% relative improvement. Additionally, a comparative subjective evaluation has been conducted, confirming the effectiveness of the proposed method in relation to the state of the art.


Below are some examples of repaired speech sequences, at different loss rates, for complex-bin2bin and the comparison methods: bin2bin [1], CRNN [2], SEGAN [3], Wave UNet [4] and TFGAN-PLC [5]. Next to each audio sample is the magnitude spectrogram of an excerpt, with a binary loss mask (in red) indicating the placement of lost samples as: 0 = non-lost, 1 = lost.

Clean (reference)

Clean spectrogram

Lossy

Lossy spectrogram

bin2bin [1]

bin2bin spectrogram

CRNN [2]

CRNN spectrogram

SEGAN [3]

SEGAN spectrogram

Wave UNet [4]

Wave UNet spectrogram

TFGAN-PLC [5]

TFGAN-PLC spectrogram

complex-bin2bin

complex-bin2bin spectrogram

Clean (reference)

Clean spectrogram

Lossy

Lossy spectrogram

bin2bin [1]

bin2bin spectrogram

CRNN [2]

CRNN spectrogram

SEGAN [3]

SEGAN spectrogram

Wave UNet [4]

Wave UNet spectrogram

TFGAN-PLC [5]

TFGAN-PLC spectrogram

complex-bin2bin

complex-bin2bin spectrogram

Clean (reference)

Clean spectrogram

Lossy

Lossy spectrogram

bin2bin [1]

bin2bin spectrogram

CRNN [2]

CRNN spectrogram

SEGAN [3]

SEGAN spectrogram

Wave UNet [4]

Wave UNet spectrogram

TFGAN-PLC [5]

TFGAN-PLC spectrogram

complex-bin2bin

complex-bin2bin spectrogram

[1] - C. Aironi, S. Cornell, L. Serafini, and S. Squartini, “A Time-Frequency Generative Adversarial Based Method for Audio Packet Loss Concealment” in 31st European Signal Processing Conference (EUSIPCO), 2023, pp. 121–125.
[2] - J. Lin, Y. Wang, K. Kalgaonkar, G. Keren, D. Zhang, and C. Fuegen, “A Time-Domain Convolutional Recurrent Network for Packet Loss Concealment” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7148–7152.
[3] - Y. Shi, N. Zheng, Y. Kang, and W. Rong, “Speech Loss Compensation by Generative Adversarial Networks” in Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2019, pp. 347–351.
[4] - M. N. Ali, A. Brutti, and D. Falavigna, “Speech Enhancement Using Dilated Wave-U-Net: an Experimental Analysis,” in 27th Conference of Open Innovations Association (FRUCT), 2020, pp. 3–9.
[5] - J. Wang, Y. Guan, C. Zheng, R. Peng, and X. Li, “A temporal-spectral generative adversarial network based end-to-end packet loss concealment for wideband speech transmission” The Journal of the Acoustical Society of America, vol. 150, no. 4, pp. 2577–2588, 2021.