Abstract
In real time, the speech signal received contains noise produced in the background and reverberations. These disturbances reduce the quality of speech; therefore, it is important to eliminate the noise and increase the intelligibility and quality of speech signal. Speech enhancement is the primary task in any real-time application that handles speech signals. In the proposed method, the most effective and challenging noise, i.e., babble noise, is removed, and the clean speech is recovered. The enhancement of the corrupted speech signal is done by applying a deep neural network-based denoising algorithm in which the ideal ratio mask is used to mask the noisy speech and separate the clean speech signal. In the proposed system, the speech signal corrupted by noise is enhanced. Evaluation of enhanced speech signal by performance metrics such as short time objective intelligibility and signal to noise ratio of the denoised speech show that the speech intelligibility and speech quality are improved by the proposed method.
Keywords:
deep neural network, noisy speech, speech enhancement, feature extraction, speech quality, computational intelligenceReferences
2. B. Li, Y. Tsao, K.C. Sim, An investigation of spectral restoration algorithms for deep neural networks-based noise robust speech recognition, [in:] Proceedings of Interspeech, Lyon, France, pp. 3002–3006, 2013.
3. H. Levitt, Noise reduction in hearing aids: An overview, Journal of Rehabilitation Research and Development, 38(1), 111–121, 2001.
4. A. Chern, Y.-H. Lai, Y.-P. Chang, Y. Tsao, R.Y. Chang, H.-W. Chang, A smartphonebased multi-functional hearing assistive system to facilitate speech recognition in the classroom, IEEE Access, 5: 10339–10351, 2017, https://doi.org/10.1109/ACCESS.2017.2711489
5. J. Li, L. Yang, J. Zhang, Y. Yan, Comparative intelligibility investigation of single-channel noise reduction algorithms for Chinese, Japanese and English, Journal of the Acoustical Society of America, 129(5): 3291–3301, 2011, https://doi.org/10.1121/1.3571422
6. J. Li, S. Sakamoto, S. Hongo, M. Akagi, Y. Suzuki, Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, 2011, https://doi.org/10.1016/j.specom.2010.04.009
7. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error logspectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2): 443–445, 1985, https://doi.org/10.1109/TASSP.1985.1164550
8. S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2): 113–120, Apr. 1979, https://doi.org/10.1109/TASSP.1979.1163209
9. Hepsiba D., J. Justin, Role of deep neural network in speech enhancement: A review, [in:] J. Hemanth, T. Silva, A. Karunananda [Eds.], Artificial Intelligence, SLAAI-ICAI 2018. Communications in Computer and Information Science, Vol. 890, Springer, Singapore, 2019, https://doi.org/10.1007/978-981-13-9129-3_8
10. P. Scalart, J.V. Filho, speech enhancement based on a priori signal to noise estimation, [in:] Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 629–633, 1996, https://doi.org/10.1109/ICASSP.1996.543199
11. W. Xue, A.H. Moore, M. Brookes, P.A. Naylor, Modulation-domain multichannel Kalman filtering for speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1833–1847, 2018, https://doi.org/10.1109/TASLP.2018.2845665
12. J. Du, Q. Huo, A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions, [in:] Proceedings of Interspeech, pp. 569–572, Brisbane, Australia, 2008.
13. B. Kollmeier, R. Koch, Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction, The Journal of the Acoustical Society of America, 95(3): 1593–1602, 1994, https://doi.org/10.1121/1.408546
14. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, 1990, https://doi.org/10.1121/1.399423
15. H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4): 578–589, 1994, https://doi.org/10.1109/89.326616
16. T. Dau, D. Püschel, A quantitative model of the “effective” signal processing in the auditory system, The Journal of the Acoustical Society of America, 99(6): 3615–3622, 1996, https://doi.org/10.1121/1.414959
17. K. Han, Y.Wang, D.L.Wang,W.S.Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6): 982–992, 2015, https://doi.org/10.1109/TASLP.2015.2416653
18. S. Davis, P. Mermelstein, Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, 1980, https://doi.org/10.1109/TASSP.1980.1163420
19. Y. Zhao, Z.-Q.Wang, D.L.Wang, Two-stage deep learning for noisy-reverberant speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 53–62, 2019, https://doi.org/10.1109/TASLP.2018.2870725
20. Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12): 1849–1858, 2014, https://doi.org/10.1109/TASLP.2014.2352935
21. J. Benesty, S. Makino, J.D. Chen, Speech Enhancement, Springer, New York, NY, USA, 2005.
22. P.C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, Boca Raton, FL, USA, 2013, https://doi.org/10.1201/9781420015836
23. H.-Y. Lee, J.-W. Cho, M. Kim, H.-M. Park, DNN-based feature enhancement using DOA constrained ICA for robust speech recognition, IEEE Signal Processing Letters, 23(8): 1091–1095, August 2016, https://doi.org/10.1109/LSP.2016.2583658
24. Y. Shao, S. Srinivasan, D.L. Wang, Incorporating auditory feature uncertainties in robust speaker identification, [in:] Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, pp. 277–280, 2007.
25. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1): 7–19, 2015, https://doi.org/10.1109/TASLP.2014.2364452
26. IEEE, IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17: 225–246, 1969.
27. Y. Hu, P. Loizou, Subjective evaluation and comparison of speech enhancement algorithms, Speech Communication, 2007, 49: 588–601, https://ecs.utdallas.edu/loizou/speech/noizeus/
28. K. Tan, D. Wang, Towards model compression for deep learning based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 1785–1794, 2021, https://doi.org/10.1109/TASLP.2021.3082282
29. F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 7–19, 2018, https://doi.org/10.1109/TASLP.2018.2868407
30. Y. Liu, H. Zhang, X. Zhang, L. Yang, Supervised speech enhancement with real spectrum approximation, [in:] Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5746–5750, 2019, https://doi.org/10.1109/ICASSP.2019.8683691
31. C. Valentini-Botinhao, J. Yamagishi, Speech enhancement of noisy and reverberant speech for text-to-speech, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8): 1420–1433, 2018, https://doi.org/10.1109/TASLP.2018.2828980
32. J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang, Audio-visual speech enhancement using multimodal deep convolutional neural networks, IEEE Transactions on Emerging Topics in Computational Intelligence, 2(20): 117–128, 2018, https://doi.org/10.1109/TETCI.2017.2784878
33. P. Pujol, S. Pol, C. Nadeu, A. Hagen, H. Bourlard, Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system, IEEE Transactions on Speech and Audio Processing, 13(1): 14–22, 2005, https://doi.org/10.1109/TSA.2004.834466
34. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, Cross-language transfer learning for deep neural network-based speech enhancement, [in:] Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 336–340, 2014, https://doi.org/10.1109/ISCSLP.2014.6936608
35. Z.-Q. Wang, D.L. Wang, Robust speech recognition from ratio masks, [in:] Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724, 2016, https://doi.org/10.1109/ICASSP.2016.7472773
36. W. Yuan, A time–frequency smoothing neural network for speech enhancement, Speech Communications, 124: 75–84, 2020, https://doi.org/10.1016/j.specom.2020.09.002
37. T. Lavanya, T. Nagarajan, P. Vijayalakshmi, Multi-level single channel speech enhancement using a unified framework for estimating magnitude and phase spectra, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1315–1327, 2020, https://doi.org/10.1109/TASLP.2020.2986877
38. K. Sekiguchi, Y. Bando, A.A. Nugraha, K. Yoshii, T. Kawahara, Semi-supervised multichannel speech enhancement with a deep speech prior, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12): 2197–2212, 2019, https://doi.org/10.1109/TASLP.2019.2944348
39. F.B. Gelderblom, T.V. Tronstad, E.M. Viggen, Subjective evaluation of a noisereduced training target for deep neural network-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(3): 583–594, 2020, https://doi.org/10.1109/TASLP.2018.2882738
40. T. Kawase, M. Okamoto, T. Fukutomi, Y. Takahashi, Speech enhancement parameter adjustment to maximize accuracy of automatic speech recognition, IEEE Transactions on Consumer Electronics, 66(2): 125–133, 2020, https://doi.org/10.1109/TCE.2020.2986003
41. D. Baby, T. Viratanen, J.F. Gemmeke, H. van Hamme, Coupled dictionaries for exemplarbased speech enhancement and automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1788–1799, 2015, https://doi.org/10.1109/TASLP.2015.2450491
42. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12): 2136–2147, 2015, https://doi.org/10.1109/TASLP.2015.2468583
43. L. Sun, J. Du, L.-R. Dai, C.-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement, [in:] 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, 2017, https://doi.org/10.1109/HSCMA.2017.7895577
44. W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, [in:] 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 51–55, 2018, https://doi.org/10.1109/ISCSLP.2018.8706604
45. W. Han, C. Wu, X. Zhang, Q. Zhang, S. Bai, Joint optimization of modified ideal ratio mask and deep neural networks for monaural speech enhancement, [in:] Proceedings of 2017 9th International Conference on Communication Software and Networks (ICCSN), pp. 1070–1074, 2017, https://doi.org/10.1109/ICCSN.2017.8230275
46. D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(3): 483–492, 2016, https://doi.org/10.1109/TASLP.2015.2512042
47. J. Ming, D. Crookes, Speech enhancement based on full-sentence correlation and clean speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3): 531–543, 2017, https://doi.org/10.1109/TASLP.2017.2651406
48. R. Jaiswal, D. Romero, Implicit Wiener filtering for speech enhancement in non-stationary noise, [in:] 2021 11th International Conference on Information Science and Technology (ICIST), pp. 39–47, 2021, https://doi.org/10.1109/ICIST52614.2021.9440639
