Computational Intelligence for Speech Enhancement using Deep Neural Network

Downloads

Authors

  • Hepsiba D. Department of Biomedical Instrumentation Engineering, Avinashilingam Institute for Home Science and Higher Education for Women / Department of Biomedical Engineering, Karunya Institute of Technology and Sciences, Coimbatore, India
  • Judith Justin Department of Biomedical Instrumentation Engineering, Avinashilingam Institute for Home Science and Higher Education for Women, Coimbatore, India

Abstract

In real time, the speech signal received contains noise produced in the background and reverberations. These disturbances reduce the quality of speech; therefore, it is important to eliminate the noise and increase the intelligibility and quality of speech signal. Speech enhancement is the primary task in any real-time application that handles speech signals. In the proposed method, the most effective and challenging noise, i.e., babble noise, is removed, and the clean speech is recovered. The enhancement of the corrupted speech signal is done by applying a deep neural network-based denoising algorithm in which the ideal ratio mask is used to mask the noisy speech and separate the clean speech signal. In the proposed system, the speech signal corrupted by noise is enhanced. Evaluation of enhanced speech signal by performance metrics such as short time objective intelligibility and signal to noise ratio of the denoised speech show that the speech intelligibility and speech quality are improved by the proposed method.

Keywords:

deep neural network, noisy speech, speech enhancement, feature extraction, speech quality, computational intelligence

References

1. J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust Automatic Speech Recognition: A Bridge to Practical Applications, 1st ed., Academic, Orlando, FL, USA, 2015.

2. B. Li, Y. Tsao, K.C. Sim, An investigation of spectral restoration algorithms for deep neural networks-based noise robust speech recognition, [in:] Proceedings of Interspeech, Lyon, France, pp. 3002–3006, 2013.

3. H. Levitt, Noise reduction in hearing aids: An overview, Journal of Rehabilitation Research and Development, 38(1), 111–121, 2001.

4. A. Chern, Y.-H. Lai, Y.-P. Chang, Y. Tsao, R.Y. Chang, H.-W. Chang, A smartphonebased multi-functional hearing assistive system to facilitate speech recognition in the classroom, IEEE Access, 5: 10339–10351, 2017, https://doi.org/10.1109/ACCESS.2017.2711489

5. J. Li, L. Yang, J. Zhang, Y. Yan, Comparative intelligibility investigation of single-channel noise reduction algorithms for Chinese, Japanese and English, Journal of the Acoustical Society of America, 129(5): 3291–3301, 2011, https://doi.org/10.1121/1.3571422

6. J. Li, S. Sakamoto, S. Hongo, M. Akagi, Y. Suzuki, Two-stage binaural speech enhancement with Wiener filter for high-quality speech communication, Speech Communication, 53(5): 677–689, 2011, https://doi.org/10.1016/j.specom.2010.04.009

7. Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error logspectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2): 443–445, 1985, https://doi.org/10.1109/TASSP.1985.1164550

8. S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2): 113–120, Apr. 1979, https://doi.org/10.1109/TASSP.1979.1163209

9. Hepsiba D., J. Justin, Role of deep neural network in speech enhancement: A review, [in:] J. Hemanth, T. Silva, A. Karunananda [Eds.], Artificial Intelligence, SLAAI-ICAI 2018. Communications in Computer and Information Science, Vol. 890, Springer, Singapore, 2019, https://doi.org/10.1007/978-981-13-9129-3_8

10. P. Scalart, J.V. Filho, speech enhancement based on a priori signal to noise estimation, [in:] Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 2, pp. 629–633, 1996, https://doi.org/10.1109/ICASSP.1996.543199

11. W. Xue, A.H. Moore, M. Brookes, P.A. Naylor, Modulation-domain multichannel Kalman filtering for speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10): 1833–1847, 2018, https://doi.org/10.1109/TASLP.2018.2845665

12. J. Du, Q. Huo, A speech enhancement approach using piecewise linear approximation of an explicit model of environmental distortions, [in:] Proceedings of Interspeech, pp. 569–572, Brisbane, Australia, 2008.

13. B. Kollmeier, R. Koch, Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction, The Journal of the Acoustical Society of America, 95(3): 1593–1602, 1994, https://doi.org/10.1121/1.408546

14. H. Hermansky, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America, 87(4): 1738–1752, 1990, https://doi.org/10.1121/1.399423

15. H. Hermansky, N. Morgan, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2(4): 578–589, 1994, https://doi.org/10.1109/89.326616

16. T. Dau, D. Püschel, A quantitative model of the “effective” signal processing in the auditory system, The Journal of the Acoustical Society of America, 99(6): 3615–3622, 1996, https://doi.org/10.1121/1.414959

17. K. Han, Y.Wang, D.L.Wang,W.S.Woods, I. Merks, T. Zhang, Learning spectral mapping for speech dereverberation and denoising, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(6): 982–992, 2015, https://doi.org/10.1109/TASLP.2015.2416653

18. S. Davis, P. Mermelstein, Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4): 357–366, 1980, https://doi.org/10.1109/TASSP.1980.1163420

19. Y. Zhao, Z.-Q.Wang, D.L.Wang, Two-stage deep learning for noisy-reverberant speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 53–62, 2019, https://doi.org/10.1109/TASLP.2018.2870725

20. Y. Wang, A. Narayanan, D.L. Wang, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12): 1849–1858, 2014, https://doi.org/10.1109/TASLP.2014.2352935

21. J. Benesty, S. Makino, J.D. Chen, Speech Enhancement, Springer, New York, NY, USA, 2005.

22. P.C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, Boca Raton, FL, USA, 2013, https://doi.org/10.1201/9781420015836

23. H.-Y. Lee, J.-W. Cho, M. Kim, H.-M. Park, DNN-based feature enhancement using DOA constrained ICA for robust speech recognition, IEEE Signal Processing Letters, 23(8): 1091–1095, August 2016, https://doi.org/10.1109/LSP.2016.2583658

24. Y. Shao, S. Srinivasan, D.L. Wang, Incorporating auditory feature uncertainties in robust speaker identification, [in:] Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2007, pp. 277–280, 2007.

25. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1): 7–19, 2015, https://doi.org/10.1109/TASLP.2014.2364452

26. IEEE, IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17: 225–246, 1969.

27. Y. Hu, P. Loizou, Subjective evaluation and comparison of speech enhancement algorithms, Speech Communication, 2007, 49: 588–601, https://ecs.utdallas.edu/loizou/speech/noizeus/

28. K. Tan, D. Wang, Towards model compression for deep learning based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 1785–1794, 2021, https://doi.org/10.1109/TASLP.2021.3082282

29. F. Bao, W. Abdulla, A new ratio mask representation for CASA-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1): 7–19, 2018, https://doi.org/10.1109/TASLP.2018.2868407

30. Y. Liu, H. Zhang, X. Zhang, L. Yang, Supervised speech enhancement with real spectrum approximation, [in:] Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5746–5750, 2019, https://doi.org/10.1109/ICASSP.2019.8683691

31. C. Valentini-Botinhao, J. Yamagishi, Speech enhancement of noisy and reverberant speech for text-to-speech, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(8): 1420–1433, 2018, https://doi.org/10.1109/TASLP.2018.2828980

32. J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang, Audio-visual speech enhancement using multimodal deep convolutional neural networks, IEEE Transactions on Emerging Topics in Computational Intelligence, 2(20): 117–128, 2018, https://doi.org/10.1109/TETCI.2017.2784878

33. P. Pujol, S. Pol, C. Nadeu, A. Hagen, H. Bourlard, Comparison and combination of features in a hybrid HMM/MLP and a HMM/GMM speech recognition system, IEEE Transactions on Speech and Audio Processing, 13(1): 14–22, 2005, https://doi.org/10.1109/TSA.2004.834466

34. Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, Cross-language transfer learning for deep neural network-based speech enhancement, [in:] Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, pp. 336–340, 2014, https://doi.org/10.1109/ISCSLP.2014.6936608

35. Z.-Q. Wang, D.L. Wang, Robust speech recognition from ratio masks, [in:] Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5720–5724, 2016, https://doi.org/10.1109/ICASSP.2016.7472773

36. W. Yuan, A time–frequency smoothing neural network for speech enhancement, Speech Communications, 124: 75–84, 2020, https://doi.org/10.1016/j.specom.2020.09.002

37. T. Lavanya, T. Nagarajan, P. Vijayalakshmi, Multi-level single channel speech enhancement using a unified framework for estimating magnitude and phase spectra, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 1315–1327, 2020, https://doi.org/10.1109/TASLP.2020.2986877

38. K. Sekiguchi, Y. Bando, A.A. Nugraha, K. Yoshii, T. Kawahara, Semi-supervised multichannel speech enhancement with a deep speech prior, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12): 2197–2212, 2019, https://doi.org/10.1109/TASLP.2019.2944348

39. F.B. Gelderblom, T.V. Tronstad, E.M. Viggen, Subjective evaluation of a noisereduced training target for deep neural network-based speech enhancement, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(3): 583–594, 2020, https://doi.org/10.1109/TASLP.2018.2882738

40. T. Kawase, M. Okamoto, T. Fukutomi, Y. Takahashi, Speech enhancement parameter adjustment to maximize accuracy of automatic speech recognition, IEEE Transactions on Consumer Electronics, 66(2): 125–133, 2020, https://doi.org/10.1109/TCE.2020.2986003

41. D. Baby, T. Viratanen, J.F. Gemmeke, H. van Hamme, Coupled dictionaries for exemplarbased speech enhancement and automatic speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(11): 1788–1799, 2015, https://doi.org/10.1109/TASLP.2015.2450491

42. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(12): 2136–2147, 2015, https://doi.org/10.1109/TASLP.2015.2468583

43. L. Sun, J. Du, L.-R. Dai, C.-H. Lee, Multiple-target deep learning for LSTM-RNN based speech enhancement, [in:] 2017 Hands-free Speech Communications and Microphone Arrays (HSCMA), pp. 136–140, 2017, https://doi.org/10.1109/HSCMA.2017.7895577

44. W.-C. Huang, H.-T. Hwang, Y.-H. Peng, Y. Tsao, H.-M. Wang, Voice conversion based on cross-domain features using variational auto encoders, [in:] 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 51–55, 2018, https://doi.org/10.1109/ISCSLP.2018.8706604

45. W. Han, C. Wu, X. Zhang, Q. Zhang, S. Bai, Joint optimization of modified ideal ratio mask and deep neural networks for monaural speech enhancement, [in:] Proceedings of 2017 9th International Conference on Communication Software and Networks (ICCSN), pp. 1070–1074, 2017, https://doi.org/10.1109/ICCSN.2017.8230275

46. D.S. Williamson, Y. Wang, D.L. Wang, Complex ratio masking for monaural speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(3): 483–492, 2016, https://doi.org/10.1109/TASLP.2015.2512042

47. J. Ming, D. Crookes, Speech enhancement based on full-sentence correlation and clean speech recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(3): 531–543, 2017, https://doi.org/10.1109/TASLP.2017.2651406

48. R. Jaiswal, D. Romero, Implicit Wiener filtering for speech enhancement in non-stationary noise, [in:] 2021 11th International Conference on Information Science and Technology (ICIST), pp. 39–47, 2021, https://doi.org/10.1109/ICIST52614.2021.9440639