|Table of Contents|

[1] Luo Xinwei, Liu Ting, Huang Ming, Xu XiaogangCao Hongli, et al. Speech detection method based on a multi-window analysis [J]. Journal of Southeast University (English Edition), 2021, 37 (4): 343-349. [doi:10.3969/j.issn.1003-7985.2021.04.001]
Copy

Speech detection method based on a multi-window analysis()
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
37
Issue:
2021 4
Page:
343-349
Research Field:
Information and Communication Engineering
Publishing date:
2021-12-20

Info

Title:
Speech detection method based on a multi-window analysis
Author(s):
Luo Xinwei1 Liu Ting1 Huang Ming1 Xu Xiaogang1Cao Hongli1 Bai Xianghua2 Xu Dayong2
1Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
2The 1st Military Representative Office in Nanjing, Shanghai Bureau, PLA Navy MilitaryEquipment Department, Nanjing 210000, China
Keywords:
voice activity detection multi-window spectral analysis K-means clustering threshold adjustment sequential decision
PACS:
TN911.72
DOI:
10.3969/j.issn.1003-7985.2021.04.001
Abstract:
Aiming at the poor performance of speech signal detection at low signal-to-noise ratio(SNR), a method is proposed to detect active speech frames based on multi-window time-frequency(T-F)diagrams. First, the T-F diagram of the signal is calculated based on a multi-window T-F analysis, and a speech test statistic is constructed based on the characteristic difference between the signal and background noise. Second, the dynamic double-threshold processing is used for preliminary detection, and then the global double-threshold value is obtained using K-means clustering. Finally, the detection results are obtained by sequential decision. The experimental results show that the overall performance of the method is better than that of traditional methods under various SNR conditions and background noises. This method also has the advantages of low complexity, strong robustness, and adaptability to multi-national languages.

References:

[1] Lamel L, Rabiner L, Rosenberg A, et al. An improved endpoint detector for isolated word recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(4): 777-785. DOI:10.1109/TASSP.1981.1163642.
[2] Lu L, Jiang H, Zhang H J. A robust audio classification and segmentation method[C]//Proceedings of the Ninth ACM International Conference on Multimedia. Ottawa, Canada, 2001: 203-211. DOI:10.1145/500141.500173.
[3] Song J L, Meng Y, Cao J M, et al. Research on digital hearing aid speech enhancement algorithm[C]//2018 37th Chinese Control Conference(CCC). Wuhan, 2018: 4316-4320. DOI:10.23919/chicc.2018.8482732.
[4] �C7;olak R, Akdeniz R. A novel voice activity detection for multi-channel noise reduction[C]//IEEE Access. IEEE, 2021: 91017-91026.
[5] Jaiswal R. Speech activity detection under adverse noisy conditions at low SNRs[C]//2021 6th International Conference on Communication and Electronics Systems(ICCES). Coimbatre, India, 2021: 97-101. DOI:10.1109/ICCES51350.2021.9488934.
[6] Masumura R, Matsui K, Koizumi Y, et al. Context-aware neural voice activity detection using auxiliary networks for phoneme recognition, speech enhancement and acoustic scene classification[C]//2019 27th European Signal Processing Conference(EUSIPCO). Coruna, Spain, 2019: 1-5. DOI:10.23919/EUSIPCO.2019.8902703.
[7] Moldovan A, Stan A, Giurgiu M. Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD[C]//2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing(ICCP). Cluj-Napoca, Romania, 2016: 171-174. DOI:10.1109/ICCP.2016.7737141.
[8] Rabiner L R, Sambur M R. An algorithm for determining the endpoints of isolated utterances[J]. The Bell System Technical Journal, 1975, 54(2): 297-315. DOI:10.1002/j.1538-7305.1975.tb02840.x.
[9] Nemer E, Goubran R, Mahmoud S. Robust voice activity detection using higher-order statistics in the LPC residual domain[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(3): 217-231. DOI:10.1109/89.905996.
[10] Marzinzik M, Kollmeier B. Speech pause detection for noise spectrum estimation by tracking power envelope dynamics[J]. IEEE Transactions on Speech and Audio Processing, 2002, 10(2): 109-118. DOI:10.1109/89.985548.
[11] Shi L, Ahmad I, He Y J, et al. Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments[J].Journal of Communications and Networks, 2018, 20(5): 509-518. DOI:10.1109/JCN.2018.000075.
[12] Al-Ali A K H, Dean D, Senadji B, et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions[J]. IEEE Access, 2017, 5: 15400-15413. DOI:10.1109/ACCESS.2017.2728801.
[13] Pham T V, Stark M, Rank E. Performance analysis of wavelet subband based voice activity detection in cocktail party environment[C]//The 2010 International Conference on Advanced Technologies for Communications. Ho Chi Minh City, Vietnam, 2010: 85-88. DOI:10.1109/ATC.2010.5672718.
[14] Ghosh P K, Tsiartas A, Narayanan S. Robust voice activity detection using long-term signal variability[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(3): 600-613. DOI:10.1109/TASL.2010.2052803.
[15] Tsiartas A, Chaspari T, Katsamanis N, et al. Multi-band long-term signal variability features for robust voice activity detection[C]//14th Annual Conference of the International Speech Communication Association(INTERSPEECH 2013). ISCA, 2013: 718-722. DOI:10.21437/interspeech.2013-201.
[16] Haider F, Luz S. Attitude recognition using multi-resolution cochleagram features[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019: 3737-3741. DOI:10.1109/ICASSP.2019.8682974.
[17] Aneeja G, Yegnanarayana B. Single frequency filtering approach for discriminating speech and nonspeech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 705-717. DOI:10.1109/TASLP.2015.2404035.
[18] Makowski R, Hossa R. Voice activity detection with quasi-quadrature filters and GMM decomposition for speech and noise[J]. Applied Acoustics, 2020, 166: 107344. DOI:10.1016/j.apacoust.2020.107344.
[19] Sohn J, Kim N S, Sung W. A statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 1999, 6(1): 1-3. DOI:10.1109/97.736233.
[20] Dey J, Bin Hossain M S, Haque M A. An ensemble SVM-based approach for voice activity detection[C]//2018 10th International Conference on Electrical and Computer Engineering(ICECE). Dhaka, Bangladesh, 2018: 297-300. DOI:10.1109/ICECE.2018.8636745.
[21] Krishnakumar H, Williamson D S. A comparison of boosted deep neural networks for voice activity detection[C]//2019 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Ottawa, ON, Canada, 2019: 1-5. DOI:10.1109/GlobalSIP45357.2019.8969258.
[22] Germain F G, Sun D L, Mysore G J. Speaker and noise independent voice activity detection[C]//Interspeech 2013. France, 2013: 732-736. DOI:10.21437/interspeech.2013-204.
[23] Tachioka Y. DNN-based voice activity detection using auxiliary speech models in noisy environments[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Calgary, AB, Canada, 2018: 5529-5533. DOI:10.1109/ICASSP.2018.8461551.
[24] Paseddula C, Gangashetty S V. DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC[C]//2018 IEEE 13th International Conference on Industrial and Information Systems(ICIIS). Rupnagar, India, 2018: 18-21. DOI:10.1109/ICIINFS.2018.8721379.
[25] Sun Y N, Yen G G, Yi Z. Evolving unsupervised deep neural networks for learning meaningful representations[J]. IEEE Transactions on Evolutionary Computation, 2019, 23(1): 89-103. DOI:10.1109/TEVC.2018.2808689.
[26] Long J Y, Zhang S H, Li C. Evolving deep echo state networks for intelligent fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2020, 16(7): 4928-4937. DOI:10.1109/TII.2019.2938884.

Memo

Memo:
Biography: Luo Xinwei(1978—), male, doctor, associate professor, luoxinwei@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.12174053, 91938203, 11674057, 11874109), the Fundamental Research Funds for the Central Universities(No.2242021k30019).
Citation: Luo Xinwei, Liu Ting, Huang Ming, et al. Speech detection method based on a multi-window analysis[J].Journal of Southeast University(English Edition), 2021, 37(4):343-349.DOI:10.3969/j.issn.1003-7985.2021.04.001.
Last Update: 2021-12-20