«Previous Article|Table of Contents|Next Article»

[1] Luo Xinwei, Liu Ting, Huang Ming, Xu XiaogangCao Hongli, et al. Speech detection method based on a multi-window analysis [J]. Journal of Southeast University (English Edition), 2021, 37 (4): 343-349. [doi:10.3969/j.issn.1003-7985.2021.04.001]
Copy

Speech detection method based on a multi-window analysis()

基于多窗口分析的语音检测方法

Share：

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:: 37
Issue:: 2021 4

Page:: 343-349

Research Field:: Information and Communication Engineering

Publishing date:: 2021-12-20

Info

Title:: Speech detection method based on a multi-window analysis

: 基于多窗口分析的语音检测方法

Author(s):: Luo Xinwei¹, Liu Ting¹, Huang Ming¹, Xu Xiaogang¹Cao Hongli¹, Bai Xianghua², Xu Dayong²; ¹Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
²The 1st Military Representative Office in Nanjing, Shanghai Bureau, PLA Navy MilitaryEquipment Department, Nanjing 210000, China

: 罗昕炜¹, 刘婷¹, 黄铭¹, 徐晓刚¹, 曹红丽¹, 柏祥华², 徐大勇²; ¹东南大学水声信号处理教育部重点实验室, 南京 210096; ²海装上海局驻南京地区第一军事代表室, 南京 210000

Keywords:: voice activity detection; multi-window spectral analysis; K-means clustering; threshold adjustment; sequential decision

: 语音活动检测; 多窗口谱分析; K-means聚类; 阈值调整; 序贯决策

PACS:: TN911.72

DOI:: 10.3969/j.issn.1003-7985.2021.04.001

Abstract:: Aiming at the poor performance of speech signal detection at low signal-to-noise ratio(SNR), a method is proposed to detect active speech frames based on multi-window time-frequency(T-F)diagrams. First, the T-F diagram of the signal is calculated based on a multi-window T-F analysis, and a speech test statistic is constructed based on the characteristic difference between the signal and background noise. Second, the dynamic double-threshold processing is used for preliminary detection, and then the global double-threshold value is obtained using K-means clustering. Finally, the detection results are obtained by sequential decision. The experimental results show that the overall performance of the method is better than that of traditional methods under various SNR conditions and background noises. This method also has the advantages of low complexity, strong robustness, and adaptability to multi-national languages.

: 针对低信噪比条件下的语音信号检测性能差的问题, 提出了一种基于多窗口时频图的语音活动检测方法.首先, 基于多窗口的时频分析计算得到语音信号的时频图, 根据语音信号与背景噪声的特征差构造语音测试统计特征.其次, 采用动态的双门限处理进行初步检测, 在此基础上通过K均值聚类得到全局的双门限值.最后, 根据序贯判决的思想进行检测得到该方法的判决结果输出.实验结果表明, 低信噪比条件下, 基于多窗口时频图的语音活动检测方法获得了良好的检测性能, 并且在各种信噪比条件和背景噪声下, 其性能总体优于传统的检测方法.该方法具有复杂度低、鲁棒性强、对不同种类语言适应性强等优点.

References:

[1] Lamel L, Rabiner L, Rosenberg A, et al. An improved endpoint detector for isolated word recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(4): 777-785. DOI:10.1109/TASSP.1981.1163642.
[2] Lu L, Jiang H, Zhang H J. A robust audio classification and segmentation method[C]//Proceedings of the Ninth ACM International Conference on Multimedia. Ottawa, Canada, 2001: 203-211. DOI:10.1145/500141.500173.
[3] Song J L, Meng Y, Cao J M, et al. Research on digital hearing aid speech enhancement algorithm[C]//2018 37th Chinese Control Conference(CCC). Wuhan, 2018: 4316-4320. DOI:10.23919/chicc.2018.8482732.
[4] �C7;olak R, Akdeniz R. A novel voice activity detection for multi-channel noise reduction[C]//IEEE Access. IEEE, 2021: 91017-91026.
[5] Jaiswal R. Speech activity detection under adverse noisy conditions at low SNRs[C]//2021 6th International Conference on Communication and Electronics Systems(ICCES). Coimbatre, India, 2021: 97-101. DOI:10.1109/ICCES51350.2021.9488934.
[6] Masumura R, Matsui K, Koizumi Y, et al. Context-aware neural voice activity detection using auxiliary networks for phoneme recognition, speech enhancement and acoustic scene classification[C]//2019 27th European Signal Processing Conference(EUSIPCO). Coruna, Spain, 2019: 1-5. DOI:10.23919/EUSIPCO.2019.8902703.
[7] Moldovan A, Stan A, Giurgiu M. Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD[C]//2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing(ICCP). Cluj-Napoca, Romania, 2016: 171-174. DOI:10.1109/ICCP.2016.7737141.
[8] Rabiner L R, Sambur M R. An algorithm for determining the endpoints of isolated utterances[J]. The Bell System Technical Journal, 1975, 54(2): 297-315. DOI:10.1002/j.1538-7305.1975.tb02840.x.
[9] Nemer E, Goubran R, Mahmoud S. Robust voice activity detection using higher-order statistics in the LPC residual domain[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(3): 217-231. DOI:10.1109/89.905996.
[10] Marzinzik M, Kollmeier B. Speech pause detection for noise spectrum estimation by tracking power envelope dynamics[J]. IEEE Transactions on Speech and Audio Processing, 2002, 10(2): 109-118. DOI:10.1109/89.985548.
[11] Shi L, Ahmad I, He Y J, et al. Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments[J].Journal of Communications and Networks, 2018, 20(5): 509-518. DOI:10.1109/JCN.2018.000075.
[12] Al-Ali A K H, Dean D, Senadji B, et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions[J]. IEEE Access, 2017, 5: 15400-15413. DOI:10.1109/ACCESS.2017.2728801.
[13] Pham T V, Stark M, Rank E. Performance analysis of wavelet subband based voice activity detection in cocktail party environment[C]//The 2010 International Conference on Advanced Technologies for Communications. Ho Chi Minh City, Vietnam, 2010: 85-88. DOI:10.1109/ATC.2010.5672718.
[14] Ghosh P K, Tsiartas A, Narayanan S. Robust voice activity detection using long-term signal variability[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(3): 600-613. DOI:10.1109/TASL.2010.2052803.
[15] Tsiartas A, Chaspari T, Katsamanis N, et al. Multi-band long-term signal variability features for robust voice activity detection[C]//14th Annual Conference of the International Speech Communication Association(INTERSPEECH 2013). ISCA, 2013: 718-722. DOI:10.21437/interspeech.2013-201.
[16] Haider F, Luz S. Attitude recognition using multi-resolution cochleagram features[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019: 3737-3741. DOI:10.1109/ICASSP.2019.8682974.
[17] Aneeja G, Yegnanarayana B. Single frequency filtering approach for discriminating speech and nonspeech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 705-717. DOI:10.1109/TASLP.2015.2404035.
[18] Makowski R, Hossa R. Voice activity detection with quasi-quadrature filters and GMM decomposition for speech and noise[J]. Applied Acoustics, 2020, 166: 107344. DOI:10.1016/j.apacoust.2020.107344.
[19] Sohn J, Kim N S, Sung W. A statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 1999, 6(1): 1-3. DOI:10.1109/97.736233.
[20] Dey J, Bin Hossain M S, Haque M A. An ensemble SVM-based approach for voice activity detection[C]//2018 10th International Conference on Electrical and Computer Engineering(ICECE). Dhaka, Bangladesh, 2018: 297-300. DOI:10.1109/ICECE.2018.8636745.
[21] Krishnakumar H, Williamson D S. A comparison of boosted deep neural networks for voice activity detection[C]//2019 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Ottawa, ON, Canada, 2019: 1-5. DOI:10.1109/GlobalSIP45357.2019.8969258.
[22] Germain F G, Sun D L, Mysore G J. Speaker and noise independent voice activity detection[C]//Interspeech 2013. France, 2013: 732-736. DOI:10.21437/interspeech.2013-204.
[23] Tachioka Y. DNN-based voice activity detection using auxiliary speech models in noisy environments[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Calgary, AB, Canada, 2018: 5529-5533. DOI:10.1109/ICASSP.2018.8461551.
[24] Paseddula C, Gangashetty S V. DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC[C]//2018 IEEE 13th International Conference on Industrial and Information Systems(ICIIS). Rupnagar, India, 2018: 18-21. DOI:10.1109/ICIINFS.2018.8721379.
[25] Sun Y N, Yen G G, Yi Z. Evolving unsupervised deep neural networks for learning meaningful representations[J]. IEEE Transactions on Evolutionary Computation, 2019, 23(1): 89-103. DOI:10.1109/TEVC.2018.2808689.
[26] Long J Y, Zhang S H, Li C. Evolving deep echo state networks for intelligent fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2020, 16(7): 4928-4937. DOI:10.1109/TII.2019.2938884.

Memo

Memo:: Biography: Luo Xinwei(1978—), male, doctor, associate professor, luoxinwei@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.12174053, 91938203, 11674057, 11874109), the Fundamental Research Funds for the Central Universities(No.2242021k30019).
Citation: Luo Xinwei, Liu Ting, Huang Ming, et al. Speech detection method based on a multi-window analysis[J].Journal of Southeast University(English Edition), 2021, 37(4):343-349.DOI:10.3969/j.issn.1003-7985.2021.04.001.

Last Update: 2021-12-20

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Info

References:

Memo

Common functions

Navigate

Tools

Statistics