|Table of Contents|

[1] Luo Xinwei, Liu Ting, Huang Ming, Xu XiaogangCao Hongli, et al. Speech detection method based on a multi-window analysis [J]. Journal of Southeast University (English Edition), 2021, 37 (4): 343-349. [doi:10.3969/j.issn.1003-7985.2021.04.001]
Copy

Speech detection method based on a multi-window analysis()
基于多窗口分析的语音检测方法
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
37
Issue:
2021 4
Page:
343-349
Research Field:
Information and Communication Engineering
Publishing date:
2021-12-20

Info

Title:
Speech detection method based on a multi-window analysis
基于多窗口分析的语音检测方法
Author(s):
Luo Xinwei1 Liu Ting1 Huang Ming1 Xu Xiaogang1Cao Hongli1 Bai Xianghua2 Xu Dayong2
1Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
2The 1st Military Representative Office in Nanjing, Shanghai Bureau, PLA Navy MilitaryEquipment Department, Nanjing 210000, China
罗昕炜1 刘婷1 黄铭1 徐晓刚1 曹红丽1 柏祥华2 徐大勇2
1东南大学水声信号处理教育部重点实验室, 南京 210096; 2海装上海局驻南京地区第一军事代表室, 南京 210000
Keywords:
voice activity detection multi-window spectral analysis K-means clustering threshold adjustment sequential decision
语音活动检测 多窗口谱分析 K-means聚类 阈值调整 序贯决策
PACS:
TN911.72
DOI:
10.3969/j.issn.1003-7985.2021.04.001
Abstract:
Aiming at the poor performance of speech signal detection at low signal-to-noise ratio(SNR), a method is proposed to detect active speech frames based on multi-window time-frequency(T-F)diagrams. First, the T-F diagram of the signal is calculated based on a multi-window T-F analysis, and a speech test statistic is constructed based on the characteristic difference between the signal and background noise. Second, the dynamic double-threshold processing is used for preliminary detection, and then the global double-threshold value is obtained using K-means clustering. Finally, the detection results are obtained by sequential decision. The experimental results show that the overall performance of the method is better than that of traditional methods under various SNR conditions and background noises. This method also has the advantages of low complexity, strong robustness, and adaptability to multi-national languages.
针对低信噪比条件下的语音信号检测性能差的问题, 提出了一种基于多窗口时频图的语音活动检测方法.首先, 基于多窗口的时频分析计算得到语音信号的时频图, 根据语音信号与背景噪声的特征差构造语音测试统计特征.其次, 采用动态的双门限处理进行初步检测, 在此基础上通过K均值聚类得到全局的双门限值.最后, 根据序贯判决的思想进行检测得到该方法的判决结果输出.实验结果表明, 低信噪比条件下, 基于多窗口时频图的语音活动检测方法获得了良好的检测性能, 并且在各种信噪比条件和背景噪声下, 其性能总体优于传统的检测方法.该方法具有复杂度低、鲁棒性强、对不同种类语言适应性强等优点.

References:

[1] Lamel L, Rabiner L, Rosenberg A, et al. An improved endpoint detector for isolated word recognition[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1981, 29(4): 777-785. DOI:10.1109/TASSP.1981.1163642.
[2] Lu L, Jiang H, Zhang H J. A robust audio classification and segmentation method[C]//Proceedings of the Ninth ACM International Conference on Multimedia. Ottawa, Canada, 2001: 203-211. DOI:10.1145/500141.500173.
[3] Song J L, Meng Y, Cao J M, et al. Research on digital hearing aid speech enhancement algorithm[C]//2018 37th Chinese Control Conference(CCC). Wuhan, 2018: 4316-4320. DOI:10.23919/chicc.2018.8482732.
[4] �C7;olak R, Akdeniz R. A novel voice activity detection for multi-channel noise reduction[C]//IEEE Access. IEEE, 2021: 91017-91026.
[5] Jaiswal R. Speech activity detection under adverse noisy conditions at low SNRs[C]//2021 6th International Conference on Communication and Electronics Systems(ICCES). Coimbatre, India, 2021: 97-101. DOI:10.1109/ICCES51350.2021.9488934.
[6] Masumura R, Matsui K, Koizumi Y, et al. Context-aware neural voice activity detection using auxiliary networks for phoneme recognition, speech enhancement and acoustic scene classification[C]//2019 27th European Signal Processing Conference(EUSIPCO). Coruna, Spain, 2019: 1-5. DOI:10.23919/EUSIPCO.2019.8902703.
[7] Moldovan A, Stan A, Giurgiu M. Improving sentence-level alignment of speech with imperfect transcripts using utterance concatenation and VAD[C]//2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing(ICCP). Cluj-Napoca, Romania, 2016: 171-174. DOI:10.1109/ICCP.2016.7737141.
[8] Rabiner L R, Sambur M R. An algorithm for determining the endpoints of isolated utterances[J]. The Bell System Technical Journal, 1975, 54(2): 297-315. DOI:10.1002/j.1538-7305.1975.tb02840.x.
[9] Nemer E, Goubran R, Mahmoud S. Robust voice activity detection using higher-order statistics in the LPC residual domain[J]. IEEE Transactions on Speech and Audio Processing, 2001, 9(3): 217-231. DOI:10.1109/89.905996.
[10] Marzinzik M, Kollmeier B. Speech pause detection for noise spectrum estimation by tracking power envelope dynamics[J]. IEEE Transactions on Speech and Audio Processing, 2002, 10(2): 109-118. DOI:10.1109/89.985548.
[11] Shi L, Ahmad I, He Y J, et al. Hidden Markov model based drone sound recognition using MFCC technique in practical noisy environments[J].Journal of Communications and Networks, 2018, 20(5): 509-518. DOI:10.1109/JCN.2018.000075.
[12] Al-Ali A K H, Dean D, Senadji B, et al. Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions[J]. IEEE Access, 2017, 5: 15400-15413. DOI:10.1109/ACCESS.2017.2728801.
[13] Pham T V, Stark M, Rank E. Performance analysis of wavelet subband based voice activity detection in cocktail party environment[C]//The 2010 International Conference on Advanced Technologies for Communications. Ho Chi Minh City, Vietnam, 2010: 85-88. DOI:10.1109/ATC.2010.5672718.
[14] Ghosh P K, Tsiartas A, Narayanan S. Robust voice activity detection using long-term signal variability[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(3): 600-613. DOI:10.1109/TASL.2010.2052803.
[15] Tsiartas A, Chaspari T, Katsamanis N, et al. Multi-band long-term signal variability features for robust voice activity detection[C]//14th Annual Conference of the International Speech Communication Association(INTERSPEECH 2013). ISCA, 2013: 718-722. DOI:10.21437/interspeech.2013-201.
[16] Haider F, Luz S. Attitude recognition using multi-resolution cochleagram features[C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Brighton, UK, 2019: 3737-3741. DOI:10.1109/ICASSP.2019.8682974.
[17] Aneeja G, Yegnanarayana B. Single frequency filtering approach for discriminating speech and nonspeech[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(4): 705-717. DOI:10.1109/TASLP.2015.2404035.
[18] Makowski R, Hossa R. Voice activity detection with quasi-quadrature filters and GMM decomposition for speech and noise[J]. Applied Acoustics, 2020, 166: 107344. DOI:10.1016/j.apacoust.2020.107344.
[19] Sohn J, Kim N S, Sung W. A statistical model-based voice activity detection[J]. IEEE Signal Processing Letters, 1999, 6(1): 1-3. DOI:10.1109/97.736233.
[20] Dey J, Bin Hossain M S, Haque M A. An ensemble SVM-based approach for voice activity detection[C]//2018 10th International Conference on Electrical and Computer Engineering(ICECE). Dhaka, Bangladesh, 2018: 297-300. DOI:10.1109/ICECE.2018.8636745.
[21] Krishnakumar H, Williamson D S. A comparison of boosted deep neural networks for voice activity detection[C]//2019 IEEE Global Conference on Signal and Information Processing(GlobalSIP). Ottawa, ON, Canada, 2019: 1-5. DOI:10.1109/GlobalSIP45357.2019.8969258.
[22] Germain F G, Sun D L, Mysore G J. Speaker and noise independent voice activity detection[C]//Interspeech 2013. France, 2013: 732-736. DOI:10.21437/interspeech.2013-204.
[23] Tachioka Y. DNN-based voice activity detection using auxiliary speech models in noisy environments[C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP). Calgary, AB, Canada, 2018: 5529-5533. DOI:10.1109/ICASSP.2018.8461551.
[24] Paseddula C, Gangashetty S V. DNN based acoustic scene classification using score fusion of MFCC and inverse MFCC[C]//2018 IEEE 13th International Conference on Industrial and Information Systems(ICIIS). Rupnagar, India, 2018: 18-21. DOI:10.1109/ICIINFS.2018.8721379.
[25] Sun Y N, Yen G G, Yi Z. Evolving unsupervised deep neural networks for learning meaningful representations[J]. IEEE Transactions on Evolutionary Computation, 2019, 23(1): 89-103. DOI:10.1109/TEVC.2018.2808689.
[26] Long J Y, Zhang S H, Li C. Evolving deep echo state networks for intelligent fault diagnosis[J]. IEEE Transactions on Industrial Informatics, 2020, 16(7): 4928-4937. DOI:10.1109/TII.2019.2938884.

Memo

Memo:
Biography: Luo Xinwei(1978—), male, doctor, associate professor, luoxinwei@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.12174053, 91938203, 11674057, 11874109), the Fundamental Research Funds for the Central Universities(No.2242021k30019).
Citation: Luo Xinwei, Liu Ting, Huang Ming, et al. Speech detection method based on a multi-window analysis[J].Journal of Southeast University(English Edition), 2021, 37(4):343-349.DOI:10.3969/j.issn.1003-7985.2021.04.001.
Last Update: 2021-12-20