|Table of Contents|

[1] Zhang Xinran, Song Peng, Zha Cheng, Tao Huawei, et al. Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition [J]. Journal of Southeast University (English Edition), 2016, 32 (4): 402-407. [doi:10.3969/j.issn.1003-7985.2016.04.002]
Copy

Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition()
用于跨库语音情感识别的时频原子听觉注意模型
Share:

Journal of Southeast University (English Edition)[ISSN:1003-7985/CN:32-1325/N]

Volumn:
32
Issue:
2016 4
Page:
402-407
Research Field:
Information and Communication Engineering
Publishing date:
2016-12-20

Info

Title:
Auditory attention model based on Chirpletfor cross-corpus speech emotion recognition
用于跨库语音情感识别的时频原子听觉注意模型
Author(s):
Zhang Xinran1 Song Peng2 Zha Cheng1 Tao Huawei1 Zhao Li1
1Key Laboratory of Underwater Acoustic Signal Processing of Ministry of Education, Southeast University, Nanjing 210096, China
2School of Computer and Control Engineering, Yantai University, Yantai 264005, China
张昕然1 宋鹏2 查诚1 陶华伟1 赵力1
1东南大学水声信号处理教育部重点实验室, 南京 210096; 2烟台大学计算机与控制工程学院, 烟台 264005
Keywords:
speech emotion recognition selective attention mechanism spectrogram feature cross-corpus
语音情感识别 选择性注意机制 语谱图特征 跨数据库
PACS:
TN912.34
DOI:
10.3969/j.issn.1003-7985.2016.04.002
Abstract:
To solve the problem of mismatching features in an experimental database, which is a key technique in the field of cross-corpus speech emotion recognition, an auditory attention model based on Chirplet is proposed for feature extraction. First, in order to extract the spectra features, the auditory attention model is employed for variational emotion features detection. Then, the selective attention mechanism model is proposed to extract the salient gist features which show their relation to the expected performance in cross-corpus testing. Furthermore, the Chirplet time-frequency atoms are introduced to the model. By forming a complete atom database, the Chirplet can improve the spectrum feature extraction including the amount of information. Samples from multiple databases have the characteristics of multiple components. Hereby, the Chirplet expands the scale of the feature vector in the time-frequency domain. Experimental results show that, compared to the traditional feature model, the proposed feature extraction approach with the prototypical classifier has significant improvement in cross-corpus speech recognition. In addition, the proposed method has better robustness to the inconsistent sources of the training set and the testing set.
为解决跨数据库语音情感识别领域中实验数据集特征不匹配的问题, 提出一种基于时频原子的听觉注意特征提取模型.首先, 为了提取频谱特征, 引入听觉注意模型对多类情感特征进行有效的探测.然后, 利用选择注意机制改进了提取的语谱图特征, 其中包含的显著性信息与跨库识别性能有紧密联系.再引入Chirplet时频原子, 通过形成的过完备原子库提高语谱图特征的信息量.来自多个数据库的样本具有多成分分布的特征, 据此所提模型中的Chirplet扩大了特征向量在时频域上的尺度.实验结果显示, 相比传统特征模型, 所提方法性能有显著提升.此外, 该方法在训练集和测试集来源不一致情况下具有更好的鲁棒性.

References:

[1] Song P, Jin Y, Zha C, et al. Speech emotion recognition method based on hidden factor analysis[J]. Electronics Letters, 2014, 51(1): 112-114. DOI:10.1049/el.2014.3339.
[2] Schuller B, Zhang Z, Weninger F, et al. Synthesized speech for model training in cross-corpus recognition of human emotion[J]. International Journal of Speech Technology, 2012, 15(3): 313-323. DOI:10.1007/s10772-012-9158-0.
[3] Deng J, Zhang Z, Marchi E, et al. Sparse autoencoder-based feature transfer learning for speech emotion recognition[C]//IEEE 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction. Geneva, Switzerland, 2013: 511-516. DOI:10.1109/acii.2013.90.
[4] Jin Y, Song P, Zheng W, et al. Speaker-independent speech emotion recognition based on two-layer multiplekernel learning[J]. IEICE Transactions on Information and Systems, 2013, 96(10): 2286-2289. DOI:10.1587/transinf.e96.d.2286.
[5] Kalinli O, Narayanan S. Prominence detection using auditory attention cues and task-dependent high level information[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(5): 1009-1024. DOI:10.1109/tasl.2009.2014795.
[6] Kalinli O. Syllable segmentation of continuous speech using auditory attention cues[C]//International Speech and Communication Association. Florence, Italy, 2011: 425-428.
[7] Wong W K, Zhao H T. Supervised optimal locality preserving projection[J]. Pattern Recognition, 2012, 45(1): 186-197. DOI:10.1016/j.patcog.2011.05.014.
[8] Yin Q, Qian S, Feng A. A fast refinement for adaptive Gaussian chirplet decomposition[J]. IEEE Transactions on Signal Processing, 2002, 50(6): 1298-1306. DOI:10.1109/tsp.2002.1003055.
[9] Bayram I. An analytic wavelet transform with a flexible time-frequency covering[J]. IEEE Transactions on Signal Processing, 2013, 61(5): 1131-1142. DOI:10.1109/tsp.2012.2232655.
[10] Noriega G. A neural model to study sensory abnormalities and multisensory effects in autism[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2015, 23(2): 199-209. DOI:10.1109/TNSRE.2014.2363775.
[11] Khoubrouy S A, Panahi I M S, Hansen J H L. Howling detection in hearing aids based on generalized teager-kaiser operator[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(1): 154-161. DOI:10.1109/taslp.2014.2377575.
[12] Ali S A, Khan A, Bashir N. Analyzing the impact of prosodic feature(pitch)on learning classifiers for speech emotion corpus[J]. International Journal of Information Technology and Computer Science, 2015, 7(2): 54-59.DOI:10.5815/ijitcs.2015.02.07.
[13] Ajmera P K, Jadhav D V, Holambe R S. Text-independent speaker identification using radon and discrete cosine transforms based features from speech spectrogram[J]. Pattern Recognition, 2011, 44(10): 2749-2759. DOI:10.1016/j.patcog.2011.04.009.
[14] Burkhardt F, Paeschke A, Rolfes M, et al. A database of german emotional speech[C]//International Speech and Communication Association. Lisbon, Portugal, 2005: 1517-1520.
[15] Martin O, Kotsia I, Macq B, et al. The enterface’05 audio-visual emotion database[C]//IEEE 22nd International Conference on Data Engineering Workshops. San Francisco, CA, USA, 2006: 8-10. DOI:10.1109/icdew.2006.145.
[16] Schuller B, Steidl S, Batliner A, et al. The interspeech 2010 paralinguistic challenge: Deception, sincerity and native language [C]//International Speech and Communication Association. Chiba, Japan, 2010: 2794-2797. DOI:10.21437/interspeech.2016-129.
[17] Eyben F, Wöllmer M, Schuller B. Opensmile: The munich versatile and fast open-source audio feature extractor[C]//Proceedings of the International Conference on Multimedia. Firenze, Italy, 2010: 1459-1462.
[18] Moustakidis S, Mallinis G, Koutsias N, et al. SVM-based fuzzy decision trees for classification of high spatial resolution remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2012, 50(1): 149-169. DOI:10.1109/TGRS.2011.2159726.
[19] Kim E H, Hyun K H, Kim S H, et al. Improved emotion recognition with a novel speaker-independent feature[J]. IEEE/ASME Transactions on Mechatronics, 2009, 14(3): 317-325.

Memo

Memo:
Biographies: Zhang Xinran(1980—), male, graduate; Zhao Li(corresponding author), male, doctor, professor, zhaoli@seu.edu.cn.
Foundation items: The National Natural Science Foundation of China(No.61273266, 61231002, 61301219, 61375028), the Specialized Research Fund for the Doctoral Program of Higher Education(No.20110092130004), the Natural Science Foundation of Shandong Province(No.ZR2014FQ016).
Citation: Zhang Xinran, Song Peng, Zha Cheng, et al. Auditory attention model based on Chirplet for cross-corpus speech emotion recognition[J].Journal of Southeast University(English Edition), 2016, 32(4):402-407.DOI:10.3969/j.issn.1003-7985.2016.04.002.
Last Update: 2016-12-20